-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llama C++ example #926
base: main
Are you sure you want to change the base?
Conversation
// Show usage of GetOutput | ||
std::unique_ptr<OgaTensor> output_logits = generator->GetOutput("logits"); | ||
|
||
// Assuming output_logits.Type() is float as it's logits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fyi, the raw model output can be float16 if the model runs on cuda. Our internal "processed logits" are always float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this correct, or not?
generator->GenerateNextToken(); | ||
|
||
// Show usage of GetOutput | ||
std::unique_ptr<OgaTensor> output_logits = generator->GetOutput("logits"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the std::unique_ptr<OgaTensor>
for clarity vs auto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So auto
would be simpler? I'll update it
std::cout << "Run Llama" << std::endl; | ||
std::cout << "-------------" << std::endl; | ||
|
||
#ifdef USE_CXX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the USE_CXX here for, given the whole file is C++?
What changes from one decoder-only model to another? Shall we instead create a standardized example that can work with other decoder-only models? |
I agree that one standardized example should be sufficient. From a user's perspective, the main change between decoder-only models is the chat template. |
@@ -6,6 +6,7 @@ set(CMAKE_CXX_STANDARD 20) | |||
option(USE_CUDA "Build with CUDA support" OFF) | |||
option(USE_CXX "Invoke the C++ example" ON) | |||
option(PHI3 "Build the Phi example" OFF) | |||
option(LLAMA "Build the Llama example" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have something like the following?
option(LLAMA "Build the Llama example" OFF) | |
option(LLM "Build the large-language model example" OFF) | |
option(VLM "Build the vision-language model example" OFF) | |
option(ALM "Build the audio-language model example" OFF) |
The abbreviations and names can be different, but the idea would be to create examples grouped by input and output modality.
- The
LLM
could be for decoder-only models (e.g. LLaMA, Phi)- Inputs: text
- Outputs: text
- The
VLM
could be for vision-text models (e.g. Phi-3/Phi-3.5 vision)- Inputs: images, text
- Outputs: text
- The
ALM
could be for audio-text models (e.g. Whisper)- Inputs: audios, text
- Outputs: text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great proposal for a new PR!
I agree that we should have a base decoder class, maybe? And these scripts can inherit from that class. But I'd like to get this sample in soonish so that folks can run Llama |
No description provided.