-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
onnxruntime-genai
generation speed very slow on int4
#1098
Comments
onnxruntime-genai
generation speedonnxruntime-genai
generation speed very slow on int4
Your graph says "tokens per second" not "execution time". Your graph says int4 has does the most tokens per second. So your graphs seems to be saying the opposite of what you are saying unless you labelled the axis wrong? 😕 |
Yes. That's a way to measure execution time -- or at least "speed" :)
Correct, for llama-cli it's the highest, 140 tokens/s results in JSON : https://github.com/tarekziade/onnxruntime-test/blob/main/results.json
I don't think it does, maybe what is confusing is that the graph includes both onnx and llama.cpp results? |
I see so the ones labelled "onnx" are the ones you are running in genai and the ones labelled "llama" are the ones running in llama.cpp . |
+1 to two of the issues raised. I am getting the exact same error on the fp16 version of my model:
Plus I am also confused about the int8 support in the model builder. It seems it is supported to a certain extent: io_dtype = TensorProto.FLOAT if precision in {"int8", "fp32"} But similarly I get an error if I actually attempt to use it:
clarification would be helpful (especially as int8 can be supported through other means such as exporting with GPTQ or using Tensor-RTs Model Optimizer) |
+1 to the issues for fp16 versions of this model:
Subscribing to this thread 👍 |
+1 I have updated from 0.4.0 to 0.5.2 and am experiencing about 3x slow down in speed. 😟 with the phi3 int4 onnx model. (I haven't tested other models). |
FP16 CPU is not officially supported in ONNX Runtime. While an FP16 CPU model can be created, many of the model's operators still need to be implemented for FP16 CPU. When an operator is not implemented for FP16 CPU, a Thus, although the model builder can create a FP16 CPU model, the following message is printed at the beginning. onnxruntime-genai/src/python/py/models/builder.py Line 3288 in 7735e10
Performance gains with INT4 precision should come from the ONNX Runtime version you have installed. This is because the quantized weights are handled within the ONNX model itself while the global inputs and outputs to the ONNX model are still in FP16 or FP32 precision. Since ONNX Runtime GenAI only manages the global inputs and outputs to the model and relies upon ONNX Runtime to run the model, it should not be causing this issue. Can you try downgrading your ONNX Runtime version to an older version? When installing the newest ONNX Runtime GenAI package, it will try to upgrade you to ONNX Runtime v1.20.1 for you (see the Before re-benchmarking, you may need to re-build the ONNX model by commenting out this line and remove |
I have built a small example using the python binding here https://github.com/tarekziade/onnxruntime-test/blob/main/run.py
to measure the inference speed on my Apple M1 and on a windows 11 box, using Qwen 2.5 0.5B instruct
to prepare the model I used the cpu provider and int4/fp16/fp32 precisions:
And compared the execution times with llama-cli using a GGUF of the same model using q4_0
One apple, the int4 precision is extremely slow on and fp16 failed on both platforms with
I was wondering if I did something wrong? I was also wondering if int8 precision is an option. looks like onnxruntime_genai.models.builder can use some int8 quantizations using the int4 mode but I am not entirely clear about this
The text was updated successfully, but these errors were encountered: