You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default, the layers of the model are distributed among the GPUs based on their free memory. If both GPUs have the same memory, they should get each the same number of layers. This assumes that each layer is the same size, but I think that's not the case with this model. You can change the default distribution by using the -ts parameter, this should allow you to get a more even usage of the GPUs.
Name and Version
Operating systems
Linux
GGML backends
CUDA
Hardware
2 * RTX 3090
Models
Llama-3_1-Nemotron-51B-Instruct
Problem description & steps to reproduce
model download:
Bartowski's GGUF behaves the same:
When I try to load this model, llama.cpp tries to load most of the layers into CUDA0 and only some into CUDA1.
I had to reduce context size a lot to get it to even load.
First Bad Commit
No response
Relevant log output
nvidia-smi
confirms this asymmetric usage pattern:The text was updated successfully, but these errors were encountered: