CPU Performance #3167
Replies: 4 comments 4 replies
-
LLMs are bound by Memory Bandwidth not Compute. If you get faster RAM (or a GPU) you will get more tokens per second. However, 7 tokens a second is already quite good. You should try it with 16 threads, not 32. Inference does not benefit from SMT, in fact it hurts it. |
Beta Was this translation helpful? Give feedback.
-
The really interesting question to ask is: How many sticks of RAM are in there? 8x16 will be twice as fast as 4x32. Next question is what is the speed? Best results would be with 8x DDR4-3200. |
Beta Was this translation helpful? Give feedback.
-
I just got a AMD epyc 9654 cpu with 96 cores and 192 threads. The CPU supports up to 12 memory channels and up with 460gb/s memory Bandwidth. At first I only got 1 stick of 64gb ram and results in inferencing a 34b q4_0 model with only 1.5 tokens/s. I replaced the 64gb stick with two 32gb ones and get 4 tokens/s on the same 34b llm model. The memory bandwidth is really important for the inferencing speed. I suggest using up all of the memory slots in the server motherboard and theoretically it should maximize the CPU bandwidth with memory. |
Beta Was this translation helpful? Give feedback.
-
Hello, I was wondering if anyone has experience looking into whether memory bandwidth from both sockets were being used. I have a setting with 8 DRAMs, 4 for each CPU socket (NUMA). I saw some posts that say this was deliberately done to avoid remote NUMA accesses. Any comments? |
Beta Was this translation helpful? Give feedback.
-
I am trying to setup the Llama-2 13B model for a client on their server. It has an AMD EPYC 7502P 32-Core CPU with 128 GB of RAM. I am getting the following results when using 32 threads
TLDR: Is this performance good or bad? Can I do anything to improve is short of running the model on a GPU?
I have never worked with anything related to machine learning, so I don't know what the expected performance for this kind of model on this machine is. I've searched other topics but most either don't mention any explicit numbers or things I am not familiar with. My client has tested the model on HuggingFace and they tell me it's much faster than what we are getting here. Is there some build option I am missing (I enabled OpenBLAS)? Or is this expected performance? Can I do something to improve it? I already converted the model using Q4.0 quantisation. If anyone ran the model on a GPU what sort of performance improvement should we expect?
Beta Was this translation helpful? Give feedback.
All reactions