CPU Performance #3167

gurbindersingh · 2023-09-14T14:02:50Z

gurbindersingh
Sep 14, 2023

I am trying to setup the Llama-2 13B model for a client on their server. It has an AMD EPYC 7502P 32-Core CPU with 128 GB of RAM. I am getting the following results when using 32 threads

llama_print_timings:        load time =   394.38 ms
llama_print_timings:      sample time =   163.32 ms /   218 runs   (    0.75 ms per token,  1334.84 tokens per second)
llama_print_timings: prompt eval time =   670.77 ms /    13 tokens (   51.60 ms per token,    19.38 tokens per second)
llama_print_timings:        eval time = 27634.81 ms /   217 runs   (  127.35 ms per token,     7.85 tokens per second)
llama_print_timings:       total time = 28653.21 ms

TLDR: Is this performance good or bad? Can I do anything to improve is short of running the model on a GPU?

I have never worked with anything related to machine learning, so I don't know what the expected performance for this kind of model on this machine is. I've searched other topics but most either don't mention any explicit numbers or things I am not familiar with. My client has tested the model on HuggingFace and they tell me it's much faster than what we are getting here. Is there some build option I am missing (I enabled OpenBLAS)? Or is this expected performance? Can I do something to improve it? I already converted the model using Q4.0 quantisation. If anyone ran the model on a GPU what sort of performance improvement should we expect?

Alumniminium · 2023-09-14T14:16:16Z

Alumniminium
Sep 14, 2023

LLMs are bound by Memory Bandwidth not Compute. If you get faster RAM (or a GPU) you will get more tokens per second. However, 7 tokens a second is already quite good. You should try it with 16 threads, not 32. Inference does not benefit from SMT, in fact it hurts it.

2 replies

gurbindersingh Sep 14, 2023
Author

I see. 128 GB of RAM should be more then sufficient then, I assume? Interestingly, I didn't see the memory usage shoot up as much as I was expecting it to when the models are loaded, considering how large they are. But I was only seeing around 500MB more, instead of GBs I was expecting. So if there is no other way to increase the performance, I guess GPU is the only other option we have.
As for SMT, the CPU has 32 physical cores (64 logical cores), 32 was the default value for the threads, I tread increasing it but as you mentioned that actually ended up degrading the performance so I didn't touch that. I haven't tried decreasing the value though.

slaren Sep 14, 2023
Collaborator

By default, the models are memory mapped, and that may cause some tools to not show the total memory usage of the program. You can use --no-mmap to test the actual memory usage. 128GB is more than enough for most models.

AphidGit · 2023-12-20T14:23:13Z

AphidGit
Dec 20, 2023

The really interesting question to ask is: How many sticks of RAM are in there? 8x16 will be twice as fast as 4x32.

Next question is what is the speed? Best results would be with 8x DDR4-3200.

2 replies

Azeirah Dec 20, 2023

This isn't the case for regular consumer motherboards, right? I was always under the impression that you need to have at least two, but adding more sticks just increases capacity. I believe Epyc motherboards have 8 channels, whereas consumer motherboards have 4?

I'm not sure about how the details around channels work though.

If it does help to add two more sticks, I will probably look into that :o

Alumniminium Dec 22, 2023

Most consumer CPUs just support dual channel memory, even latest gen Ryzen CPUs.

netspym · 2024-03-23T13:55:10Z

netspym
Mar 23, 2024

I just got a AMD epyc 9654 cpu with 96 cores and 192 threads. The CPU supports up to 12 memory channels and up with 460gb/s memory Bandwidth. At first I only got 1 stick of 64gb ram and results in inferencing a 34b q4_0 model with only 1.5 tokens/s. I replaced the 64gb stick with two 32gb ones and get 4 tokens/s on the same 34b llm model. The memory bandwidth is really important for the inferencing speed. I suggest using up all of the memory slots in the server motherboard and theoretically it should maximize the CPU bandwidth with memory.

0 replies

taehyunzzz · 2025-01-08T18:47:10Z

taehyunzzz
Jan 8, 2025

Hello, I was wondering if anyone has experience looking into whether memory bandwidth from both sockets were being used.

I have a setting with 8 DRAMs, 4 for each CPU socket (NUMA).
When I use intel PCM memory profiling to see the bandwidth utilization while running my Pytorch CPU code for large matmuls, I found that only 1 socket was being used.

I saw some posts that say this was deliberately done to avoid remote NUMA accesses.
But I want a workaround BAD, because I think CPU performance could actually be a lot better :(

Any comments?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Performance #3167

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

CPU Performance #3167

gurbindersingh Sep 14, 2023

Replies: 4 comments · 4 replies

Alumniminium Sep 14, 2023

gurbindersingh Sep 14, 2023 Author

slaren Sep 14, 2023 Collaborator

AphidGit Dec 20, 2023

Azeirah Dec 20, 2023

Alumniminium Dec 22, 2023

netspym Mar 23, 2024

taehyunzzz Jan 8, 2025

gurbindersingh
Sep 14, 2023

Replies: 4 comments 4 replies

Alumniminium
Sep 14, 2023

gurbindersingh Sep 14, 2023
Author

slaren Sep 14, 2023
Collaborator

AphidGit
Dec 20, 2023

netspym
Mar 23, 2024

taehyunzzz
Jan 8, 2025