Graphs in llama.cpp #11039

abhinav9629 · 2025-01-02T06:01:34Z

abhinav9629
Jan 2, 2025

Hi
I'm trying to debug and understand the codebase of llama.cpp
I have a couple of questions regarding graph creation.

In llama_new_context_with_model whats the need of two graphs pp and tg graph?
What exactly is the worst case scenario for which we reserve a worst case graph?
Why we need to reserve a worst case graph again in case kv cache defragments or K shift is there?

danbev · 2025-01-06T15:54:23Z

danbev
Jan 6, 2025
Collaborator

I'll try to answer and hopefully others can correct me if I'm wrong about anything here.

The two computation graphs are for the prefill prompt (pp), which is the initial prompt and usually consists of multiple tokens, and the token generation (tg) computation graph (single token).

The worst case is for the case where the prefill prompt is completely filled with the maximum number of tokens.
The token generation reservation I believe is for getting the number of graph splits and the number of nodes. And after this the pp graph is reserved again so that the reservation does not have to be done at inference time, as the tensor allocation sizes will be that of the tg graph nodes without this "re-reservation".

Regarding the kv cache I think this because if a k-shift is needed or a defragmentation, then the scheduler is reset and a new graph is built (llama_build_graph_k_shift or llama_build_graph_defrag), the scheduler is allocated with the new graph which is then computed. After this the computation graph for pp is again built and reserved to reset the scheduler to the state it was before the k-shift/defrag.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphs in llama.cpp #11039

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Graphs in llama.cpp #11039

abhinav9629 Jan 2, 2025

Replies: 1 comment

danbev Jan 6, 2025 Collaborator

abhinav9629
Jan 2, 2025

danbev
Jan 6, 2025
Collaborator