How to cache system prompt? #8947

arpitjjw · 2024-08-09T12:48:34Z

arpitjjw
Aug 9, 2024

I am using open-ai v1/chat/completions in python. I give messages as system and user.


"messages": [
{
    "role": "system",
    "content": "system prompt"
},
{
    "role": "user",
    "content": 'user input'
           
}
]

The system prompt is very long (40k tokens) and is fixed and the user input can vary. I want to cache the system prompt because it takes a lot of time to make KV cache values again and again.
I was not able to find proper method to achieve this.
I used this command to host the server and is working properly:

sudo docker run -d --network=host --runtime=nvidia --gpus all -p 8080:8080 -v /home/arpitjhunjhunwala/llama.cpp/models/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf:/models local/llama.cpp:server-cuda -m /models -n -1 -ngl 72 --ctx-size 50000 --port 8080

Answered by ggerganov

Aug 10, 2024

Yes, it will cache both "system + user". But the logic for reusing the cached prompt looks for the largest prefix between the cached data and the new input, so even if the user prompt changes, it will still reuse the system prompt

View full answer

okuvshynov · 2024-08-09T12:52:35Z

okuvshynov
Aug 9, 2024

Are you setting cache_prompt = true in your requests?

Example:

llama.cpp/examples/server/chat.sh

Line 51 in 3071c0a

cache_prompt: true,

2 replies

arpitjjw Aug 9, 2024
Author

But will not cache the entire prompt (system + user)? I just want to cache system prompt because the user prompt changes

ggerganov Aug 10, 2024
Maintainer

Yes, it will cache both "system + user". But the logic for reusing the cached prompt looks for the largest prefix between the cached data and the new input, so even if the user prompt changes, it will still reuse the system prompt

Answer selected by arpitjjw

pritam-dey3 · 2024-08-09T18:37:23Z

pritam-dey3
Aug 9, 2024

I have a similar use case. As the OP mentioned, I am interested in caching only a static part of my prompt template (nearly 4k), which could also be viewed as system prompt (Since I am using gemma 2 they don't support system prompt). However I am not interested in caching any other prompts that are generated through user interaction. Some user queries may be similar, but I am not interested to store cache for all of them, because then it can cause some storage issue. I only want to cache a large prefix of the prompts. And this cache needs to be shared across different users / slots.

So far I found all these different parameters about caching from server docs file.

These parameters can be set during spinning up the server

-lcs,  --lookup-cache-static FNAME          path to static lookup cache to use for lookup decoding (not updated by generation)
-lcd,  --lookup-cache-dynamic FNAME    path to dynamic lookup cache to use for lookup decoding (updated by generation)

--prompt-cache FNAME     file to cache prompt state for faster startup (default: none)
--prompt-cache-all       if specified, saves user input and generations to cache as well    
                                                    not supported with --interactive or other interactive options
--prompt-cache-ro        if specified, uses the prompt cache but does not update it

Then this parameter can be set during generation / chat completion:

`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`

But after all this, I have no clue how should I proceed.

3 replies

ggerganov Aug 10, 2024
Maintainer

The --prompt-cache... options are not supported by llama-server, only by llama-cli.

To achieve what you want, start llama-server with:

./llama-server --system-prompt-file my-system-prompt.txt -np <n_slots> ...

The content of the file my-system-prompt.txt will be shared across all n_slots as a common prefix, so after it is computed the first time after application start, it will be stored in memory and not recomputed again. Make sure to set cache_prompt = true for all incoming requests

richtong Jan 8, 2025

The --system-prompt-file flag seems to have gone, is this preloading of prompt still available?
so what is the correct flag now for llama-server this is super useful! It would be nice if you could store the kv-cache though, that's only with the llama-cli right?

richtong Jan 8, 2025

Is it now just ./llama-server -p "your system prompt"

TuzelKO · 2024-09-05T23:26:03Z

TuzelKO
Sep 5, 2024

How does this cache work if there are several prompts? For example, I have a unique prompt for each user, within which the user conducts a dialogue with the model, how can I make it so that the request of one user does not knock down the cache of another?

I will describe in more detail. We have an NPC who communicates with the player, there is another NPC nearby who communicates with another player. There can be an unlimited number of such NPCs. Each NPC has a description of his character and background, which is described in the system message. When a message comes from one user, the system prompt is cached, and if you send a new message, then an increase in the generation speed is noticeable, but if you send a message from another user, then the cache is reset and the first user will again wait longer for a response.

I also do not understand how this cache works with slots, because if you enable parallelism (command params "-c 64000 --parallel 24", yep 4K per slot, model supports 128k), then it feels like the cache stops working at all. This happens even if I forcefully specify the slot number in the request and write only from one user.

{
    "slot_id": 0,
    "temperature": 0.3,
    "n_keep": -1,
    "cache_prompt": true,
    "messages": [
        {
            "role": "system",
            "content": "[prompt no longer than 1000 tokens]"
        },
        {
            "role": "user",
            "content": "[dialogue history and action command, no longer than 1800 tokens]"
        }
    ]
}

5 replies

TuzelKO Sep 5, 2024

Maybe there is some way to unload the cache somewhere to external storage so that it can be loaded the next time the user requests it?

ggerganov Sep 6, 2024
Maintainer

How does this cache work if there are several prompts? For example, I have a unique prompt for each user, within which the user conducts a dialogue with the model, how can I make it so that the request of one user does not knock down the cache of another?

The prompt caches of different slots do not interact with each other. On the implementation level, this is achieved by constructing a specific attention mask that makes the tokens from one conversation "see" only the past tokens of that conversation.

"-c 64000 --parallel 24", yep 4K per slot, model supports 128k),

This actually results in 64000/24 = ~2666.66 tokens per slot - maybe you are exceeding the slot context size?

This happens even if I forcefully specify the slot number in the request and write only from one user.

Does it happen with --parallel 1?

TuzelKO Sep 6, 2024

This actually results in 64000/24 = ~2666.66 tokens per slot - maybe you are exceeding the slot context size?

Sorry, I was stupid. I copied the old command and wrote the characteristics of the current command. It was in the context of launching with parameters "-c 16000 --parallel 4" and "-c 96000 --parallel 24". The context window was not exceeded in any case, since the prompt itself was no longer than 1200 tokens in total.

Does it happen with --parallel 1?

Yes.

ggerganov Sep 6, 2024
Maintainer

So I did the following test to confirm that it is working:

Start server with --verbose flag to be able to observe more detailed stats:

./llama-server -m ./models/llama-3.1-8b-instruct/ggml-model-f16.gguf -fa -c 65536 -np 8 --threads-http 16 --metrics -dt 0.05 -b 8192 -ub 256 --host 127.0.0.1 --port 8012 --verbose

Send first request:

curl --request POST --url http://127.0.0.1:8012/chat/completions --header "Content-Type: application/json" --data '
{
    "slot_id": 0,
    "temperature": 0.3,
    "n_keep": -1,
    "cache_prompt": true,
    "messages": [
        {
            "role": "system",
            "content": "Answer the following questions as best you can, but speaking as a pirate might speak and add emojis at the end of each sentence."
        },
        {
            "role": "user",
            "content": "How to get to the island?"
        }
    ]
}' | jq

The number of processed prompt tokens is __verbose.timings.prompt_n = 48.

Send the same request again:

The number of processed prompt tokens is now just __verbose.timings.prompt_n = 1. The reason is that the prompt tokens were reused from the cache of the previous request.

Do you observe the same results on your end?

TuzelKO Sep 6, 2024

First run __verbose.timings.prompt_n = 50

Second run __verbose.timings.prompt_n = 0

richtong · 2025-01-08T07:25:03Z

richtong
Jan 8, 2025

OK, I"m confused, what is the meaning of

-p which says a prompt to start with
-f which says a file for a prompot
-bf binary file containing the prompt

There doesn't seem to be more documentaiton than this at least that I can find.

1 reply

ggerganov Jan 8, 2025
Maintainer

These 3 arguments are not used by llama-server. I've opened a PR to explicitly disable them: #11136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to cache system prompt? #8947

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to cache system prompt? #8947

Replies: 4 comments · 11 replies

arpitjjw Aug 9, 2024 Author

ggerganov Aug 10, 2024 Maintainer

ggerganov Aug 10, 2024 Maintainer

ggerganov Sep 6, 2024 Maintainer

ggerganov Sep 6, 2024 Maintainer

ggerganov Jan 8, 2025 Maintainer

Replies: 4 comments 11 replies

arpitjjw Aug 9, 2024
Author

ggerganov Aug 10, 2024
Maintainer

ggerganov Aug 10, 2024
Maintainer

ggerganov Sep 6, 2024
Maintainer

ggerganov Sep 6, 2024
Maintainer

ggerganov Jan 8, 2025
Maintainer