-
I am using open-ai v1/chat/completions in python. I give messages as system and user.
The system prompt is very long (40k tokens) and is fixed and the user input can vary. I want to cache the system prompt because it takes a lot of time to make KV cache values again and again.
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 11 replies
-
Are you setting Example: llama.cpp/examples/server/chat.sh Line 51 in 3071c0a |
Beta Was this translation helpful? Give feedback.
-
I have a similar use case. As the OP mentioned, I am interested in caching only a static part of my prompt template (nearly 4k), which could also be viewed as system prompt (Since I am using gemma 2 they don't support system prompt). However I am not interested in caching any other prompts that are generated through user interaction. Some user queries may be similar, but I am not interested to store cache for all of them, because then it can cause some storage issue. I only want to cache a large prefix of the prompts. And this cache needs to be shared across different users / slots. So far I found all these different parameters about caching from server docs file. These parameters can be set during spinning up the server
Then this parameter can be set during generation / chat completion:
But after all this, I have no clue how should I proceed. |
Beta Was this translation helpful? Give feedback.
-
How does this cache work if there are several prompts? For example, I have a unique prompt for each user, within which the user conducts a dialogue with the model, how can I make it so that the request of one user does not knock down the cache of another? I will describe in more detail. We have an NPC who communicates with the player, there is another NPC nearby who communicates with another player. There can be an unlimited number of such NPCs. Each NPC has a description of his character and background, which is described in the system message. When a message comes from one user, the system prompt is cached, and if you send a new message, then an increase in the generation speed is noticeable, but if you send a message from another user, then the cache is reset and the first user will again wait longer for a response. I also do not understand how this cache works with slots, because if you enable parallelism (command params "-c 64000 --parallel 24", yep 4K per slot, model supports 128k), then it feels like the cache stops working at all. This happens even if I forcefully specify the slot number in the request and write only from one user.
|
Beta Was this translation helpful? Give feedback.
-
OK, I"m confused, what is the meaning of -p which says a prompt to start with There doesn't seem to be more documentaiton than this at least that I can find. |
Beta Was this translation helpful? Give feedback.
Yes, it will cache both "system + user". But the logic for reusing the cached prompt looks for the largest prefix between the cached data and the new input, so even if the user prompt changes, it will still reuse the system prompt