server : add support for multiple responses #11142

ggerganov · 2025-01-08T16:11:24Z

It would be very useful to add multi-response support per slot so that a single request would be able to generate n independent completions. This functionality is useful in different situations - for example, a FIM completion can provide multiple alternative suggestions at a smaller or equal compute cost compared to running them sequentially.

I think this can be implemented by adding multiple sequence id per slot (instead of having just one like we currently do). However, I am not sure how yet much complexity would be introduced to support this.

The text was updated successfully, but these errors were encountered:

ngxson · 2025-01-08T19:35:35Z

I think having multiple sequence id per slot will be far more difficult to keep track of the KV cache. This will also force all sequence id to share the same n_ctx of the slot, reducing the maximum length of generated text.

Instead, I'd suggest adding the notion of pinned slot. A pinned slot is a slot that can only be freed once all tasks depend on it finishes.

Upon receiving request for N completions, we create N+1 tasks but only one of them contains the prompt_tokens:

With N = 3

task 0: prompt_tokens = [......], required_by = [1, 2, 3]
task 1: prompt_tokens = []
task 2: prompt_tokens = []
task 3: prompt_tokens = []

Because server_queue.post(tasks) is already atomic, these tasks will always be consecutive in the queue.

Whichever slot takes the task 0:

It will be marked as pinned
Once it finishes processing the prompt, its state will be SLOT_STATE_DONE_PROMPT
The slot is now ready to take any tasks in the required_by list, then change its state to SLOT_STATE_GENERATING
When it done generating, check the required_by list.
- If the list is empty, release()
- Else, llama_kv_cache_seq_rm to remove all generated token, then accepts any tasks in required_by

For other slots that takes either task 1, 2 or 3:

Find the pinned slot that it depends on (by checking slot[i].required_by)
If the pinned slot is SLOT_STATE_DONE_PROMPT or SLOT_STATE_GENERATING, call llama_kv_cache_seq_cp then start generating text
When it done generating, remove itself from pinned_slot.required_by, then release()

ggerganov · 2025-01-09T08:58:12Z

Nice! This seems like a reasonable way to do it.

ggerganov added server/api server labels Jan 8, 2025

ggerganov added this to ggml : roadmap Jan 8, 2025

ggerganov moved this to Todo in ggml : roadmap Jan 8, 2025

ggerganov changed the title ~~llama-server : add support for multiple responses~~ server : add support for multiple responses Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : add support for multiple responses #11142

server : add support for multiple responses #11142

ggerganov commented Jan 8, 2025

ngxson commented Jan 8, 2025 •

edited

Loading

ggerganov commented Jan 9, 2025

server : add support for multiple responses #11142

server : add support for multiple responses #11142

Comments

ggerganov commented Jan 8, 2025

ngxson commented Jan 8, 2025 • edited Loading

ggerganov commented Jan 9, 2025

ngxson commented Jan 8, 2025 •

edited

Loading