Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : add support for multiple responses #11142

Open
ggerganov opened this issue Jan 8, 2025 · 2 comments
Open

server : add support for multiple responses #11142

ggerganov opened this issue Jan 8, 2025 · 2 comments

Comments

@ggerganov
Copy link
Owner

It would be very useful to add multi-response support per slot so that a single request would be able to generate n independent completions. This functionality is useful in different situations - for example, a FIM completion can provide multiple alternative suggestions at a smaller or equal compute cost compared to running them sequentially.

I think this can be implemented by adding multiple sequence id per slot (instead of having just one like we currently do). However, I am not sure how yet much complexity would be introduced to support this.

@ggerganov ggerganov moved this to Todo in ggml : roadmap Jan 8, 2025
@ggerganov ggerganov changed the title llama-server : add support for multiple responses server : add support for multiple responses Jan 8, 2025
@ngxson
Copy link
Collaborator

ngxson commented Jan 8, 2025

I think having multiple sequence id per slot will be far more difficult to keep track of the KV cache. This will also force all sequence id to share the same n_ctx of the slot, reducing the maximum length of generated text.

Instead, I'd suggest adding the notion of pinned slot. A pinned slot is a slot that can only be freed once all tasks depend on it finishes.

Upon receiving request for N completions, we create N+1 tasks but only one of them contains the prompt_tokens:

With N = 3

task 0: prompt_tokens = [......], required_by = [1, 2, 3]
task 1: prompt_tokens = []
task 2: prompt_tokens = []
task 3: prompt_tokens = []

Because server_queue.post(tasks) is already atomic, these tasks will always be consecutive in the queue.

Whichever slot takes the task 0:

  • It will be marked as pinned
  • Once it finishes processing the prompt, its state will be SLOT_STATE_DONE_PROMPT
  • The slot is now ready to take any tasks in the required_by list, then change its state to SLOT_STATE_GENERATING
  • When it done generating, check the required_by list.
    • If the list is empty, release()
    • Else, llama_kv_cache_seq_rm to remove all generated token, then accepts any tasks in required_by

For other slots that takes either task 1, 2 or 3:

  • Find the pinned slot that it depends on (by checking slot[i].required_by)
  • If the pinned slot is SLOT_STATE_DONE_PROMPT or SLOT_STATE_GENERATING, call llama_kv_cache_seq_cp then start generating text
  • When it done generating, remove itself from pinned_slot.required_by, then release()

@ggerganov
Copy link
Owner Author

Nice! This seems like a reasonable way to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants