You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be very useful to add multi-response support per slot so that a single request would be able to generate n independent completions. This functionality is useful in different situations - for example, a FIM completion can provide multiple alternative suggestions at a smaller or equal compute cost compared to running them sequentially.
I think this can be implemented by adding multiple sequence id per slot (instead of having just one like we currently do). However, I am not sure how yet much complexity would be introduced to support this.
The text was updated successfully, but these errors were encountered:
I think having multiple sequence id per slot will be far more difficult to keep track of the KV cache. This will also force all sequence id to share the same n_ctx of the slot, reducing the maximum length of generated text.
Instead, I'd suggest adding the notion of pinned slot. A pinned slot is a slot that can only be freed once all tasks depend on it finishes.
Upon receiving request for N completions, we create N+1 tasks but only one of them contains the prompt_tokens:
It would be very useful to add multi-response support per slot so that a single request would be able to generate
n
independent completions. This functionality is useful in different situations - for example, a FIM completion can provide multiple alternative suggestions at a smaller or equal compute cost compared to running them sequentially.I think this can be implemented by adding multiple sequence id per slot (instead of having just one like we currently do). However, I am not sure how yet much complexity would be introduced to support this.
The text was updated successfully, but these errors were encountered: