top | item 46229292 (no title) red2awn | 2 months ago Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server. discuss order hn newest whimsicalism|2 months ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
whimsicalism|2 months ago I imagine you have to start decoding many speculative completions in parallel to have true low latency.
whimsicalism|2 months ago