We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM.
Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.
AndreSlavescu|2 months ago
Check it out here: https://models.hathora.dev/model/qwen3-omni
sosodev|2 months ago
red2awn|2 months ago
sosodev|2 months ago
whimsicalism|2 months ago
red2awn|2 months ago