top | item 46222544

(no title)

red2awn | 2 months ago

None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia.

discuss

We actually deployed working speech to speech inference that builds on top of vLLM as the backbone. The main thing was to support the "Talker" module, which is currently not supported on the qwen3-omni branch for vLLM.

Check it out here: https://models.hathora.dev/model/qwen3-omni

sosodev|2 months ago

Is your work open source?

red2awn|2 months ago

Nice work. Are you working on streaming input/output?

sosodev|2 months ago

That's unfortunate but not too surprising. This type of model is very new to the local hosting space.

whimsicalism|2 months ago

Makes sense, I think streaming audio->audio inference is a relatively big lift.

red2awn|2 months ago

Correct, it's breaks the single prompt, single completion assumption baked into the frameworks. Conceptually it's still prompt/completion but for low latency response you have to do streaming KV cache prefill with a websocket server.