top | item 46768844

Show HN: a Rust-based multimodal inference server

1 points| Beefin | 1 month ago |github.com

We built a production-grade multimodal inference server in Rust for serving vision–language models (image + text → streamed text).

The goal was to explore what a Rust-native control plane looks like for modern multimodal inference: continuous batching, KV-aware admission control, predictable behavior under load, and proper streaming semantics.

The system exposes an OpenAI-compatible API, supports multi-image inputs, and is designed to degrade gracefully under overload rather than OOM or stall. It’s organized as a single monorepo with a gateway, GPU workers, scheduler, and pluggable engine adapters.

We’ve also included a benchmark suite focused on real-world scenarios (TTFT, cancellation, overload, fairness) rather than synthetic tokens/sec numbers.

Would love feedback from folks building or operating inference infrastructure.

discuss

No comments yet.