(no title)
qntty
|
9 months ago
It sounds like you might be confusing different parts of the stack. NVIDIA Dynamo for example supports vLLM as the inference engine. I think you should think of something like vLLM as more akin to GUnicorn, and llm-d as an application load balancer. And I guess something like NVIDIA Dynamo would be like Django.
smarterclayton|9 months ago
1. Balance / schedule incoming requests to the right backend
2. Model server replicas that can run on multiple hardware topologies
3. Prefix caching hierarchy with well-tested variants for different use cases
So it's a 3-tier architecture. The biggest difference with Dynamo is that llm-d is using the inference gateway extension - https://github.com/kubernetes-sigs/gateway-api-inference-ext... - which brings Kubernetes owned APIs for managing model routing, request priority and flow control, LoRA support etc.
rdli|9 months ago
rdli|9 months ago
qntty|9 months ago