As someone who has utilized Nvidia Triton Inference Server for years it's really interesting to see people publicly disclosing use of TensorRT-LLM (almost certainly in conjunction with Triton).
Up until TensorRT-LLM Triton had been kind of an in-group secret amongst high scale inference providers. Now you can readily find announcements, press releases, etc of Triton (TensorRT-LLM) usage from the likes of Mistral, Phind, Cloudflare, Amazon, etc.
kkielhofner|2 years ago
Up until TensorRT-LLM Triton had been kind of an in-group secret amongst high scale inference providers. Now you can readily find announcements, press releases, etc of Triton (TensorRT-LLM) usage from the likes of Mistral, Phind, Cloudflare, Amazon, etc.
brucethemoose2|2 years ago
I still see post of people running ollama on H100s or whatever, and that's just because its so easy to set up.
jxy|2 years ago
tarruda|2 years ago
https://www.youtube.com/watch?v=dJ69gY0qRbg