top | item 39471491

(no title)

SethTro | 2 years ago

> Phind-70B is significantly faster than GPT-4 Turbo ... We're able to achieve this by running NVIDIA's TensorRT-LLM library on H100 GPUs

discuss

kkielhofner|2 years ago

As someone who has utilized Nvidia Triton Inference Server for years it's really interesting to see people publicly disclosing use of TensorRT-LLM (almost certainly in conjunction with Triton).

Up until TensorRT-LLM Triton had been kind of an in-group secret amongst high scale inference providers. Now you can readily find announcements, press releases, etc of Triton (TensorRT-LLM) usage from the likes of Mistral, Phind, Cloudflare, Amazon, etc.

brucethemoose2|2 years ago

Being accesible is huge.

I still see post of people running ollama on H100s or whatever, and that's just because its so easy to set up.

jxy|2 years ago

How many H100 GPUs does it take to serve 1 Phind-70B model? Are they serving it with bf16, or int8, or lower quants?

tarruda|2 years ago

This video [1] shows someone running at 4-bit quant in 48gb VRAM. I suspect you need 4x that to run at full f16 precision, or approx 3 H100.

https://www.youtube.com/watch?v=dJ69gY0qRbg