top | item 30768871

(no title)

komuher | 4 years ago

1000 TFLOPS so i can run my GPT3 in under 100 ms locally :D

If 1000 TFLOPS is possible to do in inference time then im speechless

discuss

At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)

But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.

But yeah the latency will be insanely fast given how massive these models are!

TOMDM|4 years ago

So, we're about a 25-50% memory increase off of being able to run GPT3 on a single machine?

Sounds doable in a generation or two.

learndeeply|4 years ago

GPT-3 can't fit in 80GB of RAM.

edf13|4 years ago

At what costs I wonder?

Melatonic|4 years ago

Huge recurrent licensing costs is the killer with these

komuher|4 years ago

I would assume about 30-40k usd but we'll see