At inference time it will be possible to do 4000 TFLOPS using sparse FP8 :)
But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.
But yeah the latency will be insanely fast given how massive these models are!
ml_hardware|4 years ago
But keep in mind the model won't fit on a single H100 (80GB) because it's 175B params, and ~90GB even with sparse FP8 model weights, and then more needed for live activation memory. So you'll still want atleast 2+ H100s to run inference, and more realistically you would rent a 8xH100 cloud instance.
But yeah the latency will be insanely fast given how massive these models are!
TOMDM|4 years ago
Sounds doable in a generation or two.
learndeeply|4 years ago
edf13|4 years ago
Melatonic|4 years ago
komuher|4 years ago