(no title)
crashocaster | 2 years ago
A100 specs:
- 312e12 BF16 FLOPS
- 1555e9 GB/s HBM bandwidth
H100:
- 1000e12/2000e12 BF16/INT8 FLOPS
(apply ~0.7 flops efficiency multiplier because h100s power throttle extremely quickly)
- 3000 GB/s HBM bandwidth
---
For a 13B model on an A100, this nets:
13e9 * 2 bytes per param = 26 GB HBM required (at bf16)
26e9/1555e9 = 17ms / token small-batch latency (~60 tokens / second)
What about large batches?
latency for some batch size B is 13e9 * 2 FLOP per param * B / 312e12
We want B such that we're just about no longer HBM bound: 26e9/312e12 * B = 17ms
<=> 17e-3/(26e9/312e12)
giving a batch size of 204.
At that batch size (and all larger batch sizes), the a100 delivers a throughput of B * 1/17ms = 12000 tokens / second
---
KV caching, multi-gpu and -node comms and matmul efficiencies left as an exercise to the reader :)
No comments yet.