top | item 42023604

(no title)

duchenne | 1 year ago

Training a 1B model on 1T tokens is cheaper than people might think. A H100 GPU can be rented for 2.5$ per hour and can train around 63k tokens per second for a 1B model. So you would need around 4,400 hours of GPU training costing only $11k And costs will keep going down.

discuss

lumost|1 year ago

Is there a handy table for this? My napkin math has either underestimated throughput by 2 orders of magnitude or the above estimate is high.

YetAnotherNick|1 year ago

You require 6 * parameter * token flops[1] to train LLM. Which means (flop/s of H100 * MFU) / (6 * parameter) token per second. Assuming MFU of 40%, it is (1000 * 10^12 * 0.4) / (6 * 10^9) token/sec = 67,000 token/sec.

This repo[2] by Meta achieves 48% MFU, or 80k token/second.

[1]: https://arxiv.org/pdf/2001.08361

[2]: https://github.com/facebookresearch/lingua

codetrotter|1 year ago

(1,000,000,000,000/63,000)/(60*60)

(1T tokens / 63k tokens per second) / (60 seconds per minute * 60 minutes per hour)

Is approx 4400 hours

So I guess that’s how the calculation went.

Or did you mean a source for the number of tokens per second?