Training a 1B model on 1T tokens is cheaper than people might think.
A H100 GPU can be rented for 2.5$ per hour and can train around 63k tokens per second for a 1B model.
So you would need around 4,400 hours of GPU training costing only $11k
And costs will keep going down.
lumost|1 year ago
YetAnotherNick|1 year ago
This repo[2] by Meta achieves 48% MFU, or 80k token/second.
[1]: https://arxiv.org/pdf/2001.08361
[2]: https://github.com/facebookresearch/lingua
codetrotter|1 year ago
(1T tokens / 63k tokens per second) / (60 seconds per minute * 60 minutes per hour)
Is approx 4400 hours
So I guess that’s how the calculation went.
Or did you mean a source for the number of tokens per second?