Training a 1B model on 1T tokens is cheaper than people might think.
A H100 GPU can be rented for 2.5$ per hour and can train around 63k tokens per second for a 1B model.
So you would need around 4,400 hours of GPU training costing only $11k
And costs will keep going down.
"Furthermore, AMD OLMo models were also able to run inference on AMD Ryzen™ AI PCs that are equipped with Neural Processing Units (NPUs). Developers can easily run Generative AI models locally by utilizing the AMD Ryzen™ AI Software."
Hope these AI PCs will run also something better than 1B model.
The point is that AMD is doing the legwork to ensure that AI models can run on their chips. While they could settle for inference workloads (port llama to AMD). It is unlikely that many teams will widely adopt their silicon unless they can be used in the end-end ML stack. Many pure OSS efforts have tried and failed to make AMD work for this use case.
As a chip maker - they will also have some undersold, QA, or otherwise wasted parts available for these training efforts - so the capex is likely less severe for them compared to a random startup betting on AMD.
Some use cases require a small memory footprint, e.g. parallel inferences. I suppose there are also dark patterns like tracking, where you don't want the load to stand out.
duchenne|1 year ago
lumost|1 year ago
throwaway888abc|1 year ago
Hope these AI PCs will run also something better than 1B model.
What is it useful for ? Spellcheck ?
lumost|1 year ago
As a chip maker - they will also have some undersold, QA, or otherwise wasted parts available for these training efforts - so the capex is likely less severe for them compared to a random startup betting on AMD.
princearthur|1 year ago
Havoc|1 year ago
Which means you can do larger but it’ll become ever slower
unknown|1 year ago
[deleted]
sireat|1 year ago
It seems actual domain specific usefulness (say specific programming language, translation, etc) starts at 3B models.
adt|1 year ago