(no title)
ahzhou | 1 year ago
Interesting to note - we have no idea how much R1 cost to train. To speculate - maybe DeepSeek’s release made an upcoming Llama release moot in comparison.
ahzhou | 1 year ago
Interesting to note - we have no idea how much R1 cost to train. To speculate - maybe DeepSeek’s release made an upcoming Llama release moot in comparison.
pptr|1 year ago
FP8 training and GRPO make sense to me, but that only gets you a 4x improvement total, right?
ahzhou|1 year ago
We don’t know how much of a speed improvement GRPO represents. They didn’t say how many GPU hours went into to RLing DeepSeek-r1 and we don’t have a o1 numbers to compare.
There’s definitely lots of misinformation spreading though. The $5.5m number refers to Deepseek-v3, not Deepseek-r1. I don't want to take away from HighFlyer's accomplishment, though. I think a lot of these innovations were forced to work around H800 networking limitations, and it's impressive what they've done.
[1] https://arxiv.org/abs/2401.06066