top | item 42856869

(no title)

ahzhou | 1 year ago

They slightly restructure their MoE [1], but I think the main difference is that other big models (e.g Llama 504B) are dense and have higher FLOP requirements. MoE should represent a ~5x improvement. FP8 should be about a ~2x improvement.

We don’t know how much of a speed improvement GRPO represents. They didn’t say how many GPU hours went into to RLing DeepSeek-r1 and we don’t have a o1 numbers to compare.

There’s definitely lots of misinformation spreading though. The $5.5m number refers to Deepseek-v3, not Deepseek-r1. I don't want to take away from HighFlyer's accomplishment, though. I think a lot of these innovations were forced to work around H800 networking limitations, and it's impressive what they've done.

[1] https://arxiv.org/abs/2401.06066

discuss

karmakaze|1 year ago

It's interesting that only having access to less powerful hardware motivated/necessitated more efficient training--like how tariffs can backfire if left in place too long.