(no title)
ogrisel | 1 year ago
The fact that DeepSeek-R1 is so much better than DeepSeek-V3 at various important tasks means that Chain-of-though / thinking-before-answering models are better. But they are also more compute intensive at inference time than their instruction non-thinking counterparts.
So even if the DeepSeek-V3 pretraining + GRPO COT post-training procedure was cheaper than anticipated to reach o1 grade performance, inference is still costly, even if you use a distilled model.
bildung|1 year ago