top | item 46322041

(no title)

This made me compare the figures, and: did they accidentally switch those around, or are the Post-training Reasoning and Factuality scores actually significantly lower than the Pre-training ones?

Edit: Just noticed

> Also note pre-training and post-training benchmarks are different, so scores are not comparable across plots.

The paper gives more details about the specific benchmarks and the scores obtained in them: https://arxiv.org/html/2512.14856v1#S4

discuss

No comments yet.