top | item 44425398

(no title)

miraculixx | 8 months ago

As any AI researcher knows, if you have a model that does 4x better than the naive baseline (the humans, in this case), you are likely looking at overfit, not real-life performance. This study is just slop, and you can tell so by the mere fact that they did not submit a paper, but just published a PR article.

discuss

order

brandonb|8 months ago

In the paper, they say they used the most recent 56 cases (from 2024–2025) as a holdout set. The majority of those cases happened after the o4 training cutoff of May 31, 2024.

miraculixx|8 months ago

Are these 56 cases distinct from all other cases in the data?