I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.
* - I also reproduced the Llama 3.1 8B result to check my setup.
daemonologist|10 months ago
* - I also reproduced the Llama 3.1 8B result to check my setup.
[0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa