top | item 43642671

(no title)

uep | 10 months ago

What is the reason you included Claude 3.5 instead of 3.7 in this?

discuss

I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.

* - I also reproduced the Llama 3.1 8B result to check my setup.

[0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa