Might look small, but the needle in a haystack numbers they report in the model card addenda at 200k are also a massive improvement towards “Proving a negative”… I.e. your answer does not exist in your text. %99.7 vs 98.3 for Opus
https://cdn.sanity.io/files/4zrzovbb/website/fed9cc193a14b84...
Could you explain how these two are related? That benchmark seems to be asking for very specific information inside a large body of text. For LLMs, that seems quite a different task compared to proving a negative. Any improvements on proving a negative would mean less hallucinations and would be a huge deal.
maeil|1 year ago