top | item 44054477

(no title)

oofbaroomf | 9 months ago

The SWE-Bench scores are very, very high for an open source model of this size. 46.8% is better than o3-mini (with Agentless-lite) and Claude 3.6 (with AutoCodeRover), but it is a little lower than Claude 3.6 with Anthropic's proprietary scaffold. And considering you can run this for almost free, this is a very extraordinary model.

discuss

AstroBen|9 months ago

extraordinary.. or suspicious that the benchmarks aren't doing their job

echelon|9 months ago

I wasn't considering Mistral for anything, but this show of goodwill to open source is amazing. I'll have to give this a try.

sagarpatil|9 months ago

They are referring to SWE bench lite. Just want to make sure you are too.

svantana|9 months ago

Where did you get that idea? In the post they are repeatedly referring to SWEBench-Verified and nothing else.

falcor84|9 months ago

Just to confirm, are you referring to Claude 3.7?

oofbaroomf|9 months ago

No. I am referring to Claude 3.5 Sonnet New, released October 22, 2024, with model ID claude-3-5-sonnet-20241022, colloquially referred to as Claude 3.6 Sonnet because of Anthropic's confusing naming.