top | item 46902371

(no title)

osti | 24 days ago

Somehow regresses on SWE bench?

discuss

lkbm|24 days ago

I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise.

That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all.

usaar333|24 days ago

i'd interpret that as rounding error. that is unchanged

swe-bench seems really hard once you are above 80%

Squarex|24 days ago

it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative