top | item 46902371 (no title) osti | 24 days ago Somehow regresses on SWE bench? discuss order hn newest lkbm|24 days ago I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise. SubiculumCode|24 days ago That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all. usaar333|24 days ago i'd interpret that as rounding error. that is unchangedswe-bench seems really hard once you are above 80% Squarex|24 days ago it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative load replies (1)
lkbm|24 days ago I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise.
SubiculumCode|24 days ago That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all.
usaar333|24 days ago i'd interpret that as rounding error. that is unchangedswe-bench seems really hard once you are above 80% Squarex|24 days ago it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative load replies (1)
Squarex|24 days ago it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative load replies (1)
lkbm|24 days ago
SubiculumCode|24 days ago
usaar333|24 days ago
swe-bench seems really hard once you are above 80%
Squarex|24 days ago