top | item 46829894

Show HN: Arenas suck, here's why we just added one to Windsurf

4 points| agtestdvn | 1 month ago |windsurf.com | reply

Benchmarks don't reflect real-world coding ability. So we made real-world coding the benchmark.

2 comments

order
[+] swyx|1 month ago|reply
(team member) my comparison matrix of why Product Arenas differ from Global Arenas here: https://x.com/swyx/status/2017342647963431363

the trick is to get it to be usable within context. what started out as a simple evals concept quickly became a lot of debating over how to properly present worktrees in an IDE. hope to hear your feedback.

[+] agtestdvn|1 month ago|reply
I work at Windsurf and would love to discuss product-agnostically any ideas/thoughts people have around how we as a community can evaluate models better. I feel like benchmarks like SWEbench are all saturated and gamed/trained on. I also feel like online arenas are mostly used by vibecoders. And our arena mode def isn't the final form factor either!