top | item 40465874

(no title)

Misleading headline and completely pointless without diving into how the benchmark was constructed and what kinds of programming questions were asked.

On the Humaneval (https://paperswithcode.com/sota/code-generation-on-humaneval) benchmark, GPT4 can generate code that works on first pass 76.5% of the time.

While on SWE bench (https://www.swebench.com/) GPT4 with RAG can only solve about 1% of github issues used in the benchmark.

discuss

No comments yet.