(no title)
ObnoxiousProxy | 1 year ago
On the Humaneval (https://paperswithcode.com/sota/code-generation-on-humaneval) benchmark, GPT4 can generate code that works on first pass 76.5% of the time.
While on SWE bench (https://www.swebench.com/) GPT4 with RAG can only solve about 1% of github issues used in the benchmark.
No comments yet.