kumama's comments

kumama | 1 year ago | on: We might be overestimating coding agent performance on SWE-Bench

Hey everyone! We recently came across a ICLR submission highlighting dataset contamination issues with SWE-Bench. After filtering out those issues, the authors saw the performance of SWE-Agent + GPT-4 drop significantly, from 12.47% to 3.97%.

This led us to think more deeply about SWE-Bench as an evaluation tool. We've put together a blog post that reviews this paper, other relevant research, and also our thoughts on additional gaps in SWE-Bench.

Blog: https://www.cgft.io/blog/swe-bench-evals

Paper: https://openreview.net/forum?id=pwIGnH2LHJ

Would love your thoughts as well! This post isn’t meant to criticize SWE-Bench; it’s still the best dataset out there for evaluating coding agents. Instead, we hope this discussion can spark ideas on how to make it even better!

We might be overestimating coding agent performance on SWE-Bench

kumama | 1 year ago | on: How to Improve Code Completion LLMs with Repo-Specific Finetuning

Hey everyone! We've been working on helping eng teams finetune custom code LLMs for their specific internal code repos for different tasks across the SDLC.

We wrote a blog post about how we're doing it for code completions. We essentially fine-tune the model as a developer going from a blank slate to the full repo, one diff at a time. Instead of treating codebases as a static, raw list of files, we treat them as time-series of diffs on graphs of code objects (functions, classes, etc.).

The results are very encouraging.

Would love to answer questions and hear any cool ideas y'all might have!

kumama | 2 years ago | on: Show HN: Generate webpages with GPT-4-Turbo

cool tool