kumama | 1 year ago | on: We might be overestimating coding agent performance on SWE-Bench
kumama's comments
kumama | 1 year ago | on: How to Improve Code Completion LLMs with Repo-Specific Finetuning
Hey everyone! We've been working on helping eng teams finetune custom code LLMs for their specific internal code repos for different tasks across the SDLC.
We wrote a blog post about how we're doing it for code completions. We essentially fine-tune the model as a developer going from a blank slate to the full repo, one diff at a time. Instead of treating codebases as a static, raw list of files, we treat them as time-series of diffs on graphs of code objects (functions, classes, etc.).
The results are very encouraging.
Would love to answer questions and hear any cool ideas y'all might have!
kumama | 2 years ago | on: Show HN: Generate webpages with GPT-4-Turbo
cool tool
page 1
This led us to think more deeply about SWE-Bench as an evaluation tool. We've put together a blog post that reviews this paper, other relevant research, and also our thoughts on additional gaps in SWE-Bench.
Blog: https://www.cgft.io/blog/swe-bench-evals
Paper: https://openreview.net/forum?id=pwIGnH2LHJ
Would love your thoughts as well! This post isn’t meant to criticize SWE-Bench; it’s still the best dataset out there for evaluating coding agents. Instead, we hope this discussion can spark ideas on how to make it even better!
We might be overestimating coding agent performance on SWE-Bench