Current code generation systems are evaluated on static benchmarks like HumanEval, which comprise isolated code snippets lacking real-world aspects of programming, like dealing with large codebases, dependencies, and execution environments. While GitHub repositories provide a rich source of real-world codebases, evaluating code generation systems on them is challenging due to the lack of test harnesses associated with the code.
We present R2E, a scalable framework that turns any GitHub repository into an environment for programming agents. These environments can be used to benchmark programming agents that can interact with interpreters on repository-level problems. The system is designed to be scalable and can be used to evaluate code generation, optimization, and refactoring on public and _private_ repos. Further, R2E also enables the collection of large-scale execution traces to improve LLMs themselves.
slimshetty|1 year ago
We present R2E, a scalable framework that turns any GitHub repository into an environment for programming agents. These environments can be used to benchmark programming agents that can interact with interpreters on repository-level problems. The system is designed to be scalable and can be used to evaluate code generation, optimization, and refactoring on public and _private_ repos. Further, R2E also enables the collection of large-scale execution traces to improve LLMs themselves.