top | item 46928777

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

2 points| NBenkovich | 23 days ago |arxiv.org

Hi HN,

We recently ran an experiment to answer a simple question:

Does coordinating multiple AI agents as a team actually help with real software engineering tasks, compared to a single strong agent?

To test this, we evaluated our system on SWE-bench Verified. The benchmark consists of real GitHub issues that require understanding codebases, modifying multiple files, running tests, and iterating.

Instead of treating software engineering as a single-agent patch generation problem, we model it as an organizational process.

Our system uses a team of agents with explicit roles:

* manager: plans work, assigns tasks, integrates results * researcher: explores the codebase, issue history, constraints * engineer: implements fixes in isolated environments * reviewer: inspects changes, requests revisions, validates results

There is no fixed pipeline and no predefined number of steps. Agents communicate via structured artifacts (plans, diffs, reviews) and produce real GitHub pull requests with full history.

For evaluation, we compared three setups on SWE-bench Verified:

* single-agent baseline: GPT-5 medium reasoning + shell * agent team (ours): GPT-5 (manager, researcher) + GPT-5 Codex (engineer, reviewer), both medium reasoning * stronger single-model reference: GPT-5.2 (high reasoning)

Results:

* the agent team resolves ~7% more issues than the single-agent GPT-5 medium reasoning baseline * despite using medium reasoning models, the agent team shows ~0.5% better quality than a single GPT-5.2 (high reasoning) agent

Beyond resolution rate, the main benefits are cleaner responsibility boundaries, context isolation, easier debugging and the ability to use different models for different roles.

Code + trajectories are open source: https://github.com/agynio/platform

Paper with methodology and results: https://arxiv.org/abs/2602.01465

Would love to hear your thoughts.

discuss

No comments yet.