top | item 45039979

(no title)

Last week we made it to the front page with our post about benchmarking how well coding agents interact with libraries and APIs. The response was positive overall, but many wanted to see the code.

For those just catching up: The problem is that existing benchmarks focus on self-contained codegen. StackBench tests how well AI coding agents (like Claude Code, and now Cursor) use your library by: • Parsing your documentation automatically • Extracting real usage examples • Having agents generate those examples from a spec from scratch • Logging every mistake and analyzing patterns

You can find out more information about how it works and how to run it in the docs https://docs.stackbench.ai/

Next up, we’re planning to add more: • Coding agents • Ways of providing docs as context (e.g. Mintlify vs Cursor doc search) • Benchmark tasks (e.g. use of APIs via API docs) • Metrics

We're also working on automating in-editor testing and maybe even using an MCP server.

Contributions and suggestions very welcome. What should we prioritize next? The issues tab is open.

discuss

No comments yet.