(no title)
richardblythman | 6 months ago
For those just catching up: The problem is that existing benchmarks focus on self-contained codegen. StackBench tests how well AI coding agents (like Claude Code, and now Cursor) use your library by: • Parsing your documentation automatically • Extracting real usage examples • Having agents generate those examples from a spec from scratch • Logging every mistake and analyzing patterns
You can find out more information about how it works and how to run it in the docs https://docs.stackbench.ai/
Next up, we’re planning to add more: • Coding agents • Ways of providing docs as context (e.g. Mintlify vs Cursor doc search) • Benchmark tasks (e.g. use of APIs via API docs) • Metrics
We're also working on automating in-editor testing and maybe even using an MCP server.
Contributions and suggestions very welcome. What should we prioritize next? The issues tab is open.
No comments yet.