top | item 43538425

(no title)

peterjliu | 11 months ago

We've (ex Google Deepmind researchers) been doing research in increasing the reliability of agents and realized it is pretty non-trivial, but there are a lot of techniques to improve it. The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks. We made our own benchmarks to measure progress.

Plug: We just posted a demo of our agent doing sophisticated reasoning over a huge dataset ((JFK assassination files -- 80,000 PDF pages): https://x.com/peterjliu/status/1906711224261464320

Even on small amounts of files, I think there's quite a palpable difference in reliability/accuracy vs the big AI players.

discuss

ai-christianson|11 months ago

> The most important thing is doing rigorous evals that are representative of what your users do in your product. Often this is not the same as academic benchmarks.

OMFG thank you for saying this. As a core contributor to RA.Aid, optimizing it for SWE-bench seems like it would actively go against perf on real-world tasks. RA.Aid came about in the first place as a pragmatic programming tool (I created it while making another software startup, Fictie.) It works well because it was literally made and tested by making other software, and these days it mostly creates its own code.

Do you have any tips or suggestions on how to do more formalized evals, but on tasks that resemble real world tasks?

peterjliu|11 months ago

I would start by making the examples yourself initially, assuming you have a good sense for what that real-world task is. If you can't articulate what a good task is and what a good output is, it is not ready for out-sourcing to crowd-workers.

And before going to crowd-workers (maybe you can skip them entirely) try LLMs.