top | item 46852767

(no title)

arauhala | 28 days ago

Hmm.. So, I'm correct, you are maintaining QA suites for 10k agents, their prompts, toolboxes and some scenarios.

That sounds like an absolutely massive scale to do QA over. Running the test suite must cost fortune and take ages.

As such, I don't think the data storage is that big problem. E.g. If you have 10-100 requests stored per agent, it's 100k-1M snapshots. Booktest normally stores the states in Git, but there is also some kind of DVC support. If you need to recreate all snapshots regularly, e.g. of you change some system wide properties often, that may became problem or not. Git and DVC can manage quite high scale. The Git PR reviews won't work though with e.g 1M changed files.

Our scale in mono repo was maybe 10k LLM snapshots in several hundred files, which worked technically. Recreating all evaluations was a bit slow (e.g 5-10 minutes) and merges often forced recreation of snapshots and re-reviews. This is ofc massively smaller scale than what you are having. We did use booktest for assistant, but it was just one assistant over dozens of use case flows.

I guess it could be somehow manageable, if you try to avoid test level manual review and only review some aggregared results in the tool, but I cannot really promise anything. It may be a worth a try.

discuss

No comments yet.