top | item 46848003

(no title)

arauhala | 28 days ago

One of the key techniques is snapshotting the LLM (or any HTTP) request. This means that if the inputs won't change, the LLM will not be called. This will also snapshot /cache LLM verifications steps.

This doesn't only saves costs, but it's main goal was to force determinism and save time. Limited changes may need only the new/changed tests to be rerun with LLM. CI typically don't have LLM API keys and only rerun against snapshots with zero costs and delays

All LLM operations tend to be notoriously slow, and at least on our side: we are often more interested of how our code interacts with the LLM. Having the LLM being fully snapshopshotted does iterating the code delightfully fast.

If you want do sampling, this can be implemented in the test code. Booktest is a bit like pytest in the sense, that the actual testing logic heavylifting is left for the the developer. Lot of LLM test suites are more opinionated, but also more intrusive in that sense

discuss

clawsyndicate|28 days ago

[deleted]

arauhala|28 days ago

Hmm.. So, I'm correct, you are maintaining QA suites for 10k agents, their prompts, toolboxes and some scenarios.

That sounds like an absolutely massive scale to do QA over. Running the test suite must cost fortune and take ages.

As such, I don't think the data storage is that big problem. E.g. If you have 10-100 requests stored per agent, it's 100k-1M snapshots. Booktest normally stores the states in Git, but there is also some kind of DVC support. If you need to recreate all snapshots regularly, e.g. of you change some system wide properties often, that may became problem or not. Git and DVC can manage quite high scale. The Git PR reviews won't work though with e.g 1M changed files.

Our scale in mono repo was maybe 10k LLM snapshots in several hundred files, which worked technically. Recreating all evaluations was a bit slow (e.g 5-10 minutes) and merges often forced recreation of snapshots and re-reviews. This is ofc massively smaller scale than what you are having. We did use booktest for assistant, but it was just one assistant over dozens of use case flows.

I guess it could be somehow manageable, if you try to avoid test level manual review and only review some aggregared results in the tool, but I cannot really promise anything. It may be a worth a try.