(no title)
jryio | 17 days ago
If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.
Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].
Also I expect both Nvidia and Google to deploy custom silicon for inference [2]
1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...
2: https://www.tomshardware.com/tech-industry/semiconductors/nv...
zozbot234|17 days ago
(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)
Doohickey-d|16 days ago
For me, it works quite well for low-priority things, without the hassle of using the batch API. Usually the added latency is just a few seconds extra, so it would still work in an agent loop (and you can retry requests that fail at the "normal" priority tier.)
https://developers.openai.com/api/docs/guides/flex-processin...
dehugger|17 days ago
Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.
cheema33|17 days ago