top | item 46992747

(no title)

jryio | 17 days ago

This is interesting for offloading "tiered" workloads / priority queue with coding agents.

If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.

Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].

Also I expect both Nvidia and Google to deploy custom silicon for inference [2]

1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...

2: https://www.tomshardware.com/tech-industry/semiconductors/nv...

discuss

zozbot234|17 days ago

Note that Batch APIs are significantly higher latency than normal AI agent use. They're mostly intended for bulk work where time constraints are not essential. Also, GPT "Codex" models (and most of the "Pro" models also) are currently not available under OpenAI's own batch API. So you would have to use non-agentic models for these tasks and it's not clear how well they would cope.

(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)

Doohickey-d|16 days ago

Openai has a "flex" processing tier, which works like the normal API, but where you accept higher latency and higher error rates, in exchange for 50% off (same as batch pricing). It also supports prompt caching for further savings.

For me, it works quite well for low-priority things, without the hassle of using the batch API. Usually the added latency is just a few seconds extra, so it would still work in an agent loop (and you can retry requests that fail at the "normal" priority tier.)

https://developers.openai.com/api/docs/guides/flex-processin...

dehugger|17 days ago

I built something similar using an MCP that allows claude to "outsource" development to GLM 4.7 on Cerebras (or a different model, but GLM is what I use). The tool allows Claude to set the system prompt, instructions, specify the output file to write to and crucially allows it to list which additional files (or subsections of files) should be included as context for the prompt.

Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.

cheema33|17 days ago

Why use MCP instead of an agent skill for something like this when MCP is typically context inefficient?