top | item 47043197

Show HN: Stop Losing LangGraph Progress to 429 Errors

1 points| rjpruitt16 | 14 days ago |ezthrottle.network

Hey HN, I built this because I kept losing progress in LangGraph workflows when OpenRouter or OpenAI returned 429s. The problem: You're 7 steps into an agent workflow. Step 7 hits a rate limit. Everything crashes. Restart from step 1. Client-side retries don't help at scale:

100 workers all retry independently → retry storm Sequential fallbacks are slow (try OpenRouter, wait 5s, try Anthropic, wait 5s) No coordination across instances

So I built a coordination layer that:

Races multiple providers simultaneously (OpenRouter + Anthropic + OpenAI) Coordinates retries across all workers (no retry storms) Resumes workflows via webhooks (idempotent keys = checkpoints)

It runs on Fly.io's anycast network + BEAM for distributed coordination. Architecture deep dive: https://www.ezthrottle.network/blog/making-failure-boring-ag... Happy to answer questions about the approach or why BEAM made this possible when other languages would struggle.

discuss

order

No comments yet.