top | item 46106504

How do you handle lost webhooks in production?

14 points| everydaydev | 3 months ago

I've worked at several companies where we'd discover hours later that critical webhooks from Stripe/Shopify never arrived (deployment, timeout, bug, etc.).

Every team ended up building the same solution: retry logic, dead letter queue, monitoring.

Curious how others handle this: - Do you rely on the provider's retry policy? - Built your own reliability layer? - Use a service? - Just manually reconcile when it happens?

(Context: Building https://relaehook.com to solve this, but genuinely curious what the norm is)

11 comments

renewiltord|3 months ago

Yeah, common problem. But trivial to solve. Just have minimal webhook server that records full request and return 200. Then process async.

Trivial Go program, day’s work. Stick it in Postgres, run continuously.

Bizarrely there are vendors who are weird about webhooks. Lifefile, as an example, charges pharmacies a dollar per webhook firing. So the pharmacies are crappy about retry policy.

Tbh I wouldn’t buy any product in this space. It’s too simple with exclusive HTTP server plus Postgres plus processing loop. And with already delicate thing I would rather not introduce more vendors.

No, not even if you converted it into event queue via websocket or zmq or what have you.

everydaydev|3 months ago

Your approach works, and lots of teams do exactly that. The tradeoff is that you’re now on the hook for uptime, retries, backpressure, tooling, on-call, metrics, etc.

Relae exists for teams who’d rather outsource that operational surface, similar to why people use managed queues instead of running their own RabbitMQ. Not everyone needs it — but some prefer not to own that part of the stack.

super256|3 months ago

Ofc I rely on the retry policy. Stripe retries with exponential back off for three days. If Stripe can't reach our endpoint in 3 days we probably went bankrupt or a solar flare ate IT.

everydaydev|3 months ago

Stripe does retries right, no argument there.

Where things get messy is when you have a mix of providers with wildly different retry behaviors, or internal services that have their own rate limits or downtime windows. A relay layer keeps the intake consistent even when the rest of the system isn’t.

samarthr1|3 months ago

Wait, so your product moves the point of failure from my infra to your infra?

Plus trusts y'all with contents of said webhook?

everydaydev|3 months ago

Fair question — we’re not eliminating failure so much as isolating it behind a system that’s purpose-built for durability. Our infra is built with redundant queues, retry pipelines, and observability you typically wouldn’t stand up for a single product team.

And on the data side, we don’t use webhook payloads for anything other than delivery. They’re encrypted at rest, transit, and automatically purged based on retention settings.

nickphx|3 months ago

Yeaaaaaaaaaaaaah.. I am not sure adding an additional third party and point of potential failure would help mitigate the issue of receiving data from third parties... but good luck.

everydaydev|3 months ago

Fair point. The value isn’t in reducing the number of components, it’s in swapping a fragile one (your app endpoint) for something built specifically to stay up, queue, retry, and give you visibility when the rest of your stack isn’t. There are plenty of other services on the market that offer similar services.

journal|3 months ago

anomaly detection, checks to make sure something is still happening.

phillipseamore|3 months ago

svix.com

everydaydev|3 months ago

Svix is a solid managed webhook solution, and their platform is clearly geared toward enterprise teams. For smaller teams or startups, the same reliability patterns—durable delivery, retries, replay—are valuable but often at a lower cost point. That’s where products like Relae aim to make sense: providing similar operational guarantees in a way that’s more accessible for non-enterprise use cases.