top | item 28494208

(no title)

> we accidentally created an infinite event loop between two Lambdas. Racked up a several-hundred-thousand dollar bill in a couple of hours

May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?

Btw, this reminds me a lot of one of my own early career screw-ups, where I had a batch job uploading images that was set up with unlimited retries. It failed halfway through, and the unlimited retries caused it to upload the same three images 100,000 times each. We emailed Cloudinary, the image CDN we were using, and they graciously forgave the costs we had incurred for my mistake.

discuss

calmlynarczyk|4 years ago

> May I ask how you dealt with this? Were you able to explain it to Amazon support and get some of these charges forgiven? Also, how would you recommend monitoring for this type of issue with Lambda?

AWS support caught it before we did, so they did something on their end to throttle the Lambda invocations. We asked for billing forgiveness from them; last I heard that negotiation was still ongoing over a year after it occurred.

Part of the problem was we had temporarily disabled our billing alarms at the time for some reason, which caused our team to miss this spike. We've enabled alerts on both billing and Lambda invocation counts to see if either go outside of normal thresholds. It still doesn't hard-stop this from occurring again, but we at least get proactively notified about it before it gets as bad as it did. I don't think we've ever found a solution to cut off resource usage if something like this is detected.

genewitch|4 years ago

Earlier in the week there was threads about how AWS will never implement resource blocking like you're talking about because big companies don't want to be shut off in the middle of a spike of traffic, and small companies don't pay enough money, and it's not like it hurts Amazon's bottom line

BackBlast|4 years ago

We use memory safe languages, type safe languages. AWS is not fundamentally billing safe.

Just to give you nightmares. There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

I don't know how you monitor it, part of the issue is the sheer complexity. How do you know what to monitor? The billing page is probably the place to start - but it is too slow for many of these events.

I guess you could start with the common problems. Keep watchdogs on the number of lambdas being evoked, or any resource you spin up or that has autoscaling utilization. Egress bandwidth is definitely another I'd watch.

Dunno, just seems to me you'd need to watch every metric and report any spikes to someone who can eyeball the system.

For me? I limit my exposure to AWS as much as I reasonably can. The possibilities combined with the known nightmare scenarios, with a "recourse" that isn't always effective doesn't make for good sleep at night.

aynsof|4 years ago

> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

AWS Shield Advanced actually offers DDoS cost protection to mitigate this specific risk: https://aws.amazon.com/shield/features/

rileymat2|4 years ago

> There's been DDoS in the news lately, I'm surprised nobody has yet leveraged those bot nets to bankrupt orgs they don't like who use cloud autoscaling services.

That’s interesting because I seems like it would happen, but what is in it for the attacker, whrn under threat they can implement caps?

jamesfinlayson|4 years ago

I think you're limited to 1,000 concurrent Lambda invocation by default anyway. That said, it's not easy to get an overview of what's going on in an AWS account (except through Billing, but I don't know how up to the moment that is).

jamesfinlayson|4 years ago

I've been able to get AWS support to waive fees for a runaway Lambda that no one spotted for a few weeks - they wanted an explanation of what happened and a mitigation strategy from us and that was it. It is still unresolved because AWS wants us to pay the bill so they can then issue a credit but the company credit card doesn't have a high enough limit to cover the bill.