top | item 24103746

I want to have an AWS region where everything breaks with high frequency

792 points| caiobegotti | 5 years ago |twitter.com | reply

185 comments

order
[+] jedberg|5 years ago|reply
For those saying "Chaos Engineering", first off, the poster is well aware of Chaos Engineering. He's an AWS Hero and the founder of Tarsnap.

Secondly, this would help make CE better. I actually asked Amazon for an API to do this ten years ago when I was working on Chaos Monkey.

I asked for an API to do a hard power off of an instance. To this day, you can only do a graceful power off. I want to know what happens when the instance just goes away.

I also asked for an API to slow down networking, set a random packet drop rate, EBS failures, etc. All of these things can be simulated with software, but it's still not exactly the same as when it happens outside the OS.

Basically I want an API where I can torture an EC2 instance to see what happens to it, for science!

[+] dmurray|5 years ago|reply
> For those saying "Chaos Engineering", first off, the poster is well aware of Chaos Engineering. He's an AWS Hero and the founder of Tarsnap.

Yeah, but did he win the Putnam?

[+] breatheoften|5 years ago|reply
Fascinating!

The mere existence of such an api would be an interesting source of problems when used accidentally/via buggy code ...

I wonder the degree to which this functionality would end up becoming relied on as a "in worst case hard kill things to recover" behavior that folks utilize for bad engineering reasons as opposed to good ones ...

[+] aloknnikhil|5 years ago|reply
> I asked for an API to do a hard power off of an instance. To this day, you can only do a graceful power off. I want to know what happens when the instance just goes away.

Wouldn't just running "halt -f" do the same?

[+] DonHopkins|5 years ago|reply
I want a stress testing feature where I can accidentally misconfigure an instance or resource somehow, then when I get an unexpectedly $10,000 higher bill at the end of the month, I can declare that I was only testing, and I don't have to pay it!
[+] asdff|5 years ago|reply
I wonder how much you would have to pay Amazon for them to send a tech down to the datacenter and pull the plug on your running node
[+] hiyer|5 years ago|reply
Today you can use spot block instances for this. They are guaranteed to die off after your chosen time block of 1-6 hours.
[+] eru|5 years ago|reply
Nearly a decade ago I was working on Xen (or rather XenServer). We had lots of fun implementing these kinds of wonky devices for the virtual machines that would randomly drop network packets or fail to read the hark disk in arbitrarily bad ways.
[+] exdsq|5 years ago|reply
Dumb question here from a CE beginner but can’t you have a Docker image for that service and turn it off?
[+] peterwwillis|5 years ago|reply
> Basically I want an API where I can torture an EC2 instance to see what happens to it, for science!

  # cat > dropme.sh <<'EOFILE'
  #!/bin/sh
  set -eu
  read -r SLEEP
  tmp=`mktemp` ; tmp6=`mktemp`
  iptables-save > $tmp
  ip6tables-save > $tmp6
  for t in iptables ip6tables ; do 
    for c in INPUT OUTPUT FORWARD ; do $t -P $c DROP ; done
    $t -t nat -F ; $t -t mangle -F ; $t -F ; $t -X
  done
  sleep "$SLEEP"
  iptables-restore < $tmp
  ip6tables-restore < $tmp6
  rm -f $tmp $tmp6
  EOFILE
  # chmod 755 dropme.sh
  # ncat -k -l -c ./dropme.sh $(ifconfig eth0 | grep 'inet ' | awk '{print $2}') 12345 &
  # echo "60" | ncat -v $(ifconfig eth0 | grep 'inet ' | awk '{print $2}') 12345
If you're lucky the existing connections won't even die, but the box will be offline for 60 seconds.
[+] pojzon|5 years ago|reply
Did you try to setup an on premis euqalyptus cloud for that ? Eucalyptus has an api compliant with AWS api?
[+] simonebrunozzi|5 years ago|reply
> Basically I want an API where I can torture an EC2 instance to see what happens to it, for science!

And one day there will be PETA [0] for EC2 instances!

[0]: https://www.peta.org/

[+] dijit|5 years ago|reply
Isn’t us-east-1 exactly that?

All jokes aside, I actually asked my google cloud rep about stuff like this; they came back with some solutions but often the problem with that is, what kind of failure condition are you hoping for?

Zonal outage (networking)? Hypervisor outage? Storage outage?

Unless it’s something like s3 giving high error rates then most things can actually be done manually. (And this was the advice I got back because faulting the entire set of apis and tools in unique and interesting ways is quite impossible)

[+] caymanjim|5 years ago|reply
Yeah, us-east-1 is pretty good at failing already. We lost us-east-1c for most of the day about a week ago due to a fiber line being cut. I'd estimate that AWS manages fewer than "three 9s" in us-east-1 on average. Not across the board, but at any given time something has a decent chance of not working, be it an entire AZ, or regional S3, etc. They're still pretty reliable, and I like the idea of a zone with built-in failure for testing things, but your joke about us-east-1 is based in solid fact.
[+] londons_explore|5 years ago|reply
> Unless it’s something like s3 giving high error rates

Just firewall off the real s3, and point clients at a proxy which forwards most requests to the real s3 and returns errors or delays to the rest.

[+] FridgeSeal|5 years ago|reply
ap-southeast-2-b in my experience hahaha
[+] davidrupp|5 years ago|reply
[Disclaimer: I work as a software engineer at Amazon (opinions my own, obvs)]

The chaos aspect of this would certainly increase the evolutionary pressure on your systems to get better. You would need really good visibility into what exactly was going on at the time your stuff fell over, so you could know what combination(s) to guard against next time. But there is definitely a class of problems this would help you discover and solve.

The problem with the testing aspect, though, is that test failures are most helpful when they're deterministic. If you could dictate the type, number, and sequence of specific failures, then write tests (and corresponding code) that help make your system resilient to that combination, that would definitely be useful. It seems like "us-fail-1" would be more helpful for organic discovery of failure conditions, less so for the testing of specific conditions.

[+] gregdoesit|5 years ago|reply
When I worked at Skype / Microsoft and Azure was quite young, the Data team next to me had a close relationship with one of the Azure groups who were building new data centers.

The Azure group would ask them to send large loads of data their way, so they could get some "real" load on the servers. There would be issues at the infra level, and the team had to detect this and respond to it. In return, the data team would also ask the Azure folks to just unpug a few machines - power them off, take out network cables - helping them test what happens.

Unfortunately, this was a one-off, and once the data center was stable, the team lost this kind of "insider" connection.

Howerver, as a fun fact, at Skype, we could use Azure for free for about a year - every dev in the office, for work purposes (including work pet projects). We spun up way too many instances during time, as you'd expect, and only came around to turning them off when Azure changed billing to charge 10% of the "regular" pricing for internal customers.

[+] eru|5 years ago|reply
When I was at Google, as a developer you officially got unlimited space in the internal equivalent of Google Drive.

I always wondered how many people got some questions from the storage team, if they really needed all those exabytes.

[+] bob1029|5 years ago|reply
It sounds to me what some people would like is for a magical box they can throw their infrastructure into that will automatically shit test all the things that could potentially go wrong for them. This is poor engineering. Arbitrary, contrived error conditions do not constitute a rational test fixture. If you are not already aware of where failures might arise in your application and how to explicitly probe those areas, you are gambling at best. Not all errors are going to generate stack traces, and not all errors are going to be detectable by your users. What you would consider an error condition for one application may be a completely acceptable outcome for another.

This is the reliability engineering equivalent of building a data warehouse when you don't know what sorts of reports you want to run or how the data will generally be used after you collect it.

[+] cogman10|5 years ago|reply
I disagree.

Not handling failures correctly is a time honored tradition in programming. It is so easy to miss.

For example, how often have you seen a malloc check for `ENOMEM`?

Even though that is something that could be semi common. Even though that's definitely something you might be able to handle. Instead, most code will simply blow chunks when that sort of condition happens. Is the person that wrote it "wrong"? That's debatable.

Some languages like Go make it even trickier to detect that someone forgot to handle an error condition. Nothing obvious in the code review (other than knowledge of the API in question) would get someone senior to catch those sorts of issues.

So the question is, HOW do you catch those problems?

The answer seems obvious to me, you simulate problems in integration tests. What happens when Service X simply disappears. What happens when a server restarts mid communication? Is everything handled or does this cause the apps to go into an non-recoverable mode?

This are all great infrastructure tests that can catch a lot of edge case problems that may have been missed in code reviews. Even better, that sort of infrastructure testing can be generalized and apply to many applications. Making rare events common in an environment makes it a lot easier to catch hard to notice bugs that everyone writes.

It's basically just Fuzz testing but for infrastructure. Fuzz testing has been shown to have a ton of value, infrastructure fuzzing seems like a natural valuable extension of that. Especially when high reliability and low maintenance is something everyone should want.

[+] irjustin|5 years ago|reply
Okay I'll bite - I have problems with this line of thinking.

You're right that you'll never be able to cover 100% of all cases, but using that logic, your specs will never test 100% of scenarios, so you shouldn't write specs.

I think the problem is you assumed it was "a magical box that people can rely to cover 100% of test cases of".

That's a poor leap. It's clearly not. It's a good system to test a set of network failures at varying degrees. Just like any engineered system, it needs to be documented what it can and cannot do.

I also eagerly await my do it all, 100% magic box.

[+] riskneutral|5 years ago|reply
> building a data warehouse when you don't know what sorts of reports you want to run or how the data will generally be used after you collect it.

Hi Bob, I can't tell you what reports I want or what we'll do with the data until you've first collected the data for analysis. Thanks!

[+] ben509|5 years ago|reply
I don't see a us-fail-1 region being set up for a number of reasons.

One, this is not how AWS regions are designed to work. What they're thinking of is a virtual region with none of its own datacenters, but AWS has internal assumptions about what a region is that are baked into their codebase. I think it would be a massive undertaking to simulate a region like this.

(I don't think a fail AZ would work either, arguably it'd be worse because all the code that automatically enumerates AZs would have to skip it, which is going to be all over the place.)

Two, set up a region with deliberate problems, and idiots will run their production workload in it. It doesn't matter how many banners and disclaimers you set up on the console, they'll click past them.

When customer support points out they shouldn't be doing this, the idiot screams at them, "but my whole business is down! You have to DO something!" This would be a small number of customers, but the support guys get all of them.

Three, AWS services depend on other AWS services. There are dozens of AWS services, each like little companies with varying levels of maturity. They ought to design all their stuff to gracefully respond to outages, but they have business priorities and many services won't want to set up in us-fail-1. When a region adds special constraints, it has a high likelihood of being a neglected region like GovCloud.

[+] falcolas|5 years ago|reply
I don't work with the group directly, but one group at our company has set up Gremlin, and the breadth and depth of outages Gremlin can cause is pretty impressive. Chaos Testing FTW.
[+] robpco|5 years ago|reply
I’ve also had a customer who used Gremlin to dramatically improve their stability.
[+] jiggawatts|5 years ago|reply
Along the same vein, instead of the typical "debug" and "release" configurations in compilers, I'd love it if there was also an "evil" configuration.

The evil configuration should randomise anything that isn't specified. No string comparison type selected? You get Turkish. All I/O and networking operations fail randomly. Any exception that can be thrown, is, at some small rate.

Or to take things to the next level, I'd love it if every language had an interpreted mode similar to Rust's MIR interpreter. This would tag memory with types, validate alignment requirements, enforce the weakest memory model (e.g.: ARM rules even when running on Intel), etc...

[+] msla|5 years ago|reply
A zone not only of sight and sound, but of CPU faults and RAM errors, cache inconsistency and microcode bugs. A zone of the pit of prod's fears and the peak of test's paranoia. Look, up ahead: Your root is now read-only and your page cache has been mapped to /dev/null! You're in the Unavailability Zone!
[+] missosoup|5 years ago|reply
That region is called Microsoft Azure. It will even break the control UI with high frequency.
[+] llama052|5 years ago|reply
I was going to post this but you beat me to it.

We are forced to use Azure for business reasons where I work, and the frequency of one off failures and outages is insane.

[+] moooo99|5 years ago|reply
Thank you, this is the exact comment I was looking for
[+] rob-olmos|5 years ago|reply
I imagine AWS and other clouds have a staging/simulation environment for testing their own services. I seem to recall them discussing that for VPC during re:Invent or something.

I'm on the fence though if I would want a separate region for this with various random failures. I think I'd be more interested in being able to inject faults/latencies/degradation in existing regions, and when I want them to happen for more control and ability to verify any fixes.

Would be interesting to see how they price it as well. High per-API cost depending on the service being affected, combined with a duration. Eg, make these EBS volumes 50% slower for the next 5min.

Then after or in tandem with the API pieces, release their own hosted Chaos Monkey type service.

[+] bigiain|5 years ago|reply
Show HN! Introducing my new SPaaS:

Unreliability.io - Shitty Performance as a Service.

We hook your accounting software up to api.unreliability.io and when a client account becomes delinquent, our platform instantly migrates their entire stack into the us-fail-1 region. Automatically migrates back again within 10 working days after full payment has cleared - guaranteed downtime of no less than 4 hours during migration back to production region. Register now for a 30 day Free Trial!

[+] kentlyons|5 years ago|reply
I want this at the programming language level too. If a function call can fail, I want to set a flag and have it (randomly?) fail. I hacked my way around this by adding in some wrapper that would if random, err for a bunch of critical functions. It was great for working through a ton of race conditions in golang with channels, and remote connections, etc. But hacking it in manually was annoying and not something I'd want to commit.
[+] imhoguy|5 years ago|reply
Failing individual computes isn't hard, some chaos script to kill VMs is enough. Worst are situations when things seem to be up but not acceptable: abnormal network latency, random packet drops, random but repeatable service errors, lagging eventual consistency. Not even mentioning any hardware woes.
[+] vemv|5 years ago|reply
While these are not exclusive, personally I'd look instead into studying my system's reliability in a way that is independent of a cloud provider, or even of performing any side-effectful testing at all.

There's extensive research and works on all things resilience. One could say: if one build a system that is proven to be theoretically resilient, that model should extrapolate to real-world resilience.

This approach is probably intimately related with pure-functional programming, which I feel has been not explored enough in this area.

[+] terom|5 years ago|reply
There are multiple methods for automating AWS EC2 instance recovery for instances in the "system status check failed" or "scheduled for retirement event" cases.

Yet to figure out how to test any of those cloudwatch alerts/rules. I've had them deployed in my dev/test environments for months now, after having to manually deal with a handful of them in a short time period. They've yet to trigger once since.

Umbrellas when it's raining etc.

[+] wmf|5 years ago|reply
This is why it seems like it would be good to have explicit fault injection APIs instead of assuming that the normal APIs behave the same as a real failure.
[+] haecceity|5 years ago|reply
Why does Twitter often fail to load when I open a thread and if I refresh it works. Does Twitter use us-fail-1?
[+] caymanjim|5 years ago|reply
I don't know why, but it happens to everyone and it's been that way for a long time. Either their engineers are failing, or there's some sketchy monetary reason for it. You're not the only one.
[+] saagarjha|5 years ago|reply
I think they don't like browsers they can't fingerprint, or something like that.
[+] mschuster91|5 years ago|reply
You mean the "Click here to reload Twitter" thing? That's spam protection, I hit this regularly after restarting Chrome (with ~600 tabs).
[+] georgewfraser|5 years ago|reply
I think people overestimate the importance of failures of the underlying cloud platform. One of the most surprising lessons of the last 5 years at my company has been how rarely single points of failure actually fail. A simple load-balanced group of EC2 instances, pointed at a single RDS Postgres database, is astonishingly reliable. If you get fancy and build a multi-master system, you can easily end up creating more downtime than you prevent when your own failover/recovery system runs amok.