top | item 10264141

(no title)

arturhoo | 10 years ago

We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests.

As much as I admire and rely on AWS' scale to build architectures and fault tolerant applications, it can't be ignored that the marketing towards going "full cloud" doesn't take into account how hard it is to build resilient architectures in the cloud.

I see those disruptions events as stop signs: when the cloud itself fails to scale, I rethink a few decisions we all make when surfing those trends.

http://yourdatafitsinram.com/ also comes to mind.

discuss

order

LoSboccacc|10 years ago

infrastructure is hard, and exponentially hard with the number of nodes you need to scale.

that said, even with those disruptions and whatnot happening on Amazon as a warning, I am not skilled enough nor have time enough to build a non cloud resilient infrastructure.

I was looking to go with redundant vps at first, because amazon does have high cost for us, however, just learning all the things that can go wrong in the first very part, the load balancer, and all the gritty details one have to consider for just this little component to support interruption free failover, made me rethink the cost benefit of going managed.

it is true that going cloud doesn't really remove outages risks completely and it will not be as resilient as an infrastructure built with skill and love by the best out there, but how many shops can actually roll with their own solution and get an equivalent level of availability?

scaling web nodes is within my capabilities, building a ha database is already quite above my skill but I may manage, testing database failover, making sure it works, making sure that it can actually recover from one node dying and that the application stay live meanwhile? that's way above what I can reasonably do and what my company can afford to pay maintenance for.

chillydawg|10 years ago

How is it any harder to do in the cloud than on a rack in a warehouse? At least you don't have to muck about with cables and phoning power companies up.

antirez|10 years ago

Just an example: during the issue even people serving 10 ops/sec, but very important 10 ops/sec, were affected by a huge complessive load which was not their for most of the part. It's true that when you "go cloud" you don't have to manage your operations, but you are basically putting everything in the hands of other op people, and what happens to you is related to a more wide set of conditions.

So managing your stuff is hard, but you are in control and can do things in a way you believe is completely safe for you. Or you at least may incur in the same events sometimes, but perhaps paying a lot less for the same services. Or you can create your deployment with characteristics which are often impossible (a lot of RAM for each server is an example) to be cost effective in the cloud.

It's not stupid to use AWS services but is not stupid to manage your operations, either in your own hardware or at least using just bare metal and/or the virtual machines service certain providers give you, but still being in part accountable, responsabile, and in control, of your system software deployment and operations.

toomuchtodo|10 years ago

I used to do infrastructure on physical hardware, and we'd go years without an outage sometimes (generators in the datacenter, diesel fuel contracts, redundant fiber providers using BGP). Doing it in the cloud is harder, because you're at the mercy of the provider when things go south, and you have no transparency into why it went wrong except what they're willing to publish. Why did it happen? Will it happen again?

I mean, you can argue that the cloud is better. But how often is Heroku and AWS down? About the same as physical providers (I concede S3 is pretty solid though).

nickpsecurity|10 years ago

You call up IBM. You ask for a mainframe solution for two sites. You get experts to set it up for you with your application and such. You don't worry about downtime again for at least 30 years.

You call up Bull, Fujitsu, or Unisys for the same thing.

You call up HP. You ask for a NonStop solution. You get same thing for at least 20 years.

You call up VMS Software. You ask for an OpenVMS cluster. You get same thing for at least 17 years.

Well-designed OS's, software, and hardware did cloud-style stuff for a long time before cloud existed without the downtime. Cloud certainly brought price down and flexibility up. Yet, these clouds haven't matched 70-80's technology in uptime yet despite all the brains and money thrown at them. That's a fact.

So, shouldn't be used for anything mission critical where downtime costs lots of money.

bbrazil|10 years ago

Given they had ~300 minutes of outage in 3 years, you're looking at ~99.98% reliable in just that region. That's pretty good for a stateful serving system, and indeed you'd be pushed to do better.