top | item 41624685

(no title)

> People and processes you’ll have to manage to achieve SLAs like Amazon’s?

In reality, you can have almost any people and processes. The trick is to put your servers and data in more than one place. If you have uptime of just 99% for a server (~3 days off in a year) and have them in 2 unrelated places, you will get 99.99% uptime. 3 places will give you 6 9's. The only thing that has to be ensured by people and processes is graceful fallback.

Notice how I say uptime and not SLA. SLA just means that you will get a little bit of money back if uptime dips below the SLA level. Oh, and for EC2 it is just 99.95%. So, if you really care about your users, you will engineer your systems to stay up rather than hoping that a third-party provider's SLA will save you.

discuss

acdha|1 year ago

That assumes the only causes of failure are environmental. I’ve definitely seen plenty of hardware failures but software failures are common, too, and keeping things synchronized is going to require more than “any people and processes” - that’s how you learn your backup has never been tested and database replication stopped working 3 days before the failure.

pantulis|1 year ago

Not forgetting operational failures due to human mistakes when doing delicate stuff on complex environments, and setting up on-prem infra to work like a hyperscaler does... well, it's not easy.

bad416f1f5a2|1 year ago

All those points OP raised as difficult - physical space, staffing, capex, etc. - and your response is “yeah, now do it twice”.

spwa4|1 year ago

Well, dedicated servers (for which you can have private cage, or VISA compliance, or ...) are a markup of say 30% over base cost, which is still 1/5th the cost of AWS. And even Hetzner will just deliver Kubernetes clusters these days.

These avoid all of the costs you were talking about.

michaelt|1 year ago

> In reality, you can have almost any people and processes.

You've never tried it, huh?

The reality is you will need some very specific processes.

You'll want a test environment, so you can make sure that proposed router reconfiguration actually does what it's supposed to do, and a process that says to use the test environment, and a process for keeping it in a consistent enough state that the tests are representative.

You'll want a process to make sure every production change can be reversed, and that an undo procedure has been figured out and tested before deployment. When that's impossible, you'll need careful review.

You'll want a process to make sure configuration changes are made in all three production data centres, avoiding the risk of a distracted employee leaving a change part-way rolled out.

But you can't roll out to all three sites at the same time, what if the change has some typo that breaks it? So you'll want a gradual process.

You'll want to monitor the load on the three systems, to make sure if one goes down that the other two have enough capacity to take over the workload. You'll have to keep monitoring this, to keep ahead of user growth.

Did I mention the user growth? Oh yeah we're expecting a surge in demand just before christmas. The extra servers we got last christmas have absorbed our user growth, so we'll need more. Of course it'll take time to get them racked and set up, and there will be a lead time on getting them delivered, and of course a back-and-forth sales process. So of course we'll have to kick off the server ordering process in August.

Of course, there's a chance of a partial failover. What if the web servers are still working in all data centres, but the SQL server in data centre B has failed, while the replicas in A and C are fine? If there's a software hiccup you'll need to figure out who to call - yet another process...

traceroute66|1 year ago

Take off those rose-tinted cloud spectacles and put them in the nearest trash can @michaelt !

> You'll want a test environment

You need that in the cloud too...

> You'll want a process to make sure every production change can be reversed

You need that in the cloud too...

> You'll want a process to make sure configuration changes

You need that in the cloud too....

> you'll want a gradual process.

You need that in the cloud too...

> You'll want to monitor

You need to do that in the cloud too....

> user growth / surge in demand

The problem with the cloud is everyone thinks they need to design for Google-scale from day zero.

Sure the cloud providers don't mind, more money for them ...

> there's a chance of a partial failover.

Could, and does, happen in the cloud too....