(no title)
bananaquant | 1 year ago
In reality, you can have almost any people and processes. The trick is to put your servers and data in more than one place. If you have uptime of just 99% for a server (~3 days off in a year) and have them in 2 unrelated places, you will get 99.99% uptime. 3 places will give you 6 9's. The only thing that has to be ensured by people and processes is graceful fallback.
Notice how I say uptime and not SLA. SLA just means that you will get a little bit of money back if uptime dips below the SLA level. Oh, and for EC2 it is just 99.95%. So, if you really care about your users, you will engineer your systems to stay up rather than hoping that a third-party provider's SLA will save you.
acdha|1 year ago
pantulis|1 year ago
bad416f1f5a2|1 year ago
spwa4|1 year ago
These avoid all of the costs you were talking about.
michaelt|1 year ago
You've never tried it, huh?
The reality is you will need some very specific processes.
You'll want a test environment, so you can make sure that proposed router reconfiguration actually does what it's supposed to do, and a process that says to use the test environment, and a process for keeping it in a consistent enough state that the tests are representative.
You'll want a process to make sure every production change can be reversed, and that an undo procedure has been figured out and tested before deployment. When that's impossible, you'll need careful review.
You'll want a process to make sure configuration changes are made in all three production data centres, avoiding the risk of a distracted employee leaving a change part-way rolled out.
But you can't roll out to all three sites at the same time, what if the change has some typo that breaks it? So you'll want a gradual process.
You'll want to monitor the load on the three systems, to make sure if one goes down that the other two have enough capacity to take over the workload. You'll have to keep monitoring this, to keep ahead of user growth.
Did I mention the user growth? Oh yeah we're expecting a surge in demand just before christmas. The extra servers we got last christmas have absorbed our user growth, so we'll need more. Of course it'll take time to get them racked and set up, and there will be a lead time on getting them delivered, and of course a back-and-forth sales process. So of course we'll have to kick off the server ordering process in August.
Of course, there's a chance of a partial failover. What if the web servers are still working in all data centres, but the SQL server in data centre B has failed, while the replicas in A and C are fine? If there's a software hiccup you'll need to figure out who to call - yet another process...
traceroute66|1 year ago
> You'll want a test environment
You need that in the cloud too...
> You'll want a process to make sure every production change can be reversed
You need that in the cloud too...
> You'll want a process to make sure configuration changes
You need that in the cloud too....
> you'll want a gradual process.
You need that in the cloud too...
> You'll want to monitor
You need to do that in the cloud too....
> user growth / surge in demand
The problem with the cloud is everyone thinks they need to design for Google-scale from day zero.
Sure the cloud providers don't mind, more money for them ...
> there's a chance of a partial failover.
Could, and does, happen in the cloud too....