top | item 29078239

(no title)

gnur | 4 years ago

> Less than 100% reliability is essential

This is actually a take most SRE's would / should believe. Every added 9 to the reliability increases the price exponentially. Finding the correct level of reliability is something most companies should focus more on, because sometimes a single physical machine that could go down once a year for a few hours is perfectly capable of providing all the resources a medium seized business could need. Proper backups, monitoring and recovery runbooks can even decrease the downtime of such a simple system to minutes, while easily saving you maybe thousands per month.

discuss

sokoloff|4 years ago

I was surprised by the difficulty in getting a company to accept a target of “three nines five” (0.9995) at a time when they were growing rapidly and launching new physical and digital products on a rapid and continuous basis. I prevailed, but what I expected would be a five minute conversation took a couple 45 minute discussions (reducing the work uptime of people in those discussions to 0.9993 for the year... :) )

Slowing your young company down in order to turn 0.9995 to 0.9998 is almost always a terrible trade. Even turning 0.995 to 0.999 is hard to justify in most places. (That improvement saves about 35 hours of downtime per year.)

auggierose|4 years ago

Is there a rigorous framework to arrive at those targets? How do you know what you built has 0.9995 uptime, and not just 0.99?

rglullis|4 years ago

Around 2012-2013 I was working on an online education platform. We had a whole web application that would serve video content, collect student answers and analyze in real-time(ish) the student progress in order to find out the next action for the student - e.g, if the student starts to get questions wrong that they were getting right before, we'd take it as a sign of fatigue and would recommend them to take a break. Or if the student was showing that has mastered a topic, we would jump ahead in the lesson to something else that needed more work.

So we needed a web server, a database, a queue system to run these heuristics and we needed to host/distribute ~100GB worth of content, most of it video.

We were bootstrapping, so I was trying to (1) save as much as possible on operational costs and (2) punt on all the "scaling issues" that would require more of my devops time that would be better spent developing and adding more features. I deployed the whole system on a single server from Hetzner: Django app, Postgresql, Redis for caching and session management, RabbitMQ for celery. All in one machine with 32GB of RAM and a RAID system with enough capacity to hold the data. I think it was costing us less than 50€/month. That is all we needed to (easily) serve ~800 students and the staff who would author new content.

In the end we delivered everything we promised to our first customer, but we were not able to grow our revenue as much as we expected, so by end of 2013 we just put the whole company on the backburner, got a small maintenance contract with the main customer and went on to find another jobs.

From end-2013 until 2018, I needed only to make sure that our domains and SSL certificates were up-to-date every six months, upgrade django packages in case of security issues and deal with ONE incident (in 2016 IIRC) where a disk failure put the array in degraded mode, which I solved by getting a new server at Hetzner (better specs and cheaper, after all those years), warning the customer that the service would be taken offline for a couple of hours later in the day, rsyncing the content, restoring the database and redeploying the application with the fabric script.

This is one the projects that I am most proud of what was accomplished given all the constraints and made me realize the difference between a Software Developer and an Engineer. Yet, it translates to a very poor entry on an CV. We are too used to ask on interviews what people have done and what technologies they have used, but we rarely ask about the moments where it was best to avoid doing something.

y4mi|4 years ago

Esp if you consider bare metal servers. I'm currently paying 45€ for a Ryzen server with 64gb ECC ram and 1tb nvme storage (raid1).

The speed is incredible if compared to ec2 or root server performance from other vendors. Even if they've dedicated resources.

_3u10|4 years ago

The cache misses alone mean the cloud should be cheaper than bare metal. In general you can buy outright any cloud service for about 3 months of the price of the cloud.

Why anyone would run their pointer chasing code in a heavy cache eviction environment is beyond me. The code is slow to start with, and then you make sure that none of your data is in the cache. Why you'd pay 10x for slower hardware makes no sense.

What people should be doing is running on bare metal and turning off all the garbage meltdown protections that kill performance. If you're not a cloud provider and you're allowing people to execute arbitrary code on your hardware, you've got much bigger problems than meltdown.

KronisLV|4 years ago

> I'm currently paying 45€ for a Ryzen server with 64gb ECC ram and 1tb nvme storage (raid1).

That does sound like a really good deal!

Until now i've only been using VPSes (apart from homelab servers as CI nodes etc.) because they're cheaper for the smaller sizes, but for comparison's sake, the cheapest VPS provider's (that i know of and trust) offering with 64 GB of RAM and 640 GB of storage would cost ~260 euros a month: https://www.time4vps.com/?affid=5294

Well, i guess there's also other VPS providers out there that can nearly match the price, like Contabo, though they do have mixed reviews: https://contabo.com/en/ (personally i just found their UI to be extremely dated and there are setup fees, but otherwise they were decent), though even then they'd cost anywhere from 30 - 90 euros a month.

bennyp101|4 years ago

And anything that is static and needs to be up can just be cached at the edge somewhere, which is peanuts really, and means that if your bare metal goes down, you can still keep something up

kuon|4 years ago

May I ask you were you rent it?

goodpoint|4 years ago

No amount of money makes a system 100% reliable.

On small platforms we are still stuck into the 1990's approach of having one reliable system.

We need distributed[1] systems and protocols even in small applications. Easy to use and self-healing.

[1] No, I'm not talking about blockchains

Spooky23|4 years ago

My former employer use to target 99% uptime for non-essential systems. It made a ton of sense, the cost of downtime was often incredibly low, while the cost and complexity of making it 4 9s was really high.

handrous|4 years ago

There's a huge jump in cost and operational style, to go from two nines to three, because it means you have to have 24/7 support coverage or an on call rotation (and good alerting, or else it's for naught) for nights. Two nines just means you need someone to check their messages sometimes, during the day, on weekends. One nine, and you can forget about the weekends, too—and that's actually sorta OK for certain applications.

Three nines also means you can't afford to intentionally take a system down to work on it, or you'll burn all your "oopsie" downtime. That means a ton more work in infrastructure and deployment processes, than two nines.

jamespwilliams|4 years ago

The Google SRE book, which I think is a reasonable reflection of SRE culture generally, actually mentions this in the very first chapter.