top | item 45981626

Build vs. Buy: What This Week's Outages Should Teach You

49 points| toddgardner | 3 months ago |toddhgardner.com

43 comments

order

onion2k|3 months ago

You can't build your own Cloudflare in any meaningful sense. You can choose not to have the functionality Cloudflare provides because you prioritize the risk of a Cloudflare outage as more important than the benefits Cloudflare gives you, but that probability tree is probably going to land in Cloudflare's favor for 99.99% of businesses.

If you can build a system with redundancy to continue working even if Cloudflare is unavailable then you should, but most years that's going to be a waste of time.

I think you'd be better off spending the time building good relationships with your customers and users so that in the event of an outage that's beyond your control they trust your business and continue to be happy customers when you're back up and running.

mrweasel|3 months ago

Exactly, CloudFlare falls squarely in the "Buy" category. This is not a product you just build, you'd overpay massively for global capacity.

In general I think people are overreaction to the CloudFlare outage and most of these types of articles aren't really thought all the way through.

Also the conclusion on Jurassic Park is wrong. Hammond "spared no expense" yet Nedry was a single point of failure? Seems like they spared at least some expense in the IT department

toddgardner|3 months ago

Yea agreed. I don't build my own CDNs.

But I don't choose cloudflare either, because its too complicated and I don't need that. So I choose the simplest possible thing with as little complexity as possible (for me, that was BunnyCDN). If it goes down, its usually obviously why. And I didn't rely on anything special about it, so I can move away painlessly.

arbol|3 months ago

Your customers are also likely down if they run online services

dan353hehe|3 months ago

> Here’s the thing, if your core business function depends on some capability, you should own it if at all possible.

If I'm building something that allows my customers to do X, then yes I will own the software that allows my customers to do X. Makes sense.

> They’ll craft artisanal monitoring solutions while their actual business logic—the thing customers pay for—runs on someone else’s computer.

So instead I should build an artisanal hosting solution on my own hardware that I purchase and maintain? I could drop proxmox on them and go from there, or K8s, or even just bare metal and systemd scripts.

But my business isn't about any of those things, its about X. How does owning and running my own hardware get me closer to delivering on X?

wrs|3 months ago

The OP's point is that if your monitoring solution dies, your customers don't even notice, so you shouldn't build it yourself. But if the service running your actual business logic dies, your customers get cut off, so you should build and maintain that part more directly. (And obviously this is a spectrum — you probably don't need to design your own CPU.)

jmull|3 months ago

The advice here is contradictory. It suggests you should build and own things your business depends on, wherever possible, but also that you should buy things that aren't a core value of your core business.

There would very typically be a large overlap here.

Probably very few companies should build and run their own CDN and internet scale firewall, for example. Doesn't have to be cloudflare, but there aren't any providers that will have zero outages (a homegrown one is likely to be orders of magnitude worse and more expensive).

gwbas1c|3 months ago

> if your core business function depends on some capability, you should own it

I fear this is easy to misconstrue.

For example, I was at a company that, as I learned how everything worked, realized that we were spending $20k / month for cloud services to basically process about as much real-time data as a CD player processes.

I joked that we should be able to run our entire product on a single server running in the office. (Then I pointed out that this was a joke and that running in the cloud gave us amazing redundancy that we didn't have to implement ourselves.) My point was to show that our architecture was massively bloated and overengineered for what we were doing. (IE, the cost of serialization to send messages was more than the actual processing that was happening. The cost was both money, and the fact that we were spending more time working on messaging than the actual product.)

BUT: There's many times where we could easily say, "this would be so much easier if we had our own server in the office." And, if we misconstrue the above quote, we could convince ourselves to run our own server in the office.

toddgardner|3 months ago

Yea totally. this is a balance.

Very few times should you manage the actual hardware yourself.

But often a cloud is overly complex for what you need. 10 years ago we left MS Azure and started leasing dedicated hardware in OVH. Our costs were cut by 90%, our performance tripled, and our reliability improved. We did have to take on some effort to make our systems portable with ansible and containers, but we greatly simplified our vendor stack.

I am never confused why something goes down, and I have confidence that I can stand up with another vendor without re-writing anything.

If I can't own it, it should be as simple and commoditized as possible. Most clouds are not that.

vivzkestrel|3 months ago

Instead we need a startup that builds over every cloud provider. Think of a web server for example. AWS has EC2, GCP has its own equivalent and Azure has its own and so on. What if we had a startup that virtualizes a layer on top of these such that we AWS has an outage, you lose 1/3rd of your operating capacity, when Azure has an outage you lose 1/3rd of your operating capacity. In order for you startup s virtual webserver to go down, all of AWS, GCP and Azure wil have to go down simultaneously. Basically build on top of everyone s cloud service into one single unified virtual layer that offers end products to consumers. A 6GB RAM server that the end consumer purchases has 2GB of RAM running on AWS, 2GB on Azure and 2GB on GCP. I am sure we can also strategize something along the same lines for a database server with the added question of the database sharding strategy at play

bradly|3 months ago

This is what Fog and other cloud agnostic libraries promise. The problem is they you get tied to the lowest common feature set or writing different code paths to take advantage of latest features.

renewiltord|3 months ago

In practice, you're better off just having one cloud but if you ever reach the point you care about this, you're better off running some cloud-agnostic platform like Kubernetes in a multi-cloud setup (i.e. one cluster per cloud) and then load-balancing or failing over via DNS.

servercobra|3 months ago

It's great in theory, it's just relatively expensive. You'll need to pay to be running on all the clouds plus keeping extra traffic to keep databases synced. Distributed systems are hard.

gwbas1c|3 months ago

> What if we had a startup that virtualizes a layer on top of these such that we AWS has an outage, you lose 1/3rd of your operating capacity, when Azure has an outage you lose 1/3rd of your operating capacity.

And then when your startup goes down we lose 3/3rds of our operating capacity!

---

There are certain kinds of errors and failures that it's not worth protecting against, because the costs (and consequences) are more than just accepting that things fail from time to time.

It's easy to forget that services used to go down all the time in the 1990s and early 2000s. In this case, we still have super-impressive resiliency with modern cloud hosting.

IMO: The best way to improve the situation is for the cloud hosts to take their lessons learned and improve themselves, and for us (their customers) to vote with our feet if/when a cloud provider has problems.

dlisboa|3 months ago

> A 6GB RAM server that the end consumer purchases has 2GB of RAM running on AWS, 2GB on Azure and 2GB on GCP.

That'd be very inefficient usage of compute. Memory access now has network latency, cache locality doesn't exist, processes don't work. You're basically subverting how computers fundamentally work today. There's no benefit.

I know Kubernetes and containers has everyone thinking servers don't matter but we should have less virtualization, not more. Redundancy and virtualization are not the same thing.

realityking|3 months ago

Many (most) companies don’t even manage to split their application across multiple cloud regions with the same provider. Doing it across providers is an order of magnitude harder.

codingdave|3 months ago

Redundancy is a proven way to build resilience into your infrastructure. Ownership does not mean you have to build it. OP is correct that you need to understand it all, but that understanding also allows for solid DR plans that use multiple providers for a resilient infrastructure.

toddgardner|3 months ago

An alternative to multiple providers is to use commoditized providers. By using simple infrastructure rather than cloud platforms, I can redploy my infrastructure using ansible with another provider in hours rather than re-building my platform if I decide the cloud is the wrong fit.

righthand|3 months ago

Yeah but my DevOps only know Aws or Cloudflare UIs and refuse to consider any other platforms. The leadership sees multiple bills as bad. Back to square one? No one will learn anything because people enjoy the pseudo holiday for problems they set themselves up to do nothing about.

janalsncm|3 months ago

For data analysis and medium-sized ML jobs, my personal computer is so much faster and more responsive than any cloud solution. Of course you get none of the resiliency or security guarantees of the cloud, but it’s a data point. I genuinely hate using cloud and avoid using it if at all possible. Even a MacBook Pro is faster.

juancn|3 months ago

There's no easy answer, but you should definitely model what happens when X goes down if you depend on X.

It may even be a rational decision to take the downtime if the cost of avoiding it exceeds the expected cost of an eventual downtime, but that's a business decision that requires some serious thought.

chasd00|3 months ago

> It may even be a rational decision to take the downtime if the cost of avoiding it exceeds the expected cost of an eventual downtime, but that's a business decision that requires some serious thought.

that's at the root of all infrastructure decisions, not just web app tech stacks but even something like utility service. I think it gets lost on a lot of technology people because we love to work on big technical things. No one wants a boring answer like a couple webservers and postgres with a backup in a different datacenter when there's a wall of knobs and switches to play with at the hyperscalers.

serjester|3 months ago

If cloudflare goes down, you can blame them. If your hand rolled solution fails when cloudflare exists, you’re going to have a tough pitch to leadership why you’re in charge of the technical roadmap. Choose your battles, and this is not a hill worth dying on.

mannyv|3 months ago

What this outage teaches you is that when a third party vendor fails and the internet breaks you can point the finger at them with no issues.

If your shit breaks and everyone else's shit is still working that's a problem.

skeezyjefferson|3 months ago

> you can point the finger at them with no issues.

yeah sure, if your business is one of the 500 startups on HN creating inane shit like a notes app or a calendar, but outages can affect genuine companies that people rely on

iso1631|3 months ago

I learned that with crowdstrike. It completely changed my understanding of what's important for C-Suite and C-Suite wannabes that want to be in middle management

toddgardner|3 months ago

I tend to sell to a wide variety of customers. They tend not to give a crap if a cloud provider is down, its still our problem to make it right.

dylan604|3 months ago

any company offering services with SLA that does not have this as a caveat is just crazy to me. "we guarantee our services will be up and running as long as the 3rd party services we run on top of are running."

erikpukinskis|3 months ago

If I build and my own CDN, it will go down. And I will have to fix it at 2am.

If I use CloudFlare, it will also go down, but probably for less time, and someone else has to be up at 2am fixing it.

> Build what delivers your value.

Like Hershey builds grocery stores?

Like Budweiser builds bars?

This can’t be serious.

We live in a society.

almosthere|3 months ago

Recoverable master and short dns ttl

1970-01-01|3 months ago

Meh. This opinion highlights the fact that availability is the least understood pillar in security. The Right Way to Think About It is having good security analysis and doing proper Risk Management. This means it is their job to do business impact analysis, 3rd party assessments, and run tabletop exercises on all your critical systems to tell you what is rock solid and what is a house of cards.

toddgardner|3 months ago

How you approach this is very different depending on the size of organization. We're a small shop (3), but we deliver big services to lots of people.

We do this by owning everything we can, and using simple vendors for what we can't.

4ndrewl|3 months ago

Wardley Mapping is a framework for better understanding Build v Buy (v Rent) at a more strategic level. tldr - it's much more nuanced than 'if you depend on it own it'

toddgardner|3 months ago

Does anyone read articles before commenting? lol