top | item 41193045

How we migrated onto K8s in less than 12 months

298 points| ianvonseggern | 1 year ago |figma.com | reply

385 comments

order
[+] julienmarie|1 year ago|reply
I personally love k8s. I run multiple small but complex custom e-commerce shops and handle all the tech on top of marketing, finance and customer service.

I was running on dedicated servers before. My stack is quite complicated and deploys were a nightmare. In the end the dread of deploying was slowing down the little company.

Learning and moving to k8s took me a month. I run around 25 different services ( front ends, product admins, logistics dashboards, delivery routes optimizers, orsm, ERP, recommendation engine, search, etc.... ).

It forced me to clean my act and structure things in a repeatable way. Having all your cluster config in one place allows you to exactly know the state of every service, which version is running.

It allowed me to do rolling deploys with no downtime.

Yes it's complex. As programmers we are used to complex. An Nginx config file is complex as well.

But the more you dive into it the more you understand the architecture if k8s and how it makes sense. It forces you to respect the twelve factors to the letter.

And yes, HA is more than nice, especially when your income is directly linked to the availability and stability of your stack.

And it's not that expensive. I lay around 400 usd a month in hosting.

[+] maccard|1 year ago|reply
Figma were running on ECS before, so they weren't just running dedicated servers.

I'm a K8S believer, but it _is_ complicated. It solves hard problems. If you're multi-cloud, it's a no brainer. If you're doing complex infra that you want a 1:1 mapping of locally, it works great.

But if you're less than 100 developers and are deploying containers to just AWS, I think you'd be insane to use EKS over ECS + Fargate in 2024.

[+] belter|1 year ago|reply
> I run multiple small but complex custom e-commerce shops

How do you handle the lack of multi tenancy in Kubernetes?

[+] wrs|1 year ago|reply
A migration with the goal of improving the infrastructure foundation is great. However, I was surprised to see that one of the motivations was to allow teams to use Helm charts rather than converting to Terraform. I haven’t seen in practice the consistent ability to actually use random Helm charts unmodified, so by encouraging its use you end up with teams forking and modifying the charts. And Helm is such a horrendous tool, you don’t really want to be maintaining your own bespoke Helm charts. IMO you’re actually better off rewriting in Terraform so at least your local version is maintainable.

Happy to hear counterexamples, though — maybe the “indent 4” insanity and multi-level string templating in Helm is gone nowadays?

[+] cwiggs|1 year ago|reply
Helm Charts and Terraform are different things IMO. Terraform is better used to deploying cloud resources (s3 bucket, EKS cluster, EKS workers, RDS, etc). Sure you can manage your k8s workloads with Terraform, but I wouldn't recommend it. Terraform having state when you already have your start in k8s makes working with Terraform + k8s a pain. Helm is purpose built for k8s, Terraform is not.

I'm not a fan of Helm either though, templat-ed yaml sucks, you still have the "indent 4" insanity too. Kustomize is nice when things are simple, but once your app is complex Kustomize is worse than Helm IMO. Try to deploy an app that has a ING, with a TLS cert and external-DNS with Kustomize for multiple environments; you have to patch the resources 3 times instead of just have 1 variable you and use in 3 places.

Helm is popular, Terraform is popular so they both are talked a lot, but IMO there is a tool that is yet to become popular that will replace both of these tools.

[+] solatic|1 year ago|reply
My current employer (BigCo) has Terraform managing both infra + deployments in Terraform, at (ludicrous) scale. It's a nightmare. The problem with Terraform is that you must plan your workspaces such that you will not exceed the best-practice amount of resources per workspace (~100-200) or else plans will drastically slow down your time-to-deploy, checking stuff like databases and networking that you haven't touched and have no desire to touch. In practice this means creating a latticework of Terraform workspaces that trigger each other, and there are currently no good open-source tools that support it.

Best practice as I can currently see it is to have Terraform set up what you need for continuous delivery (e.g. ArgoCD) as part of the infrastructure, then use the CD tool to handle day-to-day deployments. Most CD tooling then asks you to package your deployment in something like Helm.

[+] gouggoug|1 year ago|reply
Talking about helm - I personally have come to profoundly loathe it. It was amazing when it came out and filled a much needed gap.

However it is loaded with so many footguns that I spend my time redoing and debugging others engineers work.

I’m hoping this new tool called « timoni » picks up steam. It fixes pretty every qualm I have with helm.

So if like me you’re looking for a better solution, go check timoni.

[+] smellybigbelly|1 year ago|reply
Our team also suffered from the problems you described of public helm charts. There is always something you need to customise to make things work on your own environment. Our approach has been to use the public helm chart as-is and do any customisation with `kustomize —enable-helm`.
[+] mnahkies|1 year ago|reply
Whilst I'll agree that writing helm charts isn't particularly delightful, consuming them can be.

In our case we have a single application/service base helm chart that provides sane defaults and all our deployments extend from. The amount of helm values config required by the consumers is minimal, and there has been very little occasion for a consumer to include their own templates - the base chart exposes enough knobs to avoid this.

When it comes to third-party charts, many we've been able to deploy as is (sometimes with some PRs upstream to add extra functionality), and occasionally we've needed to wrap/fork them. We've deployed far more third-party charts as-is than not though.

One thing probably worth mentioning w.r.t to maintaining our custom charts is the use of helm unittest (https://github.com/helm-unittest/helm-unittest) - it's been a big help to avoid regressions.

We do manage a few kubernetes resources through terraform, including Argocd (via the helm provider which is rather slow when you have a lot of CRDs), but generally we've found helm chart deployed through Argocd to be much more manageable and productive.

[+] xiwenc|1 year ago|reply
I’m baffled to see so many anti-k8s sentiments on HN. Is it because most commenters are developers used to services like heroku, fly.io, render.com etc. Or run their apps on VM’s?
[+] elktown|1 year ago|reply
I think some are just pretty sick and tired of the explosion of needless complexity we've seen in the last decade or so in software, and rightly so. This is an industry-wide problem of deeply misaligned incentives (& some amount of ZIRP gold rush), not specific to this particular case - if this one is even a good example of this to begin with.

Honestly, as it stands, I think we'd be seen as pretty useless craftsmen in any other field due to an unhealthy obsession of our tooling and meta-work - consistently throwing any kind of sensible resource usage out of the window in favor of just getting to work with certain tooling. It's some kind of a "Temporarily embarrassed FAANG engineer" situation.

[+] moduspol|1 year ago|reply
For me personally, I get a little bit salty about it due to imagined, theoretical business needs of being multi-cloud, or being able to deploy on-prem someday if needed. It's tough to explain just how much longer it'll take, how much more expertise is required, how much more fragile it'll be, and how much more money it'll take to build out on Kubernetes instead of your AWS deployment model of choice (VM images on EC2, or Elastic Beanstalk, or ECS / Fargate, or Lambda).

I don't want to set up or maintain my own ELK stack, or Prometheus. Or wrestle with CNI plugins. Or Kafka. Or high availability Postgres. Or Argo. Or Helm. Or control plane upgrades. I can get up and running with the AWS equivalent almost immediately, with almost no maintenance, and usually with linear costs starting near zero. I can solve business problems so, so much faster and more efficiently. It's the difference between me being able to blow away expectations and my whole team being quarters behind.

That said, when there is a genuine multi-cloud or on-prem requirement, I wouldn't want to do it with anything other than k8s. And it's probably not as bad if you do actually work at a company big enough to have a lot of skilled engineers that understand k8s--that just hasn't been the case anywhere I've worked.

[+] caniszczyk|1 year ago|reply
Hating is a sign of success in some ways :)

In some ways, it's nice to see companies move to use mostly open source infrastructure, a lot of it coming from CNCF (https://landscape.cncf.io), ASF and other organizations out there (on top of the random things on github).

[+] maayank|1 year ago|reply
It’s one of those technologies where there’s merit to use them in some situations but are too often cargo culted.
[+] tryauuum|1 year ago|reply
For me it is about VMs. Feel uneasy knowing that any kernel vulnerability will allow a malicious code to escape the container and explore the kubernetes host

There are kata-containers I think, they might solve my angst and make me enjoy k8s

Overall... There's just nothing cool in kubernetes to me. Containers, load balancers, megabytes of yaml -- I've seen it all. Nothing feels interesting enough to try

[+] archenemybuntu|1 year ago|reply
Kubernetes itself is built around mostly solid distributed system principles.

It's the ecosystem around it which turns things needlessly complex.

Just because you have kubernetes, you don't necessarily need istio, helm, Argo cd, cilium, and whatever half baked stuff is pushed by CNCF yesterday.

For example take a look at helm. Its templating is atrocious, and if I am still correct, it doesn't have a way to order resources properly except hooks. Sometimes resource A (deployment) depends on resource B (some CRD).

The culture around kubernetes dictates you bring in everything pushed by CNCF. And most of these stuff are half baked MVPs.

---

The word devops has created expectations that back end developer should be fighting kubernetes if something goes wrong.

---

Containerization is done poorly by many orgs, no care about security and image size. That's a rant for another day. I suspect this isn't a big reason for kubernetes hate here.

[+] vouwfietsman|1 year ago|reply
Maybe its normal for a company this size, but I have a hard time following much of the decision making around these gigantic migrations or technology efforts because the decisions don't seem to come from any user or company need. There was a similar post from Figma earlier, I think around databases, that left me feeling the same.

For instance: they want to go to k8s because they want to use etcd/helm, which they can't on ECS? Why do you want to use etcd/helm? Is it really this important? Is there really no other way to achieve the goals of the company than exactly like that?

When a decision is founded on a desire of the user, its easy to validate that downstream decisions make sense. When a decision is founded on a technological desire, downstream decisions may make sense in the context of the technical desire, but do they make sense in the context of the user, still?

Either I don't understand organizations of this scale, or it is fundamentally difficult for organizations of this scale to identify and reason about valuable work.

[+] dijksterhuis|1 year ago|reply
> When applied, Terraform code would spin up a template of what the service should look like by creating an ECS task set with zero instances. Then, the developer would need to deploy the service and clone this template task set [and do a bunch of manual things]

> This meant that something as simple as adding an environment variable required writing and applying Terraform, then running a deploy

This sounds less like a problem with ECS and more like an overcomplication in how they were using terraform + ECS to manage their deployments.

I get the generating templates part for verification prior to live deploys. But this seems... dunno.

[+] ianvonseggern|1 year ago|reply
Hey, author here, I totally agree that this is not a fundamental limitation of ECS and we could have iterated on this setup and made something better. I intentionally listed this under work we decided to scope into the migration process, and not under the fundamental reasons we undertook the migration because of that distinction.
[+] roshbhatia|1 year ago|reply
I'm with you here -- ECS deploys are pretty painless and uncomplicated, but I can picture a few scenarios where this ends up being necessary, for ex; if they have a lot of services deployed on ECS and it ends up bloating the size of the Terraform state. That'd slow down plans and applies significantly, which makes sharding the Terraform state by literally cloning the configuration based on a template a lot safer.
[+] wfleming|1 year ago|reply
Very much agree. I have built infra on ECS with terraform at two companies now, and we have zero manual steps for actions like this, beyond “add the env var to a terraform file, merge it and let CI deploy”. The majority of config changes we would make are that process.
[+] datadeft|1 year ago|reply
> Migrating onto Kubernetes can take years

What a heck am I reading? For who? I am not sure why companies even bother with such migrations. Where is the business value? Where is the gain for the customer? Is this one of those "L'art pour l'art" project that Figma does it just because they can?

[+] kevstev|1 year ago|reply
FWIW... I was pretty taken aback by this statement as well- and also the "brag" that they moved onto K8s in less than a year. At a very well established firm ~30 years old and with the baggage that came with it, we moved to K8s in far less time- though we made zero attempt to move everything to k8s, just stuff that could benefit from it. Our pitch was more or less- move to k8s and when we do the planned datacenter move at the end of the year, you don't have to do anything aside from a checkout. Otherwise you will have to redeploy your apps to new machines or VMs and deal with all the headache around that. Or you could just containerize now if you aren't already and we take care of the rest. Most migrated and were very happy with the results.

There was plenty of services that were latency sensitive or in the HPC realm where it made no sense to force a migration though, and there was no attempt to force them to shoehorn in.

[+] xorcist|1 year ago|reply
It solves the "we have recently been acquired and have a lot of resources that we must put to use" problem.
[+] tedunangst|1 year ago|reply
How long will it take to migrate off?
[+] hujun|1 year ago|reply
depends on how much "k8s native" code you have, there are application designed to run on k8s which uses a lot of k8s api and also if you app already micro-serviced, it is also not straight forward to change it back
[+] breakingcups|1 year ago|reply
I feel so out of touch when I read a blog post which casually mentions 6 CNCF projects with kool names that I've never heard of, for gaining seemingly simple functionality.

I'm really wondering if I'm aging out of professional software development.

[+] renewiltord|1 year ago|reply
Nah, there’s lots of IC work. It just means that you’re unfamiliar with one approach to org scaling: abstracting over hardware, logging, retrying handled by platform team.

It’s not the only approach so you may well be familiar with others.

[+] JohnMakin|1 year ago|reply
I like how this article clearly and articulately states the reasons it gains to benefit from Kubernetes. Many make the jump without knowing what they even stand to gain, or if they need to in the first place - the reasons given here are good.
[+] rayrrr|1 year ago|reply
Just out of curiosity, is there any other modern system or service that anyone here can think of, where anyone in their right might would brag about migrating to it in less than a year?
[+] jjice|1 year ago|reply
It's a hard question to answer. Not all systems are equal in size, scope, and impact. K8s as a system is often the core of your infra, meaning everything running will be impacted. That coupled with their team constraints in the article make it sound like a year isn't awful.

One system I can think of off the top of my head is when Amazon moved away from Oracle to fully Amazon/OSS RDBMSs a while ago, but that was multi year I think. If they could have done it in less than a year, they'd definitely be bragging.

[+] therealdrag0|1 year ago|reply
I’ve seen many migrations take over a year. It’s less about the technology and more about your tech debt, integration complexity, and resourcing.
[+] jokethrowaway|1 year ago|reply
In which universe migrating from docker containers in ECS to Kubernetes is an effort measured in years?
[+] surfingdino|1 year ago|reply
ECS makes sense when you are building and breaking stuff. K8s makes sense when you are mature (as an org).
[+] jb1991|1 year ago|reply
Can anyone advise what is the most common language used in enterprise settings for interfacing with K8s?
[+] gadflyinyoureye|1 year ago|reply
Depends on what you mean. Helm will control a lot. You can make the yaml file in any language. Also you can admin it from command line tools. So again any language but often zsh or bash.
[+] akdor1154|1 year ago|reply
On the platform consumer side (app infra description) - well schema'd yaml, potentially orchestrated by helm ("templates to hellish extremes") or kustomize ("no templates, this is the hill we will die on").

On the platform integration/hook side (app code doing specialised platform-specific integration stuff, extensions to k8s itself), golang is the lingua franca but bindings for many languages are around and good.

[+] JohnMakin|1 year ago|reply
IME almost exclusively golang.
[+] bithavoc|1 year ago|reply
If you’re talking about connecting to Kubernetes and create resources programmatically, Pulumi allows you to interface with it from all the languages they support(js, ts, go, c#, python) including wrapping up Helm charts and inject secrets( my personal favorite).

If you want to build your own Kubernetes Custom Resources and Controllers, Go lang works pretty well for that.

[+] mplewis|1 year ago|reply
I have seen more Terraform than anything else.
[+] strivingtobe|1 year ago|reply
> At the time we did not auto-scale any of our containerized services and were spending a lot of unnecessary money to keep services provisioned such that they could always handle peak load, even on nights and weekends when our traffic is much lower.

Huh? You've been running on AWS for how long and haven't been using auto scaling AT ALL? How was this not priority number one for the company to fix? You're just intentionally burning money at that point!

> While there is some support for auto-scaling on ECS, the Kubernetes ecosystem has robust open source offerings such as Keda for auto-scaling. In addition to simple triggers like CPU utilization, Keda supports scaling on the length of an AWS Simple Queue Service (SQS) queue as well as any custom metrics from Datadog.

ECS autoscaling is easy, and supports these things. Fair play if you just really wanted to use CNCF projects, but this just seems like you didn't really utilize your previous infrastructure very well.

[+] 05bmckay|1 year ago|reply
I don't think this is the flex they think it is...
[+] ravedave5|1 year ago|reply
Completely left out of this post and most of the conversation is that being on K8 makes it much, much easier to go multi-cloud. K8 is k8.
[+] ko_pivot|1 year ago|reply
I’m not surprised that the first reason they state for moving off of ECS was the lack of support for stateful services. The lack of integration between EBS and ECS has always felt really strange to me, considering that AWS already built all the logic to integrate EKS with EBS in a StatefulSet compliant way.
[+] andrewguy9|1 year ago|reply
I look forward to the blog post where they get off K8, in just 18 months.
[+] syngrog66|1 year ago|reply
k8s and "12 months" -> my priors likely confirmed. ha
[+] Ramiro|1 year ago|reply
I love reading these "reports from the field"; I always pick up a thing or two. Thanks for sharing @ianvonseggern!