top | item 38516717

Kubernetes Needs an LTS

213 points| todsacerdoti | 2 years ago |matduggan.com

206 comments

order
[+] FridgeSeal|2 years ago|reply
I disagree.

Software is a garden that needs to be tended. LTS (and to a lesser extent requirements for large amounts of backwards compatibility) arguments are the path the ossification and orgs running 10+ year out of date, unsupported legacy garbage that nobody wants to touch, and nobody can migrate off because it’s so out of whack.

Don’t do this. Tend your garden. Do your upgrades and releases frequently, ensure that everything in your stack is well understood and don’t let any part of your stack ossify and “crust over”.

Upgrades (even breaking ones) are easier to handle when you do them early and often. If you let them pile up, and then have to upgrade all at once because something finally gave way, then you’re simply inflicting unnecessary pain on yourself.

[+] jl6|2 years ago|reply
That’s great if you have control over all the moving parts, but a lot of real-world (i.e. not exclusively software-based) orgs have interfaces to components and external entities that aren’t so amenable to change. Maybe you can upgrade your cluster without anybody noticing. Maybe you’re a wizard and upgrades never go wrong for you.

More likely, you will be constrained by a patchwork quilt of contracts with customers and suppliers, or by regulatory and statutory requirements, or technology risk policies, and to get approval you’ll need to schedule end-to-end testing with a bunch of stakeholders whose incentives aren’t necessarily aligned to yours.

That all adds up to $$$, and that’s why there’s a demand for stability and LTS editions.

[+] maximinus_thrax|2 years ago|reply
To paraphrase someone else's reaction, I'm violently shaking my head in disagreement. What you're saying only works when you have 100% full control of everything (including the customer data). As someone who spent years in the enterprise space, what you're describing is akin to 'Ivory Tower Architecture'.

LTS is a commitment. That is all. If someone is uncomfortable with such a commitment, then that's fine, let me free market sort it out. But what LTS does is it tells everyone (including paying customers) that the formats/schemas/APIs/etc.. in that version will be supported for a very long time and if I adopt it, I won't have to think about it or budget too much for its maintenance for a period of time measured in months/years.

I would go the extra mile here and say that offline format should be supported FOREVER. None of that LTS bs for offline data, ever. LTS means that you're accountable for some of the costs in your partnership with the customers. If you move fast and break things, they will have to work extra just to keep up. If you move fast but mindful with back-compat, you will work extra but your customer will be happier. That is all.

Re-reading your comment gives me chills and reinforces my belief that I will never pay money to Google (they have a similar gung-ho attitude against 'legacy garbage') or have any parts of my business depend on stuff which reserves the liberty of breaking shit early and often.

[+] baby_souffle|2 years ago|reply
> Don’t do this. Tend your garden. Do your upgrades and releases frequently, ensure that everything in your stack is well understood and don’t let any part of your stack ossify and “crust over”.

You can't see me, but I'm violently nodding in agreement. Faithfully adhering to these best practices isn't always possible, though; management gonna manage how they manage and now how your ops team wants them to manage.

> Upgrades (even breaking ones) are easier to handle when you do them early and often. If you let them pile up, and then have to upgrade all at once because something finally gave way, then you’re simply inflicting unnecessary pain on yourself.

How different could things be if k8s had a "you don't break userland!" policy akin to the way the Linux kernel operates? Is there a better balance between new stuff replacing the old and never shipping new stuff that would make more cluster operators more comfortable with upgrades?

[+] mfer|2 years ago|reply
Consider where people would use something for a long time and want to keep it relatively stable for long periods? Planes, trains, and automobiles are just a few of the examples. How should over the air updates for these kinds of things work? Where are all the places that k8s is being used?

If we only think of open source in datacenters we limit our thinking to just one of the many places it's used.

[+] cramjabsyn|2 years ago|reply
Its not a garden. Gardens are horizontal with no dependencies between plants.

Infrastructure is a high rise building. Long term planning and careful maintenance is needed. And it shouldn't be necessary to replace the foundation every year or two.

[+] iwontberude|2 years ago|reply
Kubernetes doesn't improve sufficiently to justify the broken compatibility anymore. Projects slow down and become mature. This isn't a bad thing.
[+] wouldbecouldbe|2 years ago|reply
Kubernetes feels like Javascript has reached the sysadmins, new updates, libraries, build tools every week.

It's mainly good for keeping high paid people employed, not keeping your servers stable.

I ran a cluster 1.5 years in production, took me so much energy. Especially that one night where digital ocean managed cluster forced an update that crashed all servers; and there was no sane way to fix it.

I'm back to stability with old school VPS; it just works. Every now and then you run a few patches. Simple & fast deploys; what a blessing.

[+] incahoots|2 years ago|reply
I agree on the very principal of what you're laying out here, but the reality is often rare if not at all in tandem with principals and "best practices"

Manufacturing comes to mind, shuttering a machine down to apply patches monthly is going to piss off the graph babysitters, especially if the business is a 24/7 operation, and most are currently.

In an ideal world there would be times every month to do proper machine maintenance, but that doesn't translate to big money gains for the glutton of shareholders who don't understand anything, let alone understand that maintenance prolongs processes as opposed to running everything ragged.

[+] Waterluvian|2 years ago|reply
Sometimes I get this feeling that a lot of developers kind of want to be in their code all the time, tending to it. But it’s just not a good use of my time. I want it to work so I can ignore it and move on to the next best use of my time.

I trial upgraded a years old Django project today to 5.0 and it took almost zero work. I hadn’t touched versions (other than patch)in over a year. That’s the way I want it. Admittedly this was less about an LTS and more about sensible design with upgrading in mind.

[+] zzyzxd|2 years ago|reply
Yup. At my last company, Kubernetes was the only place where software was not 2+ years behind upstream because of this and I appreciate it a lot. And its API deprecation policy [1] made upgrade not so painful (if upgrades frequently break your stuff, check if your infra people really understand this policy, and your software vendors are compliant).

1. https://kubernetes.io/docs/reference/using-api/deprecation-p...

[+] cultofmetatron|2 years ago|reply
> Do your upgrades and releases frequently

this sounds great in theory. after 5years of running a startup, you learn to pick your battles. that db library you upgraded? works great except for this one edge case where they changed the interface in an obscure portion that affects 5% of your users. it got past QA but thaat 5% of users get REAL VOCAL about it. now its in production and you're planning an emrgency revert of that dependncy after you verify that you aren't using any of the new features of the library.

sounds doable? now multiply that by the number of dependencies your app has.

I agree that it good to periodically update deps and allocate time for keeping your system up to date but Its easy to let tracking all your deps suck up all the time your startup would be better off spending on adding features that bring new users in and add to your bottom line.

[+] npacenop|2 years ago|reply
This, and very much this.

At the end of course the decision will be based on business and market analysis, as even the cloud native foundation abides more or less to the same rules as everyone else does... Introducing LTS for kubernetes however will be a huge step towards pushing it down the enterprise products alley, where software is selected more for the amount of people available in the market, capable to work with it / operate it, and the running costs generated, rather than satisfying an actual need of the business.

Chances are that if your org needs LTS for kubernetes, then kubernetes is not the right solution to your problems. Which is probably the case anyway... but that's a whole different story.

[+] jvans|2 years ago|reply
I've spent so much time chasing down performance problems and bugs where the problem was due to an outdated dependency. Simply spending a few hours a month upgrading dependencies is a big win to avoid those situations
[+] cpeterso|2 years ago|reply
At the far end of the update frequency spectrum from LTS is Google's "live at head" philosophy. Google's Abseil C++ library, for example, recommends that consumers update to Abseil's latest commit from the master branch as often as possible. Abseil have no tagged releases besides master and an LTS branch (updated every 6-9 months).

https://abseil.io/about/philosophy#we-recommend-that-you-cho...

[+] quickthrower2|2 years ago|reply
I like the idea, and with something like say Chrome, this is excellent.

However each upgrade of k8s needs planning. Need to check through the list to see what APIs are broken, and figure out if anything needs changing, preparing for it, including stuff the cloud provider has chucked in the mix.

Probably to the point I feel like a cluster of clusters would be safer, so you can slowly roll it out.

[+] hot_gril|2 years ago|reply
Even if you could get everyone to tend the garden, it encourages unnecessary breaking changes or fragile design. LTS provides some much-needed backpressure. If a new feature truly requires breaking things, people who really want it will put in the effort.
[+] MuffinFlavored|2 years ago|reply
> long-term support

> out of date, unsupported legacy garbage

[+] jacurtis|2 years ago|reply
I both agree and disagree with you.

In theory, I agree. Tend your garden, do small upgrades often instead of major upgrades less often.

In a homelab this is easy to implement. In a small organization it is too.

But in the facetious "real world", things are a lot more complicated. I work as an SRE Manager and my team is basically ALWAYS upgrading Kubernetes. New releases drop about as fast as we can upgrade the last ones.

When you work on a large cluster, doing an upgrade isn't a simple process. It requires a ton of testing, several steps, and being very slow and methodical to make sure it is all done properly. Where I currently work, we have 2 week sprints and infrastructure changes must align with sprint cycles. So to promote an upgrade at the fastest possible schedule it requires:

- Week 0: Upgrade Dev environment

- Week 2: Upgrade QA Environment

- Week 4: Upgrade Sandbox Environment

- Week 6: Upgrade Prod Environment

That is the fastest possible schedule. That assumes we do a cluster upgrade every sprint, which is 2 weeks. It also ignores other clusters for other business units. We have 4 primary product lines, so multiply all that work times 4. Plus we have supporting products (like self-hosted gitlab, codescene, and custom tools running in k8s clusters).

I say fastest possible schedule because, we can't keep up with this schedule, but even if we could it is the fastest we could go and still maintain our deployment and infrastructure promotion policies.

With new releases every 3-4 months (12-16 weeks), we are essentially in a constant state of always upgrading kubernetes. Right now my team is 2 versions behind. Skipping versions doesn't make sense because you can't guarantee a safe upgrade when skipping versions.

This is why LTS releases are nice. When you run systems at large scale, it is impractical to upgrade that often. I'd prefer to limit upgrades to no more than twice a year and personally I find annual upgrade cycles to be the best balance between "tending the garden" and "not drowning in upgrade work". LTS releases are usually tests to skip upgrades so that companies can go from LTS release to LTS release, without the need to worry about upgrading every minor version in sync.

Remember, upgrading K8s clusters isn't what my bosses want to hear my team spends our time. They want to know that observability is improving, devs are getting infrastructure support, we are building out new systems, deploying hardware for the product team, running our resiliency tests, etc.. Sure upgrading is part of the job, but i can't be ALWAYS upgrading. I have a lot of other responsibilities.

[+] stouset|2 years ago|reply
100% agreed.

Make upgrades easy, automate the tedious parts, and do them as often as possible.

If you do upgrades once per several years, yes, it is going to be excruciating.

[+] sofixa|2 years ago|reply
This, like a recent LTS discussion I saw for a different tool, ignores one tiny little detail that makes the whole discussion kind of moot.

LTS doesn't mean it's immune to bugs or security vulnerabilities. It just means that the major release is updated and supported longer - but you still need to be able to apply patches and security fixes to that major release. Yes, it's easier to go from 1.20.1 to 1.20.5 than to 1.21, because there's less chance of breakage and less things that will change, but the process is pretty much the same - check for breaking changes, read changelogs, apply everything. The risk is less, might be slightly faster, but fundamentally, it's the same process. If the process is too heavy and takes you too long, having it be slightly faster won't be a gamechanger.

So LTS brings slight advantages to the operator, while adding potentially significant complexity to the developer (generally backporting fixes into years old versions isn't fun).

The specific proposed LTS falvour is also hardcore, without an upgrade path to the next LTS. The exact type of org that needs an LTS will be extremely reluctant to having to redo everything, 2 years later, with potentially drastic breaking changes making that change very hard.

[+] kevin_nisbet|2 years ago|reply
From my perspective as a former developer on a kubernetes distribution that no longer exists.

The model seems to largely be, the CNCF/Kubernetes authors have done a good job of writing clear expectations for the lifetime for their releases. But there are customers who for various reasons want extended support windows.

This doesn't prevent the distribution from offering or selling extended support windows, so the customers of those distributions can put the pressure on those distribution authors. This is something we offered as a reason to use our distribution, that we can backport security fixes or other significant fixes to older versions of kubernetes. This was especially prevalent for the customers we focussed on, which were lots of clusters installed in places without remote access.

This created a lot of work for us though, as whenever a big security announcement came out, I'd need to triage on whether we needed a backport. Even our extended support windows were in tension with customers, who wanted even longer windows, or would open support cases on releases out of support for more than a year.

So I think the question really should be, should LTS be left to the distributions, many of which will select not to offer longer releases than upstream, but allow for some more commercial or narrow offerings where it's important enough to a customer to pay for it. Or whether it should be the responsibility of the Kubernetes authors and in that case what do you give up in project velocity with more work to do on offering and supporting LTS.

I personally resonate with the argument that this can be left with the distributors, and if it's important enough customers to seek out, they can pay for it through their selected distribution, or switching distributions.

But many customers lose out, because they're selecting distributions that don't offer this service, because it is time consuming and difficult to do.

[+] JohnFen|2 years ago|reply
As an industry, we need to get back to having security releases separate from other sorts of releases. There are tons of people who don't want to, or can't, take every feature release that comes down the pike (particularly since feature updates happen so insanely often these days), and this would be a huge win for them.
[+] edude03|2 years ago|reply
Maybe more importantly, you could get a distribution to support you but what about upstream projects? It'd be a big lift (if not impossible) to get projects like cert-manager cilium whatever to adopt the longer release cycle as well.

Is it normal for a distribution to also package upstream projects that customers want?

[+] xnyanta|2 years ago|reply
This was my immediate though while reading thr article. Why should the kubernetes authors be burdened by having to maintain an LTS release.

That should be Red Hat's job, just like they do with RHEL.

[+] sgift|2 years ago|reply
> But many customers lose out, because they're selecting distributions that don't offer this service, because it is time consuming and difficult to do.

Sure, but if they really need that service they will gravitate to distributions that do provide it, so, I think, no harm done here. It's to me like JDK distributions. Some give you six month, some give free LTS and others give you LTS with a support contract. LTS with backports is work, someone has to pay for it, so let those who really need it pay. Everyone else can enjoy the new features.

tl;dr: I'm with you in the camp that you can leave it to the distributors.

[+] cyrnel|2 years ago|reply
Good overview! I'd personally rather have better tooling for upgrades. Recently the API changes have been minimal, but the real problem is the mandatory node draining that causes downtime/disruption.

In theory, there's nothing stopping you from just updating the kubelet binary on every node. It will generally inherit the existing pods. Nomad even supports this[1]. But apparently there are no guarantees about this working between versions. And in fact some past upgrades have broken the way kubelet stores its own state, preventing this trick.

All I ask is for this informal trick to be formalized in the e2e tests. I'd write a KEP but I'm too busy draining nodes!

[1]: https://developer.hashicorp.com/nomad/docs/upgrade

[+] barryrandall|2 years ago|reply
No open source package that's given away for free needs to or should pursue LTS releases. People who want LTS need a commercially-supported distribution so that they can pay people to maintain LTS versions of the product they're using.
[+] watermelon0|2 years ago|reply
Not saying that companies shouldn't pay for extended support, but many other open source software have LTS releases with multi-year support (e.g. Ubuntu/Debian 5 years for LTS releases, and Node.js for 2.5 years.)

Additionally, I think one of the major reason for LTS is that K8s (and related software) regularly introduces breaking changes. Out of all the software that we use at work, K8s probably takes the most development time to upgrade.

[+] waynesonfire|2 years ago|reply
Maybe my team of 15 engineers that manage the k8s stack can do it.
[+] airocker|2 years ago|reply
Maybe GKE and EKS should make LTS versions.
[+] fostware|2 years ago|reply
For a group so devoted to "cattle, not pets", so many responses here indicate an almost constant need for hands-on effort from upgrade testing, UAT, right through to post-upgrade hypercare.

I'd like a slightly longer LTS purely so I'm not having to spend all my time spinning the plates to keep things up. I don't need 10 years LTS, I need three so I can work with the rest of the enterprise that moves even slower.

[+] HPsquared|2 years ago|reply
To borrow from reliability engineering, software failures in practice can approximate a "bathtub curve".

That is: an initial high failure rate (teething problems), a low failure rate for most of the lifespan (when it's actively maintained), then gradually increasing failure rate (in hardware this is called wear-out).

Unlike hardware, software doesn't wear out but the interfaces gradually shift and become obsolete. It's a kind of "gradually increasing risk of fatal incompatibility". Something like that.

I wonder if anyone has done large-scale analysis of this type. Could maybe count CVEs, but that's just one type of failure.

[+] b7we5b7a|2 years ago|reply
Perhaps it's because I work in a small software shop and we do only B2B, but 99% of our applications consist of a frontend (JS served by an nginx image), a middleware (RoR, C#, Rust), nginx ingress and cert-manager. Sometimes we have PersistentVolumes, for 1 project we have CronJobs. SQL DBs are provisioned via the cloud provider. We monitor via Grafana Cloud, and haven't felt the need for more complex tools yet (yes, we're about to deploy NetworkPolicies and perform other small changes to harden the setup a bit).

In my experience:

- AKS is the simplest to update: select "update cluster and nodes", click ok, wait ~15m (though I will always remember vividly the health probe path change for LBs in 1.24 - perhaps a giant red banner would have been a good idea in this case)

- EKS requires you to manually perform all the steps AKS does for you, but it's still reasonably easy

- All of this can be easily scripted

I totally agree with the other comments here: LTS releases would doom the project to support >10y-old releases just because managers want to "create value", but don't want to spend a couple weeks a year to care for the stuff they use in production. Having reasonably up-to-date, maintainable infrastructure IS value to the business.

[+] oneplane|2 years ago|reply
No it doesn't.

If you can't keep up you have two options:

1. Pay someone else to do it for you (effectively an LTS)

2. Don't use it

Software is imperfect, processes are imperfect. An LTS doesn't fix that, it just pushes problems forward. If you are in a situation where you need a frozen software product, Kubernetes simply doesn't fit the use case and that's okay.

I suppose it's pretty much all about expectations and managing those instead of trying to hide mis-matches, bad choices and ineptitude. (most LTS use cases) It's essentially x509 certificate management all over again; if you can't do it right automatically, that's not the certificate lifetime's fault, it's the implementors fault.

As for option 1: that can take many shapes, including abstracting away K8S entirely, replacing entire clusters instead of 'upgrading' them, or having someone do the actual manual upgrade. But in a world with control loops and automated reconciliation, adding a manual process seems a bit like missing the forest for the trees. I for one have not seen a successful use of K8S where it was treated like an application that you periodically manually patch. Not because it's not possible to do, but because it's a symptom of a certain company culture.

[+] mschuster91|2 years ago|reply
As someone working in managing a bunch of various Kubernetes clusters - on-prem and EKS - I agree a bit. Managing Kubernetes versions can be an utter PITA, especially keeping all of the various addons and integrations one needs to keep in sync with the current Kubernetes version.

But: most of that can be mitigated by keeping a common structure and baseline templates. You only need to validate your common structure against a QA cluster and then roll out necessary changes onto the production cluster... but most organizations don't bother and let every team roll their own k8s, pipelines and whatnot. This will lead to tons of issues inevitably.

Asking for a Kubernetes LTS is in many cases just papering over organizational deficiencies.

[+] yrro|2 years ago|reply
For comparison, Red Hat denote every other major release of OpenShift (4.8, 4.10, 4.12) as Extended Update Support releases, with 2 years of support.
[+] chitraa|2 years ago|reply
What are your thoughts on the feasibility of an LTS for Kubernetes? Do you think it's something the community would embrace?
[+] gtirloni|2 years ago|reply
Kubernetes LTS goes by different names: AWS EKS, Azure AKS, Google GKE, SUSE Rancher, etc.
[+] voytec|2 years ago|reply
> The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

> This bot triages issues according to the following rules: ...

> /close

[+] solatic|2 years ago|reply
The economic math is very simple: organizations that release faster are more responsive to the market, have lower operational costs, and are therefore more efficient. Market players that release slower will get squeezed by more efficient players. It may happen later rather than sooner but it is inevitable.

As a business, you can decide to become more efficient, or you can decide to try to support the status quo that enjoys internal political equilibrium. Smart businesses go for the former, most businesses go for the latter.

[+] tomjen3|2 years ago|reply
Lets Encrypt will only give you a cert that is good for 3 months, because they want you to automate the update.

I don't think K8S should create an LTS, I think they should make it dirt simple to update.

[+] simiones|2 years ago|reply
Honestly, rather than an LTS, I think k8s needs a much better upgrade process. Right now it is really poorly supported, without even the ability to jump between versions.

If you want to migrate from 1.24 to 1.28, you need to upgrade to 1.25, then 1.26, then 1.27, and only then can you go to 1.28. This alone is a significant impediment to the normal way a project upgrades (lag behind, then jump to latest), and would need to be fixed before any discussion of an actual LTS process.

[+] nezirus|2 years ago|reply
The cynical voice inside me says it works as intended. The purpose of k8s is not to help you run your business/project/whatever, but a way to ascend to DevOps Nirvana. That means never-ending cycle of upgrades for the purpose of upgrading.

I guess too many people are using k8s where they should have used something simpler. It's fashionable to follow the "best practices" of FAANGs, but I'm not sure that's healthy for vast majority of other companies, which are simply not on the same scale and don't have armies of engineers (guardians of the holy "Platform")

[+] airocker|2 years ago|reply
This is sorely needed. It causes service disruption for weeks because of weird upgrade schemes in GKE (or any other vendor) which works on top of Kubernetes. There are too many options with non-intuitive defaults related to how the control plane and the node pools will be repaired or upgraded. Clusters get upgraded without our knowledge and breaks service arbitrarily. Plus, if you are doing some infrastructure level changes, you have to put in extreme effort to keep upgrading.

IMHO , infrastructure is too low level for frequent updates. Older versions need LTS.