top | item 17581149

(no title)

[Disclaimers: I worked on Borg and Omega, and currently work on Kubernetes/GKE. Everything here is my personal opinion.]

There's a lot to unpack here, but I'll do my best.

I don't see Kubernetes locking people into GKE. There's an extensive conformance program (https://github.com/cncf/k8s-conformance) administered by the CNCF. AWS and Azure both have certified hosted Kubernetes offerings. Portability is in Google's best interest.

Go, Docker, and etcd were the best open-source technologies for the job at the time Kubernetes was created (and arguably still are). Open-sourcing Borg would have been impossible, due to its use of many Google-specific libraries (though a number of those have been open-sourced since then), and its close coupling to the Google production environment. Commenting more specifically on each of the pieces you mentioned:

* Go was chosen over C++ because, like C++, it is a systems language, but is much more accessible for building an open-source community.

* Docker was (and still is) by far the most popular container runtime, and the slimmer containerd makes it even more appropriate to serve as the container runtime for a system like Kubernetes. While it's true that in Borg the container runtime and "package" (container image) management systems are separate, the tradeoffs between packaging more in the image vs. pre-installing dependencies on the host are exactly the same as with Docker images. In any event, it's very feasible to build very slim Docker images (you definitely don't need getty in your image :-).

* You can read the reasons etcd was chosen in this recent comment (https://news.ycombinator.com/item?id=17476142) from a Red Hat employee who is one of the earliest contributors to Kubernetes and one of the most prolific. Regarding consensus, I didn't understand your comment; Borg uses Paxos and etcd uses Raft, but those are basically equivalent algorithms.

Regarding scalability, we do continuous scalability testing as part of the Kubernetes CI pipeline, at a cluster size of 5000 nodes. If you're interested in learning more, I'd encourage you to joint the scalability SIG (https://github.com/kubernetes/community/tree/master/sig-scal...). I'm not aware that "messaging around Kubernetes has gravitated toward smaller, targeted clusters." It's true that a lot of people do use small-ish clusters, but AFAICT that's not because of scalability limitations, but rather because (1) the hosted Kubernetes offerings make it so easy to spin up clusters on demand, and (2) until recently, Kubernetes was lacking critical multi-tenancy features that would allow, say, multiple teams within a company to safely share a cluster.

Regarding mixing batch and interactive/serving applications in a single cluster managed by a single control plane, this has been the intention of Kubernetes from the beginning. It's true that open-source batch systems like Hadoop and Spark have traditionally shipped with their own orchestrators/schedulers, but that's starting to change as Kubernetes becomes more popular, for example Spark now supports Kubernetes natively (https://kubernetes.io/blog/2018/03/apache-spark-23-with-nati...). In terms of features that enable batch and serving workloads to share a node and a cluster, Kubernetes has had the concept of QoS classes (https://kubernetes.io/docs/tasks/configure-pod-container/qua...) from the beginning, and as of the most recent Kubernetes release we now have priority/preemption (https://cloudplatform.googleblog.com/2018/02/get-the-most-ou...). QoS classes and priority/preemption are the two main concepts that allow batch and interactive/serving application to share nodes and clusters in Borg, and we now have them in Kubernetes.

On your fifth point, I agree that this is one of the strengths of the Google production environment, but Kubernetes is limited in how prescriptive it can be in dictating how people write applications, since we want Kubernetes to work with essentially any application. This is why we have, for example, extremely flexible liveness/readiness probes in Kubernetes (https://kubernetes.io/docs/tasks/configure-pod-container/con...) rather than the expectation that every application has a built-in web server that exports a predefined /statusz endpoint. That said, we have been more prescriptive in how to build Kubernetes control plane components (for example such components generally have /healthz endpoints and export Prometheus instrumentation according to the guidelines outlined at https://github.com/kubernetes/community/blob/master/contribu...). Over time as containers and the "cloud native" architecture become more popular, I think there will be more standardization in the ways you described when people see the benefits it provides in allowing them to plug in their app immediately to standard container ecosystems. To some extent Istio (https://github.com/istio/istio) is a step in that direction, and in some sense even better because it interposes transparently rather than requiring you to build your application a particular way.

For anyone interested in learning more about the evolution of cluster management systems at Google, I recommend this paper: https://ai.google/research/pubs/pub44843

While Kubernetes is definitely not the same codebase as Borg, I do think it's accurate to say that it is the descendant of Borg.

discuss

elvinyung|7 years ago

Dumb question: why does K8s use a centralized architecture like Borg, if the perf gains from an Omega-style shared-state scheduler decentralization (and maybe a Mesos-style two-level scheduler for batch with multiple frameworks) were already known, and Omega was already being folded back into Borg?

Is this related to (I'm assuming) the fact that K8s was originally architected "mostly" with service rather than batch in mind, and a monolithic scheduler was "good enough"?'

(Disclaimer: I haven't really followed K8s stuff in the last few months. How is multi-scheduler support for K8s nowadays, anyways?)

davidopp__|7 years ago

You can actually build an Omega vertical / Mesos framework architecture on Kubernetes, as described in this doc[1]. That doc pre-dated CRDs; the way you'd do it today is to build the application lifecycle management part of the framework using a CRD + controller, and run an application-specific scheduler (for pods created by that controller) alongside the default scheduler. The Kubernetes documentation page explaining how to run multiple/custom schedulers is here[2].

Borg only worked with a single scheduler, but Kubernetes allows you to build Omega/Mesos style verticals/frameworks and associated scheduling as user extensions to the control plane (as described above).

[1] https://github.com/kubernetes/community/blob/master/contribu...

[2] https://kubernetes.io/docs/tasks/administer-cluster/configur...

[Disclaimer: I work on Kubernetes/GKE at Google.]

repolfx|7 years ago

As a Xoogler myself, I have always wondered about the logic of "we can't open source X because it uses too many libraries and is too integrated". The obvious answer is, OK, open source the libraries and refactor the integrations to make them more flexible.

Reimplementing all of Borg from scratch seems crazy to me given the huge effort that went into it. Does Google want an open source cluster infrastructure or not? If yes, in what universe is it less effort to write a totally new one from scratch vs just progressively open sourcing things?

jsnell|7 years ago

What's the size of the transitive dependency graph of Borg? 10MLOC? 50MLOC? 100MLOC? I have no idea. But it's a lot of code no matter what. Open sourcing that much code is a huge undertaking, unless you're just planning to throw it over the wall with no expectation of external people working on it.

On the other hand starting from scratch you get to grow the community and the codebase in lockstep.

justicezyx|7 years ago

Generally Google software has a bottom up completely different approach to industry norm/standard. And the divergence started from Google’s very beginning.

Open sourcing system software from its internal state requires the same amount of work as rewriting, plus the effort to morph interfaces and internals to fit external needs, plus changes to internal workloads (assuming a unified stack internal & external).

jsmthrowaway|7 years ago

[deleted]

jsmthrowaway|7 years ago

So it descends from Borg, which is fine. It does not replace Borg or indicate a Google strategy to replace Borg with Kubernetes, which was my entire point with supporting points on why, and explaining why you made the choices in Kubernetes that were made does not dispute that at all.

I note you were careful to use the word descendant, instead of my successor.

What I mean is simple: Borg has borgmaster. Kubernetes approached the same concept like a Web application, and now Kubernetes has an entire SIG to play on the same field as Borg. It was a poor architectural decision, along with many others in Kubernetes, but I wasn’t discussing that. I was discussing why Google won’t replace Borg with it.