top | item 25908350

(no title)

benchess | 5 years ago

Hi! Co-author here. We do keep the nodes running 24/7, so Kubernetes still provides the scheduling to decide which nodes are free or not at any given time. Generally starting a container on a pre-warmed node is still much much faster than booting a VM. Also, some of our servers are bare-metal.

EDIT: Also don't discount the rest of the Kubernetes ecosystem. It's more than just a scheduler. It provides configuration, secrets management, healthchecks, self-healing, service discovery, ACLs... there are absolutely other ways to solve each of these things. But when starting from scratch there's a wide field of additional questions to answer, problems to solve.

discuss

xorcist|5 years ago

Isn't Kubernetes a pretty lousy scheduler when it doesn't take this into consideration? There are a number of schedulers used in high performance computing that should be able to do a better job.

chubot|5 years ago

Yeah exactly... This seems closer to an HPC problem, not a "cloud" problem.

Related comment from 6 months ago about Kubernetes use cases: https://lobste.rs/s/kx1jj4/what_has_your_experience_with_kub...

Summary: scale has at least 2 different meanings. Scaling in resources doesn't really mean you need Kubernetes. Scaling in terms of workload diversity is a better use case for it.

Kubernetes is basically a knockoff of Borg, but Borg is designed (or evolved) to run diverse services (search, maps, gmail, etc.; batch and low latency). Ironically most people who run their own Kube clusters don't seem to have much workload diversity.

On the other hand, HPC is usually about scaling in terms of resources: running a few huge jobs on many nodes. A single job will occupy an entire node (and thousands of nodes), which is what's happening here.

I've never used these HPC systems but it looks like they are starting to run on the cloud. Kubernetes may still have been a defensible choice for other reasons, but as someone who used Borg for a long time, it's weird what it's turned into. Sort of like protobufs now have a weird "reflection service". Huh?

https://aws.amazon.com/blogs/publicsector/tag/htcondor/

https://aws.amazon.com/marketplace/pp/Center-for-High-Throug...

AlphaSite|5 years ago

If all you care about is node in use or not in use I think it’s fine. You don’t need anything complex from the scheduler.

hamandcheese|5 years ago

Not to me mention it’s a well known skillset that can more easily be hired for, as opposed to “come work on our crazy sauce job scheduler, you’ll love it!”

stonogo|5 years ago

Are you starting from scratch? This architecture seems like a pretty standard HPC deployment with unnecessary containerization involved.

dijit|5 years ago

I feel like we solved this problem over a decade ago (if you’re keeping machines warm anyway) with job brokers. Am I somehow mistaken?

torbital|5 years ago

> self-healing, service discovery

For a second I read that as self-discovery

Damn kubernetes is some good shit

yongjik|5 years ago

Well, considering how looking at kubernetes config makes me question the choices I have made in my life that led me into this moment, "self-discovery" is not too far off, I think.