Bare Metal K8s Clustering at Scale

[+] zbentley|7 years ago|reply

To everyone asking "why would you ever do this", ask them, or look for videos of the talk/etc., but don't assume the decision was poorly thought through. That seems prejudicial at best (and I say that as someone that regularly pushes back on web software folks I work with trying to build huge-company-scale orchestration layers when they don't need them).

I saw these guys talk at QCon. It was a fascinating talk, and an excellent example of SRE adaptability and nonstandard, uncommon innovation given unusual constraints.

Not speaking for them, just from my memories of the talk and the following Q/A, but their reasons for this stack were primarily:

- They couldn't run in the cloud because connectivity to their sites is often terrible.

- They mostly ran IoT stuff from the k8s clusters--automated kitchen equipment like fryers and fridges, order tracking/status screens, building control systems, and metrics aggregation so they can see how businesses are doing.

- Because of the bad connectivity, a "fetch"/"push" (from the k8s clusters at the edge) model was needed for deployments/logging/administration/getting business data back up to the cloud.

- They explicitly did not process payments.

- k8s was used primarily for ease of deployment and providing a base layer of clustered reliability for pretty simple services. Since the boxes in the cluster were running in often-unventillated racks/closets full of junk in random restaurants, having that base layer was very important to them. Other solutions were evaluated and they chose k8s after consideration.

- Unlike typical IoT/automation setups here, they wanted to be able to experiment, monitor, and deploy software without the traditional industrial control practice of "take shit down, flash your controller (call a tech if you don't understand that), spin it up, and if it breaks you're down until we ship a new control unit or you manually fail over to a backup".

- However, they didn't want to fall into the IoT over-the-air update security pitfalls (it would really suck if someone hacked your fridge's temperature control system and gave a week's worth of customers salmonella). As a result they spent a ton of time making very good (and simultaneously very simple) deployment/update authorization and tracking tools. They chose the "pull" model and keying/security layers explicitly to avoid having to think about tons of open remote-access vectors and/or site hijacking.

- The k8s tooling (and some of their own) allowed easy, remote rollbacks to "default/clean state" in case something went wrong, which was critical given that downtime might compromise a restaurant and having a "reset button" automated in was important for ease-of-use by nontechnical, overworked site managers.

- The clustering allowed individual nodes to fail (which they will, because unreliable environments), and people to manually yank ones with confidence.

- While, as some commenters pointed out, the leader (re)election system chosen might be unacceptably slow/randomized for, say, a cloud database, it is perfectly sufficient for failing over a control system in a restaurant. A few seconds of delay on an order tracking screen, or a system reboot/state-loss of in-flight orders is vastly preferable than some split-brain situation making the restaurant accidentally cook 1.25x the correct number of sandwiches for hours, to go to waste.

It's important to understand their use case: they needed to basically ship something with the reliability equivalent of a Comcast modem (totally nontechnical users unboxed it, plugged it in, turned it on, and their restaurant worked) to extremely poorly-provisioned spaces (not server rooms) in very unreliable network environments. For them, k8s is an (important) implementation detail. It lets them get close to the substrate-level reliability of a much more expensive industrial control system in their sites (with clustering/reset/making sure everything is containerized and therefore less likely to totally break a host), while also letting them deploy/iterate/manage/experiment with much more confidence and flexibility than such systems provides.

I think this is a great story of using new tools for a novel (or at least unusual) purpose, and getting big benefits from it.

Brian, Caleb: great talk, great writeup. Sorry HN is . . . being HN. Keep at it.

Edit: QCon talk summary is here: https://www.infoq.com/news/2017/07/iot-edge-compute-chick-fi.... If you have any employees/friends that went, they should have access to the video. It may be made public at some point, too.

[+] oblio|7 years ago|reply

> Brian, Caleb: great talk, great writeup. Sorry HN is . . . being HN. Keep at it.

I don't think HN was super vicious. They presented an out-of-the-box solution to a problem but they didn't define the problem fully. Based on what we saw, their solution seemed way overkill.

Glad to hear that there was a solid reason behind it, not just hype and recruiting buzz.

[+] mikehollinger|7 years ago|reply

Hey! Thanks for the comment here. As I said elsewhere in this thread I really do look forward to understanding the “why” for this choice.

[+] coffeesn0b|7 years ago|reply

Caleb here - you nailed it. I really don't have anything to add!

[+] reacharavindh|7 years ago|reply

Wish the article had more context for readers.

These days it feels like everybody needs to throw in Kubernetes at everything introducing complexity for the sake of being cool.

I guess those of us that likes to run non-distributed software for small scale applications are the new grumpy grey beards....

[+] eddieroger|7 years ago|reply

Maybe it's because I'm also in retail IT and have seen similar models, but I get it. At the risk of oversimplifying, they've got two problems - code written in an office somewhere has to run identically in 2000 geographically separated places, and it needs to do so with some level of protection against failure. All other K8s gloss aside, it's a good way of making sure that there are multiple instances of a Docker container running. Assuming they set the cluster up correctly, they can suffer hardware failure and not downtime, and assuming they're using those little nucs in the photo, they can do so for under $1500. Sure a single server-grade piece of hardware would probably do just fine and have similar protection from failure, but this is essentially that on commodity hardware.

[+] chx|7 years ago|reply

> for small scale applications

I am a grumpy grey beard no doubt but I still maintain: most websites do not need more than a single server -- certainly not more than a single database server. And, for most, a few hundred dollar dedidcated server is aplenty. Apply YAGNI until blue in the face.

[+] itchyouch|7 years ago|reply

In my particular environment, we have over 15k jvms running across 2k hosts just for our US applications. K8s absolutely makes sense.

But for non-distributed software with only several clients, the traditional model is still fine. E.g. we still run gitlab as a pet to serve our cattle infrastructure.

[+] throwbacktictac|7 years ago|reply

I was curious about what problem was being solved too. I can't imagine what a chicken restaurant needs with a distributed k8's cluster.

[+] coffeesn0b|7 years ago|reply

(I'm re-pasting this intro to a few posters)

Hey! I'm Caleb, the SRE that helped build this solution...

Sorry about the lack of context in the article, it was intended for a specific audience (QCon) where we gave a lot more context to the problem at hand.

What we were trying to solve for was; 1) Low latency 2) High Availability 3) Container based, zero-downtime deployments 4) Continued operations even in an internet-down event

If you're running applications in a few locations, in small scale environments, our approach would be way overkill for that problem set.

[+] zimbatm|7 years ago|reply

This type of deployment is a perfect fit for NixOS. Immutable deployments with zero configuration drift, easy rollback and options to both push and pull the system configuration updates. It's also easy to customize the system to the hardware unlike CoreOS or Rancher while providing pre-built binaries of all the dependencies.

Setting up a single-node kubernetes is basically adding one line to the system config:

    services.kubernetes.roles = ["master" "node"];

[+] beepbeepbeep1|7 years ago|reply

The intetesting bit they miss detail on is why they are running k8 at the edge in restaurants.

The only reason i can think of that is they get to push point of sale software out by using K8s from some central system. I cant think of a worse use/abuse of k8 as a software updafe system if that's what they are doing.

The other reason is they distributed their compute and resturants pay the power bill but that sounds just as silly.

Curious to know why you would use k8s at the edge

[+] ealexhudson|7 years ago|reply

They made a comment about having some kind of IoT infrastructure in each restaurant.

It absolutely smells of over-engineering, though. There are a lot easier ways of pushing software out than maintaining k8s locally; and they're almost certainly going to need to build a system which manages and monitors all these clusters...

[+] zbentley|7 years ago|reply

I saw them talk at QCon and asked them that question. They are not running PoS through this system. They are instead using it to manage all of the devices in their restaurants (from kitchen hardware like automated fridges/fryers/IoT stuff to order tracking, building temperature control, and metrics tracking).

[+] philip1209|7 years ago|reply

Another idea: Maybe they want the restaurant software to continue running locally if it is offline, but they want to sync data while online.

This would make a lot of sense with something like CockroachDB - if the restaurant was offline, their local data would be preserved. But, as soon as it goes back online, then corporate would have access to all of the data.

[+] bhouston|7 years ago|reply

Why not restaurants run web apps on commodity hardware like iPads. But I guess you are dependent on the internet being up.

[+] fredsted|7 years ago|reply

I'm looking forward to the article detailing why they decided to do this.

[+] FrancoisBosun|7 years ago|reply

I'm pretty sure this is related to being able to continue running the applications even if the venue loses Internet access. You can't stop processing orders if your Internet access has a hiccup.

[+] roncohen|7 years ago|reply

This caught my eye: Home made leader election protocol that relies on UDP.

[+] madmax96|7 years ago|reply

This one was kind of troubling, if you ask me:

    >If the leader ever dies, a new leader will be elected
    >through a simple protocol that uses random sleeps and
    >leader declarations.

Why not have each node self-generate a UUID and engage in some gossip process that ends with the cluster becoming aware that some node's corresponding UUID is uniquely significant, therefore recognizing that node as a leader?

I have some really bad memories of "random sleeps" at scale.

[+] alainchabat|7 years ago|reply

Is anyone has a solution/tool to run easily kubernetes on a single bare metal server? Kubernetes or anything other docker container "orchestration" tool. Tried to google (certainly wrong keywords), and found some quite complex process, or maintained tools that are mainly for aws/gcp

[+] raesene9|7 years ago|reply

Kubeadm works just fine with a single node configuration, you just need to untaint the master so it can run non-control plane components.

https://www.mirantis.com/blog/how-install-kubernetes-kubeadm... has a decent step-by-step. It's mostly just the standard install but they have details on the untaint bit too.

[+] dboreham|7 years ago|reply

The best place to start is here (which unfortunately won't show up in any relevant Google search) : https://kubernetes.io/docs/setup/independent/create-cluster-...

[+] striking|7 years ago|reply

If you want just one server, minikube is probably the way to go, even if it does run in a VM. Otherwise, use https://kubernetes.io/docs/setup/independent/create-cluster-... to set up a cluster (one node or otherwise).

I used flannel as pod networking, as it's really simple. If you want to run app pods on your master node, remember to untaint it. ingress-nginx is probably your best bet as an ingress controller, especially because of the amount of support given it by the k8s Slack.

It is a non-zero amount of work. If it sounds like too much work for something you're throwing together, it probably is. It is generally unnecessary.

[+] tetraodonpuffer|7 years ago|reply

not sure if you would classify it as a complex process, but I have a series of posts in my blog (link in my profile) that takes you from a bare metal box to a k8s cluster running on Xen on it: a lot of it is explaining how things work, so it is a bit verbose, but it hopefully should not be too hard to follow, given that it walks you through setting up Xen, then CoreOS, then etcd and finally Kubernetes it does take several posts...

[+] gnufied|7 years ago|reply

There is `./hack/local-cluster-up.sh` if you download the source code of Kubernetes, which will give you a local baremetal cluster. The one caveat is, when you shutdown the cluster all files will be deleted.

There is also "./cluster/get-kube-local.sh" that is supposed to give you a working local cluster. But it appears to be broken right now. Might be worth opening a GH issue for that.

[+] bryanlarsen|7 years ago|reply

kubeadm. Pretty much any other recommendation (such as minikube or kubespray) is just using kubeadm under the hood. They still exist because they add something that kubeadm is missing. Like cluster support, specific cloud support, or VM/dev. But you don't need any of those, so kubeadm is what you want.

[+] kryptk|7 years ago|reply

Minikube works great to get your feet wet, but it's not suitable for anything except a playground.

[+] yanslookup|7 years ago|reply

This is great. Does anyone have guides on how to do the cluster creation bootstrap on public clouds where you don't get a known DNS name ahead of time and master nodes may come and go? Ie I want to bake an AMI and create an ASG so that we can turn it on and it will self cluster, create certs, etc and can add and remove nodes at the whim of the ASG.

[+] coffeesn0b|7 years ago|reply

We haven't open sourced how we do this yet... we have an MVP way of doing it by using Ansible to provision the NUCs, and nmap (please don't laugh!) so that they can find each other on a specific virtual network at the restaurants.

We're replacing a lot of these solutions with "better ways" over the next weeks and months, but I'd be happy to share how we went about it. You can contact me on LinkedIn: https://www.linkedin.com/in/calebrhurd/

The biggest key was that we use RKE for the clustering/certs on bare metal. That's definitely our secret sauce (pun intended).

[+] stuff4ben|7 years ago|reply

This is the first I've heard of RKE as a K8s installer. I always just thought it was another name for Rancher 2.0. Would love to see a good comparison of Kubeadm vs RKE. This article briefly mentions kubeadm and that they didn't choose it.

[+] unknown|7 years ago|reply

[deleted]

[+] Symmetry|7 years ago|reply

I went into this thinking they were using old AMD K8s clustered into a budget supercomputer.

[+] alexmorse|7 years ago|reply

I don't understand why at all you would do this for a restaurant

What challenge is this addressing, what problem does this solve? Is there a problem to solve here?

I do assume there's a good reason for this, but as presented it seems like a very stupid waste of money.

[+] coffeesn0b|7 years ago|reply

Caleb here (SRE at Chick-Fil-A)... we actually just did it because we thought it would look cool :-P .

Hah... no, but seriously... we wrote this article for QCon attendees, and gave a lot more context during our talk at that conference. We didn't realize it was going to be on here, otherwise we would have explained the "why" and not just dived in.

What we were trying to solve for was; 1) Low latency 2) High Availability 3) Container based, zero-downtime deployments 4) Continued operations even in an internet-down event

Also, as an interesting side note, the equivalent hardware has about a 6 month ROI if we put the entire load on AWS... granted it would be more efficient, so that's not an entirely fair comparison, but the hardware is unbelievably inexpensive from a cost perspective.

[+] danpalmer|7 years ago|reply

It sounds like a huge cost saving to me. Being able to install a few dumb machines in the restaurant and then have remote installation and management of applications running on them would be great. I imagine that kubernetes would be more reliable than PXE booting images across the internet (as that often requires physically rebooting machines which requires involvement of the restaurant staff, will be error prone, etc), not to mention that building bootable images with your software on is not a very modern practice.

Bear in mind that in terms of cost, this is competing with a person driving to each restaurant and fiddling around with computers for an hour, which is a very expensive process.

[+] ofrzeta|7 years ago|reply

Why do you assume that it's a waste of money? I guess it is a given that they operate some computer infrastructure in every venue, so you would have the same capex without Kubernetes and you would also need some kind of management, so you would also spend money on operations. So maybe it's not such a bad idea to use an off-the-shelf container management software to rollout and operate their containerized applications.

[+] robert_foss|7 years ago|reply

As a note, Chick-fil-A is notoriously anti-LGBTQ.

https://thinkprogress.org/chick-fil-a-still-anti-gay-970f079...

[+] majewsky|7 years ago|reply

Without taking a political stance here, how is this relevant to this submission?

[+] jrs95|7 years ago|reply

Christians taking Christianity seriously in 2018? This is just unacceptable, they must be stopped.

67 comments