top | item 39211567

(no title)

whorleater | 2 years ago

I would be curious why kube cron jobs didn't seem to fit the bill, my favorite part of these posts are when they have a section hinting that they explored other options picked specific tradeoffs

discuss

order

gtirloni|2 years ago

Yeah, the article raises more questions than it answers them.

> When designing this new, more reliable service, we decided to leverage many existing services to decrease the amount we had to build

This might explain building from scratch. Maybe the existing solutions had dependencies they didn't want to maintain and they opted for using the existing internal systems. It feels like that influenced all the rest.

nickjj|2 years ago

Kubernetes cron jobs are pretty good I must say.

I definitely don't come anywhere near Slack's scale but I've managed systems where over 3,000 cron jobs ran per day, half of which came from a cron job running every minute which usually finished in a few seconds. Some of these jobs run for X minutes too.

It's nice because there's properties you can configure for each cron job around retries and if it should be uniquely run or not. Maybe certain cron jobs should be re-tried if they fail, for others maybe it's ok to be picked up on the next interval if it fails.

Overall it's been super stable for almost 2 years which is when I started using them. Only a handful of jobs failed over this period of time and they weren't the result of Kubernetes, it was because the HTTP endpoint that was being hit from the cron job failed to respond and the cron job failure threshold was reached.

It's a good reminder that important jobs run on a schedule should be resilient to failure (saving progress, idempotent, etc.).

djboz|2 years ago

Do you capture all of your job code in a single image and reference execution paths on container startup per job? Or, are you building an image per job?

paxys|2 years ago

Spinning up a new Kubernetes pod for every single job run is a very expensive and wasteful operation, starting at least in the order of seconds (usually more) vs just milliseconds for a new process in an already hot environment.

zerbinxx|2 years ago

Sure, but if you need that thing to run every hour for a few seconds, then seconds aren’t really the limiting factor. I don’t doubt that the resource management side of k8s would make it dicey at a certain volume of these things running, though, especially if they eat a lot of compute.