top | item 17805829

We Don’t Run Cron Jobs (2016)

177 points| bedros | 7 years ago |engblog.nextdoor.com | reply

140 comments

order
[+] Cieplak|7 years ago|reply
Cron works great when you don’t need to guarantee execution, e.g., if a server goes down. Unfortunately, all the alternatives are pretty heavyweight, e.g., Jenkins, Azkaban, Airflow. I’ve been working a job scheduler that strives to work like a distributed cron. It works with very little code, because it leans heavily on Postgres (for distributed locking, parsing time interval expressions, configuration storage, log storage) and PostgREST (for the HTTP API). The application binary (~100 lines of haskell), polls for new jobs, then checks out and execute tasks. The code is here if you’re interested:

https://github.com/finix-payments/jobs

It compiles to machine code, so deploying the binary is easy. That said, I’d like to add some tooling to simplify deploying and configuring Postgres and PostgREST.

[+] JdeBP|7 years ago|reply
That is hardly "all the alternatives". Some of the alternatives to the Vixie cron family (https://news.ycombinator.com/item?id=17005677) are:

* Uwe Ohse's uschedule, http://jdebp.uk./Softwares/uschedule/ https://ohse.de/uwe/uschedule.html

* Bruce Guenter's bcron, http://untroubled.org./bcron/

* GNU mcron, https://news.ycombinator.com/item?id=17002098

* Thibault Godouet's fcron, http://fcron.free.fr/

* Matt Dillon's dcron, http://www.jimpryor.net/linux/dcron.html

Other toolsets include Paul Jarc's runwhen (http://code.dogmap.org/runwhen/) which is designed for first-class services individually scheduling themselves.

[+] freddie_mercury|7 years ago|reply
Jenkins is "heavyweight" but Postgres isn't?
[+] tbrock|7 years ago|reply
Use an ECS scheduled task or the equivalent in Kubernetes: problem solved.

Docker containers run using a task scheduler are perfect for this.

[+] swalsh|7 years ago|reply
The last company I worked at used Jenkins for all cron jobs. It has great reporting, supports complex jobs, has a million plugins. It worked really great.
[+] nullbyte|7 years ago|reply
My thoughts exactly, this is a perfect use case for Jenkins.

Talk about reinventing the wheel!

[+] amoshg|7 years ago|reply
Interesting, isn't Jenkins mostly used for internal builds? I have a hard time imagining using it for things like marketing email cron jobs.
[+] mullingitover|7 years ago|reply
Indeed, I was scratching my head reading this, wondering why they were reinventing Jenkins.
[+] ww520|7 years ago|reply
The blog said they are a Python shop. Jenkins being Java based probably won't sit well with them.
[+] rooam-dev|7 years ago|reply
How do you test those complex jobs? How do you monitor jobs that get slower and slower? Can you make sure a job is not executed too soon? Can you have dynamic jobs (add/remove/enable/disable 1k per day)?

Different requirements, different solutions.

[+] alexeiz|7 years ago|reply
Doesn't Jenkins use a web interface to create, edit and manage jobs? Is it possible to put jobs in a VCS? Or manage jobs automatically (say with Ansible)?
[+] amyjess|7 years ago|reply
We've been moving all our cronjobs into Jenkins lately. It works wonderfully, and I really recommend it.
[+] foobarian|7 years ago|reply
> Here is an example of a typical oncall experience: 1) get paged with the command line of the failed job; 2) ssh into the scheduler machine; 3) copy & paste the command line to rerun the failed job

No doubt this works if it's an established procedure, but if I was approaching a system I wasn't familiar with I would never do (3) because the environment can differ wildly between crond and a login shell. It is safer to edit the cron schedule and duplicate the entry with a time set for few minutes in the future. (And clean that up afterwards).

[+] _kevinspencer|7 years ago|reply
Not only that, but just ssh'ing in and blindly rerunning the failed job isn't the answer anyway. Research why it failed. If the job has to write a file on a full disk, you can rerun that thing a hundred times and it will never work. I'm sure they must have missed something from their write up as I can't imagine that's their on call playbook.
[+] TheSoftwareGuy|7 years ago|reply
You could even write a cron job to clean up the crontab
[+] solutionyogi|7 years ago|reply
I worked at a company which wrote their own scheduler and it was fraught with bugs. Dealing with time and date is HARD. Really, really hard. Your custom scheduler will break and at a worst possible time.

If Cron doesn't work, get an open source or commercial solution. And who cares what tech the scheduler is written in? Scheduler's job is to provide run your programs and provide API and a nice GUI if you desire.

[+] jokh|7 years ago|reply
Yeah exactly. I don't understand why they wanted the scheduler to be written in Python, since the scheduler should be decoupled from the jobs they are running anyway.
[+] lhr0909|7 years ago|reply
I think they have a task worker system built back in 2014[1], so they need to have something custom working with it as well. Back then I think they really didn't have many options, but if they were to do it again now, I think either AWS Lambda or AWS Batch will serve this type of scheduled job cases very well.

[1]: https://engblog.nextdoor.com/nextdoor-taskworker-simple-effi...

[+] amyjess|7 years ago|reply
NMS engineer at an enterprise telecom here. At my company, we've been switching over to Jenkins for job scheduling. Most of what used to be cronjobs have been fully Dockerized, and now we have Jenkins run periodic "builds" via pipelines. The pipelines themselves just run a docker image.

The single biggest advantage this has gotten us is centralized logging. I can check on the console output of any cronjob just by going to Jenkins and clicking on the job.

Moving to Jenkins to cron wasn't my idea, but the implementation is mine. I've built a few base Docker images as bases. One is just the standard Python 3.6 docker image. Another is the CentOS image equipped with Python 3.6 and Oracle RPMs for jobs that need database access. Another is the aforementioned image plus a number of Perl dependencies for jobs that need to call into our legacy Perl scripts.

For many scripts, I can use identical Dockerfiles. I just copy the directory containing the script, requirements.txt, Dockerfile, and Jenkinsfile, then I change out the script, edit the Jenkinsfile to reference the new script's name, and make any needed changes to the requirements.txt.

[+] ChuckMcM|7 years ago|reply
That was a difficult read for me, the blog post starts out with the four main problems with cron (their use wasn't scalable, editing the text file was hard, running their jobs are complex, and they didn't have any telemetry.)

That's great, what does that have to do with cron?

As a result what I read was:

"We don't understand what cron does, nor do we understand how job scheduling is supposed to work, and we don't understand how to write 'service' based applications, so somebody said 'Just use cron' and we did some stuff and it didn't work how we liked, and we still haven't figured out really what is going on with schedulers so we wrote our own thing which works for us but we don't have any idea why something as broken as cron has persisted as the way to do something for longer than any of us has been alive."

I'm not sure that is the message they wanted to send. So lets look at their problems and their solution for a minute and figure out what is really going on here.

First problem was 'scalability'. Which is described in the blog post as "cron jobs pushed the machine to its limit" and their scalability solution was to write a program that put a layer between the starting of the jobs and the jobs themselves (sends messages to SQS) and they used a new scheduler (AP scheduler to implement the core scheduler).

So what was the real win here? (since they have recreated cron :-)) The win is that instead of forking and execing as cron does allowing things like standard in and what not to be connected to the process, their version of cron sends a message to another system to actually start jobs. Guess what, if they wrote a bit of python code that all it did was send a message to SQS and exit that would run pretty simply. If they did it in C or C++ so they weren't loading an entire interpreter and its run time everytime it would be lighting fast and add no load to the "cron server", this is basically being unaware of how cron works so not knowing what would be the best way to use it.

Their second beef was that cron tabs are hard to edit reliably. Their solution was to write a giant CGI script in a web server that would read in the data structure used by their scheduler for jobs, show it as a web page, let people make changes to it, and then re-write the updated data structure to the scheduler. Guess what, the cron tab is just the data structure for cron in a text form so you can edit it by hand if necessary. Or you can use crontab -e which does syntax checking, or you could even write a giant CGI script that would read in the cron file, display it nicely on a web page, and then re-write as syntactically correct cron tab when it was done.

Problem three was that their jobs were complex and failed often. This forced their poor opsen to log in, cut and paste a complex command line and restart the job. The real problem there is jobs are failing which is going to require someone to figure out why they failed. If you don't give a crap about why they failed the standard idiom is a program that forks the child to do the thing you want done and if you catch a signal from it that it has died you fork it again[1]. But really what is important here is that you have a configuration file under source code control that contains the default parameters for the jobs you are running so that starting them is just typing in the jobname if you need to restart or maybe overriding a parameter like a disk that is too full if that is why it failed. Again, nothing to do with cron and everything to do with writing code that runs on servers.

And finally their is no telemetry, no way to tell what is going on. Except that UNIX and Linux have like a zillion ways to get telemetry out, the original one is syslog, where jobs can send messages that get collected over the network even, of what they are up to, how they are feeling and what, if anything, is going wrong. There are even different levels like INFO, or FATAL which tell you which ones are important. Another tried and true technique is to dump a death rattle into /tmp for later collection by a mortician process (also scheduled by cron).

At the end of the day, I can't see how cron had anything to do with their problems, their inability to understand the problem they were trying to solve in a general way which would have allowed them to see many solutions, both with cron or with other tools that solve similar problems, would have saved them from re-inventing the wheel yet again, giving them the opportunity to experience the bugs that those other systems have fixed given their millions (if not billions) of hours of collective run time.

[1] Yes this is lame and you get jobs that just loop forever restarting again and again which is why the real problem is the failing not the forking.

[+] pjungwir|7 years ago|reply
This rant seems out of character, but personally I appreciate it. People are always suggesting getting rid of cron, but I have always liked and trusted it. I prefer using tools that are old, battle-tested, and standard, but I do try to appreciate the advantages of new things. Cron seems to be a favorite target of NIH, for as long as I remember but especially in these days of "serverless", so it's easy to second-guess my appreciation for it. Thanks for clarifying where their problems really were. There seems to be a common temptation to think a new tool will solve your problems, when really many are in the irreducible specificity of your own code or systems.

EDIT: Oh by the way, Ruby folks struggling with cron might appreciate this: https://github.com/pjungwir/cron2english/

[+] _kevinspencer|7 years ago|reply
This. If I had too many jobs running on underpowered single point of failure hardware, I wouldn't immediately think cron is my problem and rewrite it. Rethink your architecture, write meaningful log messages, collect stats, figure out why so many of your services fail that often.
[+] mmt|7 years ago|reply
> ways to get telemetry out, the original one is syslog, where jobs can send messages that get collected over the network

The re-invention of this one is one of my pet peeves, especially since the vast majority of the (legitimate) complaints about early implementations have been addressed in modern (last.. 10ish years?) implementations.

Syslog is lightweight, flexible, and plain text.

Back when the ELK stack first started gaining popularity, I'd get asked in interviews if I had experience with provisioning "high volume" logging. I was reasonably convinced that none of those people had ever seen (or would ever see) a high enough volume to be remarkable.

In hindsight, it was probably the same as thinking one has "big data" if it doesn't fit RAM on ones laptop (or even, more charitably, a server, even a decently large one).

[+] busterarm|7 years ago|reply
It's a symptom of people developing and not understanding the environment they develop in/for.

I've used attitudes towards cron, specifically, as a personal metric for judging someone's skill and output with systems. What this team did is exactly the wrong thing to do.

I would rather use the tool with over 40 years of development work gone into it than whatever this is.

[+] parliament32|7 years ago|reply
Exactly, they solved a bunch of problems that weren't really related to cron and overall, IMO, made the situation worse.

High resource usage so they moved the actual jobs off-server. Not cron related. Crontab is hard to edit so we'll create an arbitrary data structure that'll be both hard to edit and non-standard. Good job. Our jobs are failing so instead of better error handling we'll just run them again manually. Great. And apparently they've never heard of cron logging to syslog.

[+] thdxr|7 years ago|reply
We schedule jobs on our Elixir cluster. Nice to not need anything on top of what you're already running
[+] oneeyedpigeon|7 years ago|reply
> Second, editing the plain text crontab is error prone

Doesn't every crontab in existence have a comment line giving the order of the time columns? I know I always rely on such.

Although the article didn't touch on it, this point reminded me of yesterday's discussion about manpages and command-line options. I think it's still the case today that `cron -e` is the way to edit the crontab whilst `cron -r` (the key right nextdoor in case this part needed stressing!) removes it altogether.

[+] Pete_D|7 years ago|reply
I got tired of writing the time specs manually, so now I keep this script in my PATH:

    #!/bin/sh
    # print a crontab entry for 2 minutes into the future

    date -d "+2 minutes" +"%M %H %d %m * /path/to/command"
[+] sergiotapia|7 years ago|reply
Using Elixir we just scheduled work using a simple genserver, the initial naive version has no tracking of jobs done/failed/etc, but you can append those easily since they are just language constructs.
[+] chrisferry|7 years ago|reply
Kubernetes CronJobs resource would be my go to for this.
[+] meddlepal|7 years ago|reply
Good answer in 2018 but in 2016 the development of ScheduledJob (now CronJob) had just begun around November/December of that year so it was not an option for these guys.
[+] whalesalad|7 years ago|reply
The CPU comparison are kinda funny to see. At first glance the low CPU usage looks good but to me that’s wasted resources. Good to see a more efficient system though. Hopefully those instances get allocated to different problem sets.
[+] haney|7 years ago|reply
I use Apache Airflow with BashOperator for tons of stuff like this, simple web UI for logs/retries, supports dependencies between jobs and when tasks get more complex it’s Python and it supports extensions.
[+] andscoop|7 years ago|reply
Apache Airflow is a great way to future proof your cron jobs. Existing cron jobs can be easily migrated and with that you'll get access to built in logging, distributed execution, connection management, web ui for simple monitoring and task retries and more.
[+] sideproject|7 years ago|reply
A little tangential, but I recently created a small tool called "tsk" (pronounced same as task)

https://www.tsk.io

I'm calling it a "speed-dial for your APIs". I used to have a VPS that would run out of memory, so easiest way to resolve this was to restart the server. But even that was annoying, so I created a button that would call the API to restart the box. Now I just have to click the button whenever I want to reboot.

Something similar to CRON, I'm currently building a feature in tsk to schedule the tasks.

[+] andromedavision|7 years ago|reply
My scraping VPS keeps running out of memory too and I restart it every couple of days.

> so I created a button that would call the API to restart the box

Been thinking about building a 'button' that does this as well. Will check out TSK to see how well it addresses this. Sound idea for sure.

[+] awiesenhofer|7 years ago|reply
sorry to say, but this whole premise seems just ... wrong.

why reinvent the wheel and not use cron or systemd? why reboot instead of searching for the offending script/process and fixing it?

[+] zrail|7 years ago|reply
We switched from Heroku Scheduler (very limited cron) to a system called Sidekiq Cron[1] (if we used Sidekiq Enterprise we would use the built-in scheduler). All Sidekiq Cron does is drop a job into the queue on a given interval. We also use HireFire to auto-scale our workers as necessary to keep things running.

[1]: https://github.com/ondrejbartas/sidekiq-cron

[2]: https://hirefire.io

[+] lykr0n|7 years ago|reply
Where I work we have a similar product where we run all scheduled tasks on our Mesos cluster. Same idea as a uni cCron. You have a task that is executed every N minutes/hours/days, it runs on a box, and does its thing.

It doesn't replace every cron job, but it is distributed, fault tolerant, and only breaks when the hadoop cluster backs up. The product mentioned here seems like a good solution for a team that needs to execute linux crons without much overhead.

[+] saganus|7 years ago|reply
A bit of a side-topic but, has anyone tried APS (Advanced Python Scheduler - https://apscheduler.readthedocs.io/en/latest/) in production?

I've been evaluating it as it seems to provide fault-tolerance, but IMO the documentation could be much better with more examples (e.g. mixing different triggers, configs,.etc)

Can anyone comment on it?