top | item 23844749

(no title)

nmyk | 5 years ago

I'm gonna go out on a limb and say any purported workflow tool that comes with a data model you have to memorize (i.e. "Before we start building a workflow, let’s learn a little about the components of an SWF") is too complex to be effective.

My problem with tools like these is that I already know the components of an "SWF" or whatever—these are the tasks I have that need to be run/managed. When a tool starts telling me what the architecture needs to look like, then it stops being a helpful tool and starts being a little know-it-all.

My favorite workflow tool is actually two pieces of software: cron and postgres. Cron schedules tasks and postgres handles shared state. It's easy enough to whip up an ACID-compliant task queue in SQL that has whatever bells and whistles you want, and all cron wants is a command to run and a schedule. No need to read a bunch of documentation about what a "task" is supposed to be vs. an "activity" vs. an "execution" or anything like that.

Of course, what my setup does not do is provide common functionality out of the box like "just gimme a way to kick off a series of FS-dependent tasks every day and record errors/halt if anything fails." I don't mind. It's not like Apache Airflow (just to give another example) has saved me from having to think about and express my system's dependencies and failure modes—it has only put a lot of unnecessary and unhelpful constraints on how I am able to express them.

discuss

order

teraku|5 years ago

Can you specify with what you mean by Airflow introduced unnecessary and unhelpful constraints? I'm very interested.

I'm currently working on a standard format for defining such workflows [0] and my own scheduling engine, which hopes to be as non-imposing as possible. It's supposed to be a "cron" for task scheduling with a dependency graph. Only added thing is that you can specify on what kind of environment you want to run your tasks.

So I would appreciate if you could tell me what were things that annoyed you in particular. I use airflow at work and I can list a million, but I don't know exactly with what you meant with that sentence.

[0] https://github.com/OpenWorkflow/OpenWorkflow

nmyk|5 years ago

Sure! To start, just fundamentally—why assume workflows are DAG-shaped? Why no cycles? Lots of real-world processes contain unscheduled repetition that arises at "runtime."

Or what if I can only find out what the rest of the workflow looks like once I'm halfway through it? Why must workflow definitions be static? No "decision" elements as in a flowchart?

Someone might read these complaints and think I'm asking for a programming environment rather than a workflow tool, and that's kinda my point :P

The "unnecessary" side is typically project-specific, but I tend not to need a separate notion of `backfill`, or any of the `Executor` functionality for distributed execution. I suppose if I needed to run stuff on multiple nodes I would just schedule jobs on Kubernetes directly.

mfateev|5 years ago

This mentality leads to systems that are a mess of callbacks, don't scale, and practically impossible to maintain.

You could make the same argument about SQL databases. Why do you need to understand their architecture, learn arcane SQL syntax, and learn to operate them? Instead, your program could write and read files directly from disk. But you still choose to use the DB as it has hundreds of man-years invested in it and gives you a higher level of abstraction.

Think about SWF and its newest incarnation in the form of temporal.io as a higher-level way to write task orchestrations. It requires some learning but allows you immediately to leverage dozens of man-years invested in the technology.

nmyk|5 years ago

> This mentality leads to systems that are a mess of callbacks, don't scale, and practically impossible to maintain.

Why would it? I can put the same types of abstractions into my application layer in the form of a common library. Only difference is they can be a lot fewer and simpler because they only need to meet my exact requirements.

I often do make the same argument about SQL databases in cases where RDBMSs are not an appropriate tool for the job. In the case I mentioned, where I'm using it as a shared datastore that supports ACID transactions with concurrent access, I find Postgres (and many others, including many NoSQL stores) to be suitably placed in the abstraction spectrum to be worth using rather than rolling my own solution.