Airflow and the Future of Data Engineering: A Q&A

[+] nl|9 years ago|reply

Airflow. This framework is used by numerous companies and several of the biggest unicorns — Spotify, Lyft, Airbnb, Stripe, and others to power data engineering at massive scale.

Is that correct? I've been using (and enjoying) Luigi[1] which came out of Spotify. I haven't seen anything about them switching to Airflow.

Edit: Now I see in the interview there is this:

About Luigi, it is simpler in scope than Airflow, and perhaps we’re more complementary than competition. From what I gather, the main maintainer of the product has left Spotify and apparently they are now using Airflow internally for [at least] some of their use cases. I do not have the full story here and would like to hear more about it. I’m thinking that many of the companies choosing Luigi today might also choose Airflow later as they develop the need for the extra set of features that Airflow offers.

But there are 2 day old commits in the Luigi directory, so I don't know. I like Airflow too, but it did seems a lot more complicated the Luigi when I played with it.

[1] https://github.com/spotify/luigi

[+] erikbern|9 years ago|reply

I'm one of the original Luigi authors, and used to maintain it for a while, merging PR's daily etc. I left Spotify a couple of years ago. Since then it's been maintained mostly by Arash Rouhani, who is increasingly getting busy with other things.

But it's very much in active development and there are multiple pull requests merged every day.

I haven't had a lot of time to check out Airflow, but it seems great. Data engineering and thinking of data processing as functional pipelines is a great paradigm, think we're going to see a lot of future development in this area. Luigi will probably evolve a lot over the next few years. Eventually I think there will be better frameworks. No idea if Airflow is a step change, I think there are still projects yet to be built that unifies everything beautifully

[+] ktamura|9 years ago|reply

If simplicity and non-Python-centricity matter, I encourage folks to look into Digdag [1][2].

It's Ansible for Workflow Management.

While both Luigi and Airflow (somewhat rightfully) assume the user to know/have affinity for Python, Digdag focuses on ease of use and helping enterprises move data around many systems.

If we learned one thing from today's S3 outage, it's not enough to use multiple cloud infrastructure providers: you should probably have your data in multiple cloud providers as well.

[1] https://www.digdag.io

[2] https://github.com/treasure-data/digdag

[+] concam|9 years ago|reply

They've been listed as contributors on the Apache Airflow repo[0]

[0] https://github.com/apache/incubator-airflow/blame/master/REA...

[+] smooc|9 years ago|reply

As far as we know Airflow is used in one of the teams (I think it maybe advertising). It does not mean the whole company has switched to Airflow, but some team decided it fitted their job better.

[+] caravel|9 years ago|reply

I also hear that Uber forked Airflow early on [twice]. I'm curious to hear where they're at at this time.

[+] jaz46|9 years ago|reply

One of the important paradigms that I think Luigi and Airflow miss is that they treat pipelines as a DAG of tasks, when it really should be thought of as a DAG of Data.

It's a subtle difference, but has huge impacts when you're trying to dynamically scale tasks based on cluster resources and track data lineage throughout your system.(Disclosure: I'm the founder of Pachyderm[0], a containerized data pipeline framework where we version control data in this way).

Check out Samuel Lampa's post[1] about dynamically scaling data pipelines for more details.

[0] github.com/pachyderm/pachyderm [1] http://bionics.it/posts/dynamic-workflow-scheduling

[+] caravel|9 years ago|reply

[Airflow author] The task is centric to the workflow engine. It's an important entity, and it's complementary to the data lineage graph (not necessarily a DAG btw).

At Airbnb we have another important tool (not open source at the moment) that is a UI and search engine to understand all of the "data objects" and how they relate. It includes datasets, tables, charts, dashboards and tasks. The edges are usage, attribution, sources, ... This tool shows [amongst other things] data lineage and is complementary to Airflow.

[+] detroitcoder|9 years ago|reply

Apache Nifi. I always make the comparison between the two when introducing Airflow as it often times takes awhile to under stand the nuanced (but critical) difference

[+] tedmiston|9 years ago|reply

Hey HN, I co-wrote the article with Maxime. A little late to the party here but happy to answer any questions you might have about it. I'll send him the link as well.

By the way, thank you to Maxime for sharing his thoughts, and the Astronomer team for contributing great questions.

[+] caravel|9 years ago|reply

Maxime reporting for duty here, I'll go through the thread and answer questions / add comments.

[+] kozikow|9 years ago|reply

Airflow works well for "static" jobs, but I miss something like airflow for dynamic jobs.

By dynamic, I mean something like "user sent us some new data to process, create a custom graph just for this data". I can create new airflow graph per each processing pipeline with new dag id every time, but airflow was not created for use case like this and it's not working well in such scenario.

[+] gumoro|9 years ago|reply

I work on such a system, to organize data processing (HPC) projects in the oil & gas industry, and I try to follow this space. I remember I got excited when I heard of Airflow for the first time, but quickly got frustrated with its "static flow" nature: many "flow" systems are like this, you first design the flow, then "deploy" it and let it run (usually many times).

What our tool does is allow users to organize the flow of their processing jobs on an infinite 2D layout, have some jobs run at the beginning of the flow while they organize another part to run later.

Unfortunately it's a big pile of messy code that depends too much on other "internal" systems so we can't open source it... I'd like to add "yet" because I try to gradually clean it up, simplify and make it more generic, but I'm not sure I'll see that day myself.

In the meantime... maybe Node-RED ? https://nodered.org/

[+] gdulli|9 years ago|reply

I've started working on a code generator for Airflow. Not primarily because I needed dynamic jobs, but more because I didn't want to keep writing the Airflow boilerplate.

I imagine I'll eventually need to add some sort of management system to move these dynamic jobs in and out of Airflow to keep them from bloating the database or cluttering the UI.

[+] nl|9 years ago|reply

This is one part of Luigi that I haven't tried yet, but https://luigi.readthedocs.io/en/stable/tasks.html#dynamic-de... might help.

[+] thibaut_barrere|9 years ago|reply

If you are also using Ruby, you can use Kiba ETL (disclaimer: I'm Kiba ETL author) in conjunction with Sidekiq (quick example: https://gist.github.com/thbar/ec90bbd5877d1aae40510fadd43206...).

This allows to build ETL jobs which react to all sorts of external application triggers (such as an upload made by a user in that case, but it could be an API notification / webhook etc).

[+] tedmiston|9 years ago|reply

While it's not the exact use case you've described (and it's hard to pinpoint without a concrete example), you could handle related tasks by writing a more generic static job that pulls a dynamic config, eg from a mongo db, and uses the branch operator.

[+] classybull|9 years ago|reply

How about a custom sensor class that never timesout and you define its poke method to listen for data and processes it? Granted, you'll have to make sure you allocate enough workers for that..

[+] chishaku|9 years ago|reply

Airflow is just the workflow management layer on top of your data pipeline. The flexibility to generate custom graphs based on user-specific parameters should be handled within a pipeline task.

Based on your example, I would have a single dag that would 1. get user data and 2. generate a graph.

All the flexibility should be defined in whatever function, script or program you define to generate the graph.

[+] unknown|9 years ago|reply

[deleted]

[+] unknown|9 years ago|reply

[deleted]

[+] classybull|9 years ago|reply

We're currently in a PoC phase of implementing Airflow and testing it out versus Luigi. So far, what I've liked, is that Airflow seems to be much more extensible and modular than Luigi. Getting Luigi to play nicely with our particular set of constraints was painful, and subclassing the Task was also tricky because it was way more opinionated about the structure of the class. Airflow seems way less so. There also seems to be way more right out of the gate in terms of built in task types. And the UI looks nicer.

We haven't made our final determination yet, but Airflow at the current moment feels better.

[+] gregn610|9 years ago|reply

How relevant are Airflow and similar to those of us who aren't operating at unicorn scale but are shuffling hundreds of CSVs & Excels and wrangling RDBMS with SQL?

[+] botswana99|9 years ago|reply

[Bias Alert: I'm Head Chef of DataKitchen]. Our perspective is that the DAG abstraction should not apply only to data engineering, but the whole analytic process of data engineering, data science, and data visualization. Analytic teams love to work with their favorite tools -- Python, SQL, ETL, Jupyter, R, Tableau, Alteryx, etc. The question is how do you get those diverse teams and tools to work together to deliver fast, with high quality, and reusable components?

We've identified seven steps taken from DevOps, CI, Agile and Lean Manufacturing (https://www.datakitchen.io/platform.html#sevensteps) that you can start to apply today. We also created a 'DataOps' platform that incorporates those principles into a software: https://www.datakitchen.io.

The challenge is that there are many separate DAGs (and code and configuration) involved in producing complete production analytics embedded in each of the tools the team has selected. So what is needed is a “DAG of DAGs” that encompasses the whole analytic tool chain.

[+] jcalabro|9 years ago|reply

[Disclaimer: I work for Composable] My team and I are working on a project that I would consider a competitor to Airflow. I'm not overly familiar with Airflow, but Composable seems to be fit for a much wider variety of use cases.

In Composable's DAG execution engine, you can pull in data from various sources (SQL, NoSQL, csv, json, restful endpoints, etc.) into our common data format. You can then easily transform, orchestrate, or analyze your data using our built-in Modules (blocks) or you can easily write your own. You can then view your resulting data all within the webapp.

Reading the comments, it seems like Composable supports a lot of the things people are asking for here that Airflow is lacking. Maybe check us out and let us know what you think!

For more information: Composable Site - https://composableanalytics.com/ Try it yourself - https://cloud.composableanalytics.com/ Composable's Blog - http://blog.composable.ai/

[+] caravel|9 years ago|reply

[author of Airflow here] as I wrote in another comment, I'd argue for a programmatic approach to workflows/dataflows as opposed to drag and drop. It turns out that code is a better abstraction for software: https://medium.freecodecamp.com/the-rise-of-the-data-enginee...

I'd also argue for open source over proprietary, mostly to allow for a framework that is "hackable" and extensible by nature. You can also count on the community to build a lot of the operators & hooks you'll need (Airflow terms).

[+] batbomb|9 years ago|reply

It sounds like you are more of a Beam/dataflow competitor than an airflow/workflow competitor.

[+] batmansmk|9 years ago|reply

A lot of name dropping and unprovable statements. I'm really interested by the domain and progress, but Airflow needs to be more generous in real information and less in marketing bs. Can someone share a more introductory article about what makes Airflow different from the current state of the art?

[+] smooc|9 years ago|reply

Here is, a slightly outdated, article that compares several ETL workflow tools http://bytepawn.com/luigi-airflow-pinball.html . Why we choose Airflow was because of the following reasons:

* Scheduler that knows how to handle retries, skipped tasks, failing tasks

* Great UI

* Horizontal scaleable

* Great community

* Extensible; we could make it work in an enterprise context (kerberos, ldap etc)

* No XML

* Testable and debug-able workflows

[+] caravel|9 years ago|reply

[author] sorry you feel that way. I understand that the section on other similar tools is controversial, especially if you work on one of those. I'm repeating what I hear being very active in this community, and answering the question that was asked to the best of my knowledge. I'm open to editing the article with better information if anyone wants to share more insight.

As mentioned about Luigi, I do not have the whole story, one fact I know is that someone from Spotify gave a talk I did not attend in NYC at an Airflow meetup, and I've heard the original author had left the company. Those are provable statements, happy to debunk on the article if needed.

What do you mean by [current state of the Art]?

[+] glial|9 years ago|reply

+1 for Airflow. I use it every day to handle tasks with many components and dependencies. I love that everything is code & version-controlled.

I do wish it had a REST API though.

[+] papercruncher|9 years ago|reply

I've been keeping close tabs on the project for a while now and it seems that version 1.8, which should be released in a few days, has the beginning of a rudimentary API. It also looks like more endpoints are being planned for subsequent releases

[+] fuzzylearner|9 years ago|reply

You can always expose the REST API. Its pretty easy considering they are just flask blueprints. Since, you can make your own custom plugin - You can build a lot using existing infrastructure that Airflow provides.

[+] zodiac|9 years ago|reply

We're working on moving more functionality to a REST API - there's already some code for this in master :)

[+] dataops|9 years ago|reply

Data engineering is converging under the umbrella of DataOps. For those interested, there's a DataOps Summit in Boston this June https://www.dataopssummit.com/

[+] batbomb|9 years ago|reply

This is just Data Management, a term which predates "DataOps" by more than a decade in both research and enterprise. I don't really think it needs a rebranding.

[+] cardosof|9 years ago|reply

I'm used to running R scripts with cron to handle some batch jobs and I'm no python dev. Would it be easy to start using Airflow?

[+] smooc|9 years ago|reply

DAGS in Airflow can just be a few lines. Some understanding of the syntax of python is required. But you can start simple and add complexity as you require it.

[+] mtrn|9 years ago|reply

Luigi is nice, because it is really simple to get started and it gradually allows you to do more complex things like (custom) parameter types, custom targets, enhanced super classes and dynamic dependencies, event hooks, task history and more.

One thing that I missed a bit was automatic task output naming based on the parameters of a task, so I wrote a thin wrapper for that [1]. This helps, but mostly for smaller deployments.

Airflow and luigi seemed to me like two side of the same thing: fixed graphs vs data flow. One fixates the DAG, the other puts more emphasis on composition.

That said, I am excited about the data processing tools to come - I believe this is an exciting space and choosing or writing the right tool can make a real difference between a messy data landscape and an agile part of business and business development.

[1] https://github.com/miku/gluish

[+] throwaway_374|9 years ago|reply

Minor gripe - why can I not execute an entire DAG (end to end) from the UI? Also trying to execute single tasks from the UI using the "run" functionality gives a CeleryExecutor requirement error... sorry, I know this isn't the help forums but it sounds like the most trivial tasks were overlooked.

[+] LevonK|9 years ago|reply

Disney Studios uses Azkaban because it's language agnostic, and it believes that in the data space there is a huge advantage to static (and strong, but that's beside the point in this argument) typing in the data space.

The language agnostic aspect means that non software engineers can also use the orchestration platform for runbook automation.

[+] sandGorgon|9 years ago|reply

MHO There seems to be quite some conceptual overlap between Airflow's DAG and Spark RDD.

It seems to me that Airflow is Spark-on-a-db ... or rather Spark is Airflow-on-Hadoop.

Does anyone know what the difference is ?

[+] gallamine|9 years ago|reply

Airflow doesn't have anything to do with data storage, movement or processing. It's a way to chain commands together in such a way so that you can define "do Z after Y a Z finish", for example. Many people use it like a nice version of cron with a UI, alerting, and retries.

[+] fuzzylearner|9 years ago|reply

Amazing framework with a lot of functionalities. A tool built to be extensible just what an open source software should look like :)

[+] somewhatoff|9 years ago|reply

How would you see Airflow in relation to Apache Beam / GC Dataflow?

[+] caravel|9 years ago|reply

[author] Airflow is not a data flow engine, though you can use it to do some of that, but we typically defer on doing data transformations using/coordinating external engines (Spark, Hive, Cascading, Sqoop, PIG, ...).

We operate at a higher level: orchestration. If we were to start using Apache Beam at Airbnb (and we very well may soon!), we' use Airflow to schedule and trigger batch beam jobs alongside the rest of our other jobs.

[+] vicaya|9 years ago|reply

Who has used both Airflow and Spinnaker? Feedbacks?

[+] unknown|9 years ago|reply

[deleted]

90 comments