Introduction to Apache Airflow

[+] dmayle|5 years ago|reply

I use Airflow, and am a big fan. I don't think it's particularly clear, however, as to when to use airflow.

The single best reason to use airflow is that you have some data source with a time-based axis that you want to transfer or process. For example, you might want to ingest daily web logs into a database. Or maybe you want weekly statistics generated on your database, etc.

The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures. For example, maybe you want to garbage-collect some files on a remote server with spotty connectivity, and you want to be emailed if it fails for more than two days in a row.

Beyond those two, Airflow might be very useful, but you'll be shoehorning your use case into Airflow's capabilities.

Airflow is basically a distributed cron daemon with support for reruns and SLAs. If you're using Python for your tasks, it also includes a large collection of data abstraction layers such that Airflow can manage the named connections to the different sources, and you only have to code the transfer or transform rules.

[+] javajosh|5 years ago|reply

Yes, this seems to be yet another tool that falls prey to what I think of as "The Bisquick Problem". Bisquick is a product that is basically pre-mixed flour, salt, baking powder that you can use to make pancakes, biscuits, and waffles. But why would you buy this instead of its constituent parts? Does Bisquick really save that much time? Is it worth the loss of component flexibility?

Worst of all, if you accept Bisquick, then you open the door to an explosion of Bisquick options. Its a combinatorial explosion of pre-mixed ingredients. In a dystopian future, perhaps people stop buying flour or salt, and the ONLY way you can make food is to buy the right kind of Bisquick. Might make a kind of mash up of a baking show and Black Mirror.

Anyway, yeah, Airflow (and so many other tools) feel like Bisquick. It has all the strengths, but also all the weaknesses, of that model.

[+] simo7|5 years ago|reply

I would add these gotchas/recommendations:

- Airflow the ETL framework is quite bad. Just use Airflow the scheduler/orchestrator: delegate the actual data transformation to external services (serverless, kubernetes etc.).

- Don't use it for tasks that don't require idempotency (eg. a job that uses a bookmark).

- Don't use it for latency-sensitive jobs (this one should be obvious).

- Don't use sensors or cross-DAG dependencies.

So yeah unfortunately it's not a good fit for all the use cases, but it has the right set of features for some of the most common batch workloads.

Also python as the DAG configuration language was a very successful idea, maybe the most important contributor to Airflow success.

[+] gravitas|5 years ago|reply

> The next best reason to use airflow is that you have a recurring job that you want not only to happen, but to track it's successes and failures.

For this specific use case, I use healthchecks.io - trivial to deploy in almost any context which can ping a public URL. Generous free tier limits so I've got that going for me which is nice.

[+] chrischen|5 years ago|reply

I thought the point of airflow is for orchestration in an event driven microservices architecture? That’s what Uber uses Cadence for at least.

[+] aequitas|5 years ago|reply

I recently started investigating Airflow for our use-case and it seems exactly what you describe and not more. But in it niche it excels featurewise, at least regarding the features I need to expose to the users.

[+] sdepablos|5 years ago|reply

There're so many alternatives to Airflow nowadays that you really need to make sure that Airflow is the best solution (or even a solution) to your use case. There's plenty of use cases better resolved with tools like Prefect or Dagster, but I suppose the inertia to install the tool everyone knows about is really big.

BTW, here's https://github.com/pditommaso/awesome-pipeline a list of almost 200 pipeline toolkits.

[+] anordin95|5 years ago|reply

I've been using Airflow for nearly a year in production and I'm surprised by all the positive commentary in this thread about the tool. To be fair, I've been using the GCP offered version of Airflow - Composer. I've found various components to be flaky and frustrating. For instance, large scale backfills don't seem to be well supported. I find various components of the system break when trying to large scale backfills, for instance 1M DAG runs. As another note, the scheduler also seems rather fragile and prone to crashing. My team has generally found the system good, but not rock-solid nor something to be confident in.

[+] rywalker|5 years ago|reply

Yes, we need to improve the backfills in Airflow.

We're working on making the scheduler HA and more performant, reach out to me if you'd like to collaborate on your use case (ry at astronomer dot io)

[+] chrisjc|5 years ago|reply

This mirrors my experience with Airflow by another SaaS. Ended up getting rid of it when our Data Warehouse introduced tasks and stored procedures, since most of our work was ELT.

[+] Peteris|5 years ago|reply

Airflow is an incredibly powerful framework to use in production, but a little unweildy for anything else.

You can use something like Kedro (https://github.com/quantumblacklabs/kedro) to get started building pipelines with pure Python functions. Kedro has its own pipeline visualiser and also has an Airflow plugin that can automatically help you generate airflow pipelines from Kedro pipelines.

https://github.com/quantumblacklabs/kedro-airflow

[+] zukzuk|5 years ago|reply

Airflow is great, right up to the point where you try to feed date/time-based arguments to your operators (a crucial bit of functionality not covered in the linked article). The built-in API for that is a random assortment of odd macros and poorly designed python snippets, with scoping that never quite makes sense, and patchy and sometimes misleading documentation.

[+] diogofranco|5 years ago|reply

Agree that this is a bit confusing. I ended up writing a small guide on how date arguments work in airflow (https://diogoalexandrefranco.github.io/about-airflow-date-ma...) and I always end up consulting it myself, as I just can't seem to memorize any of these macros.

[+] Grimm1|5 years ago|reply

Airflow is great, honestly the biggest gotchas are passing time to operators which someone has mentioned in the thread already and setting up the initial infra is a bit annoying too. Other than that though as a batch-etl scheduler and all around job scheduler it's pretty great, it's really a very user friendly and it's graphical interface simplifies a lot of the management process. I see a lot of people prefer the non graphical libs here like Luigi or Prefect and to each there own but I really do prefer having that interface in addition to the pipelines as code line of thinking.

I also see a lot of people saying it's a solution for big companies and the like, I heavily disagree it's useful for any size company that wants to have better organization of their pipelines and provide an easy way for non technical users to check on their health.

[+] hn2017|5 years ago|reply

FYI - Prefect has a GUI too.

I also agree Airflow is good for smaller companies too if they're familiar with Python.

[+] naveedn|5 years ago|reply

I don’t think this blogpost provides any value over the official documentation. You can run through the airflow tutorial in about 30 minutes and understand all the main principles pretty quickly.

[+] bthomas|5 years ago|reply

I recently went through the airflow docs for the first time - agree. But this comment thread has much more helpful than any of the docs!

[+] unknown|5 years ago|reply

[deleted]

[+] zomglings|5 years ago|reply

I'm actually considering using Airflow. Have never used it before, and I have the impression that setting up the required infrastructure could be problematic.

Since a lot of you use Airflow, I am curious about your experience with it:

1. Are you hosting Airflow yourselves or using a managed service?

1. a. If managed, which one? (Google Cloud Composer, Astronomer.io, something else?)

1. b. If self-hosted, how difficult was the setup? It seems daunting to get a stable setup (external database, rabbit or redis, etc.).

2. Do you use one operator (DockerOperator looks like the right choice) or do you allow yourself freedom in operators? Do you build your own?

3. How do you pass data from one task to the next? Do the tasks themselves have to be aware of external storage conventions or do you use the built-in xcom mechanism? It seems like xcom stores messages in the database, so you run the risk of blowing through storage capacity this way?

[+] pyrophane|5 years ago|reply

1. Managed, Cloud Composer. Cloud Composer is getting there. It feels much less buggy then just 8 months ago when I started using it, and it is improving rather quickly.

One downside with Composer, though, is that it must be run in its own GKE cluster, and it deploys the Airflow UI to App Engine. These two things can make it a bit of a pain to use alongside infrastructure deployed into another GKE cluster if you need the two to interact.

I would probably still recommend Composer over deploying your own Airflow into GKE, as having it managed is nice.

2. Freedom. For some tasks we run containers in GKE, for other we use things like the PythonOperator or PostgresOperator.

A note here: Using containers with Airflow is not trivial. In addition to needing some CI process to manage image building/deployment, having the ability to develop and test DAGs locally takes some extra work. I would only recommend it if you are already invested in containers and are willing to devote the time to ops to get it all working.

3. X-com is useful for small amounts of data, like if one task needs to pass a file path, IDs, or other parameters to a downstream task. For everything else have a task write its output to something like S3 or a database that another task will read from.

All in all, I would say use Airflow if you need the visibility and dependency management. Don't use it if you could get away with something like cron and some scripts or a simple pool of celery workers.

Also, don't use it if your workflows are highly dynamic. For example, if you have a situation where you need to run a task to get a list of things, then span x downstream tasks based on the contents of the list. Airflow wants the shape of the DAG to be defined before it is run.

Hope that helps.

[+] mrshu|5 years ago|reply

1. Managed, on AWS ECS -- it was not that difficult to set up using AWS services (Aurora for the DB, ElastiCache for Redis)

2. Freedom. We generally use PythonOperators but it is not uncommon to run containers as well. I agree with pyrophane, setting up containerized operators really is a non-trivial and far from straightforward to test locally. Still, it seems worth doing, particularly if you do not want various execution to affect one another.

3. Again, echoing what pyrophane said, a custom solution is needed for anything that's more than a couple hundred bytes in size. There even exists (now mostly abandoned) plugin that allows you to streamline the whole process: https://github.com/industrydive/fileflow Writing directly to something like S3 is almost always sufficient in combination with passing the path for the file from one task to another.

Having said that, I would encourage you to try it out, even if the setup may sound daunting at first. If you can model your tasks as DAGs whose graphs are known in advance, I would argue it almost always makes sense because you get many things "for free" that were not mentioned before: logging, backfill, standard handling of connections/secrets and a ton of metrics in a nice UI that helps a lot with the visibility part.

[+] ies7|5 years ago|reply

1. self hosted on ec2.

1b. Pip install airflow[all|what you need] The airflow itself is easy to install. I’d say that installing the external tools is also easy. I believe installing pg, redis, or celery should be categorized in easy. It’s not the kafka or k8s level of installation.

2. Freedom.

3. Custom scripts

[+] nojito|5 years ago|reply

Airflow is a great example of technology being used at a massive company for massive company problems...which is now being pushed as a solution to everything

Papermill is another example.

[+] slap_shot|5 years ago|reply

> Airflow is an ETL(Extract, Transform, Load) workflow orchestration tool, used in data transformation pipelines.

Apologies if this is pedantic, but the orchestration of jobs transcends ETL workflows. There's countless usecases of scheduling dependent jobs that aren't ETL workloads.

[+] unixhero|5 years ago|reply

Thread a few weeks back with Apache Airflow's cousin Apache Nifi [0] . A lot of great discussion in that thread, just like in this one.

[0] https://news.ycombinator.com/item?id=23144450

[+] chartpath|5 years ago|reply

Happy user of Prefect here. I prefer it for being more programmable and able to run on Dask. If you just want dynamic distributable DAGs and not necessarily an "ops" appliance feel (like Airflow), check them out: https://docs.prefect.io/core/getting_started/why-not-airflow...

Not knocking Airflow, it is great. Luigi too.

[+] hn2017|5 years ago|reply

Airflow isn't perfect but it's in active development and one of the biggest pros compared to other toolkits is that it's an Apache high-level project AND it's being offered by Google as Cloud Composer. This will make sure it sticks around and maintains development for some time.

https://cloud.google.com/composer/

[+] ForHackernews|5 years ago|reply

Airflow has major limitations that don't become obvious until you're already deep into it. I'd advise avoiding it myself.

It's only useful if you have workloads that are very strictly time-bounded (Every day, do X for all the data from yesterday). It's virtually impossible to manage an event-driven or for-each-file-do-Y style workflow with Airflow.

[+] gtrubetskoy|5 years ago|reply

If you are using BigQuery and your "workflow" amounts to importing data from Postgres/MySQL databases into BQ and then running a series of SQL statements into other BigQuery tables - you might want to look at Maestro, it written in Go and is SQL-centric, there is no Python dependency hell to sort out:

https://github.com/voxmedia/maestro/

With the SQL-centric approach you do not need to specify a DAG because it can be inferred automatically, all you do is maintain your SQL and Maestro takes care of executing it in correct order.

[+] throwaway7281|5 years ago|reply

Having used a variety of modern ETL frameworks in the past years, I consider writing a hands-on book about what I have learned on the way.

If I may ask, what questions do you find most difficult to solve in the context of real-world ETL setups?

[+] knite|5 years ago|reply

There are too many different tools in the space. I've been heavily researching workflow / ETL frameworks this week, and even after culling the ones that seemed like poor fits, I'm still left with:

- https://github.com/getpopper/popper

- https://docs.pachyderm.com/

- https://github.com/lyft/flyte

- https://aws.amazon.com/step-functions/

- https://github.com/spotify/luigi

- https://docs.metaflow.org/

- https://github.com/dagster-io/dagster

- https://github.com/argoproj/argo

- https://github.com/prefecthq/prefect

[+] aequitas|5 years ago|reply

> KubernetesExecutor runs each task in an individual Kubernetes pod. Unlike CeleryCelery, it spins up worker pods on demand, hence enabling maximum usage of resources.

You'll probably use up a lot of resources indeed, depending on how big your tasks are you will have quite some overhead to run each and every one in a seperate pod, compared to running them in a Celery Multiprocessing "thread" on an already running worker container.

[+] Vaslo|5 years ago|reply

Has anyone who has used this also used SSIS? Curious as to how the two compare as I use SSIS currently and have gained some experience with Python.

[+] kfk|5 years ago|reply

I always felt neither Airflow nor Superset solved any of the foundational problems with data analytics today. If we take Airflow, it is relatively easy to schedule runs of scripts using cron (or more fancy Nomad jobs with a period stanza). What else does Airflow give me that cron doesn't? Is the parallelization stuff working? Dask is built from the ground up with parallelization in mind, sure it seems to solve a more foundational problem than Airflow. Is triggering and listening to events working? Doesn't look like. Is collaboration working? Doesn't seem to be the case since after writing your python script you need to basically rewrite it into an Airflow dag.

[+] naveedn|5 years ago|reply

Airflow is 100% better at chaining jobs together than cron. The Airflow scheduler makes it so you don’t have to worry about putting “sleep” calls in your bash scripts to wait for some conditions to be met, and allows for non-linear orchestration of jobs. You might have different flows that need to run on different schedules, and with airflow, you can wait until one flow is done before the other starts.

[+] bosie|5 years ago|reply

dependency management. if task #3 fails, any task depending on it shouldn't run. not easy to do with cron based triggers

[+] jillesvangurp|5 years ago|reply

I've spent the past month+ setting airflow up. To be honest, I don't like it for a lot of reasons:

1) it's not cloud native in the sense that running this on e.g. AWS is an easy and well trodden path. Cloud is left as an exercise to the reader of the documentation and at best vaguely hinted at as a possibility. This is weird because that kind of is the whole point of this product. Sure, it has lots of things that are highly useful in the cloud (like an ECS operator or EMR operator); but the documentation is aimed at python hackers running this on their laptop; all the defaults are aimed at this as well. This is a problem because essentially all of that is wrong for a proper cloud native type environment. We've looked at quite a few third party repos for terraform, kubernetes, cloudformation, etc that try to fix this. Ultimately we ended up spending non trivial amounts of time on devops. Basically, this involved lots of problem solving for things that a combination of wrong, poorly documented, or misguided by default. Also, we're not done by a long shot.

2) The UX/UI is terrible and I don't use this word lightly. Think hudson/jenkins, 15 years ago (and technically that's unfair to good old Hudson because it never was this bad). It's a fair comparison because Jenkins kind of is a drop in replacement or at least a significant overlap in feature set. And it arguably has a better ecosystem for things like plugins. Absolutely everything in Airflow requires multiple clicks. Also you'll be doing CMD+R a lot as there is no concept of autorefresh. Lots of fiddly icons. And then there's this obsession with graphs and this being the most important thing ever. There are two separate graph views, only one of which has useful ways of getting to the logs (which never requires less than 4-5 mouse clicks). And of course the other view is the default under most links so you have to learn to click the tiny graph icon to get to the good stuff.

3) A lot of the defaults are wrong/misguided/annoying. Like catch up defaulting to true. There's this weird notion of tasks (dags in airflow speak) running on a cron pattern and requiring a start date in the past. Using a dynamic date is not recommended (i.e. now would be a sane default). So typically you just pick whatever fixed time in the past. When you turn a dag on it tries to 'backfill' from that date unless you set catchup to false. I don't know in what universe that's a sane default. Sure, I want to run this task 1000 times just because I unpaused it (everything is paused by default). There is no way to unschedule that. Did I mention the default parallism is 32. That in combination with the docker operator is a great way to instantly run out of memory (yep that happened to us).

4) The UI lacks ways to group tasks like by tag or folders, etc. This gets annoying quickly.

5) Dag configs as code in a weakly typed language without a good test harness leads to obvious problems. We've sort of gobbled together our own tests to somewhat mitigate repeated deploy screw ups.

6) implementing a worker architecture in a language that is still burdened with the global interpreter lock and that has no good support for either threading or light weight threads (aka co-routines) or doing things asynchronously, leads to a lot of complexity. The celery worker is a PITA to debug.

7) IMHO the python operator is a bad idea because it gives data scientists the wrong idea about, oh just install this library on every airflow host please so I can run my thingy. We use the Docker operator a lot and are switching to the ECS operator as soon as we can figure out how to run airflow in ECS (we currently have a snow flaky AMI running on ec2).

8) the logging UI is terrible compared to what I would normally use for logging. Looking at logs of task runs is kind of the core business the UI has to do.

9) Airflow has a DB where it keeps track of state. Any change to dags basically means this state gets stale pretty quickly. There's no sane way to get rid of this stale data other than a lot of command-line fiddling or just running some sql scripts directly against this db. I've manually deleted hundreds of jobs in the last month. Also there's no notion of having a sane default for number of execution runs to preserve. Likewise the there is built in way to clean up logs. Again, Jenkins/Hudson had that always. I have jobs that run every 10 minutes and absolutely no need to keep months of history on that.

There are more things I could list. Also, there are quite a few competing products; this is a very crowded space. I've given serious thought to using Spring Batch or even just firing up a Jenkins. Frankly the only reason we chose airflow is that it's easier for data scientists who are mostly only comfortable with python. So far, I've been disappointed with how complex and flaky this setup is.

If you go down the path of using it, think hard about which operators you are going to use and why. IMHO dockerizing tasks means that most of what Airflow does is just ensuring your dockerized tasks run. Limiting what it does is a good thing. Just because you can doesn't mean you should in airflow. IMHO most of the operators naturally lead to your airflow installs being snow flakes.

Not dockerizing means you are mixing code and orchestration. Just like installing dependencies on CI servers is not a great idea is also the reason why doing the same on an airflow system is a bad idea.

[+] rywalker|5 years ago|reply

1) An official Helm chart is coming very soon. We (Astronomer) have a commercial platform that aims to solve this completely including observability, configuration and deployment. Happy to team up to improve Airflow's open-source k8s story if you have some ideas.

2) Yes the UI is outdated, and not responsive. We're going to kick off a process to build a new modern UI in Q3 (a full-featured Swagger API is being built now, which the new UI will rely upon.)

3) I personally think catchup true is a fine default, but whatever. Generally when I launch a new DAG I want to generate some historical data using the DAG.

4) Airflow has tags now since 1.10.8 https://airflow.readthedocs.io/en/latest/howto/add-dag-tags.... - we decided not to do folders.

5) That's true, but it also provides a low bar to entry. There are some guides written on unit testing DAGs, but I agree we should be a test-first community. On my roadmap.

6) Celery w/ KEDA is pretty nice - check out https://www.astronomer.io/blog/the-keda-autoscaler/

7) I personally love the PythonOperator for simple DAGs. Agree that DockerOperator (KubernetesPodOperator if you're running Airflow in K8s)

8) Yes, logging UI will be improved in the UI rewrite. What's your favorite UI for this?

9) That feature is in the queue https://github.com/apache/airflow/issues/7911

[+] opportune|5 years ago|reply

Other than 5+6 this seems like basically a spec for a managed airflow product. So basically run Airflow on public cloud, manage all the operational bits, create a better UI, and fix some upstream bugs.

[+] blakeburch|5 years ago|reply

I couldn't agree more with you on most of these points. You may be interested in trying out Shipyard (www.shipyardapp.com). Fair disclosure, I'm the co-founder. While we don't address all of these issues, we're building with these key focuses.

- Simplicity is key. Data Teams should focus on creating solutions, not fighting infrastructure and limitations. - Workflows shouldn't change how code is written. Your code should run the same locally as on our platform, with no extra packages or proprietary setup files required. - Templates are a first-class object. The modern data pipeline should be built with repeatability in mind. - Data solutions should be usable and visible beyond the walls of technical teams.

We're in a private beta and rapidly trying to improve the product. I would love to chat more if you're interested. Details in profile.

For your specific problems:

1) We're cloud-native and handle hosting/scaling on our side. You don't have to worry about setup. Just log in and launch your code.

2) Our UI is pretty slick (built in AntD) and built to reduce the overwhelming options when setting jobs up.

3) If you want to run a job on-demand, just press "Run now". If you schedule a job, then change the schedule or status, we'll automatically add/update/remove the schedules. Other technical defaults aren't options right now because we're trying to abstract those choices away so Data Teams can just focus on building solutions that work.

4) We let you group your scripts into "Projects" (essentially folders) with a high level overview of the quantity of jobs, as well as how many recently failed or succeeded.

5) Workflows get made directly in the UI. This makes it easier for any less technical users to set up jobs on their own. We still have a lot of improvement to go in this area though.

6) Our product is written in Go (the language of the cloud). We don't force the worker to be in a language and managed by a process in that language. We manage at the process level.

7) Every job creates a new container on the fly, installing package dependencies that the user specifies. You can connect your scripts together without worrying about conflicting packages, conflicting language versions, or without needing to know how to make and manage Docker containers.

8) Not sure how we compare on the logging front. However, we separate out logs for each time a script runs so you're not having to search for a needle in a haystack. You can filter and search for specific logs in the UI.

152 comments