Metaflow, Netflix's Python framework for data science, is now open source

[+] amirathi|6 years ago|reply

After going through a lot of marketing fluff, I landed on this useful page which explains Metaflow basics: https://docs.metaflow.org/metaflow/basics

Here's my understanding:

- It's a python library for creating & executing DAGs

- Each node is a processing step & the results are stored after each step so you can restart failed workflows from where it failed

- Tight integration with AWS ECS to run the whole DAG on cloud

I don't know why their .org site oddly feels like a paid SaaS tool. Anyway, thank you Netflix for open sourcing Metaflow.

[+] seeravikiran|6 years ago|reply

I would also add - dependency management (certain degree of reproducibility) as a first class feature leveraging conda.

[+] teekert|6 years ago|reply

So it's like Snakemake? [0] Snakemake can also control k8s clusters. Perhaps they are easier to set up with Metaflow?

[0] https://snakemake.readthedocs.io/en/stable/

[+] savin-goyal|6 years ago|reply

Addendum - You can mix and match what steps of the DAG run on the cloud.

[+] est|6 years ago|reply

looks like airflow with ML tools?

[+] Thorentis|6 years ago|reply

How is this different / better to existing tools or workflows? I don't like to criticise new frameworks / tools without first understanding them, but I like to know what some key differences are without the marketing/PR fluff before giving one a go.

For instance, this tutorial example here (https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) does not look substantially different to what I could achieve just as easily in R, or other Python data wrangling frameworks.

Is the main feature the fact I can quickly put my workflows into the cloud?

[+] vtuulos|6 years ago|reply

hey, I'm one of the authors of Metaflow. Happy to answer any questions! Netflix has been using Metaflow internally for about two years, so we have many war stories :)

[+] omarhaneef|6 years ago|reply

Thanks for open sourcing this.

Can you say a little about which niche this would occupy, and what the motivation is? Is it intended to compete with Tensorflow and Pytorch or to be an industrial strength version of SKlearn.

I looked through the tutorial on my mobile and the answer was not immediately clear.

Is the benefit that it auto scales on AWS without having to think through the infrastructure?

[+] ageofwant|6 years ago|reply

G'day, seems like an cool tool, thanks - the links to the github tuts are currently broken...

I'm of the opinion that adopting some standard DAG meta format for data science may make a positive impact on the reproduceability issues we have in science generally. So its good to see the idea has real world merit as well.

[+] zmjjmz|6 years ago|reply

Hey I meant to track one of y'all down at the MLOps conference, but didn't get the chance. I've built a very shitty version of a cached-execution DAG thing internally, and one of the design decisions I made was to have it so that parent nodes in the DAG don't need to know anything about child nodes. This allows for larger DAG builders to be more easily subclassed.

MetaFlow doesn't do that -- instead each 'step' has to know what to call next, which means that if I wanted to subclass e.g. the MovieStatsFlow in [here](https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) and say, add some sort of input pre-processing before the compute_statistics call, I'd essentially end up having to either override what compute_statistics does to not match its name _or_ copy-past e that first step just to replace that last line.

I'm sure this design decision was considered and/or that use-case doesn't come up a lot at Netflix (although I've encountered that a lot), or maybe I'm missing something very obvious, but I'd love to hear your thoughts on that.

[+] faizshah|6 years ago|reply

I have been looking for something exactly like this, but I use GCP not AWS. Is there a way to deploy outside of AWS? What would be involved in getting it to work with a different cloud?

[+] rgudic|6 years ago|reply

Hi Ville, I cannot express how happy I am that you open-sourced Metaflow. I have three questions: 1) Do you know ETA of R API release? 2) Would you recommend using both Metaflow nad MLFlow in projects? Could you please explain why yes or why no :) 3) Do you plan to release integration with Spark/Yarn?

Thanks in advance, Roko

[+] kprybol|6 years ago|reply

Does this do experiment tracking similar to MLflow? I’m trying to figure where these two overlap and where they diverge.

[+] bhtucker|6 years ago|reply

The centralized DAG scheduler seems like a pretty important part. How much will not having Meson hamper the usability?

[+] ordx|6 years ago|reply

Thank you for sharing this. Would this be useful for me if I only need the deployment management part? Don't really need to track experiments, just looking for an easy way to deploy my models to Fargate.

[+] jmpeax|6 years ago|reply

[deleted]

[+] unknown|6 years ago|reply

[deleted]

[+] aniketpanjwani|6 years ago|reply

This looks exciting! I'll play around with the tutorial and try to set up the AWS environment this weekend. I have several questions.

1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?

2. When you move to a new text editor, there are some initial frictions while you're trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow's environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.

3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and/or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?

4. What's the closest open-source alternative to Metaflow on the market? Off the top of my head, I can't think of anything which quite matches.

[+] Datenstrom|6 years ago|reply

Is there a reason to use this over DVC[1] which is language and framework agnostic and supports a large number of storage backends? It works with any git repo and even polyglot implementations and can run the DAG on any system.

Currently using DVC, MLflow just for metadata visualization and notes on experiments, and Anaconda for (python) dependency management. We are an embedded shop so we don't deploy to the "cloud."

[1]: https://dvc.org/

[+] vtuulos|6 years ago|reply

We are good friends with the DVC folks! If the DVC + MLFlow + Anaconda stack works for you, that's great. Metaflow provides similar features. The cloud integration is really important at Netflix's scale.

[+] edparcell|6 years ago|reply

My team has a similar library called Loman, which we open-sourced. Instead of nodes representing tasks, they represent data, and the library keeps track of which nodes are up-to-date or stale as you provide new inputs or change how nodes are computed. Each node is either an input node with a provided value, or a computed node with a function to calculate its value. Think of it as a grown-up Excel calculation tree. We've found it quite useful for quant research, and in production it works nicely because you can serialize entire computation graph which gives an easy way to diagnose what failed and why in hundreds of interdependent computations. It's also useful for real-time displays, where you can bind market and UI inputs to nodes and calculated nodes back to the UI - some things you want to recalculate frequently, whereas some are slow and need to happen infrequently in the background.

[1] Github: https://github.com/janushendersonassetallocation/loman

[2] Docs: https://loman.readthedocs.io/en/latest/

[3] Examples: https://github.com/janushendersonassetallocation/loman/tree/...

[+] russfink|6 years ago|reply

I am disappointed that when I click on documentation, "why metaflow," I get a bunch of cartoony BS instead of a simple text explanation. Glad these folks don't write RFC'S.

Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all like that.

[+] purple-again|6 years ago|reply

We are on Azure using Spark via Databricks. We had to abandon sci kit learn because of this choice. Does your service require AWS and can it be used in conjunction with Spark? Thank you for your time and consideration.

[+] rxin|6 years ago|reply

What's the reason you need to abandon scikit-learn? You can run scikit-learn on Databricks, and many of our customers do.

Disclaimer: Databricks cofounder.

[+] pela|6 years ago|reply

We currently provide integrations with AWS (S3 and Batch) and it is easy to extend Metaflow to work with other cloud providers. https://docs.metaflow.org/internals-of-metaflow/technical-ov...

[+] missosoup|6 years ago|reply

What about databricks made you abandon sklearn?

[+] vtuulos|6 years ago|reply

btw, if you happen to be at AWS Reinvent right now, you can get a stylish, collector's edition Metaflow t-shirt if you drop by at the Netflix booth at the expo hall and/or ping us otherwise!

[+] cpintomammee|6 years ago|reply

How does this compare to snakemake[1] and nextflow[2]?

[1] https://snakemake.readthedocs.io/en/stable/ [2] https://www.nextflow.io/

[+] misterdoubt|6 years ago|reply

The fact that metaflow works directly in Python piques my interest. I can lint it, I can test it, I can format it, I can easily extend it.

I've been hesitant to commit myself and my collaborators to yet another DSL -- and that's part of why I haven't seen much to offer in snakemake and nextflow.

[+] savin-goyal|6 years ago|reply

I am not familiar with them.

[+] softwarelimits|6 years ago|reply

Can anybody provide a good comparison e.g. with Meltano?

I am not affiliated with the Meltano people, but I like the idea of keeping the system modular, what seems to make it easier to replace components.

I have no doubt that we will see better replacements for every component of a data pipeline in the coming years. If there is only one thing to do right, then it´s to not bet on one tool but keep the whole stack flexible.

I am still missing well established standards for data formats, workflow definitions and project descriptions - hopefully open source ninjas will deliver on this front before proprietary pirats will destroy the field with progress-inhibiting closed things. It seems to be too late to create an "Autocad" or "Word" file format for datascience, but I see no clear winner atm, but hopefully my sight is bad - please enlighten me!

[+] savin-goyal|6 years ago|reply

I am not familiar with Meltano, sorry.

[+] dj18|6 years ago|reply

Seems like a cool addition to the DAG ML tooling family. Thanks for sharing! Do you support, or plan to support, features commonly found in data science platform tools like Domino (https://www.dominodatalab.com/)? I'm thinking of container management, automatic publishing of web apps and API endpoints, providing a search for artifacts like code or projects, etc.

[+] vtuulos|6 years ago|reply

Good question. We have many common features covered:

- Container management: See https://docs.metaflow.org/metaflow/dependencies

- Search for artifacts: see https://docs.metaflow.org/metaflow/client

- Automatic publishing of web apps: we have this internally but it is not open-source yet. If it interests you, react to this issue (https://github.com/Netflix/metaflow/issues/3)

Let us know if you notice any other interesting features missing! Feel free to open a GitHub issue or reach out to us on http://chat.metaflow.org

[+] tristanz|6 years ago|reply

This looks like a fantastically clean API for Python data and ML pipelines. Congratulations!

It would be great to have a scheduler and monitoring UI that are equally lightweight.

[+] vtuulos|6 years ago|reply

Metaflow comes with a built in scheduler. If your company has an existing scheduler, e.g. for ETL, you can translate Metaflow DAGs to the production scheduler automatically. This is what we do at Netflix.

We could provide similar support e.g. for Airflow, if there's interest.

For monitoring, we have relied on notebooks this far. Their flexibility and customizability is unbeatable. We might build some lightweight discovery / debugging UI later, if there's demand. We are a bit on the fence about it internally.

[+] jbenam|6 years ago|reply

Hey tristanz - we had similar needs and built up a lightweight scheduling, monitoring, and data flow UI. Originally for Airflow but we're excited about metaflow and will be integrating. Shoot me a note if interested to see what we built

[+] MostlyAmiable|6 years ago|reply

The link in the docs to the CloudFormation template source is broken: https://docs.metaflow.org/metaflow-on-aws/deploy-to-aws#clou... Instead of /Netflix/metaflow-tools/aws it should probably be /Netflix/metaflow-tools/tree/master/aws

[+] seeravikiran|6 years ago|reply

Thanks for reporting it. We ll fix it. Sorry for the inconvenience.

[+] savin-goyal|6 years ago|reply

Fixed

[+] posedge|6 years ago|reply

Very interesting project. I love that this allows you to transparently switch "runtime" from local to cloud, like spark does, but integrated with common python tools like sklearn/tf etc. Looking forward to test metaflow out myself.

[+] seeravikiran|6 years ago|reply

Thanks. Let us know how you like the prototyping -> scaling out & up journey.

[+] somurzakov|6 years ago|reply

i looked over the tutorials and curious to know, whether the tutorials are representative of how Netflix does ML ?

is data really being read in .csv format and processed in memory with pandas ?

because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?

i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model

[+] vtuulos|6 years ago|reply

Good question! A typical Metaflow workflow at Netflix starts by reading data from our data warehouse, either by executing a (Spark)SQL query or by fetching Parquet files directly from S3 using the built-in S3 client. We have some additional Python tooling to make this easy (see https://github.com/Netflix/metaflow/issues/4)

After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.

The results can be pushed to various other systems. Typically they are either pushed to another table or as a microservice (see https://github.com/Netflix/metaflow/issues/3)

[+] bitfhacker|6 years ago|reply

It's so simple and intuitive to run two steps in parallel. Thank you, Netflix!

[+] savin-goyal|6 years ago|reply

You’re welcome! :)

[+] manojlds|6 years ago|reply

How does it compare to dragster.io?

https://github.com/dagster-io/dagster

[+] vtuulos|6 years ago|reply

Orchestrating a workflow, which is what Dagster does, is just one part of Metaflow. Other important parts are dependency management, cloud integration, state transfer, inspecting and organizing results - features that are central to data science workflows.

Metaflow helps data scientists build and manage data science workflows, not just execute a DAG.

[+] elwell|6 years ago|reply

At first glance, I see BASIC's GOTO statement.

[+] unknown|6 years ago|reply

[deleted]

114 comments