How is this different / better to existing tools or workflows? I don't like to criticise new frameworks / tools without first understanding them, but I like to know what some key differences are without the marketing/PR fluff before giving one a go.
hey, I'm one of the authors of Metaflow. Happy to answer any questions! Netflix has been using Metaflow internally for about two years, so we have many war stories :)
Can you say a little about which niche this would occupy, and what the motivation is? Is it intended to compete with Tensorflow and Pytorch or to be an industrial strength version of SKlearn.
I looked through the tutorial on my mobile and the answer was not immediately clear.
Is the benefit that it auto scales on AWS without having to think through the infrastructure?
G'day, seems like an cool tool, thanks - the links to the github tuts are currently broken...
I'm of the opinion that adopting some standard DAG meta format for data science may make a positive impact on the reproduceability issues we have in science generally. So its good to see the idea has real world merit as well.
Hey I meant to track one of y'all down at the MLOps conference, but didn't get the chance. I've built a very shitty version of a cached-execution DAG thing internally, and one of the design decisions I made was to have it so that parent nodes in the DAG don't need to know anything about child nodes. This allows for larger DAG builders to be more easily subclassed.
MetaFlow doesn't do that -- instead each 'step' has to know what to call next, which means that if I wanted to subclass e.g. the MovieStatsFlow in [here](https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) and say, add some sort of input pre-processing before the compute_statistics call, I'd essentially end up having to either override what compute_statistics does to not match its name _or_ copy-past e that first step just to replace that last line.
I'm sure this design decision was considered and/or that use-case doesn't come up a lot at Netflix (although I've encountered that a lot), or maybe I'm missing something very obvious, but I'd love to hear your thoughts on that.
I have been looking for something exactly like this, but I use GCP not AWS. Is there a way to deploy outside of AWS? What would be involved in getting it to work with a different cloud?
Hi Ville, I cannot express how happy I am that you open-sourced Metaflow. I have three questions:
1) Do you know ETA of R API release?
2) Would you recommend using both Metaflow nad MLFlow in projects? Could you please explain why yes or why no :)
3) Do you plan to release integration with Spark/Yarn?
Thank you for sharing this. Would this be useful for me if I only need the deployment management part? Don't really need to track experiments, just looking for an easy way to deploy my models to Fargate.
This looks exciting! I'll play around with the tutorial and try to set up the AWS environment this weekend. I have several questions.
1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?
2. When you move to a new text editor, there are some initial frictions while you're trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow's environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.
3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and/or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?
4. What's the closest open-source alternative to Metaflow on the market? Off the top of my head, I can't think of anything which quite matches.
Is there a reason to use this over DVC[1] which is language and framework agnostic and supports a large number of storage backends? It works with any git repo and even polyglot implementations and can run the DAG on any system.
Currently using DVC, MLflow just for metadata visualization and notes on experiments, and Anaconda for (python) dependency management. We are an embedded shop so we don't deploy to the "cloud."
We are good friends with the DVC folks! If the DVC + MLFlow + Anaconda stack works for you, that's great. Metaflow provides similar features. The cloud integration is really important at Netflix's scale.
My team has a similar library called Loman, which we open-sourced. Instead of nodes representing tasks, they represent data, and the library keeps track of which nodes are up-to-date or stale as you provide new inputs or change how nodes are computed. Each node is either an input node with a provided value, or a computed node with a function to calculate its value. Think of it as a grown-up Excel calculation tree. We've found it quite useful for quant research, and in production it works nicely because you can serialize entire computation graph which gives an easy way to diagnose what failed and why in hundreds of interdependent computations. It's also useful for real-time displays, where you can bind market and UI inputs to nodes and calculated nodes back to the UI - some things you want to recalculate frequently, whereas some are slow and need to happen infrequently in the background.
I am disappointed that when I click on documentation, "why metaflow," I get a bunch of cartoony BS instead of a simple text explanation. Glad these folks don't write RFC'S.
Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all like that.
We are on Azure using Spark via Databricks. We had to abandon sci kit learn because of this choice. Does your service require AWS and can it be used in conjunction with Spark? Thank you for your time and consideration.
btw, if you happen to be at AWS Reinvent right now, you can get a stylish, collector's edition Metaflow t-shirt if you drop by at the Netflix booth at the expo hall and/or ping us otherwise!
The fact that metaflow works directly in Python piques my interest. I can lint it, I can test it, I can format it, I can easily extend it.
I've been hesitant to commit myself and my collaborators to yet another DSL -- and that's part of why I haven't seen much to offer in snakemake and nextflow.
Can anybody provide a good comparison e.g. with Meltano?
I am not affiliated with the Meltano people, but I like the idea of keeping the system modular, what seems to make it easier to replace components.
I have no doubt that we will see better replacements for every component of a data pipeline in the coming years. If there is only one thing to do right, then it´s to not bet on one tool but keep the whole stack flexible.
I am still missing well established standards for data formats, workflow definitions and project descriptions - hopefully open source ninjas will deliver on this front before proprietary pirats will destroy the field with progress-inhibiting closed things. It seems to be too late to create an "Autocad" or "Word" file format for datascience, but I see no clear winner atm, but hopefully my sight is bad - please enlighten me!
Seems like a cool addition to the DAG ML tooling family. Thanks for sharing! Do you support, or plan to support, features commonly found in data science platform tools like Domino (https://www.dominodatalab.com/)? I'm thinking of container management, automatic publishing of web apps and API endpoints, providing a search for artifacts like code or projects, etc.
- Automatic publishing of web apps: we have this internally but it is not open-source yet. If it interests you, react to this issue (https://github.com/Netflix/metaflow/issues/3)
Let us know if you notice any other interesting features missing! Feel free to open a GitHub issue or reach out to us on http://chat.metaflow.org
Metaflow comes with a built in scheduler. If your company has an existing scheduler, e.g. for ETL, you can translate Metaflow DAGs to the production scheduler automatically. This is what we do at Netflix.
We could provide similar support e.g. for Airflow, if there's interest.
For monitoring, we have relied on notebooks this far. Their flexibility and customizability is unbeatable. We might build some lightweight discovery / debugging UI later, if there's demand. We are a bit on the fence about it internally.
Hey tristanz - we had similar needs and built up a lightweight scheduling, monitoring, and data flow UI. Originally for Airflow but we're excited about metaflow and will be integrating. Shoot me a note if interested to see what we built
Very interesting project. I love that this allows you to transparently switch "runtime" from local to cloud, like spark does, but integrated with common python tools like sklearn/tf etc. Looking forward to test metaflow out myself.
i looked over the tutorials and curious to know, whether the tutorials are representative of how Netflix does ML ?
is data really being read in .csv format and processed in memory with pandas ?
because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?
i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model
Good question! A typical Metaflow workflow at Netflix starts by reading data from our data warehouse, either by executing a (Spark)SQL query or by fetching Parquet files directly from S3 using the built-in S3 client. We have some additional Python tooling to make this easy (see https://github.com/Netflix/metaflow/issues/4)
After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.
Orchestrating a workflow, which is what Dagster does, is just one part of Metaflow. Other important parts are dependency management, cloud integration, state transfer, inspecting and organizing results - features that are central to data science workflows.
Metaflow helps data scientists build and manage data science workflows, not just execute a DAG.
[+] [-] amirathi|6 years ago|reply
Here's my understanding:
- It's a python library for creating & executing DAGs
- Each node is a processing step & the results are stored after each step so you can restart failed workflows from where it failed
- Tight integration with AWS ECS to run the whole DAG on cloud
I don't know why their .org site oddly feels like a paid SaaS tool. Anyway, thank you Netflix for open sourcing Metaflow.
[+] [-] seeravikiran|6 years ago|reply
[+] [-] teekert|6 years ago|reply
[0] https://snakemake.readthedocs.io/en/stable/
[+] [-] savin-goyal|6 years ago|reply
[+] [-] est|6 years ago|reply
[+] [-] Thorentis|6 years ago|reply
For instance, this tutorial example here (https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) does not look substantially different to what I could achieve just as easily in R, or other Python data wrangling frameworks.
Is the main feature the fact I can quickly put my workflows into the cloud?
[+] [-] vtuulos|6 years ago|reply
[+] [-] omarhaneef|6 years ago|reply
Can you say a little about which niche this would occupy, and what the motivation is? Is it intended to compete with Tensorflow and Pytorch or to be an industrial strength version of SKlearn.
I looked through the tutorial on my mobile and the answer was not immediately clear.
Is the benefit that it auto scales on AWS without having to think through the infrastructure?
[+] [-] ageofwant|6 years ago|reply
I'm of the opinion that adopting some standard DAG meta format for data science may make a positive impact on the reproduceability issues we have in science generally. So its good to see the idea has real world merit as well.
[+] [-] zmjjmz|6 years ago|reply
MetaFlow doesn't do that -- instead each 'step' has to know what to call next, which means that if I wanted to subclass e.g. the MovieStatsFlow in [here](https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) and say, add some sort of input pre-processing before the compute_statistics call, I'd essentially end up having to either override what compute_statistics does to not match its name _or_ copy-past e that first step just to replace that last line.
I'm sure this design decision was considered and/or that use-case doesn't come up a lot at Netflix (although I've encountered that a lot), or maybe I'm missing something very obvious, but I'd love to hear your thoughts on that.
[+] [-] faizshah|6 years ago|reply
[+] [-] rgudic|6 years ago|reply
Thanks in advance, Roko
[+] [-] kprybol|6 years ago|reply
[+] [-] bhtucker|6 years ago|reply
[+] [-] ordx|6 years ago|reply
[+] [-] jmpeax|6 years ago|reply
[deleted]
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] aniketpanjwani|6 years ago|reply
1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?
2. When you move to a new text editor, there are some initial frictions while you're trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow's environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.
3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and/or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?
4. What's the closest open-source alternative to Metaflow on the market? Off the top of my head, I can't think of anything which quite matches.
[+] [-] Datenstrom|6 years ago|reply
Currently using DVC, MLflow just for metadata visualization and notes on experiments, and Anaconda for (python) dependency management. We are an embedded shop so we don't deploy to the "cloud."
[1]: https://dvc.org/
[+] [-] vtuulos|6 years ago|reply
[+] [-] edparcell|6 years ago|reply
[1] Github: https://github.com/janushendersonassetallocation/loman
[2] Docs: https://loman.readthedocs.io/en/latest/
[3] Examples: https://github.com/janushendersonassetallocation/loman/tree/...
[+] [-] russfink|6 years ago|reply
Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all like that.
[+] [-] purple-again|6 years ago|reply
[+] [-] rxin|6 years ago|reply
Disclaimer: Databricks cofounder.
[+] [-] pela|6 years ago|reply
[+] [-] missosoup|6 years ago|reply
[+] [-] vtuulos|6 years ago|reply
[+] [-] cpintomammee|6 years ago|reply
[1] https://snakemake.readthedocs.io/en/stable/ [2] https://www.nextflow.io/
[+] [-] misterdoubt|6 years ago|reply
I've been hesitant to commit myself and my collaborators to yet another DSL -- and that's part of why I haven't seen much to offer in snakemake and nextflow.
[+] [-] savin-goyal|6 years ago|reply
[+] [-] softwarelimits|6 years ago|reply
I am not affiliated with the Meltano people, but I like the idea of keeping the system modular, what seems to make it easier to replace components.
I have no doubt that we will see better replacements for every component of a data pipeline in the coming years. If there is only one thing to do right, then it´s to not bet on one tool but keep the whole stack flexible.
I am still missing well established standards for data formats, workflow definitions and project descriptions - hopefully open source ninjas will deliver on this front before proprietary pirats will destroy the field with progress-inhibiting closed things. It seems to be too late to create an "Autocad" or "Word" file format for datascience, but I see no clear winner atm, but hopefully my sight is bad - please enlighten me!
[+] [-] savin-goyal|6 years ago|reply
[+] [-] dj18|6 years ago|reply
[+] [-] vtuulos|6 years ago|reply
- Container management: See https://docs.metaflow.org/metaflow/dependencies
- Search for artifacts: see https://docs.metaflow.org/metaflow/client
- Automatic publishing of web apps: we have this internally but it is not open-source yet. If it interests you, react to this issue (https://github.com/Netflix/metaflow/issues/3)
Let us know if you notice any other interesting features missing! Feel free to open a GitHub issue or reach out to us on http://chat.metaflow.org
[+] [-] tristanz|6 years ago|reply
It would be great to have a scheduler and monitoring UI that are equally lightweight.
[+] [-] vtuulos|6 years ago|reply
We could provide similar support e.g. for Airflow, if there's interest.
For monitoring, we have relied on notebooks this far. Their flexibility and customizability is unbeatable. We might build some lightweight discovery / debugging UI later, if there's demand. We are a bit on the fence about it internally.
[+] [-] jbenam|6 years ago|reply
[+] [-] MostlyAmiable|6 years ago|reply
[+] [-] seeravikiran|6 years ago|reply
[+] [-] savin-goyal|6 years ago|reply
[+] [-] posedge|6 years ago|reply
[+] [-] seeravikiran|6 years ago|reply
[+] [-] somurzakov|6 years ago|reply
is data really being read in .csv format and processed in memory with pandas ?
because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?
i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model
[+] [-] vtuulos|6 years ago|reply
After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.
The results can be pushed to various other systems. Typically they are either pushed to another table or as a microservice (see https://github.com/Netflix/metaflow/issues/3)
[+] [-] bitfhacker|6 years ago|reply
[+] [-] savin-goyal|6 years ago|reply
[+] [-] manojlds|6 years ago|reply
https://github.com/dagster-io/dagster
[+] [-] vtuulos|6 years ago|reply
Metaflow helps data scientists build and manage data science workflows, not just execute a DAG.
[+] [-] elwell|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]