We Need DevOps for ML Data

[+] softwaredoug|6 years ago|reply

In my experience, the problem has a lot to do with how teams organize around ML.

When you have engineering team separate than a data science team, you'll inevitably have unproductive conflict & politics. One team might be incentivized for stability and speed (engineering or ops) and the other model accuracy (data science). The end result can be disastrous... An engineering team that wants to bend nothing to help data scientists get their work in production. Or a data science team that only cares about maximizing accuracy, even if it might destroy prod, or be impractical to implement in a performant way.

To hit the sweet spot on accuracy, speed, and stability, you need to have one team that focuses on the end feature. It needs to be cross-functional and accountable for doing a great job at that feature. And the data scientists need to be possibly more focused on measuring and analyzing the feature's success, rather than just building models for models sake.

I'd recommend the book Agile IT Organization Design if you're interested in good team design patterns

[+] tixocloud|6 years ago|reply

This. In my experiences across larger enterprises, data science teams rarely hold the key to production environments and therefore, relies heavily on IT to productionizing ML. And I completely agree that data scientists need to be focused on measuring and analyzing success as opposed to churning more and more models.

[+] winrid|6 years ago|reply

I agree! We had this org structure and had tons of problems. But certain small teams that worked cross functionality were very productive

[+] c3534l|6 years ago|reply

> When you have engineering team separate than

Which is the whole idea behind DevOps: to break down the barriers between development and deployment by focusing on rapid iteration to production by continuously integrating changes into that pipeline.

It's ironic that DevOps has become a specialty in and of itself. The idea is to get rid of separate teams, not create a new one!

[+] simonw|6 years ago|reply

I see this as more of an organizational challenge than a technology challenge.

Getting ML models into production isn't particularly hard... if you put an engineering team on it that know how to write automated release procedures, design architecture that can scale and build robust APIs to surface the data.

But in many companies the engineers with those operations-level skills and the researchers who work on machine learning live completely separate lives. And then the researchers are expected to deploy and scale their models to production!

That's not to say that this organizational problem cannot be solved with technology/entrepreneurship. If a company can afford it it's likely much cheaper to pay an external company to solve your "ML in production" problems than to re-design your organization such that you equip your internal ML teams with the skills they need to go to prod.

[+] calebkaiser|6 years ago|reply

I agree that a lot of the challenges around production ML are organizational, but I think in many companies, it has more to do with a lack of engineering resources than it does the separation of eng and data science (though that certainly happens).

Building and maintaining ML infrastructure from scratch is a big project. That's why you see FAANG companies hiring for ML infrastructure/platform engineers. Most startups don't have the extra cycles for that big of an undertaking, and so you see a lot of slapped-together, hacky solutions to putting models into production.

I'm biased in that I work on Cortex ( https://github.com/cortexlabs/cortex ), but I think that open source, modular tooling that removes the need to reinvent the wheel is going to have a big impact in terms of making production ML more accessible.

[+] Cacti|6 years ago|reply

I disagree. It’s not about getting the data where it needs to be. It’s about data version control at a very fine level with very large datasets (in a way that is efficient). It’s about detecting changes in model results base don changes in data. It’s about tracking provenance of data in the datasets. It’s about potentially controlled access to the data (eg allowing models to use health care data without actually knowing the underlying data). It’s about detecting bias in datasets over time.

It’s actually quite complex, which is why generally speaking very few people do anything like this. I am unaware of any general solution to this problem, either in industry or academia.

[+] kevinstumpf|6 years ago|reply

(Tecton CTO here) You’re absolutely right that ML projects can’t be solved with technology alone. Besides the right tooling, they also require process, organizational setup, buy-in from multiple stakeholders, etc. By itself, no technology will turn a company into an “ML-first” company. Both technology and organizational problems need to be solved.

A while back, we published a blog post that discusses how we approached these organizational challenges at Uber: https://eng.uber.com/scaling-michelangelo/. With Michelangelo, we found that the right tooling can both solve technical challenges and help with some organizational challenges. For example: If a standardized and centralized platform is the path of least resistance to get ML into production and solve your business problem, you get the organizational benefits of that centralization (governance/visibility/collaboration) along the way.

[+] mmq|6 years ago|reply

> if you put an engineering team on it that know how to write automated release procedures

I think surfacing the data is just the first step, often times data scientists need to run some data exploration, the process is generally iterative, and so they need to run several experiments, resume or restart some experiments, scale training with distributed learning using several machines, or run hyper-parameters tuning, which means handling failures, visualize and debug results, before deciding if they should deploy a model. Once a model is deployed the story does not end there, because models become stale and need to be retrained. There are other issues related to compliance that need to be handled as well, and many other problems related to governance, a/b testing, ...

The good news is that there are several open source initiatives to solve several of these problems, at Polyaxon [0] we are trying to solve some of the aspects related to the experimentation phase.

[0] https://github.com/polyaxon/polyaxon

[+] moandcompany|6 years ago|reply

Fig 4 looks like it's derived from Hidden Technical Debt in Machine Learning (2015).

https://papers.nips.cc/paper/5656-hidden-technical-debt-in-m...

As someone else says in this comment thread, this is very much an organizational problem, and cannot be viewed as just a technology problem.

The common behavior of individuals and teams is the pursuit of solutions that solve problems for them. Problems here with ML, and as we've seen with "Data Science," along with other magic technologies is that having an appreciation for the domain or context goes a long way. Being familiar with entire process, or "pipeline," is valuable, and role/functional silos often lead the problems people experience.

For some classes of machine learning problems and associated data, sourcing solutions from vendors can work, but as with any tools you can procure, you need the right people to use them appropriately. This also applies to "DevOps" which is used for comparison in the blog post.

--> DevOps example -- the philosophy seems to be about having software developers also share build/release and infrastructure responsibility. But some organizations have made "DevOps" teams to silo build/release and infrastructure work... they ended up renaming what used to be called their Build/Release or SysAdmin teams. Siloing things to be "someone else's" problem doesn't result in the major transformations that are needed.

Now imagine what happens if we substitute MLDevOps for DevOps above.

I'll continue to say "The Role of a Data Engineer on a Team is Complementary and Defined By The Tasks That Others Don’t (Want To) Do (Well)"

[+] amznthrowaway5|6 years ago|reply

> The Role of a Data Engineer on a Team is Complementary and Defined By The Tasks That Others Don’t (Want To) Do (Well)

Those types of tasks are also often not recognized or rewarded by management, despite being a hugely critical part of the system. I believe the incorrect hiring of scientists who are often strong in terms of core theory or number of papers published but have no clue about building real production ML systems is a huge organizational problem, often causing ML teams to fail to deliver any real value.

[+] gas9S9zw3P9c|6 years ago|reply

Wow, I probably have seen 10 of these kind of companies over the past few months. Personally I believe (and hope) the winners in this space are going to be modular open-source companies/products as opposed to the "all-in-one enterprise solutions"

[+] _mdb|6 years ago|reply

CEO of Tecton here, and happy to give more context. Tecton is specifically focused on solving a few key data problems to make it easier to deploy and manage ML in production. e.g.:

- How can I deliver these features to my model in production?

- How do I make sure the data I'm serving to my model is similar to what is trained on?

- How can I construct my training data with point in time accuracy for every example?

- How can I reuse features that another DS on my team built?

We've found that there's a ton of complexity getting data right for real-time production use cases. These problems can be solved, but require a lot of care and are hard to get right. We're building production-ready feature infrastructure and managed workflows that "just work" for teams that can’t or don’t want to dedicate large engineering teams to these problems.

At the core of Tecton is a managed feature store, feature pipeline automation, and a feature server. We’re building the platform to integrate with existing tools in the ML ecosystem.

We’re going to share more about the platform in the next few months. Happy to answer any questions. I’d also love to hear what challenges folks on this thread have encountered when putting ML into production.

[+] jdoliner|6 years ago|reply

Pachyderm is probably one of the companies you've seen in this space. Full disclosure: I'm the founder, but I feel that we've stayed pretty true to the idea of being a modular open-source tool. We have customers who just use our filesystem, and customers who just use our pipeline system, and of course many more who use both. We've also integrated best in class open-source projects, for example Kubeflow's TFJob is now the standard way of doing Tensorflow training on Pachyderm, and we're working on integrating Seldon as the serving component. We find this architecture a lot more appealing than an all-in-one web interface that you load your data into.

[+] minimaxir|6 years ago|reply

Additionally, all of Google, Amazon, and Microsoft are pushing very heavily in the ML DevOps space. And if you are training/deploying ML models at such a frequency that you need to utilize DevOps, chances are you are already using their platforms for server compute.

[+] peterwwillis|6 years ago|reply

Open Source companies are like open source car manufacturers. When the company dies and stops making the car, will the customers start a new car manufacturing business just to support their cars? Or buy a new car?

As AWS shows, proprietary all-in-one [platform] is fine as long as it's a-la-carte.

[+] yanovskishai|6 years ago|reply

Could you please mention what are the other solutions you've got to see in this space?

[+] tixocloud|6 years ago|reply

Completely spot-on. Too many "all-in-one" platforms are just too rigid and with AI infrastructure tooling still in the early stages, the companies that adopt modular products will be able to capitalize on new advances.

[+] alfalfasprout|6 years ago|reply

Yeah, we're releasing our platform as open source soon too... kinda feel bad for these guys but it'll be tough to compete with platforms that have a larger open source following and plenty of end-users.

[+] fizixer|6 years ago|reply

I wonder what's the business model for teams/startups offering open-source solutions that they developed in-house.

[+] remmargorp64|6 years ago|reply

I was the main data science engineer at one of my previous companies. We used tools like airflow for running python scripts to import data, clean/transform it, train models, and even test various models against datasets. We also used Azure for similar things.

It's easy to do "dev ops" for machine learning. Basically, just automate everything and implement gatekeeping mechanisms along with active monitoring.

It's true, though. I had to cobble together a lot of custom things at the time, but it wasn't that hard to do.

[+] nik_s|6 years ago|reply

I'm the CTO at a data science company, and this has been my experience too. I've been lucky enough to have quite a few engineers go from zero practical experience to being able to train and deploy complex ml solutions, and the most successful solutions have always involved a combination of just a couple of tools: - airflow and/or celery for running data extraction and transformation jobs - pandas and numpy for data wrangling - sklearn, xgboost, lightgbm, pytorch or tensorflow for training/inference - flask or Django to serve results

It's a handful of technologies, but they're (generally) mature, battle tested, and well documented.

[+] starpilot|6 years ago|reply

Good god it's hard to do this at a non-tech company. MLOps would be great, but we don't really have "Ops," just IT, since our main business is not software. And we don't have Dev either, so we don't have anyone to really emulate on the inside. Our data scientists are foremost analysts who can write some Python, they don't know OO or memory optimizations or anything. They've never used a bash prompt or know what one is. Management thought we could orchestrate this huge waterfall schedule for a project and now it's falling apart as we open each new box of surprises...

[+] proverbialbunny|6 years ago|reply

If you don't have a dev, how are you collecting any data to begin with?

[+] kostas_f|6 years ago|reply

I 'll disagree with most comments that it's mainly an organizational problem. Creating tooling for things like:

- managing different data sources

- versioning data

- monitoring how new data affects the model

- testing that certain SLAs are met before new features are deployed

- ability to rollback

- data & model quality monitoring

is technically challenging.

Obviously there are engineers that will quickly hack something together and will falsely think that they have a good enough MLOps solution. I have been part of such teams.

Most companies are not Google, Facebook, or Uber. The large ones very often don't have the know-how to create a robust technical solution around this process, and even if they do it can take them years and the smaller ones lack both the resources and technical expertise.

I'm always looking for new ideas that can become successful business and when I saw the Uber Michelangelo here on HN a few years ago, I was thinking that selling similar tooling to other companies, had great potential. Seems that the right team to create that company was the one the built Michelangelo itself :)

[+] smeeth|6 years ago|reply

I really find it difficult to put into words just how little I care to pay for a web ui so I can "manage" my data.

Data pipelines are a real problem though, and I'm very interested in what startups do with this space.

[+] seddonm1|6 years ago|reply

We have been thinking about these problems for a few years now and have built Arc https://arc.tripl.ai (fully open source.) which is an abstraction layer on top of Apache Spark to help end-users rapidly build and deploy data pipelines without having to know about #dataops. Ultimately we decided that giving users a decent interface https://github.com/tripl-ai/arc-starter (based on Jupyter Notebooks) and encouraging a 'SQL first' approach means we can give users flexibility but also have a standardised way of deploying jobs with many of the devops attributes (like logging and reliability). You can run Arc as a standard docker run command or using Argo Workflows https://argoproj.github.io/ on Kubernetes as the orchestrator as it plays nicely with Arc and is easy to build resilient pipelines (retries etc.)

[+] factorialboy|6 years ago|reply

> Data pipelines are a real problem though

Can you please elaborate more, thanks.

[+] dttos|6 years ago|reply

Suggest you check out https://composable.ai for building out robust data pipelines

[+] ska|6 years ago|reply

"We need fewer data scientists and more data janitors" - anon

[+] dnautics|6 years ago|reply

Honest question (though I suppose the clickthroughs to the comments are likely to be a biased sample): is "getting to prod" really the gatekeeper/bottleneck for most ML? I would have thought "a model that works" is much harder, especially given how hyped the field is and how many people are trying to tackle problems that are I'll suited to the current batch of ML techniques.

Unless the issue here is data collection in prod to start training your model.

[+] proverbialbunny|6 years ago|reply

>is "getting to prod" really the gatekeeper/bottleneck for most ML?

The most common bottleneck is collecting the right data. It can take years, or even a task force just to get the right data before the data scientist can begin.

>I would have thought "a model that works" is much harder

It depends how experienced the data scientist is. Early on into a project a data scientist can do a feasibility assessment. They should identify what is possible, and how possible. Sometimes some data science projects are heavy on the research side where where it can take 2 weeks to 3 months to figure out if something is possible. Sometimes the feasibility assessment ends up being incorrect and a goal is shown to be impossible.

Once research is done it usually takes 4 weeks to 6 months for a data scientist to build a model. The upper bound is rare and happens because of recursive refinement to increase accuracy, trying to get every last drop out of what is possible.

In contrast it can take months to years for the company to begin to collect the right data for a data scientist to be able to begin to do what benefits the company. Sometimes crowd source projects need to be created just to collect the required data. It then takes an average of 3 to 6 months for productionization if there is clever feature engineering in the model. Note: When I say productionization, I mean all the way to the end customer, so setting up and maintaining pipelines, frontend devs updating websites to add the service, and whatever else is necessary. There is more work involved on the production side, but it can be split up to multiple engineers.

[+] Jaruzel|6 years ago|reply

Managing data is not an IT job. Data is just unformatted information, and should be managed and governed by those who are trained in Information Management: Modern day Librarians.

IT own the platform, and the software. They should never own the data as well.

[+] alexilliamson|6 years ago|reply

I agree with this, and I think data librarian is a role that any "data-driven" company needs. IMO it makes a lot of sense for data scientists to fill that role, but I think that's an issue for many. Data scientists may think being a librarian and organizing the knowledge base is beneath them, or maybe management thinks it's beneath them. Execs tend to not care about the state of knowledge infrastructure as long as their reports get to them when they expect.

[+] akarve|6 years ago|reply

This is close to home. Our approach to DevOps for ML Data is to use S3 as the git core and build immutable datasets and models on top of S3 object versioning. I wrote the piece below on "Versioning data and models for faster iteration in ML" earlier this year. The key idea is for every model iteration to be a pure function F(code, environment, data). Ideas welcome: https://medium.com/pytorch/how-to-iterate-faster-in-machine-...

[+] Tehchops|6 years ago|reply

I'm reminded of: https://blog.acolyer.org/2019/06/03/ease-ml-ci/

[+] iddan|6 years ago|reply

This startup is trying to build the next GitHub for ML Data: https://dagshub.com/

[+] gunshai|6 years ago|reply

This seems pretty cool.

[+] beckingz|6 years ago|reply

Data is hard to automate and standardized pipelines and processes are really helpful. This is interesting.

[+] tkyjonathan|6 years ago|reply

Isn't this DataOps?

[+] schnitsel|5 years ago|reply

I was thinking the same, it could be that the OP isn't familiar with the term yet.

[+] pottertheotter|6 years ago|reply

Is this not just a really long advertisement?

[+] flaxton|6 years ago|reply

Rule number one: define your terms as you introduce them. On and on about ML. But what is it?

I had to search to see it was Machine Learning.

How hard is it to define it the first time you use it?

I can bet lots of people were scratching their heads but didn’t bother to look it up or continue reading...

[+] oplav|6 years ago|reply

Genuinely curious, did you think "ML" stood for anything else? My day to day work is not machine learning but if I ever see ML, "machine learning" is the first thing I think of.

[+] kevinstumpf|6 years ago|reply

Thanks for flagging! Fixed.

87 comments