Nice article, however, I disagree on a couple of things :
- "Plateform MLE" is juste regular Devops and Software Engineering. It's not because we're dealing with models and their accuracies that it is fundamentally different from what uses to be Devops before. We don't need to make a "special" title out of everything in Data Science.
- I still like to introduce "MLOps" into the conversation, thus making it special, and infringing the rule above. Oups.
- over fitting slightly on recent data, without accounting for gaps in modeling, is likely to lead to dire situations. See also : March 2020, when all forecasts went bananas. And , no, retraining at that time did not improved the situation. That's what the MLE was talking about here : > “I know it’s not really addressing the data drift problem,” ; they were right.
- Everything about data and model drift is just the tip of the iceberg: What happen when your model ends up in production ? It will start affecting the behaviour of the very thing you're trying to predict. A prime example of this on the retail markdown case : did you sold more of that article because the product was rated better (as qualified by the marketer to compensate for lack of control over markdowns), and the in-store stock was higher, or did your markdown on that item had the effect you were looking for and the rating was actually okay ? Did the sales went down this markdown season because your model had a perfect but unattractive markdown strategy last season ? This is already very difficult to model properly, let alone trying to measure their contribution to the drift. Good luck with that.
> What happen when your model ends up in production ? It will start affecting the behaviour of the very thing you're trying to predict
Indeed, again part of the reasons I think MLOps is a problem. One ideally needs a causal theory and rational expectations regarding how and why a model will work: you should be able to explain it. The reason is that there are many more ways for things to look like they work, then ways for them to actually work. Forcing -- what should be -- the scientific effort of modelling into a cut box, tune, re-train, monitor workflow is reductionist and will invariably make it near impossible to exercise proper due diligence IMO.
Scientific discovery is generally not a sausage machine -- its messy and dependent on what happens in brains more than anything else. I think its a fools errand to expect differently from -- what should be -- the same abstract tool box applied to commerce.
Definitely a lot of interesting insights but I fail to see what 1 year of PhD has to do with it. If anything this undermines the article! We used to tell incoming grad students to never ask or believe anything first year phds tell. Either they’re trying to paint too good or too bad a picture depending on how they want to convince themselves of their big decision a year in.
My experience of grad school was that first & second year students tended to have an inflated sense of their own abilities - they had tended to be the smartest kid in their classes in school & university and it took a year or two for the arrogance to wear off when hit with the realities of actual research work. Not always, but more often than not.
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain.
> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models.
if re-training every-day actually gives you a significant business material benefit (excepting cases where you specifically want to significantly over-weight recent data - phone keyboard autocomplete predictions as an example), you likely don't have a model that is actually picking signal from noise or generalizing on the data. This is memorization, not learning. This is essentially how you get expensive ML disasters.
The back half of the essay was unrecognizable for me, but that's probably because I'm working in different contexts and on different product types. It would be helpful to know what systems the author has worked on and why these findings applied to that context. That way, if I ever find myself in one of those contexts, I can leverage the author's experience.
Generalizing individual experiences to general prognostications sells well as a genre, especially in software. This writing pattern was epitomized by OOP thought leaders in the 90s/00s, but was also present in the process evangelists who followed and the more iconoclastic folks like Dijkstra who preceded. However, actually following such general advice without taking into account the author's context can easily end in tears.
Ideally, these sorts of pieces should be written up as case studies rather than as general prognostications. With the context that a case study provides, these pieces can teach us a lot about how to engineer systems. Without that context, the reader is left to blindly follow advice that may not be relevant to their context. Or, best case, reverse engineer the context.
What is the difference between memorization and learning? Could you please elaborate on this? It always seemed to me that a lot of learning is in fact memorization otherwise you wouldn't need a large dataset of cars photos from every angle (or some angles so that ML can work out in the in-between poses, no amount of 'learning' of photos from the front can work out what a car looks like from the side) to be able to recognise them. Also in what context would you get expensive ML disasters? If you keep retraining on cars as new car models come out then you get 100% recognition memorization notwithstanding, which in the end is what you would want.
No, this is because "online/incremental" learning doesn't work due to a wide variety of reasons, most notably the phenomenon of "Catestrophic Forgetting".
This article overstates the distinction between Task MLE and Platform MLE.
From my experience managing data science teams (small teams - max 20 people), I always preferred for my Task MLE team mates to also do the Platform MLE work. I would not hire someone just to be a Platform MLE because they would be too distant from the day-to-day Task MLE needs.
I like the way that Google SREs think about this - there's toil (Task MLE parts of the job) and then there's automation (Platform MLE parts of the job). Every programmer on the team should have toil and should be given enough time and freedom to address their most painful toil through automation.
Distinguishing between Task MLEs and Platform MLEs so strictly is dangerous for anyone that applies this dichotomy in practice.
I guess the article author never explicitly stated that they have to be different people, but I got the sense from reading the article that this was an unstated assumption on their part.
> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models. Most models are not large language models
So true. This is a pretty easy flag that someone hasn’t actually been down and dirty with trying to train any actual models and doesn’t know what they’re talking about
Quite the opposite. The running time of sklearn's or xgb's training procedure is not the main obstacle to retraining daily (not even close).
In some of industries retraining has to go through rigorous manual QA. And you need a solid ML pipeline to begin with.
Great insight here that a lot of performance drift is down to data and engineering, not "natural" non-stationarity. The corollary is that one of the hidden strengths of ML systems is the ability to adapt to imperfect inputs, and even changing inputs with retraining.
Re: monitoring -- When I was doing automated trading, we of course had automated alerts to stop systems that were outside expected parameters. However, as OP describes, there was also a long tail of degradations where there was no great precision/recall tradeoff on flagging "real" issues. Instead, we put effort into "calibration reports" that visually surfaced as much information about training and performance as our eyes could handle. These would include things like over time plots and PCA plots for recent feature histories, where it was more an opportunity to spot patterns than an explicit metric. Reviewing these for 15m each day with our own eyes was much more effective at detecting a long tail of unanticipated degradations than we would have been at anticipating and coding for each one explicitly -- and it help build intuition that fueled new research ideas.
I think 'becoming a historian of the craft' is very much what a first year phd is supposed to do, and a large part of what doing a literature review is all about. It helps crystallise what exactly your research questions should be, and what experiments you should do to answer the questions. I think this article will be useful for the author to see how their thinking has evolved over time when they read it back in 3 or 4 years, because I think a lot of changes happen to a persons thought over a phd.
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
We all know that overfitting is bad (e.g. in time-series forecasting, the past isn't always representative of the future). Depending on your domain, more recent data may be more valuable than older data, sure. The solution is not to overfit to recent data!
In my experience, it is to design features which take into account recency. For example, in a particular quantity we wanted to forecast, we found out that using ~7 days worth of data was better than using multiple months, due to the data being non-stationary (the mean of the quantity was changing over time). What we did was combine features with an exponential decay with the appropriate decay constant, to great results.
Slight tangent: does anyone know of a good source for more useful blog posts on ML in the wild? Many that I come across are a funnel into some product, are very short, too theoretical, for beginners, or all of the above. This post strikes a nice balance in simply sharing some experiences and opinions, like you would see in blogposts on how to do software engineering well.
"Suppose every organization is able to clearly define their data and model quality SLOs."
I laughed out loud. The rest of the essay is so on point I'm attributing the author full marks for a classic example of the dark, laconic understatement that builds so many bridges amongst technology professionals.
I think that MLops somewhat subdues the modelling and research effort in commercial settings. At least anecdotally I only seem to see the coexistence of ML pipelines paired with the “feature engineering” + fit(X,y) approach to data “science”. What I don’t see is first principles research, leading to insight, leading to models to exploit insight which ofttimes requires heavily augmenting how and what data is collected: that part almost never fits neatly anywhere. In its place DS often starts with data then looks for models to fit it. MLops reinforce that pattern, and we end up with the old joke:
Man searching for his keys under a street lamp in the dark. Police man asks “Is this where you dropped your keys?”. Man says “No, but this is where the light is”.
MLOps is about enabling reproducibility of previous outputs; it has nothing to do with reinforcing any pattern. It's literally "DevOps for Data Science". It's also a sane set of best practices to follow even if you're working on an optimisation case.
Maybe you're conflating this with something else ?
From practice the Platform MLE is non existent in most cases except for places like OpenAI and maybe Google. The key thing not addressed here is that in software, the right levels of abstraction provide ways for pluralities of downstream tasks. That level of abstraction has not really been found in ML (yet). Every time you start making a larger system, researchers then have to onboard to that system and I find they really don't like how much time it wastes.
> I should have taken the hyperparameters that yielded the best model for the latest evaluation set.
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
I do a lot of data science work for various business problems. No deep learning image whatevers. Just understanding how processes work and influencing decisions. I can't relate to a lot of what this person is saying.
> Sometimes, I was so scientifically sound that the business lost money. I automated a hyperparameter tuning procedure that split training and validation sets into many folds based on time and picked hyperparameters that averaged best performance across all the sets. I only realized how silly this was in hindsight. I should have taken the hyperparameters that yielded the best model for the latest evaluation set.
What's silly here is thinking that the minor adjustment of hyper parameters from set to set is likely to make a difference. This might hold for some niche deep learning problems but it sounds later on like she isn't doing this. I rarely see an optuna parameter optimization affect the AUC of a model by more than 0.02 vs arbitrary choices. Most business problems tend to be pretty simple. I just can't imagine this parameter choice makes a difference.
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
Nonsense. Dangerous nonsense.
> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models. Most models are not large language models. I learned this circuitously, while doing research in ML monitoring.
This is accurate. I find it weird that people feel the need to retrain models so frequently though. Context matters I guess. But then they talk about data drift...
> Anecdotes like this really get me in a tizzy. I think it’s because what I thought were important and interesting problems are now, sadly, only interesting. Researchers think distribution shift is very important, but model performance problems that stem from natural distribution shift suddenly vanish with retraining.
It sounds like this person is just retraining their models at a high frequency until a hyper parameter search randomly produces something that looks good on an evaluation set. I don't see any evidence of them ACTUALLY EVALUATING THEIR PERFORMANCE. The missing piece to all of this is taking the predictions they made with this process and seeing how it compared to the unseen future outcomes in retrospect. If you think you're fine because you make a new model every day that works for yesterday's data, which they openly imply is overfit to that period, you might just be making a new mistake every day. A good AUC or R2 on your evaluation set doesn't guarantee that it'll be good on new data.
> I had a hard time finding the gold nugget in data drift, the practical problem. Now it seems obvious: data and engineering issues—instances of sudden drifts—trigger model performance drops. Maybe yesterday’s job to generate content-related features failed, so we were stuck with old content-related features. Maybe a source of data was corrupted, so a bunch of features became null-valued.
What? This seems pretty crazy to me. You should have alarm bells built in that freak out if important features are turning up null or corrupted. This person seems to have no idea if their model is working! They just know it worked for yesterday. They allude to doing this later on, but even then... broken data pipelines is not what people typically mean by data drift.
I see no discussion of real performance evaluation, understanding how the models work, or how to make real use of the model outputs.
Original post author here. Thanks for taking the time to write this---it made me realize that I should have discussed the nature of tasks I was working with in my post, especially because different tasks/datasets have different behaviors under the same model choices.
The overfitting phenomenon I've seen at the companies I work/have worked with are all doing some kind of ranking. None use auc to evaluate---in fact, I don't see many production models evaluated with auc because usually some threshold is fixed before deployment, or the client has their own threshold they will act on.
Re: hparamaters varying the metric---a point on the ROC can change a lot while the auc stays relatively the same, especially in class imbalance settings.
Re data validation---at large companies with 10s-100s of models with hundreds of thousands of features, often data gets corrupted. Alarm bells do go off but there is alarm fatigue. Furthermore, not all Task MLEs are Platform MLEs and vice versa...a lot of people have to build tools to monitor ML pipelines they don't own.
Anyways I really appreciate your comment & it made me realize how ML people have vastly different experiences. I will clarify assumptions on the task & evaluation next time I write :)
Thanks for writing this. I had it on my mind but am usually too lazy to write long comments. I am a data scientist and I do some ML at work that is actually core the business (Learn To Rank for an aggregator). Whenever I read some article about data science on HN I can never relate to what they say.
'Data drift' sounds like some BS one tells their manager when they don't understand what's really happening.
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
As long as they are not AB tested statement likes these are worthless. I recently tried updating a model daily vs weekly and it made aboslutely no difference so why would I care ?
Anyway, thanks for your comment I know I'm not alone.
Interesting post, with many insightful comments but I think regarding data validation, this fella needs a bit of strong typed language training and functional programming skills, a lot of the problems from the current ML ecosystem comes because many use python, simple as that.
Data validation in this context means checking the data you're training your model with (or validating the data you're about to predict something on). Things like checking if the format of the data matches what you encountered during development (same columns, same values in categorical columns, ...), if the proportion of missing values is roughly the same than the original training set, if the means and variances of numerical values are also roughly the same than what the original training set had, ...
I believe what he means by "data validation" is to validate that your data is matching up with reality. For example if you do real estate predictions but your data set doesn't include any low-price sales because your broker didn't list them. Then your data doesn't cover the lower-price part of reality, so all of your predictions are going to drift upwards, due to the gap in your data.
[+] [-] Fiahil|3 years ago|reply
- "Plateform MLE" is juste regular Devops and Software Engineering. It's not because we're dealing with models and their accuracies that it is fundamentally different from what uses to be Devops before. We don't need to make a "special" title out of everything in Data Science.
- I still like to introduce "MLOps" into the conversation, thus making it special, and infringing the rule above. Oups.
- over fitting slightly on recent data, without accounting for gaps in modeling, is likely to lead to dire situations. See also : March 2020, when all forecasts went bananas. And , no, retraining at that time did not improved the situation. That's what the MLE was talking about here : > “I know it’s not really addressing the data drift problem,” ; they were right.
- Everything about data and model drift is just the tip of the iceberg: What happen when your model ends up in production ? It will start affecting the behaviour of the very thing you're trying to predict. A prime example of this on the retail markdown case : did you sold more of that article because the product was rated better (as qualified by the marketer to compensate for lack of control over markdowns), and the in-store stock was higher, or did your markdown on that item had the effect you were looking for and the rating was actually okay ? Did the sales went down this markdown season because your model had a perfect but unattractive markdown strategy last season ? This is already very difficult to model properly, let alone trying to measure their contribution to the drift. Good luck with that.
[+] [-] usgroup|3 years ago|reply
Indeed, again part of the reasons I think MLOps is a problem. One ideally needs a causal theory and rational expectations regarding how and why a model will work: you should be able to explain it. The reason is that there are many more ways for things to look like they work, then ways for them to actually work. Forcing -- what should be -- the scientific effort of modelling into a cut box, tune, re-train, monitor workflow is reductionist and will invariably make it near impossible to exercise proper due diligence IMO.
Scientific discovery is generally not a sausage machine -- its messy and dependent on what happens in brains more than anything else. I think its a fools errand to expect differently from -- what should be -- the same abstract tool box applied to commerce.
[+] [-] zach_garwood|3 years ago|reply
[+] [-] ramraj07|3 years ago|reply
[+] [-] valarauko|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] sfifs|3 years ago|reply
> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models.
if re-training every-day actually gives you a significant business material benefit (excepting cases where you specifically want to significantly over-weight recent data - phone keyboard autocomplete predictions as an example), you likely don't have a model that is actually picking signal from noise or generalizing on the data. This is memorization, not learning. This is essentially how you get expensive ML disasters.
[+] [-] InefficientRed|3 years ago|reply
Generalizing individual experiences to general prognostications sells well as a genre, especially in software. This writing pattern was epitomized by OOP thought leaders in the 90s/00s, but was also present in the process evangelists who followed and the more iconoclastic folks like Dijkstra who preceded. However, actually following such general advice without taking into account the author's context can easily end in tears.
Ideally, these sorts of pieces should be written up as case studies rather than as general prognostications. With the context that a case study provides, these pieces can teach us a lot about how to engineer systems. Without that context, the reader is left to blindly follow advice that may not be relevant to their context. Or, best case, reverse engineer the context.
[+] [-] vandreas2|3 years ago|reply
[+] [-] auxym|3 years ago|reply
I'm not convinced it actually pays, though.
[+] [-] Der_Einzige|3 years ago|reply
Ideally, we would simply never stop training.
[+] [-] jwmoz|3 years ago|reply
[+] [-] zomglings|3 years ago|reply
From my experience managing data science teams (small teams - max 20 people), I always preferred for my Task MLE team mates to also do the Platform MLE work. I would not hire someone just to be a Platform MLE because they would be too distant from the day-to-day Task MLE needs.
I like the way that Google SREs think about this - there's toil (Task MLE parts of the job) and then there's automation (Platform MLE parts of the job). Every programmer on the team should have toil and should be given enough time and freedom to address their most painful toil through automation.
Distinguishing between Task MLEs and Platform MLEs so strictly is dangerous for anyone that applies this dichotomy in practice.
I guess the article author never explicitly stated that they have to be different people, but I got the sense from reading the article that this was an unstated assumption on their part.
[+] [-] hwers|3 years ago|reply
So true. This is a pretty easy flag that someone hasn’t actually been down and dirty with trying to train any actual models and doesn’t know what they’re talking about
[+] [-] latenightcoding|3 years ago|reply
[+] [-] evrydayhustling|3 years ago|reply
Re: monitoring -- When I was doing automated trading, we of course had automated alerts to stop systems that were outside expected parameters. However, as OP describes, there was also a long tail of degradations where there was no great precision/recall tradeoff on flagging "real" issues. Instead, we put effort into "calibration reports" that visually surfaced as much information about training and performance as our eyes could handle. These would include things like over time plots and PCA plots for recent feature histories, where it was more an opportunity to spot patterns than an explicit metric. Reviewing these for 15m each day with our own eyes was much more effective at detecting a long tail of unanticipated degradations than we would have been at anticipating and coding for each one explicitly -- and it help build intuition that fueled new research ideas.
[+] [-] unlikelymordant|3 years ago|reply
[+] [-] cosmic_quanta|3 years ago|reply
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
We all know that overfitting is bad (e.g. in time-series forecasting, the past isn't always representative of the future). Depending on your domain, more recent data may be more valuable than older data, sure. The solution is not to overfit to recent data!
In my experience, it is to design features which take into account recency. For example, in a particular quantity we wanted to forecast, we found out that using ~7 days worth of data was better than using multiple months, due to the data being non-stationary (the mean of the quantity was changing over time). What we did was combine features with an exponential decay with the appropriate decay constant, to great results.
[+] [-] stfwn|3 years ago|reply
[+] [-] Fiahil|3 years ago|reply
[+] [-] seanc|3 years ago|reply
I laughed out loud. The rest of the essay is so on point I'm attributing the author full marks for a classic example of the dark, laconic understatement that builds so many bridges amongst technology professionals.
[+] [-] usgroup|3 years ago|reply
Man searching for his keys under a street lamp in the dark. Police man asks “Is this where you dropped your keys?”. Man says “No, but this is where the light is”.
[+] [-] Fiahil|3 years ago|reply
Maybe you're conflating this with something else ?
[+] [-] bernf|3 years ago|reply
[+] [-] wodenokoto|3 years ago|reply
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
I don’t understand why that is
[+] [-] fxtentacle|3 years ago|reply
[+] [-] Kalanos|3 years ago|reply
[+] [-] thrill|3 years ago|reply
[+] [-] spywaregorilla|3 years ago|reply
> Sometimes, I was so scientifically sound that the business lost money. I automated a hyperparameter tuning procedure that split training and validation sets into many folds based on time and picked hyperparameters that averaged best performance across all the sets. I only realized how silly this was in hindsight. I should have taken the hyperparameters that yielded the best model for the latest evaluation set.
What's silly here is thinking that the minor adjustment of hyper parameters from set to set is likely to make a difference. This might hold for some niche deep learning problems but it sounds later on like she isn't doing this. I rarely see an optuna parameter optimization affect the AUC of a model by more than 0.02 vs arbitrary choices. Most business problems tend to be pretty simple. I just can't imagine this parameter choice makes a difference.
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
Nonsense. Dangerous nonsense.
> It puzzles me when people say that small companies can’t retrain every day because they don’t have FAANG-style budgets. It costs a few dollars, at best, to retrain many xgboost or scikit-learn models. Most models are not large language models. I learned this circuitously, while doing research in ML monitoring.
This is accurate. I find it weird that people feel the need to retrain models so frequently though. Context matters I guess. But then they talk about data drift...
> Anecdotes like this really get me in a tizzy. I think it’s because what I thought were important and interesting problems are now, sadly, only interesting. Researchers think distribution shift is very important, but model performance problems that stem from natural distribution shift suddenly vanish with retraining.
It sounds like this person is just retraining their models at a high frequency until a hyper parameter search randomly produces something that looks good on an evaluation set. I don't see any evidence of them ACTUALLY EVALUATING THEIR PERFORMANCE. The missing piece to all of this is taking the predictions they made with this process and seeing how it compared to the unseen future outcomes in retrospect. If you think you're fine because you make a new model every day that works for yesterday's data, which they openly imply is overfit to that period, you might just be making a new mistake every day. A good AUC or R2 on your evaluation set doesn't guarantee that it'll be good on new data.
> I had a hard time finding the gold nugget in data drift, the practical problem. Now it seems obvious: data and engineering issues—instances of sudden drifts—trigger model performance drops. Maybe yesterday’s job to generate content-related features failed, so we were stuck with old content-related features. Maybe a source of data was corrupted, so a bunch of features became null-valued.
What? This seems pretty crazy to me. You should have alarm bells built in that freak out if important features are turning up null or corrupted. This person seems to have no idea if their model is working! They just know it worked for yesterday. They allude to doing this later on, but even then... broken data pipelines is not what people typically mean by data drift.
I see no discussion of real performance evaluation, understanding how the models work, or how to make real use of the model outputs.
[+] [-] shreya_shankar|3 years ago|reply
The overfitting phenomenon I've seen at the companies I work/have worked with are all doing some kind of ranking. None use auc to evaluate---in fact, I don't see many production models evaluated with auc because usually some threshold is fixed before deployment, or the client has their own threshold they will act on.
Re: hparamaters varying the metric---a point on the ROC can change a lot while the auc stays relatively the same, especially in class imbalance settings.
Re data validation---at large companies with 10s-100s of models with hundreds of thousands of features, often data gets corrupted. Alarm bells do go off but there is alarm fatigue. Furthermore, not all Task MLEs are Platform MLEs and vice versa...a lot of people have to build tools to monitor ML pipelines they don't own.
Anyways I really appreciate your comment & it made me realize how ML people have vastly different experiences. I will clarify assumptions on the task & evaluation next time I write :)
[+] [-] toto444|3 years ago|reply
'Data drift' sounds like some BS one tells their manager when they don't understand what's really happening.
> I have done enough research on production ML now to know that it pays to simply overfit to the most recent data and constantly retrain. Successful companies do this.
As long as they are not AB tested statement likes these are worthless. I recently tried updating a model daily vs weekly and it made aboslutely no difference so why would I care ?
Anyway, thanks for your comment I know I'm not alone.
[+] [-] metropolisdelaq|3 years ago|reply
[+] [-] Longwelwind|3 years ago|reply
[+] [-] fxtentacle|3 years ago|reply