> Moreover, you may quickly realize much of this work is repetitive and while time-consuming, is “easy”. In fact, most analyses involve a great deal of time to understand the data, clean it and organize it. You may spend a minimal amount of time doing the “fun” parts that data scientists think of: complex statistics, machine learning and experimentation with tangible results.
This. Universities and online challenges provide clean labeled data, and score on model performance. The real world will provide you... “real data” and score you (hopefully) by impact. Real data work requires much more than modeling. Understanding the data, the business and value you create are important.
As per #6, better data and model infrastructure is crucial in keeping the time spent on these activities manageable, but I do think they’re important parts of the job.
I’ve seen data science teams at other companies working for years on topics that never see production because they only saw modeling as their responsibility. Even the best data and infrastructure in the world won’t help if data scientists do not feel co-responsible for the realization of measurable value for their business.
Training integrative data professionals could be a great opportunity for bootcamps. Universities will (understandably) focus on the academically interesting topic of models, while companies will increasingly realize they need people with skills across the data value chain. I know I would be interested in such profiles. :)
I took a data visualisation class in uni that handled this really cleverly. The second assignment sounded very easy. The teacher provided links to the sources where we could find data.
Most people figured that with such a simple assignment (not significantly harder than the first one, which was also easy-ish) they could put off doing it until the last moment.
Most people failed.
This real world data needed hours upon hours of cleaning before it was in any way useable. Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start.
Never again will I underestimate the dirtiness of real world data. One of the best teachers I had.
>You may spend a minimal amount of time doing the “fun” parts that data scientists think of: complex statistics, machine learning and experimentation with tangible results.
I don't get why building a model people consider to be the "fun" part. That's mostly spitting data in, watching a loading screen, and then observing the output.
That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be. Likewise, learning the business side and seeing what is possible no one has considered is great fun too.
My favorite part is feature engineering. Pre-processing and cleaning is fun too, but morphing the data into formats that extract a diamond from coal is a lot of fun, and what data science is all about. Clicking go on some ML algo is just icing on the cake, seeing it reveal bits maybe even I overlooked in the data.
If you like ML why not be an MLE? That's what MLEs do, and they're a more desirable job. DS is all about the research, discovering and learning new information, and making the impossible possible.
I have been in the data Analytics space for 15+ years. The one mantra I try to always focus is what’s the business impact of what our team is creating.
This is a simple yet very powerful rule that helps us quickly disband ideas that:
1. Do not have a robust testing mechanism. No model is useful unless it performs in the real world. Measuring this is a severely non-trivial problem with multiple operational considerations.
For e.g. are you able to run manage true control/test groups? How do you build a “reverse” data pipeline to verify your models? And, if you are required to update model weights constantly, where and how will you update the model parameters?
2. Conversely, some of the most impactful products I worked on were probably delivered in simple excel sheets or had just under 20 lines in my Jupyter notebook. Not every business problem is demanding a deep learning network. For e.g. we worked on a data-driven capacity forecasting exercise for a call-centre. I can tell you that the sophistication of the model was the last thing on my mind as I had to work on careful interpretation and data collection.
3. Data Science departments should sit closer to business than what appears to be the trend correctly. At least business data science teams ( Apart from technical data teams focusing on product analytics to improve performance etc ). Courses and academic programs, I think, have developed a bias towards tools and techniques without the underlying analytical interpretative techniques needed to work with data. For e.g a new data scientist in my team delivered excellent code but she couldn’t detect logical misses in the data (for e.g losing some data during processing, using columns with almost all data missing)
On the other end of this spectrum, we are in the lagging end of the hype bubble still so there are many top leaders who are expecting to plug in “data science” and realise Billions of dollars in savings, new sales etc.
There was a remark in the old school Linear Algebra book we had in university (Edwards & Penney) that stuck with me, to the effect (probably I recall the details wrong) that one of the authors were once involved in data analysis of water samples collected from a bunch of rivers by 15 engineers, and it turned out no 6 of these engineers' measurements were internally consistent. The moral of the story was that real world data is messy, you need to learn least squares and related methods to make sense of the data.
Now with "data science" you've taken a step further, and instead of applying the math to lab reports on meticulously filled out forms, you're going to aggregate all the messy sources you can get your hands on. Of course your headaches will multiply.
>This. Universities and online challenges provide clean labeled data, and score on model performance.
First homework assignment in the stats class I teach is to clean data that the class generated with directions they all perceived as clear. It's near about the most hated assignment I have ever given. Amazing how many ways there are to encode gender of a experimental participant.
This rings true to me. I've seen a lot of models get built that are never used. Although in my experience it wasn't that data scientists didn't care about business value, it's just that data science often requires breaking down silos and asking other teams to change their behavior.
This article mentions that leadership often doesn't support data science, but I think it actually doesn't go far enough. Leadership doesn't just have to support the data scientists, it has to actually tell other teams to prioritize data science projects over what they are currently doing. Since these data science projects are riskier than standard projects, it makes sense that leadership doesn't often do this (and focusing on the standard projects could be the right call). However, it also means that it's very hard for data scientists to create business value.
As a research-oriented data scientist at one of the larger tech companies, I can confirm that even here, a lot of people are unsure about what exactly data scientists are supposed to do. My most frequent request is "tell us why metric X dropped", to which the answer is often a subtle combination of many different factors (often random fluctuation) that doesn't lead to a pleasing actionable result in the sense of "here's why it dropped; go do this to fix it".
The really interesting research type work (Bayesian modeling, convolutional neural networks, etc.) takes a long time to implement and may produce no useful results, which is a really bad outcome at a company that measures performance in six month units of work and highly values scheduled deliverables and concrete impact. Many of the data scientists I work with tend to stick to methods that are actually quite simple (e.g., logistic regression, ARIMA) because these at least deliver something quickly, despite the fact that many of my coworkers come from research-heavy backgrounds.
In my org, there's nothing stopping anyone from pursuing advanced machine learning; for the most part we set our own agenda (in fact, determining priorities is part of the job role). And some people do in fact go after state-of-the-art ML, with some really cool results to show for it. But in terms of career progression and job safety, the risk is just way too high, at least for me personally. I save the highly mathematical stuff for a hobby.
Edit: while this may sound a bit negative, I will add that my description of data science isn't a complaint per se; I am mainly trying to inform those who are seeking a career in data science of what to expect compared to what is often promised. The work that is most valuable to a business is not exciting all of the time, but I don't think there is another job in the tech industry that I would find more enjoyable than my current one at the moment.
>But in terms of career progression and job safety, the risk is just way too high, at least for me personally. I save the highly mathematical stuff for a hobby.
I think the sad truth is that this is the reality of work no matter if you are a Data Scientist or not. What you thought you would be doing to show your worth and climb the ladder gets blurred in with KPIs you didn't set, politics you didn't create, goals and deadlines you had no input into, etc. One of the unique challenges you can face as a Data Scientist is that you may interface with people in many different groups, all of which have different goals which may be in conflict with each other. Compare this to other roles where you ultimately only follow the goals of the organization you report into.
This is very accurate. I've found that the simplest model with good enough results is often the best in the business world. On the one hand, that means I spend less time pushing the boundaries of what we're capable of doing as an organization. On the other hand, most business questions don't need massively complex answers so a quick regression may suffice.
This article is pretty spot on. As someone who has worked in data science/analytics for over 6 years I have found that the field is filled with hype, managers who are not sure what data science actually is, and an absurdly wide amount of skills jobs expect you to be able to do well.
Apply for and interviewing for data science jobs is a total nightmare. You are competing against 100s or even 1000s of applicants for every job posting because someone said it was one of the sexiest careers of the 21st century. Further exacerbating this, Everyone believes that data is the new oil, and large profit multipliers are just waiting to be discovered in this virgin data that companies are sitting on. All that is missing is someone to run some neural network, or deep learning algo on it to discover the insights that nobody else can see.
The reality is that there is an army of people who know how to run these algos. MOOC's, blogs, youtube, etc have been teaching everyone how to use these python/R packages for years. The lucky few who get that coveted data science job can't wait to apply these libraries to the virgin data only to find that they have to do all kinda of data manipulating to make the algos even work, which takes days and weeks of mundane work. Finally they find out the data is so lacking that their deep learning model does very little in providing actual business value. It is overly complicated, computationally expensive, and in the back of your mind know you can get the same results using some simple logic.
Managers who don't understand data science fundamentals learn from the news and have their data scientist implement those buzz words so they can look good in front of their bosses.
I think there is a place for data scientists who understand the fundamentals of the models out there, and know when you should not use them. Data science is also increasingly a subset of software engineering and a good data science in a tech company should be able to code well. I also think that there is not some huge unmet demand for data scientists. Just a huge amount of hype and managers wanting to look good by saying they managed a data science team.
Any work is dull and depressing when done under the supervision of idiots. Some companies, although probably less than claimed, are genuinely data driven rather than HiPPO driven, though. This might be particularly important to look for theses to do interesting stuff in the fields of data science.
Data science is correctly valued when you realize how relatively unimportant it is. It is a small cog in a larger machinery (or at least it ought to be).
You see, decision-making involves (1) getting data, (2) summarizing and predicting, and (3) taking action. Continuous decision-making -- the kind that leads to impact -- involves doing this repeatedly in a principled fashion, which means creating a system around the decision process.
For systems thinkers, this is analogous to a feedback control loop which includes sensor measurements + filters, controllers and actuators.
(1) involves programmers/data engineers who have to create/manage/monitor data pipelines (that often break). This the sensor + filters part, which is ~40% of the system.
(2) involves data scientists creating a model that guides the decision-making process. This is the model of the controller (not even the controller itself!), which is ~20% of the system. Having the right model is great, but as most control engineers will tell you, even having the wrong model is not as terrible as most people think because the feedback loop is self-correcting. A good-enough model is all you need.
(3) involves business/front-line people who actually implement decisions in real-life. This is where impact is delivered. ~40% of the system. This is the controller + actuator part, which makes the decisions and carries them out.
Most data scientists think their value is in creating the most accurate model possible in Jupyter. This is nice, but in real-life not really that critical because the feedback-loop inherently moderates the error when deployed in a complex, stochastic environment. The right level of optimization would be to optimize the entire decision-making control feedback loop instead of just the small part that is "data science".
p.s. data scientists who have particularly low-impact are those who focus on producing once-off reports (like consultant reports). Reports are rarely read, and often forgotten. Real impact comes from continuous decision-making and implementing actions with feedback.
Had to make an account to upvote this. Absolutely dead-on. I think you can generalize this comment to almost any specialist skill. "No Silver Bullet" should be a business doctrine as well as a technical one. You need to do a lot of things well to succeed in business. Specialists just provide you a capability. You have to implement and use those capabilities as part of a larger system if you want to create a machine that generates profit.
I should add that programmers have the crucial albeit boring role of creating CRUD front ends (forms) for data input.
That is a akin to a sensor input, and one that is surprisingly important. Without a good CRUD form, data either doesn’t get entered at all, or is entered in crude, unvalidated ways like as loose Excel files with formatting that is all over the place.
I stood up a data science operation at my company over the last few years, and have noticed a key difference in data-science projects that have been successful and those that have failed. It hits on a number of points brought up in the article, namely where does data science "fit" in an organization delivering software and how is the value realized by the business.
The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem, looking for trends, patterns, automating workflows, making recommendations, etc. This is high-level pie in the sky stuff that works well in pitch meetings and client meetings, but when it comes down to brass tacks this leaves very little vision of what is trying to be achieved and even less on a viable execution path.
More successful deployments have had a few items in common
1. A reasonably solid understanding of what the data could and couldn't do. What can we actually expect our data to achieve? What does it do well? What does it do poorly? Will we need to add other data sets? Propagate new data? How will we get or generate that data?
2. The business case or user problem was understood up front. In our most successful project, we saw users continuously miscategorized items on input and built a model to make recommendations. It greatly improved the efficacy of our ingested user data.
3. Break it into small chunks and wins. Promising a mega-model that will do all the things is never a good way to deliver aspirational data goals. Little model wins were celebrated regularly and we found homes and utility for those wins in our codebase along the way.
4. Make is accessible to other members of the company. We always ensure our models have an API that can be accessed by any other services in our ecosystem, so other feature teams can tap into data science work. There's a big difference between "I can run this model on my computer, let me output the results" and "this model can be called anywhere at any time."
While not exhaustive, a few solid fundamentals like the above I think align data science capabilities to business objectives and let the organization get "smarter" as time goes on as to what is possible and not possible.
As a person doing data science / ML in the last 4 years, I mostly agree with your points. Especially about the hype driven demand for DS/ML. One thing that is often neglected though is the exploration part it. There really is a lot of data out/in there that your company knows anything about, but can probably benefit from knowing. E.g. even a simple crawl of a popular jobs/ads/... site done diligently for e.g. 6 months can reveal many interesting insights about market structure and trends. Google and its mission to organize all data in the world exist for a reason. This however is in stark contrast with the approach that most executives take. Instead of managing it as a well thought strategic/long term investment, they want to time-box it, to get immediate value and to show off to senior management or customers. I've seen this tendency in both big corporations (mid-level management) and startups, which makes me think that the confounding variable is the fund/incentive management process. In both big corps and startups, there is a limited time&budget to show meaningful results and people optimize for that, which often involves taking shortcuts, neglecting strategy and outright lying.
In contrast to that, I've seen projects driven by wealthy individuals, who don't look for immediate value, but are scratching an itch (e.g. curiosity). These usually fare better than the former as long as budgets don't get out of hand (to exhaust the cash cow). I would argue that these are most successful, because of better alignment of motivation (person paying the bill) and execution (person driving the process).
A math friend of mine often consulted for scientists. His least favorite were those who asked him to "make some clusters". (think k-means) "What are you looking for? What is your hypothesis?" "Just make some clusters and we'll see."
Not utterly without merit, but fairly blind fishing nonetheless.
>The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem...high-level pie in the sky stuff that works well in pitch meetings and client meetings...
I'm been in various external and internal facing Data Science roles for 8+ years and this is spot on. IME it's the #1 reason Data Science projects "fail." If you can replace "do some of that data science" with "do some of that black magic" that probably means nobody actually checked to make sure the data and problem made sense in the first place. But somebody somewhere already committed to it, so the Data Science team has to deliver it.
> The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem, looking for trends, patterns, automating workflows, making recommendations, etc.
While I agree on the point, there's a case that's arguably worse: When those executives hire Data Scientists and then ask them: "So what can we do with Data Science?"
Teams being small, data being crummy, infra being hard, and yet expectations being high aren't so much complaints as the they are the job description.
The point of data scientists and the related roles listed in the article are not to just churn out the fun stuff, but to wade through the institutional and technical muck and mire it takes to bring the fun stuff to bear on a relevant business problem and to communicate the results in a way that people of all walks can understand.
Yeah this guy seems to think Data Science work should be like doing a problem set for CS class. Sorry that you have to deal with messy data, fragile infra, and limited resources - I know it's not "fun", but frankly that's what the money is for.
As somebody in an ML Engineering role, i.e. somebody who could be asked to either fix the logging infrastructure or build some models, I would have agreed with this.
But even in this day and age with ML being the new hotness, you will find people who are quite happy to work on infrastructure and don't have a huge amount of interest in training models themselves, and it is probably a lot easier to hire them than people who can do both, and you may get better results from actual specialists.
I'm generally confused by the hype around ML and 'data science'. it seems like CS has somehow regressed to the behavourism era of psychology or economics before the Lucas critique.
The problem with all this data talk isn't just about implementation or bad structure, the limitations of putting all your bets on inductive reasoning are systemic.
The insights that economists had in the 70s and 80s was that reasoning from aggregated quantities is extremely limited. Without understanding at a structural level the generators of your data, trying to create policy based on outputs is like trying to reason about inhabitants of a city by looking at light pollution from the sky.
My guess why data science so rarely delivers what it promises is because you can't get any value from historical data if your circumstances change to the point where past data is irrelevant. Which in the world of business happens pretty quickly. To have a competitive advantage, one needs to figure out what has not been seen yet.
And trying to exploit signals suffers from the issue laid out above. There was a funny case of an AI hiring startup trying to predict good applicants, and the result was people putting "Oxford" in their application in a font matching the background color
There’s also the issue of data scientists just not having a seat at the table. Anyone can validate their point by using data to support their answer just like anyone can validate their opinion by doing a google search.
I only see ML and data science as having real value when considered as a single component of a larger system, most of which will not consist of anything close to ML. Many real world environments are too entropic to see much accuracy from ML models except in very, very limited bands (facial recognition, for example).
As other commenters here have posted, without the integration of data science into both the business needs and the rest of the existing tech stack it will remain a fun school course activity.
As a scientist, I've worked with data for decades. There's always been a prevailing belief that scientists and engineers with specialized domain knowledge are mostly fumbling in the dark and can be replaced with a general purpose technique.
This was certainly the vibe that I got from "design of experiments" when it was the statistical method du jour. Then from "Bayesian everything" and now "data science." I remember "design of experiments" studies being conducted with great fanfare and success theater, while producing zero results.
The long term theme is that science is hard for reasons that managers don't understand, can't manage, and are reluctant to reward.
I've seen a few similar articles now. Does this represent the general view of folks working in data science? "Data Science" is such as meaningless catch all term. The reality is in many organizations it's simply advanced business intelligence or advanced business analytics. There are some industries that lend themselves well to this whole practice, and they tend to be industries that have been borne out of the internet age (e.g. social media, internet advertising, etc.)
Some other industries have been doing "data science" for ages. Credit Risk Modelling, insurance and so on.
Every time I read one of these articles, I feel it's just an individual who entered a kind of crummy situation and they're learning what it means to work in a corporate environment. Some are better than others. Some are more motivated than others. Some have better cultures than others. Some are more willing to make technology a key part of their business strategy. Some are more data driven than others.
My recommendation is to always ask the fundamental question before joining: what are you trying to achieve with data science, and is it actually achievable?
I always thought the non-specificity of the term Data Science was a strange criticism for those in the tech industry to make. How many types of SWE are there? Front-end, back-end, full-stack, devops, security, QA...
I agree wholeheartedly with your recommendation. Like any other job, each company has different needs and expectations and if you want something else out of the role you'd best avoid that company.
As a data dude in public/nonprofit healthcare-landia I agree with all this, plus:
- It's essential to have/develop domain expertise in your industry.
- Beware plausible, but incorrect (or poorly interpreted) data that supports yours (or others') assumptions/biases.
- Add on to #4 - at least as bad as this is having well-intentioned people on your team who "know enough (a bit of SQL or low/no-code data tool") to be dangerous. Um, why are you joining unnecessary tables, or using a different alias for the same columns/tables in different queries, with no comments or standard formatting?
- Hold your nose, but anything you do in SQL/R/Python/even fancier programming tool/language is going to pass through MS Excel at least once sooner or later which can irreversibly bastardize CSVs (even just opening without saving!), truncate precision to 15 digits, change data types, etc.
- So glad for the callout in #7 - there are clearly devs/data folks out there who are happy to take on an "interesting programming project at a great paying job" - that isn't serving the best interests of humanity.
This rings very true to me. I'm working on moving over to an SWE role in the next few years for many of these reasons.
I'll just add one: the business absolutely doesn't care how you get your answer, only if they're reliable enough (hand grenade close is better than most companies have today).
While this seems obvious enough to anyone with a few years under their belt, to the new DS grad who has their time series analysis canned in favor of slapping a simple moving average in place and shipping it can be rather disillusioning.
Sometimes a young startup wants to advertise to the board, and they want you to make a presentation. I've made the mistake of showing near 100% accuracy solving a difficult problem important to the business, and expecting a strong positive result.
Instead I got a, "But, are we using deep neural networks?" type comments.
Sometimes a company just wants to market, be it to customers, or to the board. It's important to know your audience.
> the business absolutely doesn't care how you get your answer, only if they're reliable enough (hand grenade close is better than most companies have today).
One of the challenges with this is that "reliable" can mean a lot of things when the goalposts of success are constantly moving in large projects with many stakeholders, all of which are clawing for attention. I've seen politics derail so many Data Science projects and destroy the morale of Data Scientists.
It's only natural that a lot of people will realize that a moving average that confirms what people wanted to see anyway will lead to more success (whatever that means).
I’ve been doing an MS in Data Science very slowly due to work and 2 new kids. Finishing the degree this year in year 4. I was very excited about the prospect of doing something different. A few things have changed for me.
1). I am hearing about Data Science Teams being furloughed during these times. That isn’t happening in my function (Corporate Finance). I am glad to be secure even though I enjoy much of the data sci work.
2) I’m able to apply Data Science concepts in my current role, and it’s adding a lot of job security and providing me with exposure. I am much less interested now in moving to straight Data Science and instead am applying my learnings in my current role as a sort of in-house Data Science guy. But I have a lot to learn to be honest.
3). There seem to be a lot of “thought leaders” acting like they are big experts in the area and really don’t know anything many of us amateur scientists don’t know. They pull perfect clean datasets and show these magic transformations they just copy from others to get YouTube hits or Twitter followers. That just never happens in real life, and many leaders are seeing this and losing interest in this function in the returns they are getting from sole data science folks.
This isn’t unique to data science. I personally know people in finance that are poor coders and even worse quants, yet they go around lecturing at universities.
I work as a data scientist. Some of the author's points are workplace-specific: lack of leadership, being the only data person, ethical concerns. The others are just aspects of the job - communicating about your job and impact, dealing with vague specs or managing low-quality datasets.
Neither of those quite match the articles title, perhaps it just refers to the author's personal expectations. Neither of them seem that specific to data science, or without parallels in other software jobs. And neither of the points read like a slight towards data science to me, like some of the other commenters here suggest.
One issue might be that organizations subconsciously resist the data scientist, or more generally, the nerd in his/her attempt to take over decisions. If these decisions are invariably tied to the goals and careers of managers, how can the data scientist have a "seat at the table" in all but the most enlightened and technical companies? The disorganized state of data and infrastructure suits the ambitious manager well, who can just put in enough effort to find data to have their project greenlightened or to answer one specific question.
Progress may only come slowly, ideally through products bought from 3rd parties whose results are understood and controlled by management.
I did "data science" for about a decade, consulting with plaintiffs firms and state AGs on antitrust and fraud cases. For each case, the work flow was roughly this:
-- write discovery requests
-- review production, and check out data and documentation
-- write supplementary discovery requests
-- review production, and check out data and documentation
[repeat as needed]
-- analyze data, and write deposition questions
-- help attorneys wring answers from deponents
[repeat as needed]
-- analyze data, and produce required output
-- write parts of briefs and expert reports
I generally did that in consultation with testimonial experts and their data analysts. Sometimes that didn't happen until we'd documented the case enough to know that it was worth it. And occasionally small cases settled with just me as the "expert".
It's a small industry, and not easy to get into, unless you know key players at key firms. But the money's pretty good, and the work can be exciting. I loved being that guy in depositions whispering questions to the attorneys :)
This all involved pretty simple calculation of damages, through comparing what actually happened vs what would have happened but for the illegal behavior. But-for models were typically based on benchmarks.
After data cleanup in UltraEdit, I did most of the analysis in SQL Server. I used Excel for charting and final calculations.
This reads like Indiana Jones teaching Archeology. Yes, as a data-scientist you actually have to work, most of the work is digging in dirt, and mostly you won't find anything of interest.
It works well when subject matter experts exist in the org and collaborate/supervise/drive data folk, to solve some issue the sme's have spent enough of their own time thinking about.
If its just data folk by themselves getting dumped with org data and told to find pirate gold...then its a crap shoot.
This reads as a series of bad job experiences and I think is explained by a wide variety of job functions that all can have "Data Scientist" as a title. Someone else's experience could be totally different. You have to know what to look for and what to avoid. If you're trying to find a DS job, one of your top priorities is finding out what the actual job consists of. For instance, a Data Scientist at Facebook might be called a Data Analyst at many other places--no modeling required.
I know this because I've been on that journey. But there's no reason to expect some department head that's never been exposed to DS to know this. They just copy/paste some other company's job req. If you're more junior, here are my tips:
- If it's a "new DS team" that supports a variety of teams: beware. Bolt-on DS doesn't work well, as it's really hard to build a meaningful solution that's not deeply integrated.
- If it's an old company or in a conservative industry: beware. There are likely to be data silos and difficult ownership models that make it nearly impossible to get and join the data you need.
- If it's a small company: beware. You're likely going to need a broad set of knowledge that's won with several years of experience to be able to build end-to-end solutions that are integrated into the rest of the tech stack.
- If it's not an engineering-driven culture: beware. DS will often be used to provide evidence to someone else whose already made up their mind and pretend they're being data-driven, and you'll be the disrespected nerd that's expected to do what it takes to deliver the answer they want. Most companies claim to be "data-driven", few are, and even fewer understand data-driven isn't always possible or desirable.
Industry is still trying to figure out how to use ML and are still learning that it's not as easy as hiring someone that knows about all the algorithms, but rather it takes deep technological changes to data infrastructure to enable the datasets that can then be used by the ML experts. But you don't have to be the person that helps them figure this out the hard way (i.e. by being paid to not accomplish much due to problems outside of your control). Better to find a place with a healthy data science team that can help you learn and contribute. They exist.
Great read. A lot of those problems are real, and some of those I’ve experienced myself. But I think at least some of them are related to the immaturity of the field. We’re only at the beginning of creating the tools and platforms to facilitate DS, making it more reproducible and easier to measure.
For example, I’m working on the tool to make data management easier and convert datasets into a structured representation. If you have experienced that you spend a lot of time on preparing and analyzing data, and it is tedious, please reach out to me michael at heartex.net, would love to get your feedback on the product we have built so far.
> But I think at least some of them are related to the immaturity of the field.
I agree. More so, I sometimes feel that in the end the field will break up once things start settling down. Some roles will migrate more towards engineering, some will go back towards data analysis.
The expectation that a data scientist is a funnel that can turn anything into magical insights and tools can't last forever.
A really easy way that I try to explain things to people is like this:
You can't compress information until you have it in a format that is appropriate for compression.
That is:
You can't compress (apply/create algorithms) information (data) until you have it (instrumented data collection) in a format (schema) that is appropriate for efficient compression (structured logging/cleaning).
99% of that is Data Engineering and building good engineering practices which have good data practices as a priority.
For any organization that has more than a handful of employees and more than one product, that is a non trivial task and gets more difficult the larger the organization gets.
[+] [-] hectormalot|6 years ago|reply
This. Universities and online challenges provide clean labeled data, and score on model performance. The real world will provide you... “real data” and score you (hopefully) by impact. Real data work requires much more than modeling. Understanding the data, the business and value you create are important.
As per #6, better data and model infrastructure is crucial in keeping the time spent on these activities manageable, but I do think they’re important parts of the job.
I’ve seen data science teams at other companies working for years on topics that never see production because they only saw modeling as their responsibility. Even the best data and infrastructure in the world won’t help if data scientists do not feel co-responsible for the realization of measurable value for their business.
Training integrative data professionals could be a great opportunity for bootcamps. Universities will (understandably) focus on the academically interesting topic of models, while companies will increasingly realize they need people with skills across the data value chain. I know I would be interested in such profiles. :)
[+] [-] kqr|6 years ago|reply
Most people figured that with such a simple assignment (not significantly harder than the first one, which was also easy-ish) they could put off doing it until the last moment.
Most people failed.
This real world data needed hours upon hours of cleaning before it was in any way useable. Of course, the teacher knew this, gave bonus points to the ones who did start in time, and then extended the deadline as he had expected to from the start.
Never again will I underestimate the dirtiness of real world data. One of the best teachers I had.
[+] [-] proverbialbunny|6 years ago|reply
I don't get why building a model people consider to be the "fun" part. That's mostly spitting data in, watching a loading screen, and then observing the output.
That's not fun, that's boring. The fun part is looking at the data and gleaming all these potential patterns from it, seeing what potential is there and what could be. Likewise, learning the business side and seeing what is possible no one has considered is great fun too.
My favorite part is feature engineering. Pre-processing and cleaning is fun too, but morphing the data into formats that extract a diamond from coal is a lot of fun, and what data science is all about. Clicking go on some ML algo is just icing on the cake, seeing it reveal bits maybe even I overlooked in the data.
If you like ML why not be an MLE? That's what MLEs do, and they're a more desirable job. DS is all about the research, discovering and learning new information, and making the impossible possible.
[+] [-] roystonvassey|6 years ago|reply
This is a simple yet very powerful rule that helps us quickly disband ideas that:
1. Do not have a robust testing mechanism. No model is useful unless it performs in the real world. Measuring this is a severely non-trivial problem with multiple operational considerations.
For e.g. are you able to run manage true control/test groups? How do you build a “reverse” data pipeline to verify your models? And, if you are required to update model weights constantly, where and how will you update the model parameters?
2. Conversely, some of the most impactful products I worked on were probably delivered in simple excel sheets or had just under 20 lines in my Jupyter notebook. Not every business problem is demanding a deep learning network. For e.g. we worked on a data-driven capacity forecasting exercise for a call-centre. I can tell you that the sophistication of the model was the last thing on my mind as I had to work on careful interpretation and data collection.
3. Data Science departments should sit closer to business than what appears to be the trend correctly. At least business data science teams ( Apart from technical data teams focusing on product analytics to improve performance etc ). Courses and academic programs, I think, have developed a bias towards tools and techniques without the underlying analytical interpretative techniques needed to work with data. For e.g a new data scientist in my team delivered excellent code but she couldn’t detect logical misses in the data (for e.g losing some data during processing, using columns with almost all data missing)
On the other end of this spectrum, we are in the lagging end of the hype bubble still so there are many top leaders who are expecting to plug in “data science” and realise Billions of dollars in savings, new sales etc.
[+] [-] semi-extrinsic|6 years ago|reply
Now with "data science" you've taken a step further, and instead of applying the math to lab reports on meticulously filled out forms, you're going to aggregate all the messy sources you can get your hands on. Of course your headaches will multiply.
[+] [-] avs733|6 years ago|reply
First homework assignment in the stats class I teach is to clean data that the class generated with directions they all perceived as clear. It's near about the most hated assignment I have ever given. Amazing how many ways there are to encode gender of a experimental participant.
Male, M, m, male, Man, ...
[+] [-] twelfthnight|6 years ago|reply
This article mentions that leadership often doesn't support data science, but I think it actually doesn't go far enough. Leadership doesn't just have to support the data scientists, it has to actually tell other teams to prioritize data science projects over what they are currently doing. Since these data science projects are riskier than standard projects, it makes sense that leadership doesn't often do this (and focusing on the standard projects could be the right call). However, it also means that it's very hard for data scientists to create business value.
[+] [-] throwaway713|6 years ago|reply
The really interesting research type work (Bayesian modeling, convolutional neural networks, etc.) takes a long time to implement and may produce no useful results, which is a really bad outcome at a company that measures performance in six month units of work and highly values scheduled deliverables and concrete impact. Many of the data scientists I work with tend to stick to methods that are actually quite simple (e.g., logistic regression, ARIMA) because these at least deliver something quickly, despite the fact that many of my coworkers come from research-heavy backgrounds.
In my org, there's nothing stopping anyone from pursuing advanced machine learning; for the most part we set our own agenda (in fact, determining priorities is part of the job role). And some people do in fact go after state-of-the-art ML, with some really cool results to show for it. But in terms of career progression and job safety, the risk is just way too high, at least for me personally. I save the highly mathematical stuff for a hobby.
Edit: while this may sound a bit negative, I will add that my description of data science isn't a complaint per se; I am mainly trying to inform those who are seeking a career in data science of what to expect compared to what is often promised. The work that is most valuable to a business is not exciting all of the time, but I don't think there is another job in the tech industry that I would find more enjoyable than my current one at the moment.
[+] [-] apohn|6 years ago|reply
I think the sad truth is that this is the reality of work no matter if you are a Data Scientist or not. What you thought you would be doing to show your worth and climb the ladder gets blurred in with KPIs you didn't set, politics you didn't create, goals and deadlines you had no input into, etc. One of the unique challenges you can face as a Data Scientist is that you may interface with people in many different groups, all of which have different goals which may be in conflict with each other. Compare this to other roles where you ultimately only follow the goals of the organization you report into.
[+] [-] Breza|6 years ago|reply
[+] [-] Swizec|6 years ago|reply
This probably describes just about every job in a for-profit business.
[+] [-] resolaibohp|6 years ago|reply
Apply for and interviewing for data science jobs is a total nightmare. You are competing against 100s or even 1000s of applicants for every job posting because someone said it was one of the sexiest careers of the 21st century. Further exacerbating this, Everyone believes that data is the new oil, and large profit multipliers are just waiting to be discovered in this virgin data that companies are sitting on. All that is missing is someone to run some neural network, or deep learning algo on it to discover the insights that nobody else can see.
The reality is that there is an army of people who know how to run these algos. MOOC's, blogs, youtube, etc have been teaching everyone how to use these python/R packages for years. The lucky few who get that coveted data science job can't wait to apply these libraries to the virgin data only to find that they have to do all kinda of data manipulating to make the algos even work, which takes days and weeks of mundane work. Finally they find out the data is so lacking that their deep learning model does very little in providing actual business value. It is overly complicated, computationally expensive, and in the back of your mind know you can get the same results using some simple logic.
Managers who don't understand data science fundamentals learn from the news and have their data scientist implement those buzz words so they can look good in front of their bosses.
I think there is a place for data scientists who understand the fundamentals of the models out there, and know when you should not use them. Data science is also increasingly a subset of software engineering and a good data science in a tech company should be able to code well. I also think that there is not some huge unmet demand for data scientists. Just a huge amount of hype and managers wanting to look good by saying they managed a data science team.
[+] [-] rixed|6 years ago|reply
[+] [-] wenc|6 years ago|reply
You see, decision-making involves (1) getting data, (2) summarizing and predicting, and (3) taking action. Continuous decision-making -- the kind that leads to impact -- involves doing this repeatedly in a principled fashion, which means creating a system around the decision process.
For systems thinkers, this is analogous to a feedback control loop which includes sensor measurements + filters, controllers and actuators.
(1) involves programmers/data engineers who have to create/manage/monitor data pipelines (that often break). This the sensor + filters part, which is ~40% of the system.
(2) involves data scientists creating a model that guides the decision-making process. This is the model of the controller (not even the controller itself!), which is ~20% of the system. Having the right model is great, but as most control engineers will tell you, even having the wrong model is not as terrible as most people think because the feedback loop is self-correcting. A good-enough model is all you need.
(3) involves business/front-line people who actually implement decisions in real-life. This is where impact is delivered. ~40% of the system. This is the controller + actuator part, which makes the decisions and carries them out.
Most data scientists think their value is in creating the most accurate model possible in Jupyter. This is nice, but in real-life not really that critical because the feedback-loop inherently moderates the error when deployed in a complex, stochastic environment. The right level of optimization would be to optimize the entire decision-making control feedback loop instead of just the small part that is "data science".
p.s. data scientists who have particularly low-impact are those who focus on producing once-off reports (like consultant reports). Reports are rarely read, and often forgotten. Real impact comes from continuous decision-making and implementing actions with feedback.
Source: practicing data scientist
[+] [-] arborism|6 years ago|reply
[+] [-] wenc|6 years ago|reply
That is a akin to a sensor input, and one that is surprisingly important. Without a good CRUD form, data either doesn’t get entered at all, or is entered in crude, unvalidated ways like as loose Excel files with formatting that is all over the place.
[+] [-] danmostudco|6 years ago|reply
The worst cases I have seen is when executives take a problem and ask data scientists to "do some of that data science" on the problem, looking for trends, patterns, automating workflows, making recommendations, etc. This is high-level pie in the sky stuff that works well in pitch meetings and client meetings, but when it comes down to brass tacks this leaves very little vision of what is trying to be achieved and even less on a viable execution path.
More successful deployments have had a few items in common
1. A reasonably solid understanding of what the data could and couldn't do. What can we actually expect our data to achieve? What does it do well? What does it do poorly? Will we need to add other data sets? Propagate new data? How will we get or generate that data?
2. The business case or user problem was understood up front. In our most successful project, we saw users continuously miscategorized items on input and built a model to make recommendations. It greatly improved the efficacy of our ingested user data.
3. Break it into small chunks and wins. Promising a mega-model that will do all the things is never a good way to deliver aspirational data goals. Little model wins were celebrated regularly and we found homes and utility for those wins in our codebase along the way.
4. Make is accessible to other members of the company. We always ensure our models have an API that can be accessed by any other services in our ecosystem, so other feature teams can tap into data science work. There's a big difference between "I can run this model on my computer, let me output the results" and "this model can be called anywhere at any time."
While not exhaustive, a few solid fundamentals like the above I think align data science capabilities to business objectives and let the organization get "smarter" as time goes on as to what is possible and not possible.
[+] [-] kavalg|6 years ago|reply
[+] [-] downerending|6 years ago|reply
Not utterly without merit, but fairly blind fishing nonetheless.
[+] [-] apohn|6 years ago|reply
I'm been in various external and internal facing Data Science roles for 8+ years and this is spot on. IME it's the #1 reason Data Science projects "fail." If you can replace "do some of that data science" with "do some of that black magic" that probably means nobody actually checked to make sure the data and problem made sense in the first place. But somebody somewhere already committed to it, so the Data Science team has to deliver it.
[+] [-] datenhorst|6 years ago|reply
While I agree on the point, there's a case that's arguably worse: When those executives hire Data Scientists and then ask them: "So what can we do with Data Science?"
[+] [-] kristjansson|6 years ago|reply
The point of data scientists and the related roles listed in the article are not to just churn out the fun stuff, but to wade through the institutional and technical muck and mire it takes to bring the fun stuff to bear on a relevant business problem and to communicate the results in a way that people of all walks can understand.
[+] [-] tqi|6 years ago|reply
[+] [-] Eridrus|6 years ago|reply
But even in this day and age with ML being the new hotness, you will find people who are quite happy to work on infrastructure and don't have a huge amount of interest in training models themselves, and it is probably a lot easier to hire them than people who can do both, and you may get better results from actual specialists.
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] Barrin92|6 years ago|reply
The problem with all this data talk isn't just about implementation or bad structure, the limitations of putting all your bets on inductive reasoning are systemic.
The insights that economists had in the 70s and 80s was that reasoning from aggregated quantities is extremely limited. Without understanding at a structural level the generators of your data, trying to create policy based on outputs is like trying to reason about inhabitants of a city by looking at light pollution from the sky.
My guess why data science so rarely delivers what it promises is because you can't get any value from historical data if your circumstances change to the point where past data is irrelevant. Which in the world of business happens pretty quickly. To have a competitive advantage, one needs to figure out what has not been seen yet.
And trying to exploit signals suffers from the issue laid out above. There was a funny case of an AI hiring startup trying to predict good applicants, and the result was people putting "Oxford" in their application in a font matching the background color
[+] [-] data4lyfe|6 years ago|reply
In my mind I see more data scientists being ignored or turned into “yes men”(https://www.interviewquery.com/blog-do-they-want-a-data-scie...)
[+] [-] michaelscott|6 years ago|reply
As other commenters here have posted, without the integration of data science into both the business needs and the rest of the existing tech stack it will remain a fun school course activity.
[+] [-] itsmefaz|6 years ago|reply
Can you please elaborate on this please?
[+] [-] analog31|6 years ago|reply
This was certainly the vibe that I got from "design of experiments" when it was the statistical method du jour. Then from "Bayesian everything" and now "data science." I remember "design of experiments" studies being conducted with great fanfare and success theater, while producing zero results.
The long term theme is that science is hard for reasons that managers don't understand, can't manage, and are reluctant to reward.
[+] [-] rafiki6|6 years ago|reply
Some other industries have been doing "data science" for ages. Credit Risk Modelling, insurance and so on.
Every time I read one of these articles, I feel it's just an individual who entered a kind of crummy situation and they're learning what it means to work in a corporate environment. Some are better than others. Some are more motivated than others. Some have better cultures than others. Some are more willing to make technology a key part of their business strategy. Some are more data driven than others.
My recommendation is to always ask the fundamental question before joining: what are you trying to achieve with data science, and is it actually achievable?
[+] [-] smeeth|6 years ago|reply
I agree wholeheartedly with your recommendation. Like any other job, each company has different needs and expectations and if you want something else out of the role you'd best avoid that company.
[+] [-] antipaul|6 years ago|reply
[+] [-] Optimal_Persona|6 years ago|reply
- It's essential to have/develop domain expertise in your industry.
- Beware plausible, but incorrect (or poorly interpreted) data that supports yours (or others') assumptions/biases.
- Add on to #4 - at least as bad as this is having well-intentioned people on your team who "know enough (a bit of SQL or low/no-code data tool") to be dangerous. Um, why are you joining unnecessary tables, or using a different alias for the same columns/tables in different queries, with no comments or standard formatting?
- Hold your nose, but anything you do in SQL/R/Python/even fancier programming tool/language is going to pass through MS Excel at least once sooner or later which can irreversibly bastardize CSVs (even just opening without saving!), truncate precision to 15 digits, change data types, etc.
- So glad for the callout in #7 - there are clearly devs/data folks out there who are happy to take on an "interesting programming project at a great paying job" - that isn't serving the best interests of humanity.
[+] [-] Icathian|6 years ago|reply
I'll just add one: the business absolutely doesn't care how you get your answer, only if they're reliable enough (hand grenade close is better than most companies have today).
While this seems obvious enough to anyone with a few years under their belt, to the new DS grad who has their time series analysis canned in favor of slapping a simple moving average in place and shipping it can be rather disillusioning.
[+] [-] proverbialbunny|6 years ago|reply
Sometimes a young startup wants to advertise to the board, and they want you to make a presentation. I've made the mistake of showing near 100% accuracy solving a difficult problem important to the business, and expecting a strong positive result.
Instead I got a, "But, are we using deep neural networks?" type comments.
Sometimes a company just wants to market, be it to customers, or to the board. It's important to know your audience.
[+] [-] apohn|6 years ago|reply
One of the challenges with this is that "reliable" can mean a lot of things when the goalposts of success are constantly moving in large projects with many stakeholders, all of which are clawing for attention. I've seen politics derail so many Data Science projects and destroy the morale of Data Scientists.
It's only natural that a lot of people will realize that a moving average that confirms what people wanted to see anyway will lead to more success (whatever that means).
[+] [-] Der_Einzige|6 years ago|reply
That's pretty sad when you think about it, but it's painfully true.
[+] [-] Vaslo|6 years ago|reply
1). I am hearing about Data Science Teams being furloughed during these times. That isn’t happening in my function (Corporate Finance). I am glad to be secure even though I enjoy much of the data sci work.
2) I’m able to apply Data Science concepts in my current role, and it’s adding a lot of job security and providing me with exposure. I am much less interested now in moving to straight Data Science and instead am applying my learnings in my current role as a sort of in-house Data Science guy. But I have a lot to learn to be honest.
3). There seem to be a lot of “thought leaders” acting like they are big experts in the area and really don’t know anything many of us amateur scientists don’t know. They pull perfect clean datasets and show these magic transformations they just copy from others to get YouTube hits or Twitter followers. That just never happens in real life, and many leaders are seeing this and losing interest in this function in the returns they are getting from sole data science folks.
[+] [-] RayVR|6 years ago|reply
[+] [-] s1t5|6 years ago|reply
Neither of those quite match the articles title, perhaps it just refers to the author's personal expectations. Neither of them seem that specific to data science, or without parallels in other software jobs. And neither of the points read like a slight towards data science to me, like some of the other commenters here suggest.
[+] [-] UweSchmidt|6 years ago|reply
Progress may only come slowly, ideally through products bought from 3rd parties whose results are understood and controlled by management.
[+] [-] mirimir|6 years ago|reply
-- write discovery requests
-- review production, and check out data and documentation
-- write supplementary discovery requests
-- review production, and check out data and documentation
[repeat as needed]
-- analyze data, and write deposition questions
-- help attorneys wring answers from deponents
[repeat as needed]
-- analyze data, and produce required output
-- write parts of briefs and expert reports
I generally did that in consultation with testimonial experts and their data analysts. Sometimes that didn't happen until we'd documented the case enough to know that it was worth it. And occasionally small cases settled with just me as the "expert".
It's a small industry, and not easy to get into, unless you know key players at key firms. But the money's pretty good, and the work can be exciting. I loved being that guy in depositions whispering questions to the attorneys :)
This all involved pretty simple calculation of damages, through comparing what actually happened vs what would have happened but for the illegal behavior. But-for models were typically based on benchmarks.
After data cleanup in UltraEdit, I did most of the analysis in SQL Server. I used Excel for charting and final calculations.
[+] [-] avip|6 years ago|reply
[+] [-] op03|6 years ago|reply
If its just data folk by themselves getting dumped with org data and told to find pirate gold...then its a crap shoot.
[+] [-] agentofoblivion|6 years ago|reply
I know this because I've been on that journey. But there's no reason to expect some department head that's never been exposed to DS to know this. They just copy/paste some other company's job req. If you're more junior, here are my tips:
- If it's a "new DS team" that supports a variety of teams: beware. Bolt-on DS doesn't work well, as it's really hard to build a meaningful solution that's not deeply integrated.
- If it's an old company or in a conservative industry: beware. There are likely to be data silos and difficult ownership models that make it nearly impossible to get and join the data you need.
- If it's a small company: beware. You're likely going to need a broad set of knowledge that's won with several years of experience to be able to build end-to-end solutions that are integrated into the rest of the tech stack.
- If it's not an engineering-driven culture: beware. DS will often be used to provide evidence to someone else whose already made up their mind and pretend they're being data-driven, and you'll be the disrespected nerd that's expected to do what it takes to deliver the answer they want. Most companies claim to be "data-driven", few are, and even fewer understand data-driven isn't always possible or desirable.
Industry is still trying to figure out how to use ML and are still learning that it's not as easy as hiring someone that knows about all the algorithms, but rather it takes deep technological changes to data infrastructure to enable the datasets that can then be used by the ML experts. But you don't have to be the person that helps them figure this out the hard way (i.e. by being paid to not accomplish much due to problems outside of your control). Better to find a place with a healthy data science team that can help you learn and contribute. They exist.
[+] [-] deppp|6 years ago|reply
For example, I’m working on the tool to make data management easier and convert datasets into a structured representation. If you have experienced that you spend a lot of time on preparing and analyzing data, and it is tedious, please reach out to me michael at heartex.net, would love to get your feedback on the product we have built so far.
[+] [-] leto_ii|6 years ago|reply
I agree. More so, I sometimes feel that in the end the field will break up once things start settling down. Some roles will migrate more towards engineering, some will go back towards data analysis.
The expectation that a data scientist is a funnel that can turn anything into magical insights and tools can't last forever.
[+] [-] AndrewKemendo|6 years ago|reply
You can't compress information until you have it in a format that is appropriate for compression.
That is:
You can't compress (apply/create algorithms) information (data) until you have it (instrumented data collection) in a format (schema) that is appropriate for efficient compression (structured logging/cleaning).
99% of that is Data Engineering and building good engineering practices which have good data practices as a priority.
For any organization that has more than a handful of employees and more than one product, that is a non trivial task and gets more difficult the larger the organization gets.
[+] [-] antipaul|6 years ago|reply
It's not quite 99% of the effort but close enough ;)
Search "data science hierarchy of needs"