top | item 17843553

What Data Scientists Really Do, According to Data Scientists

278 points| pseudolus | 7 years ago |hbr.org

136 comments

order
[+] sixdimensional|7 years ago|reply
"The only difference between screwing around and science is writing it down" - Adam Savage.

In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis? It doesn't need to be conflated with machine learning, BI, SQL, etc. It can just be about approaching data analysis with scientific rigor.

My opinion is that the term data science evolved when we started needing cross-functional people who are a blend of:

- domain experts;

- numerical/quantitative specialists (such as statisticians, mathematicians, physicists, STEM people);

- business analysts, business intelligence; and,

- those who traditionally deal with data management, platforms and tools.

That confluence of people was needed amidst the related trends:

- increased government funding for STEM education and brain research;

- marketing from companies such as IBM ("Watson"), the democratization of data and increase in the use of data in daily life;

- the big data wave, subsequent interest in "internet of things" and "digital transformation";

- renewed interest in machine learning and AI (recurrent neural networks and other breakthroughs);

- and others of course..

We needed to apply more discipline to data analysis - thus data science was born. A formalizing of what many were already doing, to capture the need and changing paradigm. Or so I like to believe.

[+] JumpCrisscross|7 years ago|reply
> why can't data science simply be about applying the scientific method in the realm of data analysis?

Because then someone with a business school education (and zero formal statistical training) wouldn't be able to do it. I joke about waiting for finance's Excel models to be rebranded as AI, as I've already seen a handful of hedge funds rebrand their analysts as data scientists.

[+] digitalzombie|7 years ago|reply
> why can't data science simply be about applying the scientific method in the realm of data analysis?

That's what a statistician do.

I've seen these ML and Datascience people. And the majority the time how they tackle data is radically different from statistician and is more of an art than a science compare to what statistician does.

But this could be my bias opinion and just some small data sample from personal experiences.

---

Actually my last day of internship I've met a few statistician interns some of them are from Cal (UCBerkely) and they came to the same conclusion (we have a lot of complaints). The ML/DS group is really just doing black magic (nicest way of putting it). I wish statistic is better at marketing. Oh well.

[+] alexpetralia|7 years ago|reply
I've written about this problem before as well[1] - data science seems to be defined uniquely narrowly (and confusingly) compared to other fields.

--

"Data science" is the most natural name for this field. Though fields like "information science" and "political science" are broad, "data science," as it is popularly defined, is uniquely narrow. This is problematic because general fields typically serve as a roadmap for all subfields - a cursory glance at what they are and how they relate. "Data science" today does not provide this road map.

[1] https://alexpetralia.com/posts/2016/6/22/reclaiming-the-term...

[+] fouc|7 years ago|reply
Seems like "Data Scientist" as a job title is too generic.

It's like saying "Web Designer" nowadays, when in reality we have a variety of job specializations like UX strategist (ux only), graphics designer (photoshop), UI developer (html/css only), front-end developer (html/css/js), etc.

[+] mcrad|7 years ago|reply
> In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis?

I suppose someone who isn't very smart is staffing data analysts with little understanding of science. Most would assume that if you are getting paid to analyze all this valuable data, you would have some grounding in the scientific method.

[+] NPMaxwell|7 years ago|reply
Tangential: I loved the appearance of the term, "Data Scientist". Scientists are domain experts. A biologist knows cell membranes. A chemist knows valences. For decades statisticians had been pitching that they were helpful without knowing the domain: appear-deliver-and-run consultants. It was wonderful to see a group embrace the importance of knowing the system that is producing the data. I regret the more recent movement of "data science" consultants who try to run models and AI without understanding the systems the data came from.
[+] em500|7 years ago|reply
This is fairly accurate in my experience: https://twitter.com/thesmartjokes/status/684286479401652224

Day to day, it's mostly SQL, or worse Hive queries which makes most things much slower than they should be.

[+] minimaxir|7 years ago|reply
One thing rarely discussed with the rise of big data is how to do efficient querying, especially at scale.

I've had a ton of data science interviews which ask how to reimplement binary search from scratch (which I would never do on the job), but not anything about how to do efficient JOINs and query nesting.

[+] tspike|7 years ago|reply
Dumb question: is Splunk implemented with Hive?
[+] appleiigs|7 years ago|reply
“It has been a common trope that 80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis.”

Is it really trope? For my experience I almost think collecting data is >80%.

[+] gaius|7 years ago|reply
80% of time is spent on getting and cleaning the data, 20% of time is spent preparing and delivering reports and presentations.

The super-cool ML stuff that attracts people to the field in the first place, accounts for little more than a rounding error in how the time is really spent

[+] visarga|7 years ago|reply
I get to spend 90% collecting, cleaning and tagging data.
[+] thibautg|7 years ago|reply
- years (continuous endeavor): find existing data sources

- months: convince management to give access to data source

- weeks: try to find the connection string

- days: clean up the data (mostly converting dates to yyyy-mm-dd) and importing/exporting csv files

- hours: load data in database, write simple SQL query and simple visualisation

- seconds: brief moment of satisfaction

[+] cpeterso|7 years ago|reply
I just started reading Weapons of Math Destruction by mathematician Cathy O'Neil. She warns about big data systems that codify racism and classism from flawed data and self-fulfilling feedback loops. The systems' "unbiased" decisions are opaque, proprietary, and often unchallengeable.

https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction

[+] listenallyall|7 years ago|reply
If "big data" informs you that a group of people (race, geographic location, nationality, occupation, gender, etc) is less likely to say, pay back a loan, or successfully complete 4 years at university, or avoid insurance claims... must that be anything-ist? Or just a fact? Or, as your post suggests, are you obligated to throw out the conclusion and just assume the inputs must have been "flawed data"?
[+] gfarah|7 years ago|reply
I just finished building a credit risk ML classifier in the company I work for. The model will be used to define if we lend or not money to people/companies.

Adding profiling features does give more accurate predictions. However, I pitched not using these features as a competitive advantage to the founders and they (luckily) agreed. We won’t be using them, and we ended up (with more work of course) getting a similar performant model without them.

We can and should try to not use those kinds of features and be as fair as possible.

[+] minimaxir|7 years ago|reply
There has been a rise in romantic thought pieces lately about how Data Scientists are wizards and can solve any problem with the real superpower of teamwork. (here's an older example from Instacart: https://tech.instacart.com/data-science-at-instacart-dabbd2d...)

In the real world, the state of affairs in Data Science is more practical and pragmatic. And there's nothing wrong with that.

[+] rahimnathwani|7 years ago|reply
Not worth reading. For example, this makes no sense:

"(2) decision science, which is about “taking data and using it to help a company make a decision”; and (3) machine learning, which is about “how can we take data science models and put them continuously into production."

Machine learning isn't about putting models into production. It's about machine learning models directly from data.

And if decision science is 'taking data and using it to help a company make a decision', then pretty much any job involves data science, e.g. the guy comparing quotes for paperclips and picking a vendor.

[+] jamesblonde|7 years ago|reply
Who are Data Scientists' heros? Seriously.

In AI, it's Hinton, Le Cun, Bengio.

In systems, it's D Richie, J Dean, Berners-Lee, Torvalds.

In distributed systems, it's Lampord, Chandy, J Dean.

In programming languages, it's D Richie, Gosling, Dijkstra, Knuth, Milner, etc.

Who are data scientists' heros or role models?

[+] curiousgal|7 years ago|reply
Hadley Wickham probably.
[+] 77ko|7 years ago|reply
Florence Nightangle, maybe? Her data viz is really impressive, like her 'coxcomb' diagram on mortality in the army.:

https://www.theguardian.com/news/datablog/2010/aug/13/floren...

> She was also a pioneer in the graphical presentation of data. At a time when research reports were only beginning to include tables, Nightingale was using bar and pie charts, which were colour coded to highlight key points (eg, high mortality rates under certain conditions). Nightingale was keen not only to get the science right but also to make it comprehensible to lay people, especially the politicians and senior civil servants who made and administered the laws.

https://ebn.bmj.com/content/4/3/68.full

[+] amrrs|7 years ago|reply
Nate Silver of 538
[+] em500|7 years ago|reply
Efron, Hastie, Tibshirani (basically Stanford stats).
[+] mailshanx|7 years ago|reply
Wouldn't you consider AI / ML to be a specialization of data science?
[+] louthy|7 years ago|reply
> Lampord

Lamport, in case anyone's googling :)

[+] didibus|7 years ago|reply
After reading this article, I still have no idea.

I mean, it made it sound like data scientist is just the same as a business analyst? Is this the new computer scientist vs software engineer?

[+] compcoffee|7 years ago|reply
>I mean, it made it sound like data scientist is just the same as a business analyst?

I think this demonstrates how hard "titles" are, because a "business analyst", in the sense that I learned, is not at all like a data scientist (or data analyst):

"Business Analysis'' is a research discipline of identifying business needs and determining solutions to business problems. Solutions often include a software-systems development component, but may also consist of process improvement, organizational change or strategic planning and policy development."

Most BA work I've done involved translating business requirements into technical or software requirements.

In other words, who knows...

[+] killjoywashere|7 years ago|reply
‘Data scientist” is what they call a statistician on the West Coast. — an East Coast statistician
[+] snackematician|7 years ago|reply
This is true. I'm a Ph.d. statistician with a "data scientist" job position in San Francisco.
[+] crunchlibrarian|7 years ago|reply
Data science lost all of its appeal to me when I spent a weekend diving in and found it was about 70% fidding with weights until you get the answer you want and 30% trying to figure out why the data was so wrong.
[+] pleasecalllater|7 years ago|reply
A friend of mine told me that in his company people write ETLs, send them to an external service for processing, and get that back - this is what they call "doing data science" :)
[+] dopeboy|7 years ago|reply
As a full stack engineer that knows very little about data science. what courses, libraries, etc are worth my time to explore? What should I be well versed in to be competent in the future?
[+] TheAceOfHearts|7 years ago|reply
A boring response, but have you studied stats? Knowing stats and a bit of SQL is enough to get you pretty far with a lot of problems. I'd consider those skills an important pre-requisite to more advanced tools and techniques.
[+] itronitron|7 years ago|reply
Whatever you are using in your stack for data access / search probably has some associated capabilities that support data analysis, and hence data science. I would start there.
[+] gaius|7 years ago|reply
As a full stack engineer

I’m confused, if you know the full stack already don’t you already know all this? It’s all part of the stack after all.

“As a webdev...”

[+] speedplane|7 years ago|reply
Learn how to export a database to CSV, then learn excel inside and out, it’s scratching the surface but a good start.
[+] reilly3000|7 years ago|reply
Check out Kaggle.com and Coursera's intro to machine learning.
[+] bane|7 years ago|reply
Mostly connecting to data, cleaning it and finding some place to stash all of it.

Only after that 90% is done can anybody think about modeling data, transforming it, processing it and lastly that glorious 5% of actually analyzing it.

Oh, and then somebody wants the results of the analysis to be put into a fully interactive scalable web application so now we're late.

[+] mcrad|7 years ago|reply
It is a hybrid of software engineering and stats.

Calling it science is a stretch. I can understand if you are solving problems in a traditional scientific field, but if you are doing economic modeling to manage investment risk and optimize profit for an internet company, it's hardly science. What a scam!

[+] natalyarostova|7 years ago|reply
Science isn't a noble and pure endeavor. It's just a methodology to construct predictive power from information.
[+] DrNuke|7 years ago|reply
Data Science attaining the scientific grade when ablation analysis becomes mandatory maybe?
[+] sgt101|7 years ago|reply
I like the idea of ablation analysis, but when did fiddling with it until it changes become science?
[+] jblow|7 years ago|reply
There are some quotes missing from this title ... the proper spelling is Data “Scientists”.
[+] glup|7 years ago|reply
Their work can be perfectly scientific. My problem with it is is the redundancy — imagine someone claiming to be a "food chef."