"The only difference between screwing around and science is writing it down" - Adam Savage.
In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis? It doesn't need to be conflated with machine learning, BI, SQL, etc. It can just be about approaching data analysis with scientific rigor.
My opinion is that the term data science evolved when we started needing cross-functional people who are a blend of:
- domain experts;
- numerical/quantitative specialists (such as statisticians, mathematicians, physicists, STEM people);
- business analysts, business intelligence; and,
- those who traditionally deal with data management, platforms and tools.
That confluence of people was needed amidst the related trends:
- increased government funding for STEM education and brain research;
- marketing from companies such as IBM ("Watson"), the democratization of data and increase in the use of data in daily life;
- the big data wave, subsequent interest in "internet of things" and "digital transformation";
- renewed interest in machine learning and AI (recurrent neural networks and other breakthroughs);
- and others of course..
We needed to apply more discipline to data analysis - thus data science was born. A formalizing of what many were already doing, to capture the need and changing paradigm. Or so I like to believe.
> why can't data science simply be about applying the scientific method in the realm of data analysis?
Because then someone with a business school education (and zero formal statistical training) wouldn't be able to do it. I joke about waiting for finance's Excel models to be rebranded as AI, as I've already seen a handful of hedge funds rebrand their analysts as data scientists.
> why can't data science simply be about applying the scientific method in the realm of data analysis?
That's what a statistician do.
I've seen these ML and Datascience people. And the majority the time how they tackle data is radically different from statistician and is more of an art than a science compare to what statistician does.
But this could be my bias opinion and just some small data sample from personal experiences.
---
Actually my last day of internship I've met a few statistician interns some of them are from Cal (UCBerkely) and they came to the same conclusion (we have a lot of complaints). The ML/DS group is really just doing black magic (nicest way of putting it). I wish statistic is better at marketing. Oh well.
I've written about this problem before as well[1] - data science seems to be defined uniquely narrowly (and confusingly) compared to other fields.
--
"Data science" is the most natural name for this field. Though fields like "information science" and "political science" are broad, "data science," as it is popularly defined, is uniquely narrow. This is problematic because general fields typically serve as a roadmap for all subfields - a cursory glance at what they are and how they relate. "Data science" today does not provide this road map.
Seems like "Data Scientist" as a job title is too generic.
It's like saying "Web Designer" nowadays, when in reality we have a variety of job specializations like UX strategist (ux only), graphics designer (photoshop), UI developer (html/css only), front-end developer (html/css/js), etc.
> In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis?
I suppose someone who isn't very smart is staffing data analysts with little understanding of science. Most would assume that if you are getting paid to analyze all this valuable data, you would have some grounding in the scientific method.
Tangential: I loved the appearance of the term, "Data Scientist". Scientists are domain experts. A biologist knows cell membranes. A chemist knows valences. For decades statisticians had been pitching that they were helpful without knowing the domain: appear-deliver-and-run consultants. It was wonderful to see a group embrace the importance of knowing the system that is producing the data. I regret the more recent movement of "data science" consultants who try to run models and AI without understanding the systems the data came from.
One thing rarely discussed with the rise of big data is how to do efficient querying, especially at scale.
I've had a ton of data science interviews which ask how to reimplement binary search from scratch (which I would never do on the job), but not anything about how to do efficient JOINs and query nesting.
“It has been a common trope that 80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis.”
Is it really trope? For my experience I almost think collecting data is >80%.
80% of time is spent on getting and cleaning the data, 20% of time is spent preparing and delivering reports and presentations.
The super-cool ML stuff that attracts people to the field in the first place, accounts for little more than a rounding error in how the time is really spent
I just started reading Weapons of Math Destruction by mathematician Cathy O'Neil. She warns about big data systems that codify racism and classism from flawed data and self-fulfilling feedback loops. The systems' "unbiased" decisions are opaque, proprietary, and often unchallengeable.
If "big data" informs you that a group of people (race, geographic location, nationality, occupation, gender, etc) is less likely to say, pay back a loan, or successfully complete 4 years at university, or avoid insurance claims... must that be anything-ist? Or just a fact? Or, as your post suggests, are you obligated to throw out the conclusion and just assume the inputs must have been "flawed data"?
I just finished building a credit risk ML classifier in the company I work for. The model will be used to define if we lend or not money to people/companies.
Adding profiling features does give more accurate predictions. However, I pitched not using these features as a competitive advantage to the founders and they (luckily) agreed. We won’t be using them, and we ended up (with more work of course) getting a similar performant model without them.
We can and should try to not use those kinds of features and be as fair as possible.
There has been a rise in romantic thought pieces lately about how Data Scientists are wizards and can solve any problem with the real superpower of teamwork. (here's an older example from Instacart: https://tech.instacart.com/data-science-at-instacart-dabbd2d...)
In the real world, the state of affairs in Data Science is more practical and pragmatic. And there's nothing wrong with that.
Not worth reading. For example, this makes no sense:
"(2) decision science, which is about “taking data and using it to help a company make a decision”; and (3) machine learning, which is about “how can we take data science models and put them continuously into production."
Machine learning isn't about putting models into production. It's about machine learning models directly from data.
And if decision science is 'taking data and using it to help a company make a decision', then pretty much any job involves data science, e.g. the guy comparing quotes for paperclips and picking a vendor.
> She was also a pioneer in the graphical presentation of data. At a time when research reports were only beginning to include tables, Nightingale was using bar and pie charts, which were colour coded to highlight key points (eg, high mortality rates under certain conditions). Nightingale was keen not only to get the science right but also to make it comprehensible to lay people, especially the politicians and senior civil servants who made and administered the laws.
>I mean, it made it sound like data scientist is just the same as a business analyst?
I think this demonstrates how hard "titles" are, because a "business analyst", in the sense that I learned, is not at all like a data scientist (or data analyst):
"Business Analysis'' is a research discipline of identifying business needs and determining solutions to business problems. Solutions often include a software-systems development component, but may also consist of process improvement, organizational change or strategic planning and policy development."
Most BA work I've done involved translating business requirements into technical or software requirements.
Data science lost all of its appeal to me when I spent a weekend diving in and found it was about 70% fidding with weights until you get the answer you want and 30% trying to figure out why the data was so wrong.
A friend of mine told me that in his company people write ETLs, send them to an external service for processing, and get that back - this is what they call "doing data science" :)
As a full stack engineer that knows very little about data science. what courses, libraries, etc are worth my time to explore? What should I be well versed in to be competent in the future?
A boring response, but have you studied stats? Knowing stats and a bit of SQL is enough to get you pretty far with a lot of problems. I'd consider those skills an important pre-requisite to more advanced tools and techniques.
Whatever you are using in your stack for data access / search probably has some associated capabilities that support data analysis, and hence data science. I would start there.
Mostly connecting to data, cleaning it and finding some place to stash all of it.
Only after that 90% is done can anybody think about modeling data, transforming it, processing it and lastly that glorious 5% of actually analyzing it.
Oh, and then somebody wants the results of the analysis to be put into a fully interactive scalable web application so now we're late.
Calling it science is a stretch. I can understand if you are solving problems in a traditional scientific field, but if you are doing economic modeling to manage investment risk and optimize profit for an internet company, it's hardly science. What a scam!
[+] [-] sixdimensional|7 years ago|reply
In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis? It doesn't need to be conflated with machine learning, BI, SQL, etc. It can just be about approaching data analysis with scientific rigor.
My opinion is that the term data science evolved when we started needing cross-functional people who are a blend of:
- domain experts;
- numerical/quantitative specialists (such as statisticians, mathematicians, physicists, STEM people);
- business analysts, business intelligence; and,
- those who traditionally deal with data management, platforms and tools.
That confluence of people was needed amidst the related trends:
- increased government funding for STEM education and brain research;
- marketing from companies such as IBM ("Watson"), the democratization of data and increase in the use of data in daily life;
- the big data wave, subsequent interest in "internet of things" and "digital transformation";
- renewed interest in machine learning and AI (recurrent neural networks and other breakthroughs);
- and others of course..
We needed to apply more discipline to data analysis - thus data science was born. A formalizing of what many were already doing, to capture the need and changing paradigm. Or so I like to believe.
[+] [-] JumpCrisscross|7 years ago|reply
Because then someone with a business school education (and zero formal statistical training) wouldn't be able to do it. I joke about waiting for finance's Excel models to be rebranded as AI, as I've already seen a handful of hedge funds rebrand their analysts as data scientists.
[+] [-] digitalzombie|7 years ago|reply
That's what a statistician do.
I've seen these ML and Datascience people. And the majority the time how they tackle data is radically different from statistician and is more of an art than a science compare to what statistician does.
But this could be my bias opinion and just some small data sample from personal experiences.
---
Actually my last day of internship I've met a few statistician interns some of them are from Cal (UCBerkely) and they came to the same conclusion (we have a lot of complaints). The ML/DS group is really just doing black magic (nicest way of putting it). I wish statistic is better at marketing. Oh well.
[+] [-] alexpetralia|7 years ago|reply
--
"Data science" is the most natural name for this field. Though fields like "information science" and "political science" are broad, "data science," as it is popularly defined, is uniquely narrow. This is problematic because general fields typically serve as a roadmap for all subfields - a cursory glance at what they are and how they relate. "Data science" today does not provide this road map.
[1] https://alexpetralia.com/posts/2016/6/22/reclaiming-the-term...
[+] [-] fouc|7 years ago|reply
It's like saying "Web Designer" nowadays, when in reality we have a variety of job specializations like UX strategist (ux only), graphics designer (photoshop), UI developer (html/css only), front-end developer (html/css/js), etc.
[+] [-] mcrad|7 years ago|reply
I suppose someone who isn't very smart is staffing data analysts with little understanding of science. Most would assume that if you are getting paid to analyze all this valuable data, you would have some grounding in the scientific method.
[+] [-] NPMaxwell|7 years ago|reply
[+] [-] em500|7 years ago|reply
Day to day, it's mostly SQL, or worse Hive queries which makes most things much slower than they should be.
[+] [-] minimaxir|7 years ago|reply
I've had a ton of data science interviews which ask how to reimplement binary search from scratch (which I would never do on the job), but not anything about how to do efficient JOINs and query nesting.
[+] [-] tspike|7 years ago|reply
[+] [-] appleiigs|7 years ago|reply
Is it really trope? For my experience I almost think collecting data is >80%.
[+] [-] gaius|7 years ago|reply
The super-cool ML stuff that attracts people to the field in the first place, accounts for little more than a rounding error in how the time is really spent
[+] [-] visarga|7 years ago|reply
[+] [-] narens|7 years ago|reply
[+] [-] frozenport|7 years ago|reply
[deleted]
[+] [-] thibautg|7 years ago|reply
- months: convince management to give access to data source
- weeks: try to find the connection string
- days: clean up the data (mostly converting dates to yyyy-mm-dd) and importing/exporting csv files
- hours: load data in database, write simple SQL query and simple visualisation
- seconds: brief moment of satisfaction
[+] [-] cpeterso|7 years ago|reply
https://en.wikipedia.org/wiki/Weapons_of_Math_Destruction
[+] [-] listenallyall|7 years ago|reply
[+] [-] gfarah|7 years ago|reply
Adding profiling features does give more accurate predictions. However, I pitched not using these features as a competitive advantage to the founders and they (luckily) agreed. We won’t be using them, and we ended up (with more work of course) getting a similar performant model without them.
We can and should try to not use those kinds of features and be as fair as possible.
[+] [-] minimaxir|7 years ago|reply
In the real world, the state of affairs in Data Science is more practical and pragmatic. And there's nothing wrong with that.
[+] [-] rahimnathwani|7 years ago|reply
"(2) decision science, which is about “taking data and using it to help a company make a decision”; and (3) machine learning, which is about “how can we take data science models and put them continuously into production."
Machine learning isn't about putting models into production. It's about machine learning models directly from data.
And if decision science is 'taking data and using it to help a company make a decision', then pretty much any job involves data science, e.g. the guy comparing quotes for paperclips and picking a vendor.
[+] [-] jamesblonde|7 years ago|reply
In AI, it's Hinton, Le Cun, Bengio.
In systems, it's D Richie, J Dean, Berners-Lee, Torvalds.
In distributed systems, it's Lampord, Chandy, J Dean.
In programming languages, it's D Richie, Gosling, Dijkstra, Knuth, Milner, etc.
Who are data scientists' heros or role models?
[+] [-] curiousgal|7 years ago|reply
[+] [-] 77ko|7 years ago|reply
https://www.theguardian.com/news/datablog/2010/aug/13/floren...
> She was also a pioneer in the graphical presentation of data. At a time when research reports were only beginning to include tables, Nightingale was using bar and pie charts, which were colour coded to highlight key points (eg, high mortality rates under certain conditions). Nightingale was keen not only to get the science right but also to make it comprehensible to lay people, especially the politicians and senior civil servants who made and administered the laws.
https://ebn.bmj.com/content/4/3/68.full
[+] [-] amrrs|7 years ago|reply
[+] [-] em500|7 years ago|reply
[+] [-] mailshanx|7 years ago|reply
[+] [-] louthy|7 years ago|reply
Lamport, in case anyone's googling :)
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] didibus|7 years ago|reply
I mean, it made it sound like data scientist is just the same as a business analyst? Is this the new computer scientist vs software engineer?
[+] [-] compcoffee|7 years ago|reply
I think this demonstrates how hard "titles" are, because a "business analyst", in the sense that I learned, is not at all like a data scientist (or data analyst):
"Business Analysis'' is a research discipline of identifying business needs and determining solutions to business problems. Solutions often include a software-systems development component, but may also consist of process improvement, organizational change or strategic planning and policy development."
Most BA work I've done involved translating business requirements into technical or software requirements.
In other words, who knows...
[+] [-] killjoywashere|7 years ago|reply
[+] [-] snackematician|7 years ago|reply
[+] [-] crunchlibrarian|7 years ago|reply
[+] [-] pleasecalllater|7 years ago|reply
[+] [-] dopeboy|7 years ago|reply
[+] [-] TheAceOfHearts|7 years ago|reply
[+] [-] itronitron|7 years ago|reply
[+] [-] gaius|7 years ago|reply
I’m confused, if you know the full stack already don’t you already know all this? It’s all part of the stack after all.
“As a webdev...”
[+] [-] speedplane|7 years ago|reply
[+] [-] reilly3000|7 years ago|reply
[+] [-] bane|7 years ago|reply
Only after that 90% is done can anybody think about modeling data, transforming it, processing it and lastly that glorious 5% of actually analyzing it.
Oh, and then somebody wants the results of the analysis to be put into a fully interactive scalable web application so now we're late.
[+] [-] mcrad|7 years ago|reply
Calling it science is a stretch. I can understand if you are solving problems in a traditional scientific field, but if you are doing economic modeling to manage investment risk and optimize profit for an internet company, it's hardly science. What a scam!
[+] [-] natalyarostova|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] DrNuke|7 years ago|reply
[+] [-] sgt101|7 years ago|reply
[+] [-] jblow|7 years ago|reply
[+] [-] glup|7 years ago|reply