Yes. The point I took away from this is that this is not at all a focus of most academic settings. This ends up leaving a huge gap and leaving candidates with an academic DS background woefully unprepared and undesirable.
That seems strange to me. People on forums like this often describe Data Science practitioners as "statisticians that can code". If academic Data Science programs aren't emphasizing data engineering as part of their curriculum, what differentiates a Data Science program from statistics or business intelligence?
> If academic Data Science programs aren't emphasizing data engineering as part of their curriculum, what differentiates a Data Science program from statistics or business intelligence?
In my experience, they're emphasizing software-based data work like machine learning, but not the (vital) peripherals like
cleaning/studying/loading data or monitoring and sanity-checking outputs.
A data science student might get a process-first task like making predictions from data using KNN, regressions, t-tests, or neural nets, choosing a method and optimizing based on performance. A statistics student might focus on theory, choosing an appropriate analysis method in advance based on the dataset, and reasoning about the effects of error instead of just trying to reduce it.
But the data scientist could still be training on a clean, wholly-theoretical dataset or a highly predictable online-training environment. The result is a lot of entry-level data scientists who are mechanically talented but stymied by real-world hurdles. Issues handling dirty or inconstant data, for one. But there are a lot of others: a tendency to do analysis in a vacuum, without taking advantage of knowledge about the domain and data source; or judging output effectiveness based on training accuracy, without asking whether the dataset is (and will stay) well-matched to the actual task.
I don't mean that to sound dismissive; there are lots of people who do all of that well, even newly-trained. But it does seem to be a common gap in a lot of data science education.
I'm not familiar with academic Data Science programs but I've worked with statisticians for over fifteen years and they are usually very involved on the data engineering side. If they aren't running the systems then they are working closely with those people to test and confirm inputs and outputs before running analyses.
barbecue_sauce|7 years ago
Bartweiss|7 years ago
In my experience, they're emphasizing software-based data work like machine learning, but not the (vital) peripherals like cleaning/studying/loading data or monitoring and sanity-checking outputs.
A data science student might get a process-first task like making predictions from data using KNN, regressions, t-tests, or neural nets, choosing a method and optimizing based on performance. A statistics student might focus on theory, choosing an appropriate analysis method in advance based on the dataset, and reasoning about the effects of error instead of just trying to reduce it.
But the data scientist could still be training on a clean, wholly-theoretical dataset or a highly predictable online-training environment. The result is a lot of entry-level data scientists who are mechanically talented but stymied by real-world hurdles. Issues handling dirty or inconstant data, for one. But there are a lot of others: a tendency to do analysis in a vacuum, without taking advantage of knowledge about the domain and data source; or judging output effectiveness based on training accuracy, without asking whether the dataset is (and will stay) well-matched to the actual task.
I don't mean that to sound dismissive; there are lots of people who do all of that well, even newly-trained. But it does seem to be a common gap in a lot of data science education.
itronitron|7 years ago
darkxanthos|7 years ago