2. Cleaning/understanding data - nearly all data sets I've used have duplicates, missing data, highly anomalous distributions in some fields that indicate we aren't measuring what we think we are, etc. So a lot of my time is spent figuring out what's going on in the data, cleaning up the issues, figure out what subset of the data is reliable, and then dealing with the biases introduced by what is missing or wrong.
3. Dealing with people who don't understand or respect statistics and data science. For example, I've been brought in to "do the analysis" on an "A/B test" where a team didn't appropriately randomize their samples, and also hadn't done a statistical power test beforehand so had an underpowered test anyway, so there was just no hope of validating that their change was an improvement.
Getting the dimensionality correct between layers in an untyped language is incredibly painful.
Getting a prototype working on my personal machine and then having to play AMI roulette on AWS to get it running on a GPU is frustrating and expensive.
Every time I read the tensorflow documentation I think, "I wish i could pay 500 dollars for a version of this that looked like someone cared about it."
Not related to TF, but the upcoming version of Gorgonia has a solution for that - we basically developed a calculus for dimensionality of data. Gorgonia is deep learning in Go, and is well typed as far as I can tell.
Work on the coq program proving the calculus is taking longer than expected though.
If you're interested in contributing to Gorgonia, lmk.
Agreed. Many people don't want to think, they only want to know.
One of the things my team does is build data tools. The number of people who want to take data they have hardly looked at, put it through a tool they don't understand, and rely in important ways on the output, is astonishing to me.
At large companies there are lots of people who see themselves as decision makers, not Analysts. Analyzing data is beneath them. The end result is they expect the "Analytics People" to provide a set of different recommendation on what to do and business folks decide which recommendations to pursue.
If you have to interact with enough of these people your job as a Data Scientist will be miserable.
IME a number of Data Scientists reinforce this thinking. When people think you are are a genius, it's painful to admit that 80% of project goals can be achieved with a simple heuristic.
It's incredible how many "Data Science" problems can be solved with a better dashboard that enables people to look at data in a more useful way.
That it is 90% cleaning, and only 10% modeling. Never had a dataset that was ready to use like the ones on Kaggle. Most of the time I get a mixture of Excel Sheets with weird formatting, .csv files and SQL dumps that have questionably encoding, and lots of unnecessary information and missing values.
I think Kaggle participants miss out on some of the best parts of being a Data Scientist. Fiddling with data, writing scripts to clean/transform/ingest data, interacting with data owners and subject matter experts, etc. IME that's actually a lot more fun than fiddling with parameters and looking at model performance metrics.
[+] [-] magneticnorth|7 years ago|reply
2. Cleaning/understanding data - nearly all data sets I've used have duplicates, missing data, highly anomalous distributions in some fields that indicate we aren't measuring what we think we are, etc. So a lot of my time is spent figuring out what's going on in the data, cleaning up the issues, figure out what subset of the data is reliable, and then dealing with the biases introduced by what is missing or wrong.
3. Dealing with people who don't understand or respect statistics and data science. For example, I've been brought in to "do the analysis" on an "A/B test" where a team didn't appropriately randomize their samples, and also hadn't done a statistical power test beforehand so had an underpowered test anyway, so there was just no hope of validating that their change was an improvement.
[+] [-] chewxy|7 years ago|reply
I want to know if there is a way to spread the load for this.
[+] [-] airza|7 years ago|reply
[+] [-] gtrevize|7 years ago|reply
[+] [-] chewxy|7 years ago|reply
Work on the coq program proving the calculus is taking longer than expected though.
If you're interested in contributing to Gorgonia, lmk.
[+] [-] bsg75|7 years ago|reply
An expectation that data can eliminate the need for reason and thought is problematic.
I have tried to communicate the reasoning for things like judgemental forecasting but success is hard to achieve.
[+] [-] magneticnorth|7 years ago|reply
One of the things my team does is build data tools. The number of people who want to take data they have hardly looked at, put it through a tool they don't understand, and rely in important ways on the output, is astonishing to me.
[+] [-] apohn|7 years ago|reply
If you have to interact with enough of these people your job as a Data Scientist will be miserable.
[+] [-] sillyguy123|7 years ago|reply
[+] [-] apohn|7 years ago|reply
It's incredible how many "Data Science" problems can be solved with a better dashboard that enables people to look at data in a more useful way.
[+] [-] rajacombinator|7 years ago|reply
[+] [-] r0f1|7 years ago|reply
[+] [-] apohn|7 years ago|reply
[+] [-] p1esk|7 years ago|reply
[+] [-] avin_regmi|7 years ago|reply
[+] [-] Iwan-Zotow|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] blabla321|7 years ago|reply
[deleted]
[+] [-] throwaway1080|7 years ago|reply
[deleted]
[+] [-] throwaway5082|7 years ago|reply
[deleted]
[+] [-] throwaway1080|7 years ago|reply
[deleted]
[+] [-] throwaway5092|7 years ago|reply
[deleted]
[+] [-] throwaway5082|7 years ago|reply
[deleted]