Ask HN: What are your biggest pain points as a data scientist?

[+] magneticnorth|7 years ago|reply

1. Dealing with biased data

2. Cleaning/understanding data - nearly all data sets I've used have duplicates, missing data, highly anomalous distributions in some fields that indicate we aren't measuring what we think we are, etc. So a lot of my time is spent figuring out what's going on in the data, cleaning up the issues, figure out what subset of the data is reliable, and then dealing with the biases introduced by what is missing or wrong.

3. Dealing with people who don't understand or respect statistics and data science. For example, I've been brought in to "do the analysis" on an "A/B test" where a team didn't appropriately randomize their samples, and also hadn't done a statistical power test beforehand so had an underpowered test anyway, so there was just no hope of validating that their change was an improvement.

[+] chewxy|7 years ago|reply

I tried scaling the task of #2 across multiple people recently. It took way longer than had I personally done it myself.

I want to know if there is a way to spread the load for this.

[+] airza|7 years ago|reply

Getting the dimensionality correct between layers in an untyped language is incredibly painful. Getting a prototype working on my personal machine and then having to play AMI roulette on AWS to get it running on a GPU is frustrating and expensive. Every time I read the tensorflow documentation I think, "I wish i could pay 500 dollars for a version of this that looked like someone cared about it."

[+] gtrevize|7 years ago|reply

You can skip AMI roulette by using AWS Deep Learning AMIs (https://aws.amazon.com/machine-learning/amis/) or the new container equivalent (https://aws.amazon.com/machine-learning/containers/)

[+] chewxy|7 years ago|reply

Not related to TF, but the upcoming version of Gorgonia has a solution for that - we basically developed a calculus for dimensionality of data. Gorgonia is deep learning in Go, and is well typed as far as I can tell.

Work on the coq program proving the calculus is taking longer than expected though.

If you're interested in contributing to Gorgonia, lmk.

[+] bsg75|7 years ago|reply

That other groups in the business want simple, black and white answers to very complex questions.

An expectation that data can eliminate the need for reason and thought is problematic.

I have tried to communicate the reasoning for things like judgemental forecasting but success is hard to achieve.

[+] magneticnorth|7 years ago|reply

Agreed. Many people don't want to think, they only want to know.

One of the things my team does is build data tools. The number of people who want to take data they have hardly looked at, put it through a tool they don't understand, and rely in important ways on the output, is astonishing to me.

[+] apohn|7 years ago|reply

At large companies there are lots of people who see themselves as decision makers, not Analysts. Analyzing data is beneath them. The end result is they expect the "Analytics People" to provide a set of different recommendation on what to do and business folks decide which recommendations to pursue.

If you have to interact with enough of these people your job as a Data Scientist will be miserable.

[+] sillyguy123|7 years ago|reply

Having high expectations by stakeholders on some ‘AI’ magic when a heuristic will get us 80% there in 10% of the time

[+] apohn|7 years ago|reply

IME a number of Data Scientists reinforce this thinking. When people think you are are a genius, it's painful to admit that 80% of project goals can be achieved with a simple heuristic.

It's incredible how many "Data Science" problems can be solved with a better dashboard that enables people to look at data in a more useful way.

[+] rajacombinator|7 years ago|reply

Heuristics are way underrated! I’d wager it’s more like 90% there in 1% of the time.

[+] r0f1|7 years ago|reply

That it is 90% cleaning, and only 10% modeling. Never had a dataset that was ready to use like the ones on Kaggle. Most of the time I get a mixture of Excel Sheets with weird formatting, .csv files and SQL dumps that have questionably encoding, and lots of unnecessary information and missing values.

[+] apohn|7 years ago|reply

I think Kaggle participants miss out on some of the best parts of being a Data Scientist. Fiddling with data, writing scripts to clean/transform/ingest data, interacting with data owners and subject matter experts, etc. IME that's actually a lot more fun than fiddling with parameters and looking at model performance metrics.

[+] p1esk|7 years ago|reply

Getting lots of data (for deep learning models), cleaning that data, and labeling it.

[+] avin_regmi|7 years ago|reply

what about playing with different hyperparameters? I always found that time consuming? What do you guys think?

16 comments