top | item 24460141

Data science interview questions with answers

242 points| fagnerbrack | 5 years ago |github.com

71 comments

order
[+] ryndbfsrw|5 years ago|reply
I’ve worked in the field for 7 years now so not that long but long enough to build some heuristics. The best data scientists are just people who try to understand the ins and outs of business processes and look at problems with healthy suspicion and curiosity. The ability to explain the nuances of manifolds in SVMs is not something that comes into it outside these contrived interviews. I prefer to ask candidates how they would approach solving a problem I’m facing at that moment rather than these cookie cutter tests which are easy to game and tell me nothing
[+] y42|5 years ago|reply
>> I prefer to ask candidates how they would approach solving a problem

Word. Totally off topic, but: I work in the field of information technology since more than 20 years now, more or less. Not always the same focus, not always full time, but always IT related. I consider myself a good problem solver because of my self learning and analytical skills.

I recently applied for a job as a BI developer. The interview consisted of 10 questions about SQL. I more or less answered them, just 1 or 2 wrong. Not wrong as in "not correct" but rather "Not what we exactly expected" or "you did not see the little traps".

Comes out they didn't take me because of my lack of SQL skills. I do not understand how this kind of recruiting process will help anyone getting skilled people and how this is still common practice. It's frustrating for people like me, who do not have the complete SQL syntax in mind, but are flexible in choosing their problem solving approaches. A couple of years ago I started in a big data company, never heard of MongoDB before, little skills in Bash. If they would just asked me questions about that, hiring me would be a total no-go. They did hire me. I improved process, like measurable, and mastered MongoDB. Nothing, that one could expect from a questionaire.

A second interview, same outcome. They not even asked me detailled questions, just wanted to know what my SQL skills are. I answered: Immediate, but I'm good in learning. The did not take me, too.

Although, I understand that it's hard to evaluate this kind of skill, I'm really frustrated, when I face those "hiring techniques". Or maybe I'm just not good in SQL, and they anticipated it.. ;)

[+] Breza|5 years ago|reply
I completely agree! I run a data science department at a corporation and I give applicants real world data and ask them to answer real world questions. "Here's a dataset of the dates we sent an email and the open rate. Does one day of the week have a higher open rate than the others?" People have to know how to turn a string into a date, turn a date into the day of week, and then apply statistical reasoning and then give me a one sentence answer. That kind of thing is far more common at work than explaining the kernel trick.
[+] RobinL|5 years ago|reply
Agree. In addition to curiosity another quality I think is very important is persistence. A surprising amount of being an effective data scientist comes down to being able to learn new things quickly and being able to make the computer do what you want it to, especially when the first few things you try don't work.

Success if often a trial and error process, involving slowly building a deep understanding of the problem you're trying to solve (business and technical), hitting lots of problems, and not giving up too easily (at least, not giving up due to surmountable technical hurdles).

This often means spending many hours banging your head into a brick wall; but in my experience these are often the times I'm learning quickest even if it doesn't feel like it at the time.

[+] Jugurtha|5 years ago|reply
Diving deep into a domain and interfacing with a subject matter expert goes a long, long, way. We've been building custom data products for enterprise for about seven years, and a lot of the work is listening and grokking large amounts of content about whatever domain we were helping with in general, and the specifics of our clients. Retail, banking, telcos, energy, communication.

Having a background in acoustics, reservoir characterization, telecom networks, opens up clients because you 'get it' or at least you work hard to get it, which improves buy-in of the experts to sit down with you and answer your questions. You did your homework.

If you don't and just storm in talking about something something neural nets, they'll see it as a waste of time, won't bother explaining nuances, will delay sending data you desperately need. You won't have their cooperation even if you have executive support. There's no data in CSV form or an API to hit in most real world projects, so you need their help getting data, and their expertise to understand it.

Another major point is specifying the metrics. The real world metrics, not AUC or F1 scores. You need collaboration to get there, too.

There's so much to be done before there's data to work with, let alone good data. And there's so much after the model building step.

It can drive people to quit. One reason is that when you storm in and consider that people are morons, you get frustrated rapidly.

[+] Eugeleo|5 years ago|reply
> The best data scientists are just people who try to understand the ins and outs of business processes and look at problems with healthy suspicion and curiosity

I guess the unsaid part of this is “... and curiosity AND are often able to leverage this in data-driven solutions to bussiness problems”. Because nobody cares whether you’re curious or not if you don’t bring any value to the company. With that said, the part you mentioned almost seems like an innate ability, while the part I filled in could probably be trained.

Suppose you were still in college and wanted to become a data scientist. Based on your current knowledge and values, how would you go about it, being a student? Which skills would you hone and how?

[+] starpilot|5 years ago|reply
This is why I'm switching from DS to SWE. The communication hurdles with the business people are so hard for me. I've talked with other nerds my whole life and struggle to connect with and dissect the other side. That and the pay is better.
[+] fractionalhare|5 years ago|reply
The quality and depth of answers here is pretty inconsistent. But this in particular is a pet peeve of mine:

> Plot a histogram out of the sampled data. If you can fit the bell-shaped "normal" curve to the histogram, then the hypothesis that the underlying random variable follows the normal distribution can not be rejected.

This is commonly taught in undergrad stats, but you shouldn't do this. I'm of the opinion that normality testing in general is usually a red herring, but this is specifically not a productive way of doing it. Use the other methods. A visual test that relies on how much the histogram approximates a bell curve is very prone to error, because a sample from a variety of other distributions can look visually normal even though it isn't.

More broadly speaking, the reason I don't like this is because it's an example of the kind of formulaic, cargo-culted recipes that are often used in statistics without critical thinking. You should strive to obtain a deep understanding of your data and its distribution, and you should be deeply skeptical if the sample you happen to have looks normal. Nature abhors normality, and the central limit theorem can only promise a tendency towards normality as n approaches infinity. It says nothing about what size sample you'll practically need for your specific data to be able to treat it as normal.

[+] jpeloquin|5 years ago|reply
> You should strive to obtain a deep understanding of your data and its distribution, and you should be deeply skeptical if the sample you happen to have looks normal.

Although normality testing is useless in many situations, the parent comment somewhat overstates the degree of caution required. In many contexts the exact distribution doesn't matter; sort-of-normal is good enough. For example, the t-test is used ubiquitously. It assumes normality, so we would expect possible non-normality to be a major problem, right? Not so. The t-test is extremely robust to departures from normality given equal sample sizes [1,2,3]. Or you can use a so-called non-parametric test. Rather than investing great effort in specifying exactly what distribution you're dealing with, it's more productive to simply use a test that is robust against your unknowns and move on to pursuing your actual objectives.

It's true that if you are interested in predicting events in the tails of the distribution, you really do need to study the distribution in detail. Predicting rare events is very difficult. But if you're just interested in differences between group means, don't overthink it.

[1] http://www.jerrydallal.com/LHSP/student3.htm [2] Posten, H.O., Yeh, H.C., and Owen, D.B. (1977). Robustness of the two-sample t-test under violations of the homogeneity of variance assumption. Communications in Statistics - Theory and Methods 11, 109–126. [3] Posten, H.O. (1992). Robustness of the two-sample t-test under violations of the homogeneity of variance assumption, part ii. Communications in Statistics - Theory and Methods 21, 2169–2184.

[+] tgv|5 years ago|reply
There's a test for normality. Well, I knew one, Kolmogorov, but Wikipedia already lists 8. What data scientist doesn't know of such a test?

And how would you fit the data? There's not one, unique way to fit. And as you say, a small deviation can mean a lot: if you use L2 distance, the errors at the outer regions of a normal distribution are probably dwarfed by any deviation that occurs more towards the center.

[+] minimaxir|5 years ago|reply
The weirdest thing about data science interviews (when I was actively interviewing) is SQL gotcha questions. Especially with window functions. Here's a long HN thread a few months ago about an annoying situation with an interviewer asserting uncommon syntax is common: https://news.ycombinator.com/item?id=23053981

Another related SQL gotcha I saw multiple times is finding the top n records of each group in a table. Which is a know-it-or-you-don't implementation, and the interviewer can still be a jerk if they want by slamming the interviewee if they include ties, or not (RANK vs. ROW_NUMBER).

It's telling that there aren't any SQL window function examples in this repo.

Another fun aspect of SQL interviews is dialect-specific questions, particularly with how date/times are handled. Years ago, a company famous/infamous for primarily using mySQL explicitly noted in a take-home assignment problem definition that the database was PostgreSQL, which allowed them to ask the aforementioned window question problem, the AT TIME ZONE syntax for filter, and allow a specific definition for "beginning of week" which required me to download the database and test it manually.

[+] claudeganon|5 years ago|reply
These kinds of stories seem insane to me. I currently have to do some data science-y stuff in the more generalist consulting dev role I have. I used to do a lot of SQL stuff years ago, but have forgotten most of the syntax beyond the basics. That being said, all it took is some minor googling and reading of a few blog posts to solve some middling hard problems for my client.

What kind of companies make people jump through these nerd hoops? Do they actually have real work that needs doing or is this all just posturing by interviewers?

[+] bobdosherman|5 years ago|reply
Linear regression does not require errors to be iid, normal, and homoscedastic for it to "work". One of the ways to separate candidates is to push on what (and how) assumptions can be weakened, what the consequences are for estimation and inference, and what sort of corrections can be incorporated for maintaining consistency, improving efficiency, correcting biases, etc. An entry level candidate may not have (nor need to have) a complete understanding of asymptotic theory, but they should know what the purpose of robust standard errors are and how to use them.
[+] bonoboTP|5 years ago|reply
Interesting and perhaps shows the cultural differences between ML and stats people. I took a machine learning course in my bachelor's and two more ML courses in my master's (CS). These weren't some "deep learning lite", mess-around-in-Keras courses, because DL wasn't even big back then. We covered lots of stuff, Bayesian linear regression, Gaussian processes, Gibbs sampling, Metropolis-Hastings, hierarchical Dirichlet processes, SVMs, multi-class SVM, PCA, kernel PCA, perceptrons, CMAC neural nets, Hebbian learning, AdaBoost, Fisher vectors, EM algorithms for various distributions, fuzzy logic, optimization methods like conjugate gradients etc etc.

But not once were the "Gauss-Markov conditions" mentioned. Frequentist theory was only marginally addressed. I taught myself some of that stuff from the Internet, such as hypothesis testing theory, p-values, t statistic, ANOVA, etc.

Also, I'd say I'm good with data structures and algorithms, complexity theory, graph theory etc.

I thought these skills would be a good fit for data science jobs, but I guess it's really such a wide umbrella term, that probably you're more looking for people trained in the frequentist, statistical side of it. What application field are you in, if it's no secret?

[+] notafraudster|5 years ago|reply
Not really sure why your comment is marked dead when it is correct. Maybe it's a new account thing? As you say, the Gauss-Markov conditions are necessary (and sufficient) to make OLS BLUE, but OLS still works fine under a variety of pathological conditions, and many of those conditions can be tested for and adjusted for using common techniques.
[+] kkoncevicius|5 years ago|reply
I only read the first question in theory.md and think the answer is quite weak.

> What is regression? Which models can you use to solve a regression problem?

The current answer only list some names that have "regression" in it, and the description of what a regression is doesn't say anything that distinguishes it from classification.

It fails to mention that regression (in ML terminology) is prediction of a continuous variable. And that almost any method can be used to do regression: knn, neural network, random forest, svm.

If other answers are of similar nature you might fail the interview.

[+] savagedata|5 years ago|reply
One of my favorite data science factoids is how "regression" (to return to a former state) came to mean "prediction of a continuous variable".

In the late 1800s, Sir Francis Galton noticed that extremely tall or short parents usually had children that were not as tall or short as themselves, i.e. the children's heights were regressing (returning) to the mean. He collected hundreds of data points, graphed them, and estimated a coefficient describing this relationship, thereby inventing "linear regression."

We call them "regression" models simply because the first linear regression model was created to demonstrate the concept of regression to the mean.

https://en.wikipedia.org/wiki/Regression_toward_the_mean#His...

[+] notafraudster|5 years ago|reply
The error here is ML terminology; the article answers correctly, if shallowly, the statistical definition of regression.

Classical regression techniques can be (and are correctly) used on binary, ordinal, or categorical dependent variables. I know we teach people doing ML that the two forms of supervised ML are classification and regression, but that does a disservice mostly in order to make visual examples in teaching easier by making every topic a binary classification question and then saying "oh yeah, this works for regression too".

Granted in an interview you'd probably want to use context in case the hiring people were trained on a specific vocab, but that maybe speaks more to the folly of using these dial-an-answer systems in place of actually learning the material.

[+] dgellow|5 years ago|reply
From the README:

> The answers are given by the community

> If you know how to answer a question — please create a PR with the answer

> If there's already an answer, but you can improve it — please create a PR with improvement suggestion

> If you see a mistake — please create a PR with a fix

[+] atty|5 years ago|reply
I read two answers from technical.md, and one of them was inconsistent with what was asked for...

I would be very suspicious of using this repo for studying.

[+] latentdeepspace|5 years ago|reply
Do people realize that, these interview question collection do not help? I think there is 2 things to address here:

- Interviewers will know "what is known" by every candidate (with the help of these pages) and harder questions will be asked

- If these questions are asked at >junior levels, then RUN! the work will not satisfy you. The interview should be fun, and show the creativity of the candidate. These ones could be answered by anyone who read it a few times. I would not like to work with somebody who only know the answers to these questions and not more

[+] marcusabu|5 years ago|reply
I think it's a bit far fetched to assume that recruiters tailor interview questions based on these Github repo's. For me as a junior data scientist it really useful to test my own knowledge and highlight areas which I need to study more.
[+] lumost|5 years ago|reply
Having somewhat standard "objective" questions helps even with senior candidates. You'd be surprised how many senior's would struggle with the basics, or who aren't as senior as their resume would lead you to believe.
[+] zwaps|5 years ago|reply
Very interesting. I wonder in what type of interviews would these answers be considered ideal? Do people at OpenAI ask these sorts of questions, or is this more targeted toward other industries that require data scientists but are not populated by "stats" experts?

For example, consider "What is regression?". The answer given is one way to go, but if the job posting would involve causal analysis or more statistical know-how, for example, it would probably be insufficient. I would want the candidate to speak about linear projections, about sampling assumptions, about probability models underlying the process and so on.

On the other hand, I could imagine that it would not be a good strategy to start to lecture about minute details of regression analysis when applying for a standard data scientist position when sitting on front of "applied data scientists" or even HR folks.

Anyone has any insights?

[+] data4lyfe|5 years ago|reply
I have to agree that these interview questions function more as a cheatsheet review than actually anything practical that would be seen in an interview. Data science interviews don't function as a biology test where you're just rattling off memorizations to how neural networks or linear models work.

Ultimately these types of questions like "What is feature selection" are more likely to be encapsulated into case studies where the answer to the question itself will be, using feature selection.

For example: "Let's say you have thousands of categorical features for an anonymized dataset involving human traits, how would you figure out which predictors are the most important?"

Source: https://www.interviewquery.com/

[+] marcinzm|5 years ago|reply
>I have to agree that these interview questions function more as a cheatsheet review than actually anything practical that would be seen in an interview. Data science interviews don't function as a biology test where you're just rattling off memorizations to how neural networks or linear models work.

Me and my co-workers have been asked exactly that by top companies although it was more for machine learning engineer/applied scientist positions. Many textbook questions asked one after the other. It's not the only interview type they did but it definitely mattered and was often the first filter. So if you didn't answer well enough then you were out.

[+] rhacker|5 years ago|reply
These are at best machine learning questions, not data science. ML is definitely a sub-category of data science, but if you see a job posting for data science, you're 100% not going to be doing machine learning. That would have been labeled as a machine learning job posting.
[+] teleforce|5 years ago|reply
For Machine Learning theory related to Data Science I'd highly recommend "The Hundred-Page Machine Learning Book" by Andriy Burkov [1].

According to the author, by reading one hundred pages of the book (plus some extra bonus pages), you will be ready to build complex AI systems, pass an interview or start your own business.

It is a read first book and buy later if you think it is good enough.

[1] http://themlbook.com/

[+] marcinzm|5 years ago|reply
I've seen multiple companies ask candidates to write working code for machine learning end-to-end from scratch. As in, write a stochastic gradient descent logistic regression model with training, inference, etc. without any libraries beyond pandas/numpy If you're lucky they'll provide you the equations or let you google them. So something to memorize including the various numpy/pandas gotchas.
[+] conjectures|5 years ago|reply
That's a great question. Because it lets you as a candidate screen out places run by morons.