No matter the industry or the role, interviewing for a job can be stressful and awkward. Through fairly limited interactions, you’re trying to convince a bunch of strangers to hire you—to spend eight hours a day with you—instead of the dozens of other people they’ve considered.
And when you’re on the hunt for a data science role, you have the added pressure of tackling tough technical tests. You may have to solve probability puzzles and write some SQL, then quickly pivot to more casual conversations designed to determine whether you’re a cultural fit.
Even if you’re fully confident in your skills, it’s typically a tremendously taxing experience.
The key to handling pressure and managing stress during the interview process is preparation. And while there’s no way to practice every possible question that might come your way, you can increase your confidence by working through sample scenarios and getting guidance from data scientists who have successfully navigated the process.
To help you nail your next interview, we curated a list of 25 data science interview questions that fall into six different categories:
We then asked four data scientists currently working in the field to weigh in with direct answers and/or insights into what would make an answer stand out. For each question, there will be at least two answers, giving you different perspectives on how to construct your response.
Before we get to the questions, let’s introduce the data scientists:
- Michael Beaumier is a data scientist at Google who previously worked in machine learning and data science at Mercedes Benz Research.
- Ramkumar Hariharan is a senior machine learning scientist at macro-eyes and a mentor for Springboard’s Data Science Career Track.
- Mansha Mahtani is a data scientist at Instagram who also was a data scientist at Blue Apron.
- Danny Wells is a senior data scientist at the Parker Institute for Cancer Immunotherapy and a mentor for Springboard’s Data Science Career Track.
Ever wonder what a data scientist really does? Check out Springboard’s comprehensive guide to data science. We’ll teach you everything you need to know about becoming a data scientist, from what to study to essential skills, salary guide, and more!
What are the assumptions required for linear regression?
Ramkumar: Some of the key assumptions are (1) low or no correlation between any two variables, (2) there is a linear relationship between the independent variables and the dependent variable, and (3) the residual errors (the difference between model-predicted Ys and actual Ys) are normally distributed.
Michael: Linear regression assumes that the relationship between the input feature space and an outcome is parameterized with a set of weights that never change. This is another way of stating that the outcome variable is simply a linear combination of the input features. “Never change” means that the same linear combination always predicts equally well—i.e., that the data is not heteroskedastic. Finally, linear regression assumes that features themselves are not correlated to each other.
Mansha: When articulating assumptions in a linear regression, it is often helpful to include examples.
Assumption: Linear regressions assume a linear relationship between the independent variable and the dependent variable.
Age and height could be strongly related—not in a linear fashion, but rather a logistic one. There is a point at which height tends to plateau once someone hits a certain age. An interviewer may ask you how you would account for this in your model and a common answer is to alter the feature to account for the relationship it has. For example, instead of including age in your model, you could change the feature to log(age).
How would you sort a large list of numbers?
Mansha: Although this question is more typical in software engineering interviews, understanding this can be helpful when evaluating which functions to use in your analysis. A common sorting algorithm is mergesort. In simple terms, mergesort is the process of sorting through dividing the list and sorting the list independently and eventually combining the independent lists to perform the same iterative process.
Sorting algorithms that are comparing a single number against every other number are less efficient but still accomplish the same goal. The interviewer is interested in knowing whether you are able to appreciate how different approaches to solving a problem could result in different computational effort.
Ramkumar: Either mergesort or quicksort can be used. While quicksort is faster, mergesort may be more stable for very large arrays of numbers.
Michael: It depends on how large the list of numbers is and how much memory the computer I was using to sort the numbers had. For most cases, I would just use a pre-built sorting algorithm, such as Python’s “sort” function. If the list of numbers is very large, I might need to use a method that can do out-of-core operations (i.e., sorting a subset of the list, serializing the middle step, sorting another part of the list) and then merge back together.
What are your favorite data visualization techniques?
Ramkumar: My favorite data visualization technique depends on the problem we are intending to solve! It also depends, obviously, on the kind of data we are trying to visualize (e.g., continuous vs. categorical).
That said, I love using clustermaps in some of my analysis. Clustermaps can be very useful for visualizing multiple dimensions. For one, you can see a color-coded variation across three different features or dimensions on a 2D plot. And when you apply clustering on either dimension, you get to see correlation-based structures in the data.
I also love simple bar plots that can show fundamental trends in the data. And you can see the mean and standard deviation very clearly in a well-constructed bar plot.
Michael: I like using matplotlib with seaborn to visualize data. Generally, I find statistical-based summarizations such as box plots or violin plots to communicate relationships most clearly.
Mansha: As a data scientist, a large part of your role will be to communicate insights in an understandable way. The visualization technique you choose will be highly dependent on the context of the problem, the message you are trying to land, and your audience. In general, there is no compulsion to choose one tool over the other as long as the visual is simple to digest by the expected audience.
Is it better to have too many false positives or too many false negatives?
Mansha: It depends on the problem and what is at stake. If the cost of a false positive is higher than the cost of a false negative, it is preferable to go for a model that reduces false-positive rates.
Michael: It depends on the needs of the model. If the cost of a false positive is huge (an autonomous car kills someone, for example) then you should minimize false positives to zero, even at the expense of more false negatives.
Danny: This is very application-dependent and really comes down to the comparative cost of false negatives and false positives. In cancer diagnostics, you may be OK with having two false positives for every true positive, since a false negative potentially means cancer going undiagnosed (very very bad), while a false positive might lead to an unnecessary biopsy (bad, but not as bad as missing the cancer).
Alternatively, say you’re building a movie recommendation engine. In this case, an excess of false positives (movies you recommend that a user hates) may lead to users losing trust in your tool (bad) while false negatives (missing a movie a user would like) are less bad since there are only so many movies a person will watch.
Tell me about a time when you resolved a conflict.
Michael: In middle school, I was a peer mediator. As part of this experience, I learned it was important to first listen to the grievances of the parties involved individually. Then, I would invite each party to repeat the concerns of the other party in their own words. I would find common ground and point out areas for compromise.
Ramkumar: We once had a situation when our team was waiting on another team’s data. After repeated requests, the team did not respond. So, I decided to have a 1:1 conversation with that team lead. I understood that they were short-staffed and had reservations about making any comments public. I resolved the conflict by offering to extend my time and effort to help the other team gather the data. This led to a happy situation and reinforced inter-team support and respect at my organization.
Mansha: In similar behavioral questions, you are not only expected to provide a structured answer but also expected to articulate what you learned from the experience. The STAR framework can be handy to help structure your answer: situation, task, action, result.