Data Science Interview Questions GitHub: Your Ultimate Guide
Explore top GitHub repositories for data science interview questions. Practice coding, ML, stats, and real-world problems to ace your interviews in 2025.

The demand for skilled data scientists continues to surge. According to recent statistics, over 80% of Fortune 500 companies have prioritized hiring data scientists to leverage data-driven decision-making. Aspiring data professionals are expected to demonstrate not only technical skills but also problem-solving abilities, statistical knowledge, and coding proficiency.
While preparing for data science interviews can seem overwhelming, GitHub has emerged as an invaluable resource for interview preparation. The platform hosts thousands of repositories dedicated to data science interview questions GitHub, providing aspiring professionals with practice problems, solutions, and real-world examples. These repositories range from beginner-friendly SQL and Python questions to advanced machine learning and deep learning challenges. Using GitHub as a study tool allows candidates to explore interactive code snippets, collaborate with a community of learners, and develop a hands-on understanding of key data science concepts
1. Python Basics
Q1. How to handle missing data in Python?
Missing data is common in datasets. In Python, we handle it using Pandas:
import pandas as pd
df = pd.DataFrame({'Age':[25,None,30,None]})
# Fill missing with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Drop missing rows
df.dropna(inplace=True)
print(df)
Explanation:
-
fillna() replaces NaN with a specific value (mean, median, mode).
-
dropna() removes rows with missing values.
Q2. Explain Python’s list comprehension with an example.
numbers = [1,2,3,4,5]
squared = [x**2 for x in numbers if x % 2 == 0]
print(squared)
Output: [4, 16]
Explanation:
List comprehension is a concise way to create lists. Here, we squared only even numbers.
Q3. Difference between shallow copy and deep copy in Python
-
Shallow copy: Copies only the reference; changes affect the original.
-
Deep copy: Creates an independent copy; changes do not affect the original.
import copy
original = [[1,2],[3,4]]
shallow = copy.copy(original)
deep = copy.deepcopy(original)
Q4. Explain Python’s lambda function.
A lambda function is an anonymous function:
square = lambda x: x**2
print(square(5))
Output: 25
2. Pandas & NumPy
Q5. How do you merge two DataFrames?
import pandas as pd
df1 = pd.DataFrame({'ID':[1,2,3],'Name':['A','B','C']})
df2 = pd.DataFrame({'ID':[1,2,4],'Score':[90,85,75]})
merged = pd.merge(df1, df2, on='ID', how='inner')
print(merged)
Output:
ID Name Score
0 1 A 90
1 2 B 85
Explanation: Inner join returns only matching rows. Other types: left, right, outer.
Q6. How to filter DataFrame rows based on a condition?
filtered = df1[df1['ID'] > 1]
Q7. What is vectorization in NumPy?
Vectorization allows element-wise operations without loops, improving efficiency.
import numpy as np
a = np.array([1,2,3])
b = a * 2 # multiplies all elements by 2
Q8. How to handle outliers in Pandas?
-
Using the IQR method:
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Age'] >= Q1 - 1.5*IQR) & (df['Age'] <= Q3 + 1.5*IQR)]
3. SQL Questions
Q9. How to find the second-highest salary?
SELECT MAX(Salary) AS SecondHighestSalary
FROM Employee
WHERE Salary < (SELECT MAX(Salary) FROM Employee);
Q10. Count employees in each department
SELECT Department, COUNT(*)
FROM Employee
GROUP BY Department;
Q11. Difference between INNER JOIN and LEFT JOIN
-
INNER JOIN: Returns only matching rows
-
LEFT JOIN: Returns all rows from the left table, matched rows from right
Q12. How to find duplicates in SQL?
SELECT Name, COUNT(*)
FROM Employee
GROUP BY Name
HAVING COUNT(*) > 1;
Q13. Explain window functions in SQL
-
Used to perform calculations across rows related to the current row.
-
Example: ROW_NUMBER(), RANK(), LEAD(), LAG()
4. Statistics & Probability
Q14. Explain the Central Limit Theorem (CLT)
-
The distribution of sample means approaches a normal distribution as sample size increases, regardless of population distribution.
Q15. Difference between Type I and Type II errors
-
Type I: Rejecting the true null (False Positive)
-
Type II: Failing to reject a true null (False Negative)
Q16. What is a p-value?
-
Probability of obtaining observed results if the null hypothesis is true.
-
p < 0.05 usually indicates statistical significance.
Q17. Difference between correlation and covariance
-
Covariance measures joint variability.
-
Correlation normalizes covariance; ranges [-1,1].
Q18. Explain the z-score and its use
-
The Z-score measures how many standard deviations a data point is from the mean.
-
Formula: z=(x−μ)σz = \frac{(x - \mu)}{\sigma}
5. Machine Learning
Q19. Supervised vs Unsupervised Learning
-
Supervised: Labeled data (regression, classification)
-
Unsupervised: Unlabeled data (clustering, PCA)
Q20. Overfitting vs Underfitting
-
Overfitting: Model too complex; fits training data but fails on test data
-
Underfitting: Model too simple; poor performance on both training and test data
Q21. How to handle imbalanced datasets?
-
Resampling: Over-sampling or under-sampling
-
SMOTE
-
Change evaluation metrics (F1-score, ROC-AUC)
Q22. What is regularization?
-
Penalizes large coefficients to reduce overfitting
-
L1 (Lasso) and L2 (Ridge)
Q23. Explain Decision Trees and Random Forest
-
Decision Tree: Single tree; prone to overfitting
-
Random Forest: Ensemble of trees; reduces overfitting
Q24. What is gradient descent?
-
Optimization algorithm to minimize cost function
-
Types: Batch, Stochastic, Mini-batch
Q25. PCA (Principal Component Analysis)
-
Dimensionality reduction technique
-
Retains maximum variance in fewer features
6. Deep Learning
Q26. Difference between CNN and RNN
-
CNN: Good for image processing; uses convolution layers
-
RNN: Good for sequence data; uses memory to capture previous steps
Q27. Explain backpropagation in neural networks
-
Algorithm to update weights based on error gradient
-
Uses the chain rule for derivatives
Q28. What is dropout, and why is it used?
-
Randomly drops neurons during training
-
Prevents overfitting
Q29. LSTM vs GRU
-
Both are RNN variants for sequence modeling
-
GRU: Fewer parameters; faster training
-
LSTM: Better at learning long-term dependencies
7. Data Visualization
Q30. Explain different types of plots and their uses
-
Histogram: Distribution of numerical data
-
Boxplot: Median, quartiles, outliers
-
Scatter plot: Relationship between 2 variables
-
Line plot: Trends over time
-
Bar plot: Categorical comparisons
Python Example:
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x='Category', y='Sales', data=df)
plt.show()
Preparing for a data science interview can be challenging, given the breadth and depth of topics involved—from Python and SQL to statistics, machine learning, deep learning, and data visualization. The 30 questions outlined above serve as a comprehensive roadmap for aspirants to understand the core concepts, practical applications, and industry-relevant techniques. By studying these questions and answers, candidates gain clarity on problem-solving approaches, coding efficiency, and analytical reasoning, which are critical for cracking real-world data science interviews. The examples and explanations provided ensure that learners not only memorize answers but also develop the ability to think critically, tackle unseen problems, and demonstrate confidence during interviews.