Data Science

Data Science Interview Questions GitHub: Your Ultimate Guide

Explore top GitHub repositories for data science interview questions. Practice coding, ML, stats, and real-world problems to ace your interviews in 2025.

Nikhil Hegde

Aug 29, 2025

0 213

Data Science Interview Questions GitHub

Content ▾

The demand for skilled data scientists continues to surge. According to recent statistics, over 80% of Fortune 500 companies have prioritized hiring data scientists to leverage data-driven decision-making. Aspiring data professionals are expected to demonstrate not only technical skills but also problem-solving abilities, statistical knowledge, and coding proficiency.

While preparing for data science interviews can seem overwhelming, GitHub has emerged as an invaluable resource for interview preparation. The platform hosts thousands of repositories dedicated to data science interview questions GitHub, providing aspiring professionals with practice problems, solutions, and real-world examples. These repositories range from beginner-friendly SQL and Python questions to advanced machine learning and deep learning challenges. Using GitHub as a study tool allows candidates to explore interactive code snippets, collaborate with a community of learners, and develop a hands-on understanding of key data science concepts

1. Python Basics

Q1. How to handle missing data in Python?
Missing data is common in datasets. In Python, we handle it using Pandas:

import pandas as pd

df = pd.DataFrame({'Age':[25,None,30,None]})

# Fill missing with mean

df['Age'].fillna(df['Age'].mean(), inplace=True)

# Drop missing rows

df.dropna(inplace=True)

print(df)

Explanation:

fillna() replaces NaN with a specific value (mean, median, mode).
dropna() removes rows with missing values.

Q2. Explain Python’s list comprehension with an example.

numbers = [1,2,3,4,5]

squared = [x**2 for x in numbers if x % 2 == 0]

print(squared)

Output: [4, 16]

Explanation:
List comprehension is a concise way to create lists. Here, we squared only even numbers.

Q3. Difference between shallow copy and deep copy in Python

Shallow copy: Copies only the reference; changes affect the original.
Deep copy: Creates an independent copy; changes do not affect the original.

import copy

original = [[1,2],[3,4]]

shallow = copy.copy(original)

deep = copy.deepcopy(original)

Q4. Explain Python’s lambda function.

A lambda function is an anonymous function:

square = lambda x: x**2

print(square(5))

Output: 25

2. Pandas & NumPy

Q5. How do you merge two DataFrames?

import pandas as pd

df1 = pd.DataFrame({'ID':[1,2,3],'Name':['A','B','C']})

df2 = pd.DataFrame({'ID':[1,2,4],'Score':[90,85,75]})

merged = pd.merge(df1, df2, on='ID', how='inner')

print(merged)

Output:

ID Name Score

0 1 A 90

1 2 B 85

Explanation: Inner join returns only matching rows. Other types: left, right, outer.

Q6. How to filter DataFrame rows based on a condition?

filtered = df1[df1['ID'] > 1]

Q7. What is vectorization in NumPy?

Vectorization allows element-wise operations without loops, improving efficiency.

import numpy as np

a = np.array([1,2,3])

b = a * 2 # multiplies all elements by 2

Q8. How to handle outliers in Pandas?

Using the IQR method:

Q1 = df['Age'].quantile(0.25)

Q3 = df['Age'].quantile(0.75)

IQR = Q3 - Q1

df = df[(df['Age'] >= Q1 - 1.5*IQR) & (df['Age'] <= Q3 + 1.5*IQR)]

3. SQL Questions

Q9. How to find the second-highest salary?

SELECT MAX(Salary) AS SecondHighestSalary

FROM Employee

WHERE Salary < (SELECT MAX(Salary) FROM Employee);

Q10. Count employees in each department

SELECT Department, COUNT(*)

FROM Employee

GROUP BY Department;

Q11. Difference between INNER JOIN and LEFT JOIN

INNER JOIN: Returns only matching rows
LEFT JOIN: Returns all rows from the left table, matched rows from right

Q12. How to find duplicates in SQL?

SELECT Name, COUNT(*)

FROM Employee

GROUP BY Name

HAVING COUNT(*) > 1;

Q13. Explain window functions in SQL

Used to perform calculations across rows related to the current row.
Example: ROW_NUMBER(), RANK(), LEAD(), LAG()

4. Statistics & Probability

Q14. Explain the Central Limit Theorem (CLT)

The distribution of sample means approaches a normal distribution as sample size increases, regardless of population distribution.

Q15. Difference between Type I and Type II errors

Type I: Rejecting the true null (False Positive)
Type II: Failing to reject a true null (False Negative)

Q16. What is a p-value?

Probability of obtaining observed results if the null hypothesis is true.
p < 0.05 usually indicates statistical significance.

Q17. Difference between correlation and covariance

Covariance measures joint variability.
Correlation normalizes covariance; ranges [-1,1].

Q18. Explain the z-score and its use

The Z-score measures how many standard deviations a data point is from the mean.
Formula: z=(x−μ)σz = \frac{(x - \mu)}{\sigma}

5. Machine Learning

Q19. Supervised vs Unsupervised Learning

Supervised: Labeled data (regression, classification)
Unsupervised: Unlabeled data (clustering, PCA)

Q20. Overfitting vs Underfitting

Overfitting: Model too complex; fits training data but fails on test data
Underfitting: Model too simple; poor performance on both training and test data

Q21. How to handle imbalanced datasets?

Resampling: Over-sampling or under-sampling
SMOTE
Change evaluation metrics (F1-score, ROC-AUC)

Q22. What is regularization?

Penalizes large coefficients to reduce overfitting
L1 (Lasso) and L2 (Ridge)

Q23. Explain Decision Trees and Random Forest

Decision Tree: Single tree; prone to overfitting
Random Forest: Ensemble of trees; reduces overfitting

Q24. What is gradient descent?

Optimization algorithm to minimize cost function
Types: Batch, Stochastic, Mini-batch

Q25. PCA (Principal Component Analysis)

Dimensionality reduction technique
Retains maximum variance in fewer features

6. Deep Learning

Q26. Difference between CNN and RNN

CNN: Good for image processing; uses convolution layers
RNN: Good for sequence data; uses memory to capture previous steps

Q27. Explain backpropagation in neural networks

Algorithm to update weights based on error gradient
Uses the chain rule for derivatives

Q28. What is dropout, and why is it used?

Randomly drops neurons during training
Prevents overfitting

Q29. LSTM vs GRU

Both are RNN variants for sequence modeling
GRU: Fewer parameters; faster training
LSTM: Better at learning long-term dependencies

7. Data Visualization

Q30. Explain different types of plots and their uses

Histogram: Distribution of numerical data
Boxplot: Median, quartiles, outliers
Scatter plot: Relationship between 2 variables
Line plot: Trends over time
Bar plot: Categorical comparisons

Python Example:

import matplotlib.pyplot as plt

import seaborn as sns

sns.boxplot(x='Category', y='Sales', data=df)

plt.show()

Preparing for a data science interview can be challenging, given the breadth and depth of topics involved—from Python and SQL to statistics, machine learning, deep learning, and data visualization. The 30 questions outlined above serve as a comprehensive roadmap for aspirants to understand the core concepts, practical applications, and industry-relevant techniques. By studying these questions and answers, candidates gain clarity on problem-solving approaches, coding efficiency, and analytical reasoning, which are critical for cracking real-world data science interviews. The examples and explanations provided ensure that learners not only memorize answers but also develop the ability to think critically, tackle unseen problems, and demonstrate confidence during interviews.

Tags:

Digital Marketing Courses in Other Locations in India

Easy Guide to Video SEO Optimization

Nikhil Hegde Nikhil Hegde is a proficient data science professional with four years of experience specializing in Machine Learning, Data Visualization, Predictive Analytics, and Big Data Processing. He is skilled at transforming complex datasets into actionable insights, driving data-driven decision-making, and optimizing business outcomes.