10 Data analytics Interview Questions and Answer

by finnstats

Data analytics Interview Questions and Answer are a major part of the data science interview and the path to becoming a data analyst, data scientist, machine learning engineer, data engineer, or statistician.

Top 10 Data analytics Interview Questions and Answer are here

1. How is KNN different from k-means clustering?

KNN-K Nearest Neighbours is a supervised classification algorithm and k-means clustering is an unsupervised clustering algorithm.

K-Nearest Neighbours need to labeled data. K-means clustering requires only a set of unlabelled data points.

2. What is the difference between supervised and unsupervised methods?

Supervised learning requires training labeled data. In a classification example, you’ll need to first label the data into two groups like training data sets and test data sets.

Unsupervised learning does not require any labeling.

3. What is Bayes’ Theorem?

Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge.

What is mean by NULL Hypothesis

4. What is mean by “Naive” in Naive Bayes

Naive Bayes is considered “Naive” as it makes an assumption this is truly not possible to look at actual-existence data: the conditional chance is calculated because of the natural product of the character changes of components.

In simple words, a “Naive” Bayes classifier assumes that the presence of a particular feature of a class is unrelated to the presence of any other feature, given the class variable

5. What’s the difference between Type I and Type II errors?

Type I error is a false positive, while Type II error is a false negative.

Type 1 error:- Reject Ho when it is true and Type II error:- Accept Ho when H1 is true

6. Which is a more important model accuracy or model performance?

Model accuracy is only a subset of model performance or in other words model accuracy really comes from model performance.

When you make sure the equal representation of the groups. For example, if you are predicting color changing pattern training data sets should contain an almost equal representation of changed and not changed colors.

7. Explain the ROC curve.

The ROC curve is a graphical illustration of the contrast between true positive rates and the false positive rate at various thresholds.

It’s frequently used as a proxy for the trade-off among the sensitivity of the model (proper positives) vs the fall-out or the opportunity it’s going to cause a fake alarm (fake positives).

8. How would you deal with an imbalanced dataset?

Need to understand, what is mean by imbalance dataset, 90% classification in the group and remaining in another group.

1. Collect more data points and adjust imbalances in the dataset.

2. Resample/reshuffle the dataset to correct for imbalances.

3. If algorithm accuracy is not adequate then try a different one on your dataset.

9. When should you use classification over regression?

Classification based on the dataset categories and regression is based on a continuous data set point. You would use classification over regression if you wanted your results to reflect certain explicit categories.

10. What are your thoughts on the best data visualization and source datasets?

Define your views on how to properly visualize data and your personal preferences when it comes to software tools.

Popular tools are R’s ggplot, Python’s seaborn and matplotlib, and tools such as Plot.ly and Tableau.

Questions like these try to get at the heart of your machine learning interest. Quandl dataset used for economic and financial data, and Kaggle’s Datasets collection for another great list.

Sample Size Calculation