How to Remove Outliers in R

How to Remove Outliers in R?, What does outlier mean? It’s an observation that differs significantly from the rest of the data set’s values. Outliers can skew the results by providing false information.

We’ll go over how to eliminate outliers from a dataset in this section.

How to Remove Outliers in R

To begin, we must first identify the outliers in a dataset; typically, two methods are available.

That’s z scores and interquartile range.

Naive Bayes Classification in R » Prediction Model »

1. Interquartile range.

In a dataset, it is the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

The interquartile range (IQR) is a measurement of the spread of values in the middle 50%.

If an observation is 1.5 times the interquartile range more than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1), it is considered an outlier (Q1).

Outlier = Observations > Q3 + 1.5*IQR  or < Q1 – 1.5*IQR

2. Use z-scores.

The z-score indicates the number of standard deviations a given value deviates from the mean. A z-score is calculated using the following formula:

z = (X – μ) / σ

where:

X is a single raw data value

μ is the population mean

σ is the population standard deviation

If an observation’s z-score is less than -3 or larger than 3, it’s considered an outlier.

Outlier = values with z-scores > 3 or < -3

How to Remove Outliers in R

You can find and eliminate outliers from a dataset once you’ve decided what you believe to be an outlier. We’ll use the following data frame to demonstrate how to do so

What all skills required for a data scientist? »

set.seed(123)
data <- data.frame(Apperance=rnorm(100, mean=8, sd=4),
Thickness=rnorm(100, mean=15, sd=2.3),
Softness=rnorm(100, mean=29, sd=2.5))
head(data)
  Apperance Thickness Softness
1 16.037138  15.53825 30.37900
2 10.553282  17.06525 25.33928
3  7.857956  21.40918 30.56349
4 12.113920  12.98547 29.77937
5  4.156319  17.92857 29.54414
6 10.438148  13.15290 31.03692

Method 1:- Z-score

The code below demonstrates how to calculate the z-score of each value in each column in the data frame, then eliminate rows having at least one z-score with an absolute value greater than 3.

z_scores <- as.data.frame(sapply(data, function(data) (abs(data-mean(data))/sd(data))))    

Only rows in the data frame with all z-scores less than 3 are kept.

no_outliers <- z_scores[!rowSums(z_scores>3), ]
head(no_outliers)
  Apperance Thickness   Softness
1 0.3614132 0.1129102 0.04407156
2 1.4740501 1.1390075 2.02913381
3 0.6701501 0.6016034 0.26996386
4 1.6611551 0.2010902 0.32123480
5 0.1314868 0.5332800 1.35252878
6 1.1568042 1.3903598 1.46030266

Let’s check the dimension of both the data frame.

dim(data)
100   3
dim(no_outliers)
99  3

We got one value as an outlier and removed the same for further analysis.

Method 2:-Interquartile Range

The code below explains how to eliminate rows from the data frame that have a value in column ‘Apperance’ that is 1.5 times the interquartile range less than the first quartile (Q1) or 1.5 times the interquartile range bigger than the third quartile (Q3) (Q1).

How to Calculate Mahalanobis Distance in R »

Q1 <- quantile(data$Apperance, .25)
Q3 <- quantile(data$Apperance, .75)
IQR <- IQR(data$Apperance)

Now wen keep the values within 1.5*IQR of Q1 and Q3

no_outliers <- subset(data, data$Apperance > (Q1 - 1.5*IQR) & data$Apperance < (Q3 + 1.5*IQR))
dim(no_outliers)
99   3

Now you can see 1 outlier in the Appearance column.

For the graphical representation, you can make use of the below code.

boxplot(data)

How to Identify Outliers-Grubbs’ Test in R »

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

two + 13 =