How to Identify Outliers-Grubbs’ Test in R

How to Identify Outliers, The Grubbs’ Technique is a statistical test that may be used to detect outliers in a dataset.

A dataset should be generally normally distributed and have at least 7 observations to use this test.

This article shows how to use R to run Grubbs’ Test to find outliers in a dataset.

How to Identify Outliers -Grubbs’ Test

We can utilize the Outliers package’s grubbs.test() function, which has the following syntax:

grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)

How to Calculate SMAPE in R » Model Accuracy »

where:

x: a numeric vector of data values

type:10 = check if the maximum value is an outlier, 11 = check if both the minimum and maximum values are outliers, 20 = check if one tail has two outliers.

opposite: logically specifying whether you want to verify the value with the largest departure from the mean, or the value with the smallest difference from the mean (lowest, if most suspicious is highest, etc.)

two-sided: If this test is to be treated as two-sided, this logical value indicates that.

Test Hypotheses:

H0: There is no outlier in the data.

H1: There is an outlier in the data.

Linear Discriminant Analysis in R » LDA Prediction »

Case1:-Identifying whether or not a dataset’s maximum value is an outlier.

Load the library first,

library(outliers)

Let’s start by making a test data frame.

data <- c(35, 12, 27, 32, 16, 42, 25, 12, 8, 15, 25, 17, 18, 19, 29, 33, 25, 20,18, 26)

As simple it is use grubbs.test function,

grubbs.test(data)

Grubbs test for one outlier

data:  data
G = 2.21065, U = 0.72925, p-value = 0.1867
alternative hypothesis: highest value 42 is an outlier

G = 2. 21065 is the test statistic, and p = 0.18 is the related p-value. We can’t reject the null hypothesis because this value is bigger than 0.05.

We do not have sufficient evidence to say that the maximum value of ‘42’ is an outlier.

Linear optimization using R » Optimal Solution »

Case2:- Identifying whether or not a dataset’s minimum value is an anomaly.

Here we need to mention opposite=TRUE

grubbs.test(data, opposite=TRUE)

Grubbs test for one outlier

data:  data
G = 1.68376, U = 0.84293, p-value = 0.8365
alternative hypothesis: lowest value 8 is an outlier

G = 1.68376 is the test statistic, and p = 0.8365 is the related p-value. We cannot reject the null hypothesis because this number is not less than 0.05.

We don’t have enough evidence to declare the minimal value of ‘8’ to be an outlier.

Naive Bayes Classification in R » Prediction Model »

Case3:-Identify two large values outlier or not

sort(data)
[1]  8 12 12 15 16 17 18 18 19 20 25 25 25 26 27 29 32 33 35 42

In this case, 35 and 42 are the largest numbers and we need to specify type=20.

grubbs.test(data, type=20)

Grubbs test for two outliers

data:  data
U = 0.60002, p-value = 0.247
alternative hypothesis: highest values 35 , 42 are outliers

The test has a p-value of 0.247. We can’t reject the null hypothesis because this is more than 0.05, so we conclude that we don’t have enough evidence to determine that the values 35 and 42 are outliers.

Principal component analysis (PCA) in R »

Case4:-Identify two smallest values outlier or not

As we all know, the smallest numbers are 8 and 12, and the lowest number that has already been validated is not an outlier. However, for the sake of experimentation, we can use the same rationale.

grubbs.test(data, type=20,opposite=TRUE)

Grubbs test for two outliers

data:  data
U = 0.74698, p-value = 0.7964
alternative hypothesis: lowest values 8 , 12 are outliers

The test has a p-value of 0.7964. We can’t reject the null hypothesis because this is more than 0.05, so we conclude that we don’t have enough evidence to determine that the values 8 and 12 are outliers.

Case5:-Identify smallest and hightest values are outlier or not

grubbs.test(data, type=11) 
Grubbs test for two opposite outliers

data:  data
G = 3.89441, U = 0.59277, p-value = 0.5336
alternative hypothesis: 8 and 42 are outliers

The test has a p-value of 0.5336. We can’t reject the null hypothesis because this is more than 0.05, so we conclude that we don’t have enough evidence to determine that the values 8 and 42 are outliers.

How to Calculate Mahalanobis Distance in R »