How to Identify Outliers-Grubbs’ Test in R
How to Identify Outliers, The Grubbs’ Technique is a statistical test that may be used to detect outliers in a dataset.
A dataset should be generally normally distributed and have at least 7 observations to use this test.
This article shows how to use R to run Grubbs’ Test to find outliers in a dataset.
How to Identify Outliers -Grubbs’ Test
We can utilize the Outliers package’s grubbs.test() function, which has the following syntax:
grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)
How to Calculate SMAPE in R » Model Accuracy »
where:
x: a numeric vector of data values
type:10 = check if the maximum value is an outlier, 11 = check if both the minimum and maximum values are outliers, 20 = check if one tail has two outliers.
opposite: logically specifying whether you want to verify the value with the largest departure from the mean, or the value with the smallest difference from the mean (lowest, if most suspicious is highest, etc.)
two-sided: If this test is to be treated as two-sided, this logical value indicates that.
Test Hypotheses:
H0: There is no outlier in the data.
H1: There is an outlier in the data.
Linear Discriminant Analysis in R » LDA Prediction »
Case1:-Identifying whether or not a dataset’s maximum value is an outlier.
Load the library first,
library(outliers)
Let’s start by making a test data frame.
data <- c(35, 12, 27, 32, 16, 42, 25, 12, 8, 15, 25, 17, 18, 19, 29, 33, 25, 20,18, 26)
As simple it is use grubbs.test function,
grubbs.test(data)
Grubbs test for one outlier
data: data G = 2.21065, U = 0.72925, p-value = 0.1867 alternative hypothesis: highest value 42 is an outlier
G = 2. 21065 is the test statistic, and p = 0.18 is the related p-value. We can’t reject the null hypothesis because this value is bigger than 0.05.
We do not have sufficient evidence to say that the maximum value of ‘42’ is an outlier.
Linear optimization using R » Optimal Solution »
Case2:- Identifying whether or not a dataset’s minimum value is an anomaly.
Here we need to mention opposite=TRUE
grubbs.test(data, opposite=TRUE)
Grubbs test for one outlier
data: data G = 1.68376, U = 0.84293, p-value = 0.8365 alternative hypothesis: lowest value 8 is an outlier
G = 1.68376 is the test statistic, and p = 0.8365 is the related p-value. We cannot reject the null hypothesis because this number is not less than 0.05.
We don’t have enough evidence to declare the minimal value of ‘8’ to be an outlier.
Naive Bayes Classification in R » Prediction Model »
Case3:-Identify two large values outlier or not
sort(data) [1] 8 12 12 15 16 17 18 18 19 20 25 25 25 26 27 29 32 33 35 42
In this case, 35 and 42 are the largest numbers and we need to specify type=20.
grubbs.test(data, type=20)
Grubbs test for two outliers
data: data U = 0.60002, p-value = 0.247 alternative hypothesis: highest values 35 , 42 are outliers
The test has a p-value of 0.247. We can’t reject the null hypothesis because this is more than 0.05, so we conclude that we don’t have enough evidence to determine that the values 35 and 42 are outliers.
Principal component analysis (PCA) in R »
Case4:-Identify two smallest values outlier or not
As we all know, the smallest numbers are 8 and 12, and the lowest number that has already been validated is not an outlier. However, for the sake of experimentation, we can use the same rationale.
grubbs.test(data, type=20,opposite=TRUE)
Grubbs test for two outliers
data: data U = 0.74698, p-value = 0.7964 alternative hypothesis: lowest values 8 , 12 are outliers
The test has a p-value of 0.7964. We can’t reject the null hypothesis because this is more than 0.05, so we conclude that we don’t have enough evidence to determine that the values 8 and 12 are outliers.
Case5:-Identify smallest and hightest values are outlier or not
grubbs.test(data, type=11) Grubbs test for two opposite outliers data: data G = 3.89441, U = 0.59277, p-value = 0.5336 alternative hypothesis: 8 and 42 are outliers
The test has a p-value of 0.5336. We can’t reject the null hypothesis because this is more than 0.05, so we conclude that we don’t have enough evidence to determine that the values 8 and 42 are outliers.