How to Calculate Mahalanobis Distance in R
In multivariate space, the Mahalanobis distance is the distance between two points. It’s frequently used to locate outliers in statistical investigations involving several variables.
This tutorial describes how to execute the Mahalanobis distance in R.
Discriminant Analysis in r » Discriminant analysis in r »
Mahalanobis Distance in R
First, we need to create a data frame
Step 1: Create Dataset.
We can explore student datasets with exam scores, the number of hours they spent studying, preparation numbers, and current grades.
Sample Size Calculation Formula » Sampling Methods »
data = data.frame(score = c(81, 83, 92, 87, 96, 73, 68, 77, 78, 97, 99, 86, 84, 96, 70, 80, 83, 83, 73, 70), hours = c(7, 8, 3, 1, 4, 3, 2, 5, 5, 5, 2, 3, 4, 8, 3, 3, 7, 3, 4, 1), prep = c(3, 4, 0, 3, 5, 0, 1, 2, 1, 2, 3, 5, 3, 2, 2, 1, 5, 3, 2, 3), grade = c(80, 78, 80, 80, 84, 85, 88, 94, 91, 95, 79, 82, 95, 84, 81, 93, 83, 80, 89, 79))
head(data)
score hours prep grade 1 81 7 3 80 2 83 8 4 78 3 92 3 0 80 4 87 1 3 80 5 96 4 5 84 6 73 3 0 85
Step 2: For each observation calculate the Mahalanobis distance
We can make use of mahalanobis() function in R
Syntax mentioned as follows,
mahalanobis(x, center, cov)
Naive Bayes Classification in R » Prediction Model »
where:
x: indicate matrix of data
center: indicate the mean vector of the distribution
cov: indicate the covariance matrix of the distribution
Now we can calculate the distance for each observation.
mahalanobis(data, colMeans(data), cov(data))
[1] 3.3431887 5.7202321 7.3521513 3.1990061 4.2208239 3.4181516 3.1017453 2.8156955 1.9605904 5.6692191 5.3856421 3.5954695 3.9963068 5.9551989 2.4928251 2.4151973 4.3417003 0.9334786 1.4406139 4.6427634
Step 3: Calculate the p-value
Based on the step 2 result, some of the distances are much higher than others. Suppose if we want to identify any of the distances that are statistically significant then we need to calculate p-values.
Cluster Analysis in R » Unsupervised Approach »
The p-value for each distance is calculated as the Chi-Square statistic of the Mahalanobis distance with k-1 degrees of freedom, where k is the number of variables.
data$mahalnobis<- mahalanobis(data, colMeans(data), cov(data))
score hours prep grade mahalnobis 1 81 7 3 80 3.3431887 2 83 8 4 78 5.7202321 3 92 3 0 80 7.3521513 4 87 1 3 80 3.1990061 5 96 4 5 84 4.2208239 6 73 3 0 85 3.4181516 7 68 2 1 88 3.1017453 8 77 5 2 94 2.8156955 9 78 5 1 91 1.9605904 10 97 5 2 95 5.6692191 11 99 2 3 79 5.3856421 12 86 3 5 82 3.5954695 13 84 4 3 95 3.9963068 14 96 8 2 84 5.9551989 15 70 3 2 81 2.4928251 16 80 3 1 93 2.4151973 17 83 7 5 83 4.3417003 18 83 3 3 80 0.9334786 19 73 4 2 89 1.4406139 20 70 1 3 79 4.6427634
Let’s create the p values
KNN Algorithm Machine Learning » Classification & Regression »
data$pvalue <- pchisq(data$mahalnobis, df=3, lower.tail=FALSE) data
score hours prep grade mahalnobis pvalue 1 81 7 3 80 3.3431887 0.34167668 2 83 8 4 78 5.7202321 0.12604387 3 92 3 0 80 7.3521513 0.06148152 4 87 1 3 80 3.1990061 0.36194826 5 96 4 5 84 4.2208239 0.23858527 6 73 3 0 85 3.4181516 0.33153375 7 68 2 1 88 3.1017453 0.37620253 8 77 5 2 94 2.8156955 0.42092267 9 78 5 1 91 1.9605904 0.58062647 10 97 5 2 95 5.6692191 0.12886057 11 99 2 3 79 5.3856421 0.14564075 12 86 3 5 82 3.5954695 0.30858950 13 84 4 3 95 3.9963068 0.26186321 14 96 8 2 84 5.9551989 0.11381036 15 70 3 2 81 2.4928251 0.47658914 16 80 3 1 93 2.4151973 0.49081192 17 83 7 5 83 4.3417003 0.22685238 18 83 3 3 80 0.9334786 0.81734205 19 73 4 2 89 1.4406139 0.69604281 20 70 1 3 79 4.6427634 0.19990417
In general, a p-value that is less than 0.001 is considered to be an outlier. In this case, all the p values are greater than 0.001.