How to Calculate Mahalanobis Distance in R

by finnstats

In multivariate space, the Mahalanobis distance is the distance between two points. It’s frequently used to locate outliers in statistical investigations involving several variables.

This tutorial describes how to execute the Mahalanobis distance in R.

Discriminant Analysis in r » Discriminant analysis in r »

Mahalanobis Distance in R

First, we need to create a data frame

Step 1: Create Dataset.

We can explore student datasets with exam scores, the number of hours they spent studying, preparation numbers, and current grades.

Sample Size Calculation Formula » Sampling Methods »

data = data.frame(score = c(81, 83, 92, 87, 96, 73, 68, 77, 78, 97, 99, 86, 84, 96, 70, 80, 83, 83, 73, 70),
hours = c(7, 8, 3, 1, 4, 3, 2, 5, 5, 5, 2, 3, 4, 8, 3, 3, 7, 3, 4, 1),
prep = c(3, 4, 0, 3, 5, 0, 1, 2, 1, 2, 3, 5, 3, 2, 2, 1, 5, 3, 2, 3),
grade = c(80, 78, 80, 80, 84, 85, 88, 94, 91, 95, 79, 82, 95, 84, 81, 93, 83, 80, 89, 79))

head(data)

    score hours prep grade
1    81     7    3    80
2    83     8    4    78
3    92     3    0    80
4    87     1    3    80
5    96     4    5    84
6    73     3    0    85

Step 2: For each observation calculate the Mahalanobis distance

We can make use of mahalanobis() function in R

Syntax mentioned as follows,

mahalanobis(x, center, cov)

Naive Bayes Classification in R » Prediction Model »

where:

x: indicate matrix of data

center: indicate the mean vector of the distribution

cov: indicate the covariance matrix of the distribution

Now we can calculate the distance for each observation.

mahalanobis(data, colMeans(data), cov(data))

[1] 3.3431887 5.7202321 7.3521513 3.1990061 4.2208239 3.4181516 3.1017453 2.8156955 1.9605904 5.6692191 5.3856421 3.5954695 3.9963068 5.9551989 2.4928251 2.4151973 4.3417003 0.9334786 1.4406139 4.6427634

Step 3: Calculate the p-value

Based on the step 2 result, some of the distances are much higher than others. Suppose if we want to identify any of the distances that are statistically significant then we need to calculate p-values.

Cluster Analysis in R » Unsupervised Approach »

The p-value for each distance is calculated as the Chi-Square statistic of the Mahalanobis distance with k-1 degrees of freedom, where k is the number of variables.

data$mahalnobis<- mahalanobis(data, colMeans(data), cov(data))

    score hours prep grade mahalnobis
1     81     7    3    80  3.3431887
2     83     8    4    78  5.7202321
3     92     3    0    80  7.3521513
4     87     1    3    80  3.1990061
5     96     4    5    84  4.2208239
6     73     3    0    85  3.4181516
7     68     2    1    88  3.1017453
8     77     5    2    94  2.8156955
9     78     5    1    91  1.9605904
10    97     5    2    95  5.6692191
11    99     2    3    79  5.3856421
12    86     3    5    82  3.5954695
13    84     4    3    95  3.9963068
14    96     8    2    84  5.9551989
15    70     3    2    81  2.4928251
16    80     3    1    93  2.4151973
17    83     7    5    83  4.3417003
18    83     3    3    80  0.9334786
19    73     4    2    89  1.4406139
20    70     1    3    79  4.6427634

Let’s create the p values

KNN Algorithm Machine Learning » Classification & Regression »

data$pvalue <- pchisq(data$mahalnobis, df=3, lower.tail=FALSE)
data

   score hours prep grade mahalnobis     pvalue
1     81     7    3    80  3.3431887 0.34167668
2     83     8    4    78  5.7202321 0.12604387
3     92     3    0    80  7.3521513 0.06148152
4     87     1    3    80  3.1990061 0.36194826
5     96     4    5    84  4.2208239 0.23858527
6     73     3    0    85  3.4181516 0.33153375
7     68     2    1    88  3.1017453 0.37620253
8     77     5    2    94  2.8156955 0.42092267
9     78     5    1    91  1.9605904 0.58062647
10    97     5    2    95  5.6692191 0.12886057
11    99     2    3    79  5.3856421 0.14564075
12    86     3    5    82  3.5954695 0.30858950
13    84     4    3    95  3.9963068 0.26186321
14    96     8    2    84  5.9551989 0.11381036
15    70     3    2    81  2.4928251 0.47658914
16    80     3    1    93  2.4151973 0.49081192
17    83     7    5    83  4.3417003 0.22685238
18    83     3    3    80  0.9334786 0.81734205
19    73     4    2    89  1.4406139 0.69604281
20    70     1    3    79  4.6427634 0.19990417

In general, a p-value that is less than 0.001 is considered to be an outlier. In this case, all the p values are greater than 0.001.

Principal component analysis (PCA) in R »