How to Calculate Mahalanobis Distance in R

by finnstats

In multivariate space, the Mahalanobis distance is the distance between two points. It’s frequently used to locate outliers in statistical investigations involving several variables.

This tutorial describes how to execute the Mahalanobis distance in R.

Discriminant Analysis in r » Discriminant analysis in r »

Mahalanobis Distance in R

First, we need to create a data frame

Step 1: Create Dataset.

We can explore student datasets with exam scores, the number of hours they spent studying, preparation numbers, and current grades.

Sample Size Calculation Formula » Sampling Methods »

data = data.frame(score = c(81, 83, 92, 87, 96, 73, 68, 77, 78, 97, 99, 86, 84, 96, 70, 80, 83, 83, 73, 70),
hours = c(7, 8, 3, 1, 4, 3, 2, 5, 5, 5, 2, 3, 4, 8, 3, 3, 7, 3, 4, 1),
prep = c(3, 4, 0, 3, 5, 0, 1, 2, 1, 2, 3, 5, 3, 2, 2, 1, 5, 3, 2, 3),
grade = c(80, 78, 80, 80, 84, 85, 88, 94, 91, 95, 79, 82, 95, 84, 81, 93, 83, 80, 89, 79))

head(data)

    score hours prep grade
1    81     7    3    80
2    83     8    4    78
3    92     3    0    80
4    87     1    3    80
5    96     4    5    84
6    73     3    0    85

Step 2: For each observation calculate the Mahalanobis distance

We can make use of mahalanobis() function in R

Syntax mentioned as follows,

mahalanobis(x, center, cov)

Naive Bayes Classification in R » Prediction Model »

where:

x: indicate matrix of data

center: indicate the mean vector of the distribution

cov: indicate the covariance matrix of the distribution

Now we can calculate the distance for each observation.

mahalanobis(data, colMeans(data), cov(data))

[1] 3.3431887 5.7202321 7.3521513 3.1990061 4.2208239 3.4181516 3.1017453 2.8156955 1.9605904 5.6692191 5.3856421 3.5954695 3.9963068 5.9551989 2.4928251 2.4151973 4.3417003 0.9334786 1.4406139 4.6427634

Step 3: Calculate the p-value

Based on the step 2 result, some of the distances are much higher than others. Suppose if we want to identify any of the distances that are statistically significant then we need to calculate p-values.

Cluster Analysis in R » Unsupervised Approach »

The p-value for each distance is calculated as the Chi-Square statistic of the Mahalanobis distance with k-1 degrees of freedom, where k is the number of variables.

data$mahalnobis<- mahalanobis(data, colMeans(data), cov(data))

    score hours prep grade mahalnobis
1     81     7    3    80  3.3431887
2     83     8    4    78  5.7202321
3     92     3    0    80  7.3521513
4     87     1    3    80  3.1990061
5     96     4    5    84  4.2208239
6     73     3    0    85  3.4181516
7     68     2    1    88  3.1017453
8     77     5    2    94  2.8156955
9     78     5    1    91  1.9605904
10    97     5    2    95  5.6692191
11    99     2    3    79  5.3856421
12    86     3    5    82  3.5954695
13    84     4    3    95  3.9963068
14    96     8    2    84  5.9551989
15    70     3    2    81  2.4928251
16    80     3    1    93  2.4151973
17    83     7    5    83  4.3417003
18    83     3    3    80  0.9334786
19    73     4    2    89  1.4406139
20    70     1    3    79  4.6427634

Let’s create the p values

KNN Algorithm Machine Learning » Classification & Regression »

data$pvalue <- pchisq(data$mahalnobis, df=3, lower.tail=FALSE)
data

   score hours prep grade mahalnobis     pvalue
1     81     7    3    80  3.3431887 0.34167668
2     83     8    4    78  5.7202321 0.12604387
3     92     3    0    80  7.3521513 0.06148152
4     87     1    3    80  3.1990061 0.36194826
5     96     4    5    84  4.2208239 0.23858527
6     73     3    0    85  3.4181516 0.33153375
7     68     2    1    88  3.1017453 0.37620253
8     77     5    2    94  2.8156955 0.42092267
9     78     5    1    91  1.9605904 0.58062647
10    97     5    2    95  5.6692191 0.12886057
11    99     2    3    79  5.3856421 0.14564075
12    86     3    5    82  3.5954695 0.30858950
13    84     4    3    95  3.9963068 0.26186321
14    96     8    2    84  5.9551989 0.11381036
15    70     3    2    81  2.4928251 0.47658914
16    80     3    1    93  2.4151973 0.49081192
17    83     7    5    83  4.3417003 0.22685238
18    83     3    3    80  0.9334786 0.81734205
19    73     4    2    89  1.4406139 0.69604281
20    70     1    3    79  4.6427634 0.19990417

In general, a p-value that is less than 0.001 is considered to be an outlier. In this case, all the p values are greater than 0.001.

Principal component analysis (PCA) in R »

How to Calculate Mahalanobis Distance in R

Mahalanobis Distance in R

Step 1: Create Dataset.

Step 2: For each observation calculate the Mahalanobis distance

Step 3: Calculate the p-value

You may also like...

Leave a Reply Cancel reply

Recent Jobs

United States-Healthcare Informatics AI Intern-Operations

Turkey-A.I. Engineering Intern

Colombia-A.I. Engineering Intern (Colombia)

c(“Philippines”, “United States”)-Internship Applicants

United States-IT & Computer Science – Internship

Machine Learning Engineer Intern

South Korea-Operation Intern | South Korea | Remote-Operations

United States-Technical Intern (Masters or PhD)

United States-Research Intern

United States-Data Science Intern (Spring ’25-2)

How to Calculate Mahalanobis Distance in R

Mahalanobis Distance in R

Step 1: Create Dataset.

Step 2: For each observation calculate the Mahalanobis distance

Step 3: Calculate the p-value

You may also like...

How to determine if a time series is stationery?

How to apply a transformation to multiple columns in R?

Create new variables from existing variables in R

Leave a Reply Cancel reply

Recent Jobs

United States-Healthcare Informatics AI Intern-Operations

Turkey-A.I. Engineering Intern

Colombia-A.I. Engineering Intern (Colombia)

c(“Philippines”, “United States”)-Internship Applicants

United States-IT & Computer Science – Internship

Machine Learning Engineer Intern

South Korea-Operation Intern | South Korea | Remote-Operations

United States-Technical Intern (Masters or PhD)

United States-Research Intern

United States-Data Science Intern (Spring ’25-2)