Correlation in R with Missing Values

Correlation in R with Missing Values, when one or more variables have missing values, you can compute correlation coefficients in R using the following techniques:

The examples that follow demonstrate each technique in action.

Example 1: Determine the correlation coefficient when values are missing Present

If we try to compute the Pearson correlation coefficient between two variables using the cor() method and there are missing values.

Data Scientist II » finnstats

Let’s create two variables

x <- c(60, 78, 90, 87, 84, NA, 91, 94, 83, 95)
y <- c(50, NA, 79, 86, 80, 83, 88, 92, 76, 15)

Try to determine the correlation between x and y.

cor(x, y)
[1] NA

Since we didn’t indicate how to handle missing values, the cor() function returns NA.

To get around this problem, we may tell R to only use paired observations when both values are present by passing it the argument use=’complete.obs’:

determine the correlation between x and y.

How to add NA values into a factor level » finnstats

cor(x, y, use='complete.obs')
[1] 0.1519416

The two variables’ correlation coefficient comes out to be 0.1519416.

It should be noted that for determining the correlation coefficient, the cor() function only examined pairwise pairs in which both values were present.

Example 2: Create a correlation matrix with missing values Present

When missing values are present, let’s say we try to use the cor() method to build a correlation matrix for a data frame with three variables:

Create a data frame that contains some missing values.

df <- data.frame(x=c(10, 18, 30, 87, 44, NA, 41, 24, 83, 15),
                 y=c(50, NA, 79, 86, 84, 83, 88, 92, 76, 75),
                 z=c(57, 50, 48, 59, 50, 78, 71, 33, NA, 50))

try to construct a correlation matrix for the data frame’s variables.

Convert character strings to Date in R » finnstats

cor(df)
   x  y  z
x  1 NA NA
y NA  1 NA
z NA NA  1

Due to the fact that we didn’t specify how to handle missing values, the cor() function returns NA in a number of places.

To get around this problem, we can tell R to only use pairwise observations when both values are present by passing it the input use=’pairwise.complete.obs’:

only pairwise full observations should be used to generate the correlation matrix for the variables.

cor(df, use='pairwise.complete.obs')
     x           y           z
x 1.0000000  0.37440384  0.34186351
y 0.3744038  1.00000000 -0.06905567
z 0.3418635 -0.06905567  1.00000000

The data frame’s correlation coefficients for each pairwise combination of variables are now displayed.

Summary statistics in R » finnstats

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

five + 10 =