Correlation in R with Missing Values
Correlation in R with Missing Values, when one or more variables have missing values, you can compute correlation coefficients in R using the following techniques:
The examples that follow demonstrate each technique in action.
Example 1: Determine the correlation coefficient when values are missing Present
If we try to compute the Pearson correlation coefficient between two variables using the cor() method and there are missing values.
Let’s create two variables
x <- c(60, 78, 90, 87, 84, NA, 91, 94, 83, 95) y <- c(50, NA, 79, 86, 80, 83, 88, 92, 76, 15)
Try to determine the correlation between x and y.
cor(x, y)  NA
Since we didn’t indicate how to handle missing values, the cor() function returns NA.
To get around this problem, we may tell R to only use paired observations when both values are present by passing it the argument use=’complete.obs’:
determine the correlation between x and y.
cor(x, y, use='complete.obs')  0.1519416
The two variables’ correlation coefficient comes out to be 0.1519416.
It should be noted that for determining the correlation coefficient, the cor() function only examined pairwise pairs in which both values were present.
Example 2: Create a correlation matrix with missing values Present
When missing values are present, let’s say we try to use the cor() method to build a correlation matrix for a data frame with three variables:
Create a data frame that contains some missing values.
df <- data.frame(x=c(10, 18, 30, 87, 44, NA, 41, 24, 83, 15), y=c(50, NA, 79, 86, 84, 83, 88, 92, 76, 75), z=c(57, 50, 48, 59, 50, 78, 71, 33, NA, 50))
try to construct a correlation matrix for the data frame’s variables.
x y z x 1 NA NA y NA 1 NA z NA NA 1
Due to the fact that we didn’t specify how to handle missing values, the cor() function returns NA in a number of places.
To get around this problem, we can tell R to only use pairwise observations when both values are present by passing it the input use=’pairwise.complete.obs’:
only pairwise full observations should be used to generate the correlation matrix for the variables.
x y z x 1.0000000 0.37440384 0.34186351 y 0.3744038 1.00000000 -0.06905567 z 0.3418635 -0.06905567 1.00000000
The data frame’s correlation coefficients for each pairwise combination of variables are now displayed.