Missing Value Imputation in R

Missing Value Imputation in R, Every data user is aware of the problem:

Nearly all data sets contain some missing data, which can cause major issues like skewed estimations or decreased efficiency owing to a smaller data set.

Imputation techniques can be used to replace missing data with new values in order to lessen these problems.

Imputation of missing data is a statistical technique that assigns substitute values to miss data items.

However, before we do anything, we first need to respond to the query.

XGBoost’s assumptions »

Why Is Missing Value Imputation Required?

In R, listwise deletion—which involves deleting all rows with a missing value in one or more observations—is the default method for handling missing data.

Why should we worry ourselves with more difficult concepts when that strategy is simple to comprehend and put into practice?

So it is as usual:

Considering that we can enhance the caliber of our data analysis!

The response mechanism of our data determines how missing values affect our data analysis (find more information on response mechanisms here).

Data imputation versus Listwise Deletion

Due to the fact that the imputation of missing data does not affect sample size, the variance of studies based on imputed data is typically lower.

Depending on the response mechanism, listwise deletion performs better in terms of bias than missing data imputation.

To put it briefly: The imputation of missing data almost always raises the caliber of our data!

As a result, imputation should be used to replace any missing values.

And how does it operate? I’ll show you exactly that right now!

Step 1: Apply R’s Missing Data Imputation

Almost all statistical software today uses missing data imputation techniques.

We’ll provide an RStudio software sample below.

You might use imputation techniques based on a variety of other programs, such as SPSS, Stata, or SAS.

We’ll use a collection of air quality data as an example. The data is already included in R and can be loaded as described below:

Let’s load the data

data(airquality)

There are six variables (Ozone, Solar.R, Wind, Temp, Month, and Day) and 153 observations in the data, as can be seen by looking at the data structure. the elements solar and ozone. R has missing values of 37 and 7, respectively.

Data summaries

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Our sample size would be decreased to 111 observations if we used listwise deletion as the basis for our study.

A 27.5% loss is that.

Check for the number of complete cases

sum(complete.cases(airquality))
[1] 111

Such a drastic reduction in our sample size would undoubtedly lead to skewed estimations as well as reduced accuracy.

Fortunately, we can perform better thanks to missing data imputation!

Principal Component Analysis Advantages »

Using R to Impute Missing Values

“mice” – multivariate imputations by chained equations – is a robust imputation tool for R. (van Buuren, 2017).

For expert users, the mouse package offers a variety of missing value imputation techniques and features.

However, it also has fantastic default characteristics and is therefore incredibly simple to use for newcomers.

Installing and loading the package should come first.

Let’s install and load the R package mice

#install.packages("mice")
library("mice")

Next, use the following code to impute missing values.

Now we can Impute missing data

imp <- mice(airquality, m = 1)
     iter imp variable
  1   1  Ozone  Solar.R
  2   1  Ozone  Solar.R
  3   1  Ozone  Solar.R
  4   1  Ozone  Solar.R
  5   1  Ozone  Solar.R

We can simply store our imputed data in a fresh, finished data set after missing value imputation.

Create a new data frame with imputed data.

airquality_imputed <- complete(imp)

 There are no more missings if you look at the structure of our imputed data. Imputation has been completed.

head(airquality_imputed)
summary(airquality_imputed)
Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:120.0   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 32.00   Median :207.0   Median : 9.700   Median :79.00  
 Mean   : 41.76   Mean   :188.8   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 61.00   3rd Qu.:259.0   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0

That wasn’t really that difficult, was it?

10 Best R Programming Books »

The predefined default mouse function requirements are the cause of that.

Let’s examine what actually transpired during the imputation process in more detail:

m: I simply used the m option as a specification for the mouse function. There is only one imputation when m = 1.

Although many imputations of your data are typically preferred, I choose to use only one in the example below for simplicity.

However, you can either add m to the desired number of multiple imputations if you wish to execute multiple imputations.

Or, for the default specification of five imputed data sets, you can simply remove m = 1 from the imputation function.

method: For each of your variables, you can choose a different imputation method using the method argument.

Mice use multinomial logistic regression imputation for categorical data and predictive mean matching for numerical variables.

Learn statistics for Data Science » Play Quizzes »

predictorMatrix: Mice employ an automatic imputation model that takes advantage of all available data.

The variables in our situation are solar. Solar was imputed using R, Wind, Temp, Month, and Day, and Ozone was imputed using Solar, Wind, Temp, Month, and Day.

It is possible to specify imputation models with the predictorMatrix argument, but it is frequently beneficial to use as many variables as you can.

Using the predictorMatrix argument, organizational variables such as ID columns can also be removed.

maxit: Multivariate imputation using chained equations was used for the computation (Azur et al., 2011). The R package mouse has another fantastic function called that.

Until the imputation method gradually converges to an ideal value, missing values are continually restored and removed.

The replacement and deletion operations are automatically repeated five times by the mouse function. This number may be manually changed by using the argument maxit.

The user can specify a wide range of additional arguments.

But the most crucial arguments are those mentioned above.

You can look at the R documentation of mice for more details.

Step 2) Find the Best Imputation Method for Your Data

The used imputation method generates imputed values or values that replace missing data.

Over the past few decades, researchers have created a wide variety of imputation techniques, from straightforward ones (like mean imputation) to more complex ones (e.g. multiple imputation).

Different approaches can produce assumed values that differ greatly.

Therefore, it is worthwhile to take some time to choose an appropriate imputation technique for your data.

Adoption from the banking sector to push the growth of RPA market

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

5 × 2 =