How to Clean Up Your Data in R

by finnstats

How to Clean Up Your Data in R?, Data cleaning is the process of converting unclean data into clean data that may be used for analysis or model construction.

The majority of the time, “cleaning” a dataset entails dealing with duplicated data and missing values.

Here are the most typical R methods for “cleaning” a dataset:

With the help of the following R data frame, which contains details about numerous basketball players, the following examples demonstrate how to utilize each of these techniques in practice.

ggplot2 colors in R-Default colors complete guide » finnstats

How to Clean Up Your Data in R

library(dplyr)
library(tidyr)

Let’s create a data frame

df <- data.frame(team=c('A', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'),
                 points=c(24, 14, NA, 28, 26, 22, 24, 26, 23, 28),
                 rebounds=c(9, 9, 7, 6, 8, NA, 9, 14, 12, 11),
                 assists=c(12, 12, NA, 15, 61, 16, 15, 20, NA, 12))

Now we can view the data frame

df

team points rebounds assists
1     A     24        9      12
2     A     14        9      12
3     B     NA        7      NA
4     C     28        6      15
5     D     26        8      61
6     E     22       NA      16
7     F     24        9      15
8     G     26       14      20
9     H     23       12      NA
10    I     28       11      12

Example 1: Remove Rows with Missing Values

The following syntax can be used to eliminate rows that have missing values in any column:

NLP Technology- N-gram Model in NLP » finnstats

library(dplyr)

Let’s remove rows with missing values

new_df <- df %>% na.omit()
new_df

   team points rebounds assists
1     A     24        9      12
2     A     14        9      12
4     C     28        6      15
5     D     26        8      61
7     F     24        9      15
8     G     26       14      20
10    I     28       11      12

You’ll see that there are no rows in the new data frame with missing values.

Example 2: Substitute a Different Value for Missing Values

The median value of each column can be used to fill in any missing values using the technique shown below:

library(dplyr)
library(tidyr)

each numeric column’s missing values with the column’s median

new_df <-df %>% mutate(across(where(is.numeric),~replace_na(.,median(.,na.rm=TRUE))))
new_df

   team points rebounds assists
1     A     24        9      12
2     A     14        9      12
3     B     24        7      15
4     C     28        6      15
5     D     26        8      61
6     E     22        9      16
7     F     24        9      15
8     G     26       14      20
9     H     23       12      15
10    I     28       11      12

Each numeric column’s missing values have been replaced with the column’s median, as can be seen.

How to Change Background Color in ggplot2? » finnstats

It should be noted that you could also substitute mean for the median in the formula to replace missing values with the average value for each column.

In this case, we also have to load the tidyr package because it contains the drop na() method.

Example 3: Eliminate Double Rows

The median value of each column can be used to fill in any missing values using the technique shown below:

new_df <- df %>% distinct(.keep_all=TRUE)
new_df

team points rebounds assists
1     A     24        9      12
2     A     14        9      12
3     B     NA        7      NA
4     C     28        6      15
5     D     26        8      61
6     E     22       NA      16
7     F     24        9      15
8     G     26       14      20
9     H     23       12      NA
10    I     28       11      12

Due to the fact that every value in the second row was a duplicate of a value in the first row, the second row has been eliminated from the data frame.