How to Clean Up Your Data in R
How to Clean Up Your Data in R?, Data cleaning is the process of converting unclean data into clean data that may be used for analysis or model construction.
The majority of the time, “cleaning” a dataset entails dealing with duplicated data and missing values.
Here are the most typical R methods for “cleaning” a dataset:
With the help of the following R data frame, which contains details about numerous basketball players, the following examples demonstrate how to utilize each of these techniques in practice.
ggplot2 colors in R-Default colors complete guide » finnstats
How to Clean Up Your Data in R
library(dplyr) library(tidyr)
Let’s create a data frame
df <- data.frame(team=c('A', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'), points=c(24, 14, NA, 28, 26, 22, 24, 26, 23, 28), rebounds=c(9, 9, 7, 6, 8, NA, 9, 14, 12, 11), assists=c(12, 12, NA, 15, 61, 16, 15, 20, NA, 12))
Now we can view the data frame
df
team points rebounds assists 1 A 24 9 12 2 A 14 9 12 3 B NA 7 NA 4 C 28 6 15 5 D 26 8 61 6 E 22 NA 16 7 F 24 9 15 8 G 26 14 20 9 H 23 12 NA 10 I 28 11 12
Example 1: Remove Rows with Missing Values
The following syntax can be used to eliminate rows that have missing values in any column:
NLP Technology- N-gram Model in NLP » finnstats
library(dplyr)
Let’s remove rows with missing values
new_df <- df %>% na.omit() new_df
team points rebounds assists 1 A 24 9 12 2 A 14 9 12 4 C 28 6 15 5 D 26 8 61 7 F 24 9 15 8 G 26 14 20 10 I 28 11 12
You’ll see that there are no rows in the new data frame with missing values.
Example 2: Substitute a Different Value for Missing Values
The median value of each column can be used to fill in any missing values using the technique shown below:
library(dplyr) library(tidyr)
each numeric column’s missing values with the column’s median
new_df <-df %>% mutate(across(where(is.numeric),~replace_na(.,median(.,na.rm=TRUE)))) new_df
team points rebounds assists 1 A 24 9 12 2 A 14 9 12 3 B 24 7 15 4 C 28 6 15 5 D 26 8 61 6 E 22 9 16 7 F 24 9 15 8 G 26 14 20 9 H 23 12 15 10 I 28 11 12
Each numeric column’s missing values have been replaced with the column’s median, as can be seen.
How to Change Background Color in ggplot2? » finnstats
It should be noted that you could also substitute mean for the median in the formula to replace missing values with the average value for each column.
In this case, we also have to load the tidyr package because it contains the drop na() method.
Example 3: Eliminate Double Rows
The median value of each column can be used to fill in any missing values using the technique shown below:
new_df <- df %>% distinct(.keep_all=TRUE) new_df
team points rebounds assists 1 A 24 9 12 2 A 14 9 12 3 B NA 7 NA 4 C 28 6 15 5 D 26 8 61 6 E 22 NA 16 7 F 24 9 15 8 G 26 14 20 9 H 23 12 NA 10 I 28 11 12
Due to the fact that every value in the second row was a duplicate of a value in the first row, the second row has been eliminated from the data frame.