How to Perform Data Cleaning in R

by finnstats

How to Perform Data Cleaning in R, To perform data cleaning in R using the examples provided, you can follow these steps:

1. Remove rows with missing values:

library(dplyr)
df <- data.frame(team=c('A', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'),
                 points=c(4, 4, NA, 8, 6, 12, 14, 86, 13, 8),
                 rebounds=c(9, 9, 7, 6, 8, NA, 9, 14, 12, 11),
                 assists=c(2, 2, NA, 7, 6, 6, 9, 10, NA, 14))
df

    team points rebounds assists
1     A      4        9       2
2     A      4        9       2
3     B     NA        7      NA
4     C      8        6       7
5     D      6        8       6
6     E     12       NA       6
7     F     14        9       9
8     G     86       14      10
9     H     13       12      NA
10    I      8       11      14

Remove rows with missing values

new_df <- df %>% na.omit()
new_df

   team points rebounds assists
1     A      4        9       2
2     A      4        9       2
4     C      8        6       7
5     D      6        8       6
7     F     14        9       9
8     G     86       14      10
10    I      8       11      14

In this example, we use the `na.omit()` function from the `dplyr` package to remove any rows that contain missing values (`NA`).

This function removes any row that has at least one missing value, regardless of whether the missing value is in a numeric or non-numeric column.

How to copy files in R » Data Science Tutorials

After removing the rows with missing values, we can view the updated data frame using the `view()` function.

2. Replace missing values with another value:

library(dplyr)
library(tidyr)

Replace missing values in each numeric column with median value of column

new_df <- df %>% mutate(across(where(is.numeric), ~replace_na(., median(., na.rm = TRUE))))
new_df

  team points rebounds assists
1     A      4        9     2.0
2     A      4        9     2.0
3     B      8        7     6.5
4     C      8        6     7.0
5     D      6        8     6.0
6     E     12        9     6.0
7     F     14        9     9.0
8     G     86       14    10.0
9     H     13       12     6.5
10    I      8       11    14.0

In this example, we use the `tidyr` package to replace missing values with a specific value.

We first load the `tidyr` package and then use the `across()` function from `dplyr` to apply a function to multiple columns at once.

In this case, we use the `replace_na()` function from `tidyr` to replace any missing values (`NA`) with the median value of each column (calculated using `median()`).

After replacing the missing values, we can view the updated data frame using the `view()` function.

3. Remove duplicate rows:

library(dplyr)

Remove duplicate rows

new_df <- df %>% distinct(.keep_all = TRUE)
new_df

   team points rebounds assists
1    A      4        9       2
2    B     NA        7      NA
3    C      8        6       7
4    D      6        8       6
5    E     12       NA       6
6    F     14        9       9
7    G     86       14      10
8    H     13       12      NA
9    I      8       11      14

In the third example, we use the `distinct()` function from `dplyr` to remove duplicate rows from our data frame.

We set the `.keep_all = TRUE` argument to keep all columns in our updated data frame, even if they are not used in our analysis.

After removing duplicate rows, we can view the updated data frame using the `view()` function.

How to Create Frequency Tables in R »