Summary statistics in R

Summary statistics in R, This tutorial will show you how to use the aggregate function in the R programming language.

Definition: The aggregate R function computes summary statistics for a data set’s subgroups.

R Syntax for Beginners: The aggregate function’s basic R programming syntax is provided below.

aggregate(x = any_data, by = group_list, FUN = any_function)  # Basic R syntax of aggregate function

In the following sections, I’ll show you three different ways to use the aggregate function in R.

Example:- Summary statistics in R

Let’s start by creating some sample data.

data <- data.frame(x1 = 1:5,                                  
 x2 = 2:6,
 x3 = 1,
group = c("A", "A", "B", "C", "C"))
data                                                          
x1 x2 x3 group
1  1  2  1     A
2  2  3  1     A
3  3  4  1     B
4  4  5  1     C
5  5  6  1     C

The previously displayed RStudio console output shows that the example data has five rows and four columns.

The variables x1, x2, and x3 have numerical values, while group is a grouping indicator that divides our data into subgroups.

Example 1: Using the aggregate Function, compute the mean by group

In Example 1, We’ll show you how to use the aggregate function to get the mean of each subgroup and variable in our example data.

Three arguments must be specified within the aggregate function:

The data that was entered.

The indicator of grouping.

We want to apply the function to each subgroup.

Examine the following R code:

aggregate(x = data[ , colnames(data) != "group"],            
by = list(data$group),
FUN = mean)
Group.1  x1  x2 x3
1       A 1.5 2.5  1
2       B 3.0 4.0  1
3       C 4.5 5.5  1

As you can see, the RStudio console returned the mean for each of our numeric variables for each subgroup (A, B, and C) (i.e. x1, x2, and x3).

It’s worth noting that we had to remove the grouping indicator from our data frame, as well as convert it to a list. These are the aggregate function’s necessary conditions.

Example 2: Using the aggregate Function, compute the sum by group.

We calculated the mean of each subgroup across multiple columns of our data frame in the previous Example.

However, other functions can be easily applied within the aggregate command. In Example 2, I’ll show how to use the aggregate function to return the sum by group:

aggregate(x = data[ , colnames(data) != "group"],            
by = list(data$group),
          FUN = sum)
Group.1 x1 x2 x3
1       A  3  5  2
2       B  3  4  1
3       C  9 11  2

All we had to do was change the FUN argument in the aggregate function. The previous output shows the count of our example data by group.

Example 3: Using an Aggregate Function on Data With NAs

Missing values in the input data frame are a common issue when using the aggregate function.

As a result, Example 3 demonstrates how to handle NA values with the aggregate function. To begin, let’s add some NA values to our example data:

dataNA <- data                                             
dataNA$x1[2] <- NA
dataNA$x2[4] <- NA
dataNA                                                      
x1 x2 x3 group
1  1  2  1     A
2 NA  3  1     A
3  3  4  1     B
4  4 NA  1     C
5  5  6  1     C

The previous RStudio console output shows how our updated data looks. Some data cells, as you can see, were set to NA.

Let’s try applying the aggregate function again.

aggregate(x = dataNA[ , colnames(dataNA) != "group"],      
by = list(dataNA$group),
          FUN = mean)
Group.1  x1  x2 x3
1       A  NA 2.5  1
2       B 3.0 4.0  1
3       C 4.5  NA  1

As you can see, some of the output values are NA. Fortunately, we can simply remove our NA values using the aggregate function’s na.rm argument.

aggregate(x = data_NA[ , colnames(data_NA) != "group"],      
by = list(data_NA$group),
          FUN = mean,
          na.rm = TRUE)
Group.1  x1  x2 x3
1       A  NA 2.5  1
2       B 3.0 4.0  1
3       C 4.5  NA  1

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

16 − four =