Transposing Data in R
Transposing Data in R, Data manipulation and transformation are essential tasks in data analysis.
R, being a popular programming language for data science, offers various packages to perform these operations efficiently.
In this article, we will explore two R packages, tidyr and reshape2, which are part of the tidyverse and help in reshaping data from long to wide format or vice versa.
We will also discuss the dplyr package and its limitations in terms of transposing data.
Transpose Data with R
Before we dive into the examples, let’s create a sample dataset for demonstration.
data <- read.table(text="ID X Y Z 1 1 ProductA 13377 1 2 ProductA 14069 1 1 ProductB 11426 1 2 ProductB 11750 2 1 ProductA 11122 2 2 ProductA 11202 2 1 ProductB 14712 2 2 ProductB 10169 ",header=TRUE)
Convert Long to Wide Format
Suppose you have data containing three variables (X, Y, and Z) where X contains IDs, Y contains dates, and Z contains income.
You need to convert this long-format data to wide format so that the dates move to column names, and the income information comes under these dates.
Reshape2 Package
The reshape2 package offers two functions, dcast and acast, to transform long-format data to wide format.
The primary difference between these two functions is that dcast returns a data frame as output, while acast returns a vector, matrix, or array as output.
# Install reshape2 package if not installed already if (!require(reshape2)){ install.packages('reshape2') } library(reshape2) # Transform Long to Wide Format mydt = dcast(data, X~Y, value.var = "Z")
tidyr Package
The tidyr package provides the spread() function to reshape data from long to wide format.
How to make a rounded corner bar plot in R? » Data Science Tutorials
library(tidyr) mydt = spread(data, Y, Z) mydt ID X ProductA ProductB 1 1 1 13377 11426 2 1 2 14069 11750 3 2 1 11122 14712 4 2 2 11202 10169
Example 2: More than 1 ID Variable
Let’s consider an example where we have more than one ID variable. The dataset contains information about income generated from two products (Product A and B) reported semi-annually.
data <- read.table(text="Year SemiYear Product Income 1 1 ProductA 13377 1 2 ProductA 14069 1 1 ProductB 11426 1 2 ProductB 11750 2 1 ProductA 11122 2 2 ProductA 11202 2 1 ProductB 14712 2 2 ProductB 10169 ",header=TRUE)
Transform Data
library(reshape2) xx=dcast(data, Year + SemiYear ~ Product, value.var = "Income") xx Year SemiYear ProductA ProductB 1 1 1 13377 11426 2 1 2 14069 11750 3 2 1 11122 14712 4 2 2 11202 10169
Using tidyr Package
library(tidyr) mydt = spread(data, Product, Income) mydt Year SemiYear ProductA ProductB 1 1 1 13377 11426 2 1 2 14069 11750 3 2 1 11122 14712 4 2 2 11202 10169
If you want the final output to be reported at the year level:
# Transform Data
dcast(data, Year ~ Product, fun.aggregate = sum, value.var = "Income")
Year ProductA ProductB
1 1 27446 23176
2 2 22324 24881
This will give you the total income for each product at the year level.
Convert Wide Format Data to Long Format
Suppose you have data containing information about species and their sepal length. The data of sepal length for species are in columns.
We will now convert this wide-format data to long format.
# Create Sample Data mydata = read.table(text= "ID setosa versicolor virginica 1 5.1 NA NA 2 4.9 NA NA 3 NA 7 NA 4 NA 6.4 NA 5 NA NA 6.3 6 NA NA 5.8 ", header=TRUE) # Convert Wide Format Data to Long Format library(reshape2) x = colnames(mydata[,-1]) t = melt(mydata, id.vars = "ID", measure.vars = x , variable.name="Species", value.name="Sepal.Length", na.rm = TRUE) t ID Species Sepal.Length 1 1 setosa 5.1 2 2 setosa 4.9 9 3 versicolor 7.0 10 4 versicolor 6.4 17 5 virginica 6.3 18 6 virginica 5.8
tidyr: gather() Function
The gather() function in the tidyr package transforms data from wide to long format. The parameters for this function are:
- data: Data frame
- key: Name of the new key column
- value: Name of the new value column
- na.rm: na.rm=TRUE means removing rows from the output where the value is missing
# Transform Data gather(mydata, Species, Sepal.Length, x, na.rm = TRUE) ID Species Sepal.Length 1 1 setosa 5.1 2 2 setosa 4.9 9 3 versicolor 7.0 10 4 versicolor 6.4 17 5 virginica 6.3 18 6 virginica 5.8
Conclusion
In this article, we explored the process of reshaping data from long to wide format and vice versa using the tidyr and reshape2 packages in R.
We also discussed the limitations of the dplyr package in terms of transposing data.
By understanding and applying these techniques, you can efficiently manipulate and explore your data using R.
Creating Relative Frequency in R » finnstats