How to read or export large datasets in R

How to read or export large datasets in R, the fwrite() function is used to write data to a binary file. It is a part of the base R package and stands for “fast write”.

Unlike the write() function, which writes data in a human-readable format, fwrite() writes data in a binary format, making it faster and more efficient for large datasets.

In this tutorial, we will explain how to use the fwrite() function in R with examples using an inbuilt dataset.

We will also discuss some important parameters and options of this function.

Steps to Mastering Natural Language Processing » Data Science Tutorials

Prerequisites:

Before diving into the fwrite() function, make sure you have a basic understanding of R programming language and its syntax.

You should also have some familiarity with working with CSV files in R using functions like read.csv() and write.csv().

Step 1: Loading the Dataset

Let’s start by loading the “iris” dataset, which is an inbuilt dataset in R that contains information about different types of iris flowers.

We can load this dataset using the following code:

data(iris)

Step 2: Preparing the Data for Writing

Before writing the data to a binary file using fwrite(), we need to prepare the data for writing.

This involves converting the data frame into a vector or list that can be written to a binary file.

In our case, we will convert the “iris” dataset into a list with each column as a separate element.

This will make it easier to write each column separately to the binary file.

# Convert the iris dataset into a list with each column as a separate element

iris_list <- as.list(iris)

Step 3: Specifying the Path and Name of the Binary File

Next, we need to specify the path and name of the binary file where we want to write our data.

In this example, we will save our data in a folder called “data” in our working directory.

We will also name our binary file “iris_binary.dat”. You can replace these values with your own preferred path and filename.

# Specify the path and name of the binary file where we want to write our data

binary_file <- file("data/iris_binary.dat", "wb")

Step 4: Writing Each Column Separately Using fwrite() Function

Now that we have prepared our data and specified the path and name of our binary file, we can use the fwrite() function to write each column separately to our binary file.

In this example, we will write each column as a separate line in our binary file, separated by spaces. You can modify this format according to your specific requirements. Here’s how you can do it:

# Write each column separately using fwrite() function with appropriate format specifiers (%) for each column type (numeric or character)

fwrite(x = iris_list$Sepal.Length, file = binary_file, ncol = 1, width = 10) # Write Sepal.Length column first (numeric)
fwrite(x = iris_list$Sepal.Width, file = binary_file, ncol = 1, width = 10) # Write Sepal.Width column second (numeric)
fwrite(x = iris_list$Petal.Length, file = binary_file, ncol = 1, width = 10) # Write Petal.Length column third (numeric)
fwrite(x = iris_list$Petal.Width, file = binary_file, ncol = 1, width = 10) # Write Petal.Width column fourth (numeric)
fwrite(x = iris_list$Species, file = binary_file, ncol = 1, width = 50)

# Write Species column fifth (character) with wider width specifier due to longer strings in this column (maximum length is around 50 characters)

How to Overlay Plots in R-Quick Guide with Example » finnstats

close(binary_file)

# Close the binary file after all columns have been written successfully using fwrite() function for proper resource management and avoiding potential errors or issues related to open files or memory leaks when working with large datasets.

 Step 5: Reading Back from Binary File Using fread() Function (Optional)

After writing our data successfully using fwrite(), we can read back our data from the binary file using another built-in function called fread().

read_data <- readBin("data/iris_binary.dat", what = numeric(4), size = 4*4)

# Read first four numeric columns (Sepal.Length through Petal.Length)

read_data2 <- readBin("data/iris_binary.dat", what = character(), size = integerContent(binaryFile), n=1)

# Read last column (Species) as character type with integerContent() helper function for calculating number of bytes required based on maximum string length in this column

readBin("data/iris_binary.dat", what=NULL , size=integerContent(binaryFile))

# Read remaining bytes until end of file is reached close(binaryFile)

Step 6: Cleaning Up Resources After Completion

We should always clean up resources after completing any operation on files in R scripts or functions that involve repetitive use of functions like fwrite(), fread(), etc., for writing/reading large volumes of data quickly and efficiently.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

four + twelve =