How to do Binning in R?

Binning in R, you will learn about data binning in this tutorial. Binning develops distinct categories from numerical data that are frequently continuous.

It’s very handy for comparing different sets of data. Binning is a pre-processing procedure for numerical numbers that can be used to group them.

Why do we need binning?

Binning can sometimes increase the predictive model’s accuracy.

To have a better grasp of the data distribution, you can use data binning to group a set of numerical values into a smaller number of bins.

dim(data)
2855   21

For example, the variable “ArrDelay” has 2855 unique values and a range of -73 to 682 and can categorize “ArrDelay” variable as [0 to 5], [6 to 10], [11 to 15], and so on.

Binning in R

In this tutorial, arrival delays can be divided into four bins by quartiles using binning.

The borders that divide observations into four distinct intervals are referred to as quartiles. They’re frequently calculated using data point values and how they compare to the rest of the dataset.

Binning is simple to implement in tidyverse. Assume you want four bins with the same number of observations, in which case you’ll need three numbers as dividers:

The 1st, 2nd, and 3rd quartiles are the first, 2nd, and 3rd quartiles, respectively.

The dataset is divided into two half by the median. The median of the lower half of the dataset is the 1st quartile or lower quartile. This quartile is referred to as Q1.

The median of the entire dataset is in the second quartile, Q2.

The median of the upper half of the dataset is the upper quartile, or 3rd quartile, Q3.

Histogram

Plotting a histogram before binning can give you an idea of how the data looks.

ggplot(data=data,mapping=aes(x=ArrDelay))+
geom_histogram(bins=100,color="white",fill="red")+
coord_cartesian(xlim=c(-73,682))

Based on the above plot, most of the flights experience no delays which are roughly bell-shaped and right-skewed.

Let’s get binning now. To begin, divide “ArrDelay” into four buckets, each with an equal amount of observations of flight arrival delays, using the dplyr ntile() function.

Then, make a list called “rank” with four bins named “1”, “2”, “3”, and “4”, accordingly.

This categorizes the data into different bins based on the number of minutes the planes were delayed.

The longer the flight was delayed, the larger the bin label. You can execute the same based on a one-liner code.

binning<-data %>% mutate(rank=ntile(data$ArrDelay,4))

Conclusion

Binning is a data pre-processing technique that groups a series of numerical values into a set of bins, as you learned in this tutorial.

Binning can help you better understand the distribution of your data and increase the accuracy of predictive models.

You also learned how to improve data analysis by using a binning method that separates numerical values into quartiles.

You may also like...

2 Responses

  1. Don MacQueen says:

    Base R has everything needed, and is simpler and more direct (in my opinion, that is):

    > x bins table(bins)
    bins
    (-2.77,-0.687] (-0.687,0.0878] (0.0878,0.757] (0.757,3.27]
    124 125 125 125

    ## an alternative would be
    > ints <- findInterval(x, quantile(x))

    ## or using the example data
    bins <- with(data, cut(ArrDelay , quantile(ArrDelay) ) )

    ##
    For a small but interesting side trip, investigate the type argument of quantile(), for variations in how to interpolate the quantiles from the data.

Leave a Reply

Your email address will not be published.

error

Subscribe Now