How to deal with the class imbalance in R with Example
How to deal with the class imbalance in R, when dealing with machine learning methods, the classes in the dataset are frequently uneven.
Consider the following scenario:
According to a dataset containing information on whether or not collegiate players are drafted into the NBA, 98 percent of players are not drafted and only 2% are drafted.
A dataset including data on whether or not patients have cancer could comprise 99 percent of patients without cancer and only 1% with cancer.
A dataset including information on bank fraud may contain 96 percent genuine transactions and 4% fraudulent activities.
As a result of these unequal classes, your prediction model is likely to underperform on the minority class.
Worse yet, the minority group is frequently the one we’re most interested in forecasting.
How to deal with the class imbalance in R
Synthetic Minority Oversampling Technique, or SMOTE for short, is one way to address this imbalance problem.
This method entails constructing a new dataset by oversampling observations from the minority class, resulting in a dataset with more evenly distributed classes.
The SMOTE() function from the DMwR package is the simplest way to use SMOTE in R.
The following is the fundamental syntax for this function:
SMOTE(form, data, perc.over = 200, perc.under = 200, ...)
where:
form: A formula that describes the model you want to use.
data: Name of the data frame
perc.over: The number that defines how many extra cases are generated from the minority class.
perc.under: The number that defines how many extra cases are generated from the majority class.
The example below demonstrates how to utilize this function in practice.
How to Use SMOTE in R as an Example
Assume we have the following dataset in R with 100 observations, 90 of which have a ‘Yes’ class and 10 of which have a ‘No’ class for the response variable:
create a repeatable example
set.seed(123)
Create a data frame that includes one response variable and two predictor variables.
df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)), x1=rnorm(100), x2=rnorm(100))
Let’s view the first six rows of the data frame
head(df)
y x1 x2 1 Yes -0.56047565 -0.71040656 2 Yes -0.23017749 0.25688371 3 Yes 1.55870831 -0.24669188 4 Yes 0.07050839 -0.34754260 5 Yes 0.12928774 -0.95161857 6 Yes 1.71506499 -0.04502772
look at the response variable’s distribution
table(df$y)
No Yes 10 90
Because the response variable we’re predicting includes 90 observations with a class of ‘Yes’ and only 10 observations with a class of ‘No,’ this is an excellent example of an imbalanced dataset.
Using the SMOTE() method from the DMwR package, we can make a more balanced dataset.
install.packages("DMwR") library(DMwR)
use SMOTE to create a new dataset that is more balanced
new <- SMOTE(y ~ ., df, perc.over = 2000, perc.under = 400)
view distribution of response variable in a new dataset
table(newf$y)
No Yes 210 800
The generated dataset contains 210 observations with the class ‘No’ and 800 observations with the class ‘Yes.’
This is how the SMOTE function generated the new dataset:
We wished to add 2000/100 (i.e. 20) times the number of existing minority observations to the dataset with the perc.over parameter. We added 20*10 = 200 new minority observations because there were 10 in the original dataset.
We wanted to make the number of majority observations equal to 400/100 (i.e. 4) times the number of minority observations added to the current minority observations using the perc.under argument.
Because 200 extra minority observations were added, the number of majority observations was increased to 200 * 4 = 800.
The resulting result is a dataset that still has more majority classes than the original dataset, but is more balanced.
You may now apply your preferred classification algorithm to this new dataset, which should perform better on the minority class due to the increased number of observations in this new dataset.
Note: You can experiment with the perc.over and perc.under options in the SMOTE function to create a dataset that meets your requirements.