Random forest machine learning Introduction

by finnstats

Random forest machine learning, we frequently utilize non-linear approaches to represent the relationship between a collection of predictor factors and a response variable when the relationship is exceedingly complex.

Classification and regression trees (commonly abbreviated CART) are a method for predicting the value of a response variable that uses a set of predictor variables to generate decision trees.

An example of a regression tree is mentioned below.

Decision trees have the advantage of being simple to understand and visualize. The disadvantage is that they are prone to large variance. That is, if a dataset is split into two halves and a decision tree is applied to each half, the outcomes could be substantially different.

Bagging

The process of bagging, which works as follows, is one way to reduce the variation of decision trees:

1. From the original dataset, take b bootstrapped samples.

2. For each bootstrapped sample, create a decision tree.

3. Create a final model by averaging the forecasts of each tree.

A bagged model often delivers a lower test error rate than a single decision tree, which is a benefit of this technique.

The disadvantage is that if there is a very strong predictor in the dataset, the predictions from the collection of bagged trees can be strongly correlated. In this situation, this predictor will be used for the initial split by most or all of the bagged trees, resulting in trees that are similar to one another and have highly correlated predictions.

As a result, when we combine the predictions of each tree to create a final bagged model, it’s probable that this model doesn’t cut variance as much as a single decision tree.

The adoption of a process known as random forests is one way to get around this problem.

Random Forest Classification Model in R » finnstats

What Are They?

Random forests, like bagging, use b bootstrapped samples from an initial dataset.

When generating a decision tree for each bootstrapped sample, however, only a random sample of m predictors from the whole set of p predictors is examined as split candidates each time a split in the tree is considered.

So, here’s how random forests generate a model in its entirety:

1. From the original dataset, take b bootstrapped samples.

2. For each bootstrapped sample, create a decision tree.

Only a random sample of m predictors from the whole set of p predictors is considered as split candidates each time a split is evaluated when developing the tree.

3. Create a final model by averaging the forecasts of each tree.

Random Forest in R » Complete Tutorial » finnstats

When compared to trees created by bagging, the collection of trees in a random forest is decorrelated utilizing this method.

As a result, when we combine the average predictions of each tree to create a final model, it has less variability and a lower test error rate than a bagged model.

The average forecast from each of the trees in which that observation was OOB can be used to predict the value for the ith observation in the original dataset.

Using this method, we can make a prediction for all n observations in the original dataset and generate an error rate, which is a reliable estimate of the test error.

The advantage of adopting this method to estimate the test error over k-fold cross-validation is that it is significantly faster, especially when the dataset is large.

Random Forests’ Benefits and Drawbacks

The following are some of the advantages of random forests,

Random forests will, in most circumstances, outperform bagged models and, in particular, single decision trees in terms of accuracy.

Outliers aren’t a problem for random forests.

Random forests do not require any pre-processing.

Random forests, on the other hand, have the following possible drawbacks:

It’s difficult to figure out what they’re saying.

On large datasets, they can be computationally costly (i.e. sluggish).

Random forests are commonly used by data scientists to increase forecast accuracy, therefore their inability to be easily interpreted is usually not a concern.

Random Forest Feature Selection » Boruta Algorithm » finnstats