Model Selection in Machine Learning

by finnstats

We’re often interested in developing models utilizing a set of predictor variables and a response variable in the field of machine learning.

Our goal is to create a model that can effectively predict the response variable’s value using the predictor variables.

Best Data Science Books For Beginners » finnstats

There are various models that we could design with a set of p total predictor variables. The best subset selection method is one way for selecting the best model.

Model fitting is pretty simple, but choosing among them is the actual problem of applied machine learning.

To begin, we must abandon the notion of a “best” model. Given the statistical noise in the data, the incompleteness of the data sample, and the limits of each model type, all models have some predictive inaccuracy.

As a result, the concept of a perfect or best model is useless. Instead, we must look for a “good enough” model.

NLP Courses Online (Natural Language Processing) » finnstats

Model Selection in Machine Learning

Assume you have a dataset with p = 3 predictor variables and y as the response variable. We would fit the following 2^p = 2^3 = 8 models to this dataset to do the best subset selection.

There are no predictors in this model.
A model with x1 as a predictor
A model with x2 as a predictor
A model with x3 as a predictor
x1, x2 predictors in a model
A model using the variables x1, x3
x2, x3 predictors in a model
A model with the variables x1, x2, and x3 as predictions

Then, from each set of models with k predictors, we’d pick the model with the highest R^2. For instance, we might decide to select,

How to Perform Dunnett’s Test in R » Post-hoc Test » finnstats

There are no predictors in this model.
A model with x2 as a predictor
x1, x2 predictors in a model
A model with the variables x1, x2, and x3 as predictions etc..

Following that, we’d use cross-validation to select the “good enough” model, which would be the one with the lowest prediction error BIC, AIC, or adjusted R^2.

It gave the lowest cross-validated prediction error, we can choose the model as the “good enough” model.

Systematic Random Sampling in R with Example » finnstats

If a model with fewer parameters is always less complex, it is chosen because it is more likely to generalize on average.

The following are four regularly used probabilistic model selection measures

Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Minimum Description Length (MDL)
Structural Risk Minimization (SRM).

Benefits and Drawbacks

The Benefits and Drawbacks of Choosing the “good enough” model,

It is a simple way to comprehend and interpret.

This approach, however, has the following drawbacks:

It has the potential to be computationally intensive.

Because it considers so many models, it’s possible that it’ll find one that works well on training data but not on future data. This may lead to overfitting.

View data frame in r: use of View() function in R » finnstats

A “good enough” model might relate to a variety of things, each of which is unique to your project, such as:

A model that satisfies the project stakeholders’ needs and limits.
Given the time and resources available, a model that is appropriately skilled.
When compared to naive models, a model that is skilled.
A model that performs well in comparison to other models that have been tested.
A model that is proficient in comparison to the current state of the art.

Draw a trend line using ggplot-Quick Guide » finnstats

Conclusion

While optimal subset selection (“good-enough” model) is simple to implement and understand, it may be impractical when working with a dataset with a large number of predictors, and it may result in overfitting.

Stepwise selection is an alternative to this method that is more computationally efficient.

Subscribe to our newsletter!

[newsletter_form type=”minimal”]