# Cross Validation in R with Example

## What Does Cross-Validation Mean?

Cross-validation is a statistical approach for determining how well the results of a statistical investigation generalize to a different data set.

Cross-validation is commonly employed in situations where the goal is prediction and the accuracy of a predictive model’s performance must be estimated.

We explored different stepwise regressions in a previous article and came up with different models, now let’s see how cross-validation can help us choose the best model.

Which model is the most accurate at forecasting?

To begin, we need to load our dataset:

library(purrr) library(dplyr) head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

There are several ways to accomplish this, but we’ll utilize the modelr package to assist us.

To begin, we divided our data into two categories:

KNN Algorithm Machine Learning » Classification & Regression »

### K Fold Cross-Validation in R

library(modelr) cv <- crossv_kfold(mtcars, k = 5) cv

train test .id <named list> <named list> <chr> 1 <resample [25 x 11]> <resample [7 x 11]> 1 2 <resample [25 x 11]> <resample [7 x 11]> 2 3 <resample [26 x 11]> <resample [6 x 11]> 3 4 <resample [26 x 11]> <resample [6 x 11]> 4 5 <resample [26 x 11]> <resample [6 x 11]> 5

Our data has been divided into five sets, each with a training set and a test set.

For each training set, we now use map to fit a model. In actuality, our three models will be fitted separately.

Decision Trees in R » Classification & Regression »

### Model Fitting

models1 <- map(cv$train, ~lm(mpg ~ wt + cyl + hp, data = .)) models2 <- map(cv$train, ~lm(mpg ~ wt + qsec + am, data = .)) models3 <- map(cv$train, ~lm(mpg ~ wt + qsec + hp, data = .))

Now it’s time to make some predictions. To accomplish this, I created a tiny function that takes the models and test data and returns the predictions. It’s worth noting that I use as.data.frame to get the data ().

get_pred <- function(model, test_data){ data <- as.data.frame(test_data) pred <- add_predictions(data, model) return(pred) }

pred1 <- map2_df(models1, cv$test, get_pred, .id = "Run")

pred2 <- map2_df(models2, cv$test, get_pred, .id = "Run")

pred3 <- map2_df(models3, cv$test, get_pred, .id = "Run")

Now we will calculate the MSE for each group:

datatable editor-DT package in R » Shiny, R Markdown & R »

MSE1 <- pred1 %>% group_by(Run) %>% summarise(MSE = mean( (mpg - pred)^2)) MSE1

Run MSE <chr> <dbl> 1 1 7.36 2 2 1.27 3 3 5.31 4 4 8.84 5 5 13.8

MSE2 <- pred2 %>% group_by(Run) %>% summarise(MSE = mean( (mpg - pred)^2)) MSE2

Run MSE <chr> <dbl> 1 1 6.45 2 2 2.27 3 3 7.71 4 4 9.56 5 5 15.4

MSE3 <- pred3 %>% group_by(Run) %>% summarise(MSE = mean( (mpg - pred)^2)) MSE3

Run MSE <chr> <dbl> 1 1 6.45 2 2 2.27 3 3 7.71 4 4 9.56 5 5 15.4

Please note your machine uses a different random number than mine to construct the folds, your numbers may differ somewhat from mine.

pipe operator in R-Simplify Your Code with %>% »

Finally, consider the following comparison of the three models:

mean(MSE1$MSE) [1] 7.31312

mean(MSE2$MSE) [1] 8.277929

mean(MSE2$MSE) [1] 9.333679

In this case, values are really close however, it appears that model1 is the best model!

apply family in r apply(), lapply(), sapply(), mapply() and tapply() »