# Cross Validation in R with Example

## What Does Cross-Validation Mean?

Cross-validation is a statistical approach for determining how well the results of a statistical investigation generalize to a different data set.

Cross-validation is commonly employed in situations where the goal is prediction and the accuracy of a predictive model’s performance must be estimated.

We explored different stepwise regressions in a previous article and came up with different models, now let’s see how cross-validation can help us choose the best model.

Which model is the most accurate at forecasting?

To begin, we need to load our dataset:

```library(purrr)
library(dplyr)
```                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1```

There are several ways to accomplish this, but we’ll utilize the modelr package to assist us.

To begin, we divided our data into two categories:

KNN Algorithm Machine Learning » Classification & Regression »

### K Fold Cross-Validation in R

```library(modelr)
cv  <- crossv_kfold(mtcars, k = 5)
cv```
```train                test                .id
<named list>         <named list>        <chr>
1 <resample [25 x 11]> <resample [7 x 11]> 1
2 <resample [25 x 11]> <resample [7 x 11]> 2
3 <resample [26 x 11]> <resample [6 x 11]> 3
4 <resample [26 x 11]> <resample [6 x 11]> 4
5 <resample [26 x 11]> <resample [6 x 11]> 5    ```

Our data has been divided into five sets, each with a training set and a test set.

For each training set, we now use map to fit a model. In actuality, our three models will be fitted separately.

Decision Trees in R » Classification & Regression »

### Model Fitting

```models1  <- map(cv\$train, ~lm(mpg ~ wt + cyl + hp, data = .))
models2  <- map(cv\$train, ~lm(mpg ~ wt + qsec + am, data = .))
models3  <- map(cv\$train, ~lm(mpg ~ wt + qsec + hp, data = .))```

Now it’s time to make some predictions. To accomplish this, I created a tiny function that takes the models and test data and returns the predictions. It’s worth noting that I use as.data.frame to get the data ().

```get_pred  <- function(model, test_data){
data  <- as.data.frame(test_data)
return(pred)
}```
`pred1  <- map2_df(models1, cv\$test, get_pred, .id = "Run")`
`pred2  <- map2_df(models2, cv\$test, get_pred, .id = "Run")`
`pred3  <- map2_df(models3, cv\$test, get_pred, .id = "Run")`

Now we will calculate the MSE for each group:

datatable editor-DT package in R » Shiny, R Markdown & R »

```MSE1  <- pred1 %>% group_by(Run) %>%
summarise(MSE = mean( (mpg - pred)^2))
MSE1```
```Run     MSE
<chr> <dbl>
1 1      7.36
2 2      1.27
3 3      5.31
4 4      8.84
5 5     13.8 ```
```MSE2  <- pred2 %>% group_by(Run) %>%
summarise(MSE = mean( (mpg - pred)^2))
MSE2```
``` Run     MSE
<chr> <dbl>
1 1      6.45
2 2      2.27
3 3      7.71
4 4      9.56
5 5     15.4 ```
```MSE3  <- pred3 %>% group_by(Run) %>%
summarise(MSE = mean( (mpg - pred)^2))
MSE3```
```Run     MSE
<chr> <dbl>
1 1      6.45
2 2      2.27
3 3      7.71
4 4      9.56
5 5     15.4 ```

Please note your machine uses a different random number than mine to construct the folds, your numbers may differ somewhat from mine.

pipe operator in R-Simplify Your Code with %>% »

Finally, consider the following comparison of the three models:

```mean(MSE1\$MSE)
 7.31312```
```mean(MSE2\$MSE)
 8.277929```
```mean(MSE2\$MSE)
 9.333679```

In this case, values are really close however, it appears that model1 is the best model!

apply family in r apply(), lapply(), sapply(), mapply() and tapply() »