# Regression analysis in R-Model Comparison

Regression analysis in R, just look at the Boston housing data and we can see a total of 506 observations and 14 variables.

In this dataset, medv is the response variable, and the remaining are the predictors.

We want to make a regression prediction model for medv based on other predictor variables.

Most of the variables are numeric variables except one variable.

First, we need to look at the multicollinearity problem, for that exclude factor variable.

In this case, some of the pairs are highly correlated and this may lead to inaccurate results.

Rank order analysis in R

## How to avoid collinearity issues?

Collinearity leads to overfitting

The first solution is to fit ridge regression, shrink coefficient to non-zero values to prevent overfitting, but keep all variables.

The second option is lasso regression, which shrinks regression coefficients, with some shrunk to zero. Thus, it also helps with feature selection.

The third option is too elastic net regression, Mix of the ridge and lasso.

Elastic net regression sum of squares reduces to the ridge when alpha equals zero and reduces to lasso regression when alpha equals 1.

Elastic net regression models are more flexible. When we fit the elastic net regression model end up with the best model maybe 20% ridge and 80% lasso or it could be another combination of ridge and lasso.

tidyverse in R

## Regression analysis in R

```library(caret)
library(glmnet)
library(mlbench)
library(psych)```

### Getting Data

```data("BostonHousing")
data <- BostonHousing```

### Data Partition

```set.seed(222)
ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]```

Custom Control Parameters with 10 number cross-validation

`custom <- trainControl(method = "repeatedcv",number = 10,repeats = 5,verboseIter = T)`

### Linear Model

```set.seed(1234)
lm <- train(medv~.,train,methods='lm', trControl=custom)```
```Linear Regression
353 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ...
Resampling results:
RMSE     Rsquared  MAE
4.23222  0.778488  3.032342
Tuning parameter 'intercept' was held constant at a value
of TRUE
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min       1Q   Median       3Q      Max
-10.1018  -2.3528  -0.7279   1.7047  27.7868```

You can see RMSE is 4.23 and R squares is 0.77. Cross-validation is 10 indicates 9 parts used for training the model and one part used for testing the error and it’s repeated with five number of times.

How to clean datasets?

`summary(lm)`
```Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  25.742808   5.653389   4.554 7.37e-06 ***
crim         -0.165452   0.036018  -4.594 6.15e-06 ***
zn            0.047202   0.015401   3.065 0.002352 **
indus         0.013377   0.067401   0.198 0.842796
chas1         1.364633   0.947288   1.441 0.150630
nox         -13.065313   4.018576  -3.251 0.001264 **
rm            5.072891   0.468889  10.819  < 2e-16 ***
age          -0.028573   0.013946  -2.049 0.041247 *
dis          -1.421107   0.208908  -6.803 4.66e-11 ***
rad           0.260863   0.070092   3.722 0.000232 ***
tax          -0.013556   0.004055  -3.343 0.000922 ***
ptratio      -0.906744   0.139687  -6.491 3.03e-10 ***
b             0.008912   0.002986   2.985 0.003040 **
lstat        -0.335149   0.056920  -5.888 9.40e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.192 on 339 degrees of freedom
Multiple R-squared:  0.7874,     Adjusted R-squared:  0.7793
F-statistic: 96.59 on 13 and 339 DF,  p-value: < 2.2e-16```

The variables that do not have a star indicate those variables are not statistically significant.

### Ridge Regression

```set.seed(1234)
ridge <- train(medv~.,train, method='glmnet',tuneGrid=expand.grid(alpha=0,lambda=seq(0.0001,1,length=5)),trControl=custom)
ridge```
```353 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ...
Resampling results across tuning parameters:
lambda    RMSE      Rsquared   MAE
0.000100  4.242204  0.7782278  3.008339
0.250075  4.242204  0.7782278  3.008339
0.500050  4.242204  0.7782278  3.008339
0.750025  4.248536  0.7779462  3.012397
1.000000  4.265479  0.7770264  3.023091```

Tuning parameter ‘alpha’ was held constant at a value of 0

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were alpha = 0 and lambda = 0.50005.

You can see alpha is 0 because we are doing ridge regression and lambda is 0.5000.

#### Plot Results

```plot(ridge)
```

Increasing the lambda increases the error and the appropriate lambda is 0.5.

```plot(ridge\$finalModel, xvar = "lambda", label = T)
```

X-axis has log lambda, when log lambda around 9 all coefficients are zero.

```plot(ridge\$finalModel, xvar = 'dev', label=T)
```

In this plot, you can see that the fraction deviation 60% model explains very well and after that lot of deviation is noticed.

Repeated Measures of ANOVA

```plot(varImp(ridge, scale=T))
```

The most important variables you can see at the top of the graph and at least once are at the bottom.

### Lasso Regression

```set.seed(1234)
lasso <- train(medv~.,train,
method='glmnet',
tuneGrid=expand.grid(alpha=1,
lambda=seq(0.0001,1,length=5)),trControl=custom)```
```glmnet
353 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ...
Resampling results across tuning parameters:
lambda    RMSE      Rsquared   MAE
0.000100  4.230700  0.7785841  3.025998
0.250075  4.447615  0.7579974  3.135095
0.500050  4.611916  0.7438984  3.285522
0.750025  4.688806  0.7406668  3.362630
1.000000  4.786658  0.7366188  3.445216```

Tuning parameter ‘alpha’ was held constant at a value of 1

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were alpha = 1 and  lambda = 1e-04.

In this case, lambda is close to zero that is the best value.

Deep Neural Network in R

#### Plot Results

```plot(lasso)
```
```plot(lasso\$finalModel, xvar = 'lambda', label=T)
```

60% of variability explains based on only 3 variables.

```plot(varImp(ridge, scale=T))
```

Just look at the important 3 variables in lasso regression.

What is mean by best Standard Deviation?

### Elastic Net Regression

```set.seed(1234)
en <- train(medv~.,train,
method='glmnet',
tuneGrid=expand.grid(alpha=seq(0,1,length=10),
lambda=seq(0.0001,1,length=5)),trControl=custom)```
```glmnet
353 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times)
Summary of sample sizes: 316, 318, 318, 319, 317, 318, ...
Resampling results across tuning parameters:
alpha      lambda    RMSE      Rsquared   MAE
0.0000000  0.000100  4.242204  0.7782278  3.008339
0.0000000  0.250075  4.242204  0.7782278  3.008339
0.0000000  0.500050  4.242204  0.7782278  3.008339
0.0000000  0.750025  4.248536  0.7779462  3.012397
0.0000000  1.000000  4.265479  0.7770264  3.023091

0.1111111  0.000100  4.230292  0.7786226  3.025857
0.1111111  0.250075  4.239094  0.7778348  3.005382
0.1111111  0.500050  4.272822  0.7751270  3.024999
0.1111111  0.750025  4.314170  0.7719071  3.052562
0.1111111  1.000000  4.357845  0.7686150  3.085807
0.2222222  0.000100  4.230694  0.7785669  3.026161
0.2222222  0.250075  4.258991  0.7758849  3.015914
0.2222222  0.500050  4.330452  0.7695318  3.059968
0.2222222  0.750025  4.389640  0.7650387  3.106606
0.2222222  1.000000  4.443160  0.7613804  3.151750
0.3333333  0.000100  4.230795  0.7785677  3.026282
0.3333333  0.250075  4.285269  0.7732992  3.030452
0.3333333  0.500050  4.382444  0.7647643  3.096016
0.3333333  0.750025  4.457291  0.7590837  3.157815
0.3333333  1.000000  4.537080  0.7528068  3.229560
0.4444444  0.000100  4.230574  0.7785789  3.025987
0.4444444  0.250075  4.318752  0.7699550  3.049478
0.4444444  0.500050  4.426926  0.7608447  3.127902
0.4444444  0.750025  4.528733  0.7524128  3.216182
0.4444444  1.000000  4.610942  0.7461712  3.292246
0.5555556  0.000100  4.230656  0.7785681  3.026115
0.5555556  0.250075  4.353828  0.7665028  3.071586
0.5555556  0.500050  4.474680  0.7564421  3.164763
0.5555556  0.750025  4.591765  0.7464771  3.269433
0.5555556  1.000000  4.638309  0.7448745  3.323076
0.6666667  0.000100  4.230688  0.7785626  3.026161
0.6666667  0.250075  4.378865  0.7642222  3.087591
0.6666667  0.500050  4.522902  0.7518766  3.203910
0.6666667  0.750025  4.616421  0.7448532  3.295564
0.6666667  1.000000  4.668353  0.7434801  3.351792
0.7777778  0.000100  4.230768  0.7785606  3.026086
0.7777778  0.250075  4.400658  0.7622860  3.101157
0.7777778  0.500050  4.568780  0.7474490  3.243044
0.7777778  0.750025  4.636481  0.7438164  3.317472
0.7777778  1.000000  4.705950  0.7413472  3.383504
0.8888889  0.000100  4.230862  0.7785562  3.026279
0.8888889  0.250075  4.423849  0.7601929  3.117267
0.8888889  0.500050  4.599200  0.7446729  3.270369
0.8888889  0.750025  4.660298  0.7424824  3.338783
0.8888889  1.000000  4.746398  0.7389209  3.415104
1.0000000  0.000100  4.230700  0.7785841  3.025998
1.0000000  0.250075  4.447615  0.7579974  3.135095
1.0000000  0.500050  4.611916  0.7438984  3.285522
1.0000000  0.750025  4.688806  0.7406668  3.362630
1.0000000  1.000000  4.786658  0.7366188  3.445216
RMSE was used to select the optimal model using the
smallest value.
The final values used for the model were alpha = 0.1111111
and lambda = 1e-04.```

Now you can see that alpha= 0.111 and lambda=1e-04.

Coefficient of variation example

#### Plot Results

```plot(en)
```
```plot(en\$finalModel, xvar = 'lambda', label=T)
```
```plot(en\$finalModel, xvar = 'dev', label=T)
```
```plot(varImp(en))
```

### Compare Models

Now just compare the models we created,

Maximum number of units in an experimental design

```model_list <- list(LinearModel=lm,Ridge=ridge,Lasso=lasso,ElasticNet=en)
res <- resamples(model_list)
summary(res)```
```Call:
summary.resamples(object = res)
Models: LinearModel, Ridge, Lasso, ElasticNet
Number of resamples: 50
MAE
Min.  1st Qu.   Median     Mean  3rd Qu.
LinearModel 2.080208 2.767061 3.002455 3.032342 3.355281
Ridge       2.094151 2.736246 2.934350 3.008339 3.366834
Lasso       2.072408 2.764289 2.988132 3.025998 3.346437
ElasticNet  2.074008 2.762076 2.987955 3.025857 3.348605
Max. NA's
LinearModel 3.874270    0
Ridge       3.971337    0
Lasso       3.882800    0
ElasticNet  3.882943    0
RMSE
Min.  1st Qu.   Median     Mean  3rd Qu.
LinearModel 2.673817 3.495197 3.998562 4.232220 4.751509
Ridge       2.478993 3.477912 4.169422 4.242204 4.759265
Lasso       2.650331 3.490881 3.993362 4.230700 4.748958
ElasticNet  2.650603 3.489053 3.993227 4.230292 4.747517
Max. NA's
LinearModel 7.027551    0
Ridge       7.035089    0
Lasso       7.040494    0
ElasticNet  7.033125    0
Rsquared
Min.   1st Qu.    Median      Mean   3rd Qu.
LinearModel 0.4865769 0.7269864 0.7991104 0.7784880 0.8472274
Ridge       0.4796929 0.7339342 0.8018589 0.7782278 0.8459744
Lasso       0.4848588 0.7272700 0.8002386 0.7785841 0.8475939
ElasticNet  0.4855896 0.7271484 0.8002849 0.7786226 0.8476337
Max. NA's
LinearModel 0.9128278    0
Ridge       0.9141020    0
Lasso       0.9138499    0
ElasticNet  0.9134723    0

Elastic Net regression model comes as best fit model based on RMSE.```

### Best Model

```en\$bestTune
best <- en\$finalModel
coef(best, s = en\$bestTune\$lambda)```

You can find out best coefficients based on above command.

### Prediction

```p1 <- predict(fm, train)
sqrt(mean((train\$medv-p1)^2))
4.108671
p2 <- predict(fm, test)
sqrt(mean((test\$medv-p2)^2))
6.14675```

## Conclusion

If we look at the RMSE the lowest value coming in the elastic net model. Elastic Net regression model avoids multicollinearity issue and provides the best model.