Best Regression Model in SAS Using PROC GLMSELECT
Best Regression Model in SAS Using PROC GLMSELECT, selecting the best predictor variables is crucial for developing accurate and meaningful models.
In SAS, you can use the PROC GLMSELECT
statement to identify the most effective regression model from a list of potential predictors.
Best Regression Model in SAS Using PROC GLMSELECT
This article will guide you through the process of using PROC GLMSELECT
for model selection, complete with a practical example.
Example: Using PROC GLMSELECT for Model Selection
Imagine you want to create a multiple linear regression model to predict students’ final exam scores based on three predictors:
(1) the number of hours spent studying,
(2) the number of prep exams taken, and (3) gender.
First, let’s start by creating a dataset that captures this information for 20 students.
Step 1: Create the Dataset
Using the following SAS code, we can create a dataset named exam_data
:
/* Create dataset */
data exam_data;
input hours prep_exams gender $ score;
datalines;
1 1 0 76
2 3 1 78
2 3 0 85
4 5 0 88
2 2 0 72
1 2 1 69
5 1 1 94
4 1 0 94
2 0 1 88
4 3 0 92
4 4 1 90
3 3 1 75
6 2 1 96
5 4 0 90
3 4 0 82
4 4 1 85
6 5 1 99
2 1 0 83
1 0 1 62
2 1 0 76
;
run;
/* View dataset */
proc print data=exam_data;
run;
This dataset includes the number of hours studied, the number of prep exams taken, gender (as a categorical variable), and the final exam score.
Step 2: Perform Model Selection Using PROC GLMSELECT
Next, we will apply the PROC GLMSELECT
statement to find the subset of predictor variables that yields the best regression model:
/* Perform model selection */
proc glmselect data=exam_data;
class gender;
model score = hours prep_exams gender;
run;
In this code, we specify gender
in the class
statement because it is categorical. This setup allows PROC GLMSELECT
to evaluate all combinations of predictor variables to identify the model that minimizes the Schwarz Bayesian Criterion (SBC).
Understanding the Output
After executing PROC GLMSELECT
, the output consists of several tables:
- Overview of the GLMSELECT Procedure: The first group of tables summarizes the procedure’s operations.
- Stepwise Selection Results: Here, you will find how the selection process terminated. For instance, the SBC value starts at 93.4337 for a model with just the intercept. When adding
hours
, the SBC drops to 70.4452, indicating an improvement in model fit. However, includinggender
increases the SBC to 71.7383, which suggests that adding this variable does not enhance the model. - Final Model Summary: The summary table presents the fitted regression model’s details. Based on the output, you can state the fitted model as:
Exam Score = 67.161689 + 5.250257(hours studied)
This equation implies that for every additional hour studied, the exam score is expected to increase by approximately 5.25 points.
Model Fit Metrics
Alongside the regression equation, you’ll also find various statistics indicating the model’s performance:
- R-Square Value: This metric shows that about 72.73% of the variation in exam scores can be explained by the number of hours studied and the prep exams taken.
- Root Mean Squared Error (Root MSE): This value estimates the average distance that observed values deviate from the regression line. In this analysis, the average deviation is about 5.28 units.
Conclusion
Using PROC GLMSELECT
in SAS allows you to efficiently select the best regression model from a set of potential predictors.
By understanding and interpreting the output, you can make informed decisions about which variables significantly contribute to predicting the outcome of interest.
This process enhances your analysis, empowering you to derive actionable insights from your data.