Best Regression Model in SAS Using PROC GLMSELECT

by finnstats

Best Regression Model in SAS Using PROC GLMSELECT, selecting the best predictor variables is crucial for developing accurate and meaningful models.

In SAS, you can use the PROC GLMSELECT statement to identify the most effective regression model from a list of potential predictors.

Best Regression Model in SAS Using PROC GLMSELECT

This article will guide you through the process of using PROC GLMSELECT for model selection, complete with a practical example.

Example: Using PROC GLMSELECT for Model Selection

Imagine you want to create a multiple linear regression model to predict students’ final exam scores based on three predictors:

(1) the number of hours spent studying,

(2) the number of prep exams taken, and (3) gender.

First, let’s start by creating a dataset that captures this information for 20 students.

Step 1: Create the Dataset

Using the following SAS code, we can create a dataset named exam_data:

/* Create dataset */
data exam_data;
    input hours prep_exams gender $ score;
    datalines;
1 1 0 76
2 3 1 78
2 3 0 85
4 5 0 88
2 2 0 72
1 2 1 69
5 1 1 94
4 1 0 94
2 0 1 88
4 3 0 92
4 4 1 90
3 3 1 75
6 2 1 96
5 4 0 90
3 4 0 82
4 4 1 85
6 5 1 99
2 1 0 83
1 0 1 62
2 1 0 76
;
run;

/* View dataset */
proc print data=exam_data;
run;

This dataset includes the number of hours studied, the number of prep exams taken, gender (as a categorical variable), and the final exam score.

Step 2: Perform Model Selection Using PROC GLMSELECT

Next, we will apply the PROC GLMSELECT statement to find the subset of predictor variables that yields the best regression model:

/* Perform model selection */
proc glmselect data=exam_data;
    class gender;
    model score = hours prep_exams gender;
run;

In this code, we specify gender in the class statement because it is categorical. This setup allows PROC GLMSELECT to evaluate all combinations of predictor variables to identify the model that minimizes the Schwarz Bayesian Criterion (SBC).

Understanding the Output

After executing PROC GLMSELECT, the output consists of several tables:

Overview of the GLMSELECT Procedure: The first group of tables summarizes the procedure’s operations.
Stepwise Selection Results: Here, you will find how the selection process terminated. For instance, the SBC value starts at 93.4337 for a model with just the intercept. When adding hours, the SBC drops to 70.4452, indicating an improvement in model fit. However, including gender increases the SBC to 71.7383, which suggests that adding this variable does not enhance the model.
Final Model Summary: The summary table presents the fitted regression model’s details. Based on the output, you can state the fitted model as:

Exam Score = 67.161689 + 5.250257(hours studied)

This equation implies that for every additional hour studied, the exam score is expected to increase by approximately 5.25 points.

Model Fit Metrics

Alongside the regression equation, you’ll also find various statistics indicating the model’s performance:

R-Square Value: This metric shows that about 72.73% of the variation in exam scores can be explained by the number of hours studied and the prep exams taken.
Root Mean Squared Error (Root MSE): This value estimates the average distance that observed values deviate from the regression line. In this analysis, the average deviation is about 5.28 units.

Conclusion

Using PROC GLMSELECT in SAS allows you to efficiently select the best regression model from a set of potential predictors.

By understanding and interpreting the output, you can make informed decisions about which variables significantly contribute to predicting the outcome of interest.

This process enhances your analysis, empowering you to derive actionable insights from your data.

XGBoost’s assumptions » FINNSTATS

Best Regression Model in SAS Using PROC GLMSELECT