Regression Analysis in Statistics
Regression analysis in statistics, Regression analysis is a technique for identifying patterns in data.
You might think there’s a link between how much you eat and how much you weigh, for example; regression analysis might help you prove it.
Regression analysis will provide you with a graph equation that you may use to make predictions about your data.
For instance, if you’ve gained weight in recent years, it can forecast how much you’ll weigh in 10 years if you keep gaining weight at the same rate.
It will also provide you with a number of statistics (such as a p-value and a correlation coefficient) that will indicate the accuracy of your model.
The majority of basic statistics courses include extremely basic concepts like scatter plots and linear regression. More complex techniques, such as multiple regression, may be encountered.
Introduction to Regression Analysis
In statistics, staring at a table of random numbers and trying to make sense of it is difficult.
For instance, global warming may be reducing average snowfall in your town, and you’ve been asked to forecast how much snow will fall this year.
Looking at the table below, you might guess between 10 and 20 inches. That’s a good guess, but regression could help you make a better one.
Regression is essentially the “best guess” at using a set of facts to produce a forecast. It is the process of fitting a set of points to a graph.
You may fine-tune your best guess by looking at the regression line that runs through the data. You can see how far off the first estimate (20 inches or so) was.
The line for 2015 appears to be anywhere between 5 and 10 inches! That may be “good enough,” but regression also provides an equation, which in this case is:
y = -2.2923x + 4624.4.
That is, you may enter an x number (the year) and get a somewhat accurate prediction of snowfall for any year. Consider the year 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is pretty close to the actual figure of 30 inches for that year.
The equation can also be used to make predictions. How much snow will fall, for example, in 2017?
y = 2.2923(2017) + 4624.4 = 0.8 inches.
You can also get an R squared value from regression, which in this case is 0.702. This figure indicates how good your model is.
The values range from 0 to 1, with 0 representing a bad model and 1 representing an ideal model.
As you can see, 0.7 is a fairly good model, therefore you may be reasonably sure in your weather forecast!.
Multiple Regression Analysis
Multiple regression analysis is used to determine whether two sets of variables have a statistically significant relationship. It’s used to look for patterns in large amounts of data.
Simple linear regression and multiple regression analysis are nearly identical. The number of predictors (“x” variables) utilized in the regression is the only difference between simple linear regression and multiple regression.
Each dependent “y” variable in simple regression analysis is represented by a single x variable. For instance: (x1, Y1).
For each independent variable, multiple “x” variables are used: (x1)1, (x2)1, (x3)1, Y1).
In one-variable linear regression, one dependent variable (such as “sales”) is compared to an independent variable (such as “profit”).
However, you might be curious about the impact of different sorts of sales on the regression.
You could make your X1 represent one type of sale, your X2 represents another, and so on.
When Should Multiple Regression Analysis Be Used?
Ordinary linear regression is rarely sufficient to account for all of the real-world factors that influence a result. The graph below, for example, shows a single variable (number of doctors) against another (life expectancy of women).
It appears from this graph that there is a link between women’s life expectancy and the number of doctors in the population.
In fact, that is most likely accurate, and there is a simple solution: increase the number of doctors in the population to enhance life expectancy.
However, other considerations must be considered, such as the likelihood that doctors in rural areas have less education or experience.
Maybe they don’t have access to medical facilities such as trauma centers.
You’d have to add more dependent variables to your regression analysis and develop multiple regression analysis models if you added those other components.
Multiple Regression Analysis Output.
Regression analysis is always performed in software, like Excel or SPSS. The output differs according to how many variables you have but it’s essentially the same type of output you would find in a simple linear regression. There’s just more of it:
Simple regression: Y = b0 + b1 x. Multiple regression: Y = b0 + b1 x1 + b0 + b1 x2…b0…b1 xn.
A summary, similar to a summary for simple linear regression, would be included in the output, which would include:
R (multiple correlation coefficient), R squared (determination coefficient), adjusted R-squared
The estimate’s standard error.
These figures aid in determining how well a regression model matches the data. The p-value and f-statistic can be found in the ANOVA table in the output.
Minimum Sample Size
The sample size question appears to be dependent in part on the researcher’s aims, the research issues being addressed, and the type of model being used.
Although various research articles and textbooks advocate minimal sample sizes for multiple regression, few agree on what size is large enough, and few discuss the prediction side of MLR.
Gregory Knofczynski’s paper is a useful read if you’re interested in discovering accurate values for the squared multiple correlation coefficient, limiting the shrinkage of the squared multiple correlation coefficient, or achieving another specified aim.
However, many consumers only want to use MLS to obtain a basic sense of trends and don’t want precise estimations.
If this is the case, a rule of thumb can be applied. More than 100 items should be included in your sample, according to the literature.
While this is occasionally sufficient, you’ll be safer if you have at least 200 or better yet, more than 400 observations.
Overfitting in Regression
Overfitting might result in a bad data model.
When your sample size is too small, your model becomes too sophisticated for your data, resulting in overfitting.
You will almost always get a model that seems significant if you include enough predictor variables in your regression model.
While an overfitted model may perfectly suit the peculiarities of your data, it will not fit subsequent test samples or the entire population.
The p-values, R-Squared, and regression coefficients of the model can all be deceiving. You’re basically asking too much of a limited quantity of data.
How to Avoid Overfitting
For each term you’re trying to estimate in linear modeling (including multiple regression), you need to have at least 10-15 observations. If you go lower than that, you risk overfitting your model.
Term definitions include:
Polynomial expressions (for modeling curved lines), and predictor variables are all examples of interaction effects.
While this rule of thumb is widely recognized, Green (1991) recommends a minimum sample size of 50 for every regression, with an additional 8 observations per term. To avoid overfitting, you’ll need roughly 45-60 items in your sample if you have one interaction variable and three predictor variables, or 50 + 3(8) = 74 items, according to Green.
There are exceptions to the “10-15” rule of thumb. They include:
When your data has multicollinearity or the effect magnitude is minimal. If that’s the case, you’ll need to add additional terms (despite the fact that there’s no rule of thumb for how many to add!).
If you’re using logistic regression or survival models, you might be able to get away with as few as 10 observations per predictor if you don’t have extreme event probabilities, small effect sizes, or predictor variables with truncated ranges.
How to Detect and Avoid Overfitting
Increasing your sample size by gathering additional data is the easiest approach to avoid overfitting.
If you can’t accomplish that, you can reduce the number of predictors in your model by combining or deleting them. One way of identifying related predictors that might be candidates for merging is factor analysis.
To detect overfitting, use cross-validation: this separates your data, generalizes your model, and selects the best model.
Predicted R-squared is one type of cross-validation. This statistic is calculated as follows:
Taking one observation out of your data at a time,
For each iteration, estimate the regression equation, and use the regression equation to predict the deleted observation.
Cross-validation isn’t a panacea for tiny data sets, and even with sufficient sample size, a clear model isn’t always established.
2. Shrinkage & Resampling
Techniques like shrinkage and resampling can assist you to figure out how well your model fits a new sample.
3. Automated Methods
For tiny data sets, automated stepwise regression should not be employed as an overfitting method. Babyak claims that (2004),
“The issues with automated selection handled in this very usual method are so numerous that cataloging them all [in a scientific article] would be difficult.”
Babyak also advises against using univariate pretesting or screening (which he describes as “a disguised kind of automated selection”), dichotomizing continuous variables, which can drastically increase Type I errors, and multiple testing of confounding variables (although this may be ok if used judiciously).