Multivariate Logistic Regression in R
Multivariate Logistic Regression in R, That’s an excellent segue into what to do when there are multiple variables.
The two models we’ve looked at thus far only do single-variable logistic regression.
Logistic Regression Machine Learning: Explanation »
However, if we have a large number of variables, we must account for all of them.
As a result, we’ll create a multi-variant logistic regression model in that situation.
So the probability transformation is the same as before, only now we have a generic linear model with an intercept and a coefficient for each variable.
If you reverse that transformation, you have a representation for the probability that is guaranteed to be between zero and one.
So, just like before, we can use glm and r to fit that in. We’ll also take into account our variable balance, income, and the student variable.
Three coefficients, three standard errors, three z statistics, and three p values are now available.
The first thing we notice is that balance and student are still significant, as they were in the single variable scenario.
The amount of money earned is insignificant. So it appears that two of the variables are crucial. But there’s one thing that stands out.
We know this because the student coefficient is now negative when it was previously positive.
So it had a positive coefficient before when we merely measured students on its own.
In a multivariate model, however, the coefficient is negative. So, what are the chances of that happening?
Remember how difficult it is to interpret coefficients in a multiple regression model because the correlations between the variables can affect the signs?
Well, remember how difficult it is to interpret coefficients in a regression model the last time we talked about how difficult it is to interpret coefficients in a regression model because the correlations between the variables can affect the signs?
The role of correlations in the variables will now be examined.
So, here’s a shot.
We can see the credit card balance there. On the vertical axis, we can see the default rate. And, let’s see—so students’ status, brown is yes and blue is no.
As a result, students’ balances are higher than non-students’. As a result, they have a greater marginal default rate than non-students.
Because we just witnessed it. Balance is important. However, as seen in the graph on the left, students default less than non-students at each level of balance.
So, if you only look at a student on its own, you’ll notice that it’s off balance. Furthermore, the substantial influence of balancing gives the impression that students are more likely to default.
This plot, on the other hand, explains everything. When we look at the default rates for students and non-students for each level of credit card amount, we find that students have a lower default rate.
And so that we can tease out by using multiple logistic regression to account for these relationships.
Now let’s look at a different scenario with more variables. This example was mentioned in the introduction.
This is a data set about cardiac disease in South Africa. Keep in mind that South Africans consume a lot of meat.
It was a study that looked backwards. They discovered 160 white guys who had suffered a myocardial infarction, which is a fancy term for a heart attack.
They also took a sample of 302 controls from the many people who hadn’t experienced a heart attack.
A case control sample is what it’s called. And these people were all white guys between the ages of 15 and 64.
And they were from South Africa’s Western Cape region. In the early 1980s, this was done.
As a result, the overall prevalence of heart disease in this region was quite high, at 5.1 percent, indicating a very high risk.
So we have measurements on seven predictors or, in this case, risk variables in the study.
They’re also visible in the scatter plot matrix, which we’ll demonstrate right now.
Remember that the scatter plot matrix is a great way to plot each variable against the other.
We may now code the heart disease status into the plot because it’s a classification problem.
As a result, the brown or red points indicate those who had heart disease.
The controls are the blue points. And, if you look at the top plot, you’ll notice that if you smoked a lot and had high systolic blood pressure, you’ll be a brown point.
Those were the persons who had a history of heart attacks. As a result, each of these plots depicts a pairwise plot of two risk variables and codes related to heart disease status.
One risk factor was overlooked. There is one amusing variable in this equation: family history. It is, after all, a categorical variable.
Apart from being South African or not, it turns out to be a significant risk factor.
The risk of heart disease is increased if you have a family history of it. You can tell that it’s a one-zero.
in this situation, a variable you can undoubtedly see that the right-hand category has more browns than the left-hand category.
In this scenario, we’re not attempting to estimate the likelihood of developing heart disease.
What we’re actually attempting to figure out is how risk variables affect the likelihood of heart disease.
In fact, this was an intervention study aimed at teaching the public about the need of eating a healthy diet.
So, here’s the GLM result for the data on heart disease. And here’s where I show you some of the code that was used to make it work.
Later, we’ll dive into the coding session. But it’s fascinating to see how simple it is to accomplish.
Here’s a message for glm. We inform it that the response is CHD, which is the response variable’s name. And tilde denotes being modeled like.
And dot refers to all of the data frame’s other variables, which in this case is heart. So that’s a data frame that contains all of the study’s variables.
And the response is CHD in this case. We also provide it the binomial of the family, which just instructs it to fit the logistic regression model.
Then we had to fit that model. It should be saved in the heart fit object. After that, we conduct a heartfit summary.
And we get this summary printed out, which is the same information we’ve seen before.
As a result, we now have coefficients for each variable.
We obtain standard errors, c values, and p values in this column. And here’s when the story gets a little muddled.
The intercept isn’t something we’re particularly interested in.
Tobacco use is significant. Low-density lipoproteins, a marker of cholesterol, is significant.
Keep in mind that there are two types of cholesterol: healthy and harmful.
This is high-density lipoprotein (HDL) cholesterol. Family history is really important. And there’s the matter of age. We all know that the risk of heart disease increases as we get older.
Now, there isn’t much interest in obesity or alcohol consumption, which is a little surprising.
However, this is another example of interrelated variables. You can see that there is a lot of correlation between variables in the previous plot.
As a result, there is a link between age and cigarette use.
Alcohol consumption and LDL cholesterol appear to have a negative relationship. The beneficial cholesterol is LDL.
As a result, there are numerous connections.
As a result, those will be important. So, for example, LDL is a substantial factor in the model.
And, once LDL is included in the model, it’s possible that alcohol will no longer be required.
It’s because it’s taken care of. These variables serve as stand-ins for one another.