Simple Linear Regression in r
Simple linear regression in r, we want to create models to investigate and forecast the relationship between variables, and the most basic relationship that we can think of is a straight line.
Let’s take a look at the first linear relationship that we are going to create.
Intraclass Correlation Coefficient in R-Quick Guide »
Simple Linear Regression in r
Let’s load Boston Housing data from mlbench package.
library(mlbench) data("BostonHousing2") head(BostonHousing2) dim(BostonHousing2)
The data set contains 506 rows and 19 columns.
Now we can check the association between the average number of rooms in a house and the median house price from this data set.
Kruskal Wallis test in R-One-way ANOVA Alternative »
Scatter Plot
Now we can make use of ggplot for making a scatterplot.
library(ggplot2) ggplot(BostonHousing2,mapping = aes(y=medv,x=rm)) + geom_point() + xlab("Average number of Rooms") + ylab("Median House Price")
The average house price and the number of rooms have a strong linear relationship.
Ok, let’s see another example of the relationship between the price of a diamond and the number of carats using a fancy hexbin plot.
Equality of Variances in R-Homogeneity test-Quick Guide »
Let’s see the dataset first,
head(diamonds)
carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
library(hexbin) ggplot(diamonds, mapping = aes(x = carat, y = price)) + geom_hex(bins=50)
Now, let’s look at the plot it doesn’t appear to be linear to me, but we can make it while making small changes.
ggplot(diamonds, mapping = aes(x = log10(carat), y = log10(price))) + geom_hex(bins=50)
We’ll look at using the log carat to forecast a diamond’s log price.
lm <- lm(log(price) ~ log(carat), data = diamonds)
summary(lm)
Call: lm(formula = log(price) ~ log(carat), data = diamonds) Residuals: Min 1Q Median 3Q Max -1.50833 -0.16951 -0.00591 0.16637 1.33793 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.448661 0.001365 6190.9 <2e-16 *** log(carat) 1.675817 0.001934 866.6 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2627 on 53938 degrees of freedom Multiple R-squared: 0.933, Adjusted R-squared: 0.933 F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16
Assumptions
For the linear model we have the following assumptions:
- Linearity (A straight line between log price and log carat)
- Homoscedasticity (noise terms have the same variance)
- Normality (Noise terms are normally distributed)
- Independence (The error terms are independent)
rbind in r-Combine Vectors, Matrix or Data Frames by Rows » f
1. Linearity
Plot for residual versus fitted values.
plot(lm, which = 1)
The red line aids look at any patterns that exist. It is essentially straight in this example, which indicates no trend in the residuals and assumption satisfied.
QQ-plots in R: Quantile-Quantile Plots-Quick Start Guide »
2. Homoscedasticity
Let’s look at the spread
plot(lm, which = 3)
In this scenario, we’d like to see an equitable distribution of points as we move from left to right – no obvious tendencies here.
3. Normality
Here we will make use of which=2
plot(lm, which = 2)
4. Independence
Hope you eagerly waiting for this assumption, It necessitates some understanding of the data’s origins, meaning, and collection methods. So no shortcuts.
Conclusion
All of the assumptions have been met, and we can now use the below formula to forecast the log(price).
log(price)=8.448661+1.675817 log(carat)