Simple Linear Regression in r

by finnstats

Simple linear regression in r, we want to create models to investigate and forecast the relationship between variables, and the most basic relationship that we can think of is a straight line.

Let’s take a look at the first linear relationship that we are going to create.

Intraclass Correlation Coefficient in R-Quick Guide »

Simple Linear Regression in r

Let’s load Boston Housing data from mlbench package.

library(mlbench)
data("BostonHousing2")
head(BostonHousing2)
dim(BostonHousing2)

The data set contains 506 rows and 19 columns.

Now we can check the association between the average number of rooms in a house and the median house price from this data set.

Kruskal Wallis test in R-One-way ANOVA Alternative »

Scatter Plot

Now we can make use of ggplot for making a scatterplot.

library(ggplot2)
ggplot(BostonHousing2,mapping = aes(y=medv,x=rm)) +
  geom_point() +
  xlab("Average number of Rooms") +
  ylab("Median House Price")

The average house price and the number of rooms have a strong linear relationship.

Ok, let’s see another example of the relationship between the price of a diamond and the number of carats using a fancy hexbin plot.

Equality of Variances in R-Homogeneity test-Quick Guide »

Let’s see the dataset first,

head(diamonds)

  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

library(hexbin)
ggplot(diamonds, mapping = aes(x = carat, y = price)) +
geom_hex(bins=50)

Now, let’s look at the plot it doesn’t appear to be linear to me, but we can make it while making small changes.

ggplot(diamonds, mapping = aes(x = log10(carat), y = log10(price))) +  geom_hex(bins=50)

We’ll look at using the log carat to forecast a diamond’s log price.

lm <- lm(log(price) ~ log(carat), data = diamonds)
summary(lm)

Call:
lm(formula = log(price) ~ log(carat), data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.50833 -0.16951 -0.00591  0.16637  1.33793 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.448661   0.001365  6190.9   <2e-16 ***
log(carat)  1.675817   0.001934   866.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2627 on 53938 degrees of freedom
Multiple R-squared:  0.933,	Adjusted R-squared:  0.933 
F-statistic: 7.51e+05 on 1 and 53938 DF,  p-value: < 2.2e-16

Assumptions

For the linear model we have the following assumptions:

Linearity (A straight line between log price and log carat)
Homoscedasticity (noise terms have the same variance)
Normality (Noise terms are normally distributed)
Independence (The error terms are independent)

rbind in r-Combine Vectors, Matrix or Data Frames by Rows » f

1. Linearity

Plot for residual versus fitted values.

plot(lm, which = 1)

The red line aids look at any patterns that exist. It is essentially straight in this example, which indicates no trend in the residuals and assumption satisfied.

QQ-plots in R: Quantile-Quantile Plots-Quick Start Guide »

2. Homoscedasticity

Let’s look at the spread

plot(lm, which = 3)

In this scenario, we’d like to see an equitable distribution of points as we move from left to right – no obvious tendencies here.

3. Normality

Here we will make use of which=2

plot(lm, which = 2)

4. Independence

Hope you eagerly waiting for this assumption, It necessitates some understanding of the data’s origins, meaning, and collection methods. So no shortcuts.

Conclusion

All of the assumptions have been met, and we can now use the below formula to forecast the log(price).

log(price)=8.448661+1.675817 log(carat)

Simple Linear Regression in r

Simple Linear Regression in r

Scatter Plot

Assumptions

1. Linearity

2. Homoscedasticity

3. Normality

4. Independence

Conclusion

You may also like...

Leave a Reply Cancel reply

Recent Jobs

United States-Healthcare Informatics AI Intern-Operations

Turkey-A.I. Engineering Intern

Colombia-A.I. Engineering Intern (Colombia)

c(“Philippines”, “United States”)-Internship Applicants

United States-IT & Computer Science – Internship

Machine Learning Engineer Intern

South Korea-Operation Intern | South Korea | Remote-Operations

United States-Technical Intern (Masters or PhD)

United States-Research Intern

United States-Data Science Intern (Spring ’25-2)

Simple Linear Regression in r

Simple Linear Regression in r

Scatter Plot

Assumptions

1. Linearity

2. Homoscedasticity

3. Normality

4. Independence

Conclusion

You may also like...

Remove rows that contain all NA or certain columns in R?

Random Forest in R

Granger Causality Test in R (with Example)

Leave a Reply Cancel reply

Recent Jobs

United States-Healthcare Informatics AI Intern-Operations

Turkey-A.I. Engineering Intern

Colombia-A.I. Engineering Intern (Colombia)

c(“Philippines”, “United States”)-Internship Applicants

United States-IT & Computer Science – Internship

Machine Learning Engineer Intern

South Korea-Operation Intern | South Korea | Remote-Operations

United States-Technical Intern (Masters or PhD)

United States-Research Intern

United States-Data Science Intern (Spring ’25-2)