Simple Linear Regression in r

Simple linear regression in r, we want to create models to investigate and forecast the relationship between variables, and the most basic relationship that we can think of is a straight line.

Let’s take a look at the first linear relationship that we are going to create.

Intraclass Correlation Coefficient in R-Quick Guide »

Simple Linear Regression in r

Let’s load Boston Housing data from mlbench package.


The data set contains 506  rows and 19 columns.

Now we can check the association between the average number of rooms in a house and the median house price from this data set.

Kruskal Wallis test in R-One-way ANOVA Alternative »

Scatter Plot

Now we can make use of ggplot for making a scatterplot.

ggplot(BostonHousing2,mapping = aes(y=medv,x=rm)) +
  geom_point() +
  xlab("Average number of Rooms") +
  ylab("Median House Price")

The average house price and the number of rooms have a strong linear relationship.

Ok, let’s see another example of the relationship between the price of a diamond and the number of carats using a fancy hexbin plot.

Equality of Variances in R-Homogeneity test-Quick Guide »

Let’s see the dataset first,

  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
ggplot(diamonds, mapping = aes(x = carat, y = price)) +

Now, let’s look at the plot it doesn’t appear to be linear to me, but we can make it while making small changes.

ggplot(diamonds, mapping = aes(x = log10(carat), y = log10(price))) +  geom_hex(bins=50)

We’ll look at using the log carat to forecast a diamond’s log price.

lm <- lm(log(price) ~ log(carat), data = diamonds)
lm(formula = log(price) ~ log(carat), data = diamonds)

     Min       1Q   Median       3Q      Max 
-1.50833 -0.16951 -0.00591  0.16637  1.33793 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.448661   0.001365  6190.9   <2e-16 ***
log(carat)  1.675817   0.001934   866.6   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2627 on 53938 degrees of freedom
Multiple R-squared:  0.933,	Adjusted R-squared:  0.933 
F-statistic: 7.51e+05 on 1 and 53938 DF,  p-value: < 2.2e-16


For the linear model we have the following assumptions:

  1. Linearity (A straight line between log price and log carat)
  2. Homoscedasticity (noise terms have the same variance)
  3. Normality (Noise terms  are normally distributed)
  4. Independence (The error terms are independent)

rbind in r-Combine Vectors, Matrix or Data Frames by Rows » f

1. Linearity

Plot for residual versus fitted values.

plot(lm, which = 1)

The red line aids look at any patterns that exist. It is essentially straight in this example, which indicates no trend in the residuals and assumption satisfied.

QQ-plots in R: Quantile-Quantile Plots-Quick Start Guide »

2. Homoscedasticity

Let’s look at the spread

plot(lm, which = 3)

In this scenario, we’d like to see an equitable distribution of points as we move from left to right – no obvious tendencies here.

3. Normality

Here we will make use of which=2

plot(lm, which = 2)

4. Independence

Hope you eagerly waiting for this assumption, It necessitates some understanding of the data’s origins, meaning, and collection methods. So no shortcuts.


All of the assumptions have been met, and we can now use the below formula to forecast the log(price).

log(price)=8.448661+1.675817 log(carat)

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

four × 1 =