Standardization in Statistics with R

by finnstats

Standardization in statistics, when a dataset is standardized, all of the variables are scaled so that the mean is 0 and the standard deviation is 1.

Standardization in Statistics

In a data frame, there may be occasions when the values for one feature range from 1-500 and the values for another feature vary from 1-10000000.

In cases like this, the impact of a feature with a larger numeric range on response variables may be greater than that of a feature with a smaller numeric range, which may have an impact on prediction accuracy.

The goal is to enhance predictive accuracy while preventing a specific attribute from influencing the prediction due to a wide numeric value range.

As a result, we may need to normalize or scale results based on distinct attributes so that they all fall within the same range.

If you’re working with cluster analysis, keep in mind that scaling is critical.

The most frequent method is to use z-score standardization, which scales numbers according to the formula:

(xi – xbar) / s

Where,

xi: The dataset’s ith value.

xbar: The average of the sample

s: The standard deviation of the sample

The following examples demonstrate how to scale one or more variables in a data frame using the z-score standardization by using the scale() function in conjunction with the dplyr package in R.

Step1: Standardize a Single Variable

In a data frame with three variables, the following code explains how to scale just one variable.

Let’s load the dplyr library

library(dplyr)

If you want to make this example reproducible then utilize set.seed function

set.seed(1234)

Now we create the data frame,

originaldata<- data.frame(var1= runif(10, 0, 50),
                 var2= runif(10, 2, 30),
                 var3= runif(10, 5, 60))

head(originaldata)

Let’s view originaldata first six rows

    var1      var2      var3
1 17.67357  9.343906 32.817272
2 32.95256  2.677442 51.860993
3 19.89896  7.183681 37.270479
4 13.79581 25.001149 50.688037
5 20.40034 28.806178  5.723320
6 10.10240  9.812477  8.779676

Now we can scale the variable “var1” to have mean = 0 and standard deviation = 1

scaleddata<- originaldata %>% mutate_at(c('var1'), ~(scale(.) %>% as.vector))
scaleddata

       var1      var2     var3
1  -1.0371043 28.393877 31.12995
2   2.1192352 15.906119 11.85231
3   0.6026644  2.914802 35.95064
4   0.8720766 18.403451 40.79981
5   0.2209620 20.700270 34.40191
6  -0.7089954 26.373200 30.98616
7   0.1906236  5.933846 56.28785
8  -0.6436983 14.245744 39.85674
9  -0.8395434 17.858952 57.31483
10 -0.7762205  6.227785 32.25440

It’s worth noting that only the first variable was scaled, while the other two stayed unchanged.

We can immediately confirm that the new scaled variable has a mean of 0 and a standard deviation of 1 by looking at the data.

Now we can calculate the mean of the scaled variable

mean(scaleddata$var1)

[1] -1.113476e-17

round(mean(scaleddata$var1),0)

[1] 0

Let’s calculate the standard deviation of the scaled variable

sd(scaleddata$var1)

[1] 1

Step2:-Standardize Multiple Variables

The code below demonstrates how to scale many variables in a data frame at the same time.

We can use mutate_at function to scale var1 and var2 to have mean = 0 and standard deviation = 1

scaleddata <- originaldata %>% mutate_at(c('var1', 'var2'), ~(scale(.) %>% as.vector))
scaleddata

         var1        var2     var3
1  -1.0371043  1.47974343 31.12995
2   2.1192352  0.02450854 11.85231
3   0.6026644 -1.48940748 35.95064
4   0.8720766  0.31552992 40.79981
5   0.2209620  0.58318492 34.40191
6  -0.7089954  1.24426801 30.98616
7   0.1906236 -1.13758949 56.28785
8  -0.6436983 -0.16897978 39.85674
9  -0.8395434  0.25207785 57.31483
10 -0.7762205 -1.10333593 32.25440

Step 3:-Standardize All Variables

Using the mutate all function, the following code explains how to scale all variables in a data frame.

Let’s scale all variables to have mean = 0 and standard deviation = 1

scaleddata <- originaldata %>% mutate_all(~(scale(.) %>% as.vector))
scaleddata

       var1        var2        var3
1  -1.0371043  1.47974343 -0.45503317
2   2.1192352  0.02450854 -1.92844375
3   0.6026644 -1.48940748 -0.08658275
4   0.8720766  0.31552992  0.28404422
5   0.2209620  0.58318492 -0.20495350
6  -0.7089954  1.24426801 -0.46602298
7   0.1906236 -1.13758949  1.46781181
8  -0.6436983 -0.16897978  0.21196488
9  -0.8395434  0.25207785  1.54630538
10 -0.7762205 -1.10333593 -0.36909013

Class Imbalance-Handling Imbalanced Data in R »

Subscribe to our newsletter!

[newsletter_form]

Standardization in Statistics with R