Standardization in Statistics with R
Standardization in statistics, when a dataset is standardized, all of the variables are scaled so that the mean is 0 and the standard deviation is 1.
Standardization in Statistics
In a data frame, there may be occasions when the values for one feature range from 1-500 and the values for another feature vary from 1-10000000.
In cases like this, the impact of a feature with a larger numeric range on response variables may be greater than that of a feature with a smaller numeric range, which may have an impact on prediction accuracy.
The goal is to enhance predictive accuracy while preventing a specific attribute from influencing the prediction due to a wide numeric value range.
As a result, we may need to normalize or scale results based on distinct attributes so that they all fall within the same range.
If you’re working with cluster analysis, keep in mind that scaling is critical.
The most frequent method is to use z-score standardization, which scales numbers according to the formula:
(xi – xbar) / s
Where,
xi: The dataset’s ith value.
xbar: The average of the sample
s: The standard deviation of the sample
The following examples demonstrate how to scale one or more variables in a data frame using the z-score standardization by using the scale() function in conjunction with the dplyr package in R.
Step1: Standardize a Single Variable
In a data frame with three variables, the following code explains how to scale just one variable.
Let’s load the dplyr library
library(dplyr)
If you want to make this example reproducible then utilize set.seed function
set.seed(1234)
Now we create the data frame,
originaldata<- data.frame(var1= runif(10, 0, 50), var2= runif(10, 2, 30), var3= runif(10, 5, 60))
head(originaldata)
Let’s view originaldata first six rows
var1 var2 var3 1 17.67357 9.343906 32.817272 2 32.95256 2.677442 51.860993 3 19.89896 7.183681 37.270479 4 13.79581 25.001149 50.688037 5 20.40034 28.806178 5.723320 6 10.10240 9.812477 8.779676
Now we can scale the variable “var1” to have mean = 0 and standard deviation = 1
scaleddata<- originaldata %>% mutate_at(c('var1'), ~(scale(.) %>% as.vector)) scaleddata
var1 var2 var3 1 -1.0371043 28.393877 31.12995 2 2.1192352 15.906119 11.85231 3 0.6026644 2.914802 35.95064 4 0.8720766 18.403451 40.79981 5 0.2209620 20.700270 34.40191 6 -0.7089954 26.373200 30.98616 7 0.1906236 5.933846 56.28785 8 -0.6436983 14.245744 39.85674 9 -0.8395434 17.858952 57.31483 10 -0.7762205 6.227785 32.25440
It’s worth noting that only the first variable was scaled, while the other two stayed unchanged.
We can immediately confirm that the new scaled variable has a mean of 0 and a standard deviation of 1 by looking at the data.
Now we can calculate the mean of the scaled variable
mean(scaleddata$var1)
[1] -1.113476e-17
round(mean(scaleddata$var1),0)
[1] 0
Let’s calculate the standard deviation of the scaled variable
sd(scaleddata$var1)
[1] 1
Step2:-Standardize Multiple Variables
The code below demonstrates how to scale many variables in a data frame at the same time.
We can use mutate_at function to scale var1 and var2 to have mean = 0 and standard deviation = 1
scaleddata <- originaldata %>% mutate_at(c('var1', 'var2'), ~(scale(.) %>% as.vector)) scaleddata
var1 var2 var3 1 -1.0371043 1.47974343 31.12995 2 2.1192352 0.02450854 11.85231 3 0.6026644 -1.48940748 35.95064 4 0.8720766 0.31552992 40.79981 5 0.2209620 0.58318492 34.40191 6 -0.7089954 1.24426801 30.98616 7 0.1906236 -1.13758949 56.28785 8 -0.6436983 -0.16897978 39.85674 9 -0.8395434 0.25207785 57.31483 10 -0.7762205 -1.10333593 32.25440
Step 3:-Standardize All Variables
Using the mutate all function, the following code explains how to scale all variables in a data frame.
Let’s scale all variables to have mean = 0 and standard deviation = 1
scaleddata <- originaldata %>% mutate_all(~(scale(.) %>% as.vector)) scaleddata
var1 var2 var3 1 -1.0371043 1.47974343 -0.45503317 2 2.1192352 0.02450854 -1.92844375 3 0.6026644 -1.48940748 -0.08658275 4 0.8720766 0.31552992 0.28404422 5 0.2209620 0.58318492 -0.20495350 6 -0.7089954 1.24426801 -0.46602298 7 0.1906236 -1.13758949 1.46781181 8 -0.6436983 -0.16897978 0.21196488 9 -0.8395434 0.25207785 1.54630538 10 -0.7762205 -1.10333593 -0.36909013
Class Imbalance-Handling Imbalanced Data in R »
Subscribe to our newsletter!