How to Use the scale() Function in R
Scale() Function in R, Scaling is a technique for comparing data that isn’t measured in the same way. The normalizing of a dataset using the mean value and standard deviation is known as scaling.
When working with vectors or columns in a data frame, scaling is frequently employed.
In R, you can use the scale() function to scale the values in a vector, matrix, or data frame.
You will almost always receive meaningless results if you do not normalize the vectors or columns you are utilizing.
Scale() is a built-in R function that centers and/or scales the columns of a numeric matrix by default.
Only if the value provided is numeric, the scale() function subtracts the values of each column by the matching “center” value from the argument.
The following is the fundamental syntax for this function:
scale(x, center = TRUE, scale = TRUE)
where:
x: Name of the scaled object
center: When scaling, whether the mean should be subtracted. TRUE is the default value.
scale: When scaling, whether to divide by the standard deviation. TRUE is the default value.
This function uses the following formula to calculate scaled values.
xscaled = (x – x̄) / s
where:
x: real x-value
x̄: Sample mean
s: Sample SD
This is also known as data standardization, and it basically involves converting each original value into a z-score.
If the value is numeric, the scale() method divides the values of each column by the corresponding scale value from the input.
Otherwise, the standard deviation or root-mean-square values are used to split the numbers.
The examples below demonstrate how to utilize this function in practice.
Example 1: Scale the Values in a Vector
Assume we have the following value vector in R.
x <- c(11, 12, 13,24, 25, 16, 17, 18, 19)
look at the average and standard deviation of the data
mean(x)
[1] 17.22222
sd(x)
[1] 4.944132
The scale() function is used to scale the values in the vector in the following code.
x values should be scaled
x_scaled <- scale(x)
Let’s view the scaled values
x_scaled
[,1] [1,] -1.25850641 [2,] -1.05624645 [3,] -0.85398649 [4,] 1.37087305 [5,] 1.57313301 [6,] -0.24720662 [7,] -0.04494666 [8,] 0.15731330 [9,] 0.35957326 attr(,"scaled:center") [1] 17.22222 attr(,"scaled:scale") [1] 4.944132
If you center the data while scaling a vector, you will receive negative numbers. When comparing vectors, it reduces the effect of a different scale, bringing it closer to the same distribution.
This type of standardization is useful when comparing proposed data from multiple measures.
It’s worth noting that if we supplied scale=FALSE, the function would not have split by the standard deviation when scaling:
Don’t divide by standard deviation when scaling x values.
x_scaled <- scale(x, scale = FALSE) x_scaled
[,1] [1,] -6.2222222 [2,] -5.2222222 [3,] -4.2222222 [4,] 6.7777778 [5,] 7.7777778 [6,] -1.2222222 [7,] -0.2222222 [8,] 0.7777778 [9,] 1.7777778 attr(,"scaled:center") [1] 17.22222
Example 2: Scale the Column Values in a Data Frame
When we want to scale the values in several columns of a data frame so that each column has a mean of 0 and a standard deviation of 1, we usually use the scale() function.
As an example, consider the following data frame in R:
data <- data.frame(x=c(11, 12, 23, 24, 25, 66, 77, 18, 9), y=c(60, 80, 90, 10, 5, 6, 700, 180, 190))
data
x y 1 11 60 2 12 80 3 23 90 4 24 10 5 25 5 6 66 6 7 77 700 8 18 180 9 9 190
df_scaled <- scale(data) df_scaled
x y 1 11 60 2 12 80 3 23 90 4 24 10 5 25 5 6 66 6 7 77 700 8 18 180 9 9 190
The y variable’s range of values is significantly larger than the x variable’s range of values.
The scale() method can be used to scale the values in both columns so that the scaled values of x and y have the same mean and standard deviation.
The x and y columns now have the same mean of 0 and standard deviation of 1.
Conclusion
With the default settings, the scale() function calculates the vector’s mean and standard deviation, then “scales” each element by removing the mean and dividing by the sd.
When you have several variables to examine over multiple scales, the scale() function makes more sense. One variable, for example, is of magnitude 100, whereas another is of magnitude 1000.
The scale serves no purpose other than to standardize the data. The values it generates are known by a variety of names, one of which being z-scores.
Cluster Analysis in R » Unsupervised Approach » finnstats
Subscribe to our newsletter!
“When comparing vectors, it reduces the effect of a different scale, bringing it closer to a normal distribution.”
Please don’t confuse standardization with normalization. A standardized variable has the same exact distribution shape as the raw scale variable.
Yes, absolutely right.
Thanks Bruce
The console output right beneath `df_scaled` still shows the raw data. You might want to replace that with the actual scaled values.