How to Use the scale() Function in R

Scale() Function in R, Scaling is a technique for comparing data that isn’t measured in the same way. The normalizing of a dataset using the mean value and standard deviation is known as scaling.

When working with vectors or columns in a data frame, scaling is frequently employed.

In R, you can use the scale() function to scale the values in a vector, matrix, or data frame.

You will almost always receive meaningless results if you do not normalize the vectors or columns you are utilizing.

Scale() is a built-in R function that centers and/or scales the columns of a numeric matrix by default.

Only if the value provided is numeric, the scale() function subtracts the values of each column by the matching “center” value from the argument.

The following is the fundamental syntax for this function:

scale(x, center = TRUE, scale = TRUE)

where:

x: Name of the scaled object

center: When scaling, whether the mean should be subtracted. TRUE is the default value.

scale: When scaling, whether to divide by the standard deviation. TRUE is the default value.

This function uses the following formula to calculate scaled values.

xscaled = (x – x̄) / s

where:

x: real  x-value

x̄: Sample mean

s: Sample SD

This is also known as data standardization, and it basically involves converting each original value into a z-score.

If the value is numeric, the scale() method divides the values of each column by the corresponding scale value from the input.

Otherwise, the standard deviation or root-mean-square values are used to split the numbers.

The examples below demonstrate how to utilize this function in practice.

Example 1: Scale the Values in a Vector

Assume we have the following value vector in R.

x <- c(11, 12, 13,24, 25, 16, 17, 18, 19)

look at the average and standard deviation of the data

mean(x)

[1] 17.22222

sd(x)

[1] 4.944132

The scale() function is used to scale the values in the vector in the following code.

x values should be scaled

x_scaled <- scale(x)

Let’s view the scaled values

x_scaled
[,1]
 [1,] -1.25850641
 [2,] -1.05624645
 [3,] -0.85398649
 [4,]  1.37087305
 [5,]  1.57313301
 [6,] -0.24720662
 [7,] -0.04494666
 [8,]  0.15731330
 [9,]  0.35957326
attr(,"scaled:center")
[1] 17.22222
attr(,"scaled:scale")
[1] 4.944132

If you center the data while scaling a vector, you will receive negative numbers. When comparing vectors, it reduces the effect of a different scale, bringing it closer to the same distribution.

This type of standardization is useful when comparing proposed data from multiple measures.

It’s worth noting that if we supplied scale=FALSE, the function would not have split by the standard deviation when scaling:

Don’t divide by standard deviation when scaling x values.

x_scaled <- scale(x, scale = FALSE)
x_scaled
           [,1]
 [1,] -6.2222222
 [2,] -5.2222222
 [3,] -4.2222222
 [4,]  6.7777778
 [5,]  7.7777778
 [6,] -1.2222222
 [7,] -0.2222222
 [8,]  0.7777778
 [9,]  1.7777778
attr(,"scaled:center")
[1] 17.22222

Example 2: Scale the Column Values in a Data Frame

When we want to scale the values in several columns of a data frame so that each column has a mean of 0 and a standard deviation of 1, we usually use the scale() function.

As an example, consider the following data frame in R:

data <- data.frame(x=c(11, 12, 23, 24, 25, 66, 77, 18, 9),
                 y=c(60, 80, 90, 10, 5, 6, 700, 180, 190))
data
  x   y
1 11  60
2 12  80
3 23  90
4 24  10
5 25   5
6 66   6
7 77 700
8 18 180
9  9 190
df_scaled <- scale(data)
df_scaled
  x   y
1 11  60
2 12  80
3 23  90
4 24  10
5 25   5
6 66   6
7 77 700
8 18 180
9  9 190

The y variable’s range of values is significantly larger than the x variable’s range of values.

The scale() method can be used to scale the values in both columns so that the scaled values of x and y have the same mean and standard deviation.

The x and y columns now have the same mean of 0 and standard deviation of 1.

Conclusion

With the default settings, the scale() function calculates the vector’s mean and standard deviation, then “scales” each element by removing the mean and dividing by the sd.

When you have several variables to examine over multiple scales, the scale() function makes more sense. One variable, for example, is of magnitude 100, whereas another is of magnitude 1000.

The scale serves no purpose other than to standardize the data. The values it generates are known by a variety of names, one of which being z-scores.

Cluster Analysis in R » Unsupervised Approach » finnstats

Subscribe to our newsletter!

[newsletter_form type=”minimal”]

You may also like...

3 Responses

  1. Bruce says:

    “When comparing vectors, it reduces the effect of a different scale, bringing it closer to a normal distribution.”
    Please don’t confuse standardization with normalization. A standardized variable has the same exact distribution shape as the raw scale variable.

  2. Florian says:

    The console output right beneath `df_scaled` still shows the raw data. You might want to replace that with the actual scaled values.

Leave a Reply

Your email address will not be published. Required fields are marked *

fifteen − six =