Calculating Z-Scores in Python: A Step-by-Step Guide
Calculating Z-Scores in Python, Z-scores are a fundamental concept in statistics, providing a way to measure how many standard deviations away a value is from the mean.
Calculating Z-Scores in Python
In this article, we’ll explore how to calculate z-scores in Python using various libraries and data structures.
Using SciPy’s zscore
Function
The zscore
function in SciPy’s stats
module provides a convenient way to calculate z-scores for one-dimensional arrays or multi-dimensional arrays. The function takes the following arguments:
a
: an array-like object containing the dataaxis
: the axis along which to calculate the z-scores (default is 0)ddof
: degrees of freedom correction in the calculation of the standard deviation (default is 0)nan_policy
: how to handle NaN values (default ispropagate
, which returns NaN)
Example 1: Calculating Z-Scores for a One-Dimensional Numpy Array
Let’s start with a simple example using a one-dimensional numpy array.
import numpy as np
import scipy.stats as stats
data = np.array([6, 7, 7, 12, 13, 13, 15, 16, 19, 22])
z_scores = stats.zscore(data)
print(z_scores)
This will output:
[-1.394 -1.195 -1.195 -0.199 0. 0. 0.398 0.598 1.195 1.793]
Each z-score tells us how many standard deviations away an individual value is from the mean.
Example 2: Calculating Z-Scores for a Multi-Dimensional Numpy Array
What if we have a multi-dimensional array? We can use the axis
parameter to specify which axis to calculate the z-scores for. For example:
Correlation By Group in R » Data Science Tutorials
data = np.array([[5, 6, 7, 7, 8],
[8, 8, 8, 9, 9],
[2, 2, 4, 4, 5]])
z_scores = stats.zscore(data, axis=1)
print(z_scores)
This will output:
[[-1.569 -0.588 0.392 0.392 1.373]
[-0.816 -0.816 -0.816 1.225 1.225]
[-1.167 -1.167 0.5 0.5 1.333]]
Each z-score is calculated relative to its own array.
Example 3: Calculating Z-Scores for a Pandas DataFrame
Finally, let’s use the apply
function to calculate z-scores for individual values in a Pandas DataFrame.
import pandas as pd
import numpy as np
import scipy.stats as stats
data = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=['A', 'B', 'C'])
z_scores = data.apply(stats.zscore)
print(z_scores)
This will output:
A B C
0 0.659380 -0.802955 0.836080
1 -0.659380 -0.802955 -0.139347
2 0.989071 -0.917663 -0.487713
3 -1.648451 -1.491202 -1.950852
4 0.659380 -0.802955 -0.487713
Each z-score is calculated relative to its own column.
Conclusion
Calculating z-scores in Python is a straightforward process using SciPy’s zscore
function or the apply
function in Pandas DataFrames.
By following these examples, you can easily calculate z-scores for your own data and gain valuable insights into your data distribution.