Linear Discriminant Analysis: A step by step Guide
Linear Discriminant Analysis, we often utilize logistic regression when we have a collection of predictor factors and want to categorize a response variable into one of two classes.
In the following circumstance, we may utilize logistic regression.
We want to use a customer’s credit score and bank balance to predict whether they would eligible on a loan. (Select “Yes” or “No” as the response variable.)
Linear Discriminant Analysis
When there are more than two distinct classes for a response variable, however, we usually choose to employ a method called linear discriminant analysis, or LDA.
For instance, in the following circumstance, we may use LDA
We want to use points per game to forecast if a high school cricket player will be accepted into one of three divisions, like Division 1, Division 2, or Division 3.
Although both LDA and logistic regression models are used for classification, it turns out that LDA is significantly more stable than logistic regression when making predictions for several classes, and is hence the preferable approach to utilize when the response variable has more than two classes.
When compared to logistic regression, LDA performs better when sample sizes are small, making it a favored method to utilize when big samples are unavailable.
How to Construct LDA Models
The following assumptions about a dataset are made by LDA,
(1) Each predictor variable’s values are uniformly distributed. That is, a histogram depicting the distribution of values for a given predictor would roughly have a “bell shape.”
(2) The variance of each predictor variable is the same. Because this virtually never happens in real-world data, we scale each variable to have the same mean and variance before building an LDA model.
LDA then estimates the following values once these assumptions are met:
μk: The average of all kth-class training observations.
σ2: For each of the k classes, the weighted average of the sample variances.
πk: The proportion of the training observations that belong to the kth class.
LDA then plugs these numbers into the formula below, assigning each observation X = x to the class with the highest value produced by the formula:
Dk(x) = x * (k/2) – (k2/22) + log(k) Dk(x) = x * (k/2) – (k2/22)
The name LDA comes from the fact that the value produced by the function above is the result of linear functions of x.
How to Get Data Ready for LDA
Before using an LDA model on your data, make sure it fits the following criteria:
1. There is a categorical response variable. LDA models are intended for use in classification problems, in which the response variable can be classified into groups or categories.
2. The predictor variables are distributed normally. First, make sure that each predictor variable has a normal distribution. If this is not the case, you may want to alter the data first to normalize the distribution.
3. The variance of each predictor variable is the same. LDA presupposes that each predictor variable has the same variance, as previously stated.
Because this is uncommon in practice, it’s a good idea to scale each variable in the dataset to have a mean of 0 and a standard deviation of 1.
4. Take into account severe outliers. Before using LDA, make sure the dataset is free of extreme outliers. Using boxplots or scatterplots, you can usually visually check for outliers.
Subscribe to our newsletter!