PCA for Categorical Variables in R
PCA for Categorical Variables in R, Using Principal Component Analysis to minimize the dimensionality of your data frame may have crossed your mind (PCA).
However, can PCA be applied to a data set with categorical variables?
You’ll discover how to apply Principal Component Analysis (PCA) to data frames that include categorical variables in this course.
Additionally, you’ll discover how to use the R programming language to put these alternatives into practice.
Can a Data Frame with Categorical Variables be Used for PCA?
The answer is not straightforward: although it is technically possible to run a PCA on a data frame containing categorical variables, this doesn’t appear to be the best course of action.
The primary explanation for this is that the PCA, which involves dissecting the variance structure of the variable, is made to function better with numerical or continuous variables.
PCA won’t be effective with categorical variables since they lack a variance structure (they are not numerical).
Converting categorical variables into a sequence of binary variables with 0 and 1 values is one way to do the PCA in a data set with categorical variables.
However, this definitely wouldn’t make sense if we had a data set with only binary variables; instead, we should look at other options if we want to study a data set including categorical data.
We’ll need the FactoMineR, vcd, and factoextra packages for this tutorial. The following code can be used to install these packages if necessary:
install.packages("FactoMineR") install.packages("vcd") install.packages("factoextra")
Next, load the libraries:
library(FactoMineR) library(vcd) library(factoextra)
Factorial Analysis of Mixed Data (FAMD) Is a PCA for Categorical Variables Alternate
A major component method is the Factor Analysis of Mixed Data (FAMD). By considering several sorts of data, this approach enables one to examine how similar people are.
This technique consists of two steps: first, it suitably encodes the data; and second, it searches the data set iteratively for the K principal components.
Similar to how PCA operates, this main component search does the same.
Variables that are both quantitative and qualitative are standardized during the Factorial Analysis of Mixed Data. This balances the impact of each group of variables.
An R programming language example of a factor analysis of mixed data (FAMD)
By using the FAMD() function from the FactoMineR package, we can construct this analysis and see how it functions using the R programming language.
We will use a portion of the wine data set from the FactoMineR software to illustrate this example:
data(wine) wine_data <- wine[,c(1,2,13,22,24,28,30)] head(wine_data)
Label Soil Fruity Acidity Alcohol Intensity Overall.quality 2EL Saumur Env1 2.885 2.107 2.500 2.857 3.393 1CHA Saumur Env1 2.560 2.107 2.654 2.893 3.214 1FON Bourgueuil Env1 2.769 2.179 2.643 3.074 3.536 1VAU Chinon Env2 2.391 3.179 2.500 2.462 2.464 1DAM Saumur Reference 3.160 2.571 2.786 3.643 3.741 2BOU Bourgueuil Reference 2.800 2.393 2.857 3.464 3.643
Categorical variables and PCA
Our data collection will consist of 21 rows and 7 columns, with the first two columns (label and soil) being categorical variables and the remaining five columns being numerical. Using the str() function, we can observe this structure:
data.frame': 21 obs. of 7 variables: $ Label : Factor w/ 3 levels "Saumur","Bourgueuil",..: 1 1 2 3 1 2 2 1 3 1 ... $ Soil : Factor w/ 4 levels "Reference","Env1",..: 2 2 2 3 1 1 1 2 2 3 ... $ Fruity : num 2.88 2.56 2.77 2.39 3.16 ... $ Acidity : num 2.11 2.11 2.18 3.18 2.57 ... $ Alcohol : num 2.5 2.65 2.64 2.5 2.79 ... $ Intensity : num 2.86 2.89 3.07 2.46 3.64 ... $ Overall.quality: num 3.39 3.21 3.54 2.46 3.74 ...
Now that we have our data frame, we can perform the Factorial Analysis of Mixed Data (FAMD):
wine_famd <- FAMD(wine_data, graph=FALSE) wine_famd
*The results are available in the following objects: name description 1 "$eig" "eigenvalues and inertia" 2 "$var" "Results for the variables" 3 "$ind" "results for the individuals" 4 "$quali.var" "Results for the qualitative variables" 5 "$quanti.var" "Results for the quantitative variables"
We have the graph= set to FALSE, but if we change it to TRUE, we can also see the factor maps for the individuals as well as the variable, category, and quantitative variable graphs.
Using the fviz_famd_ind() function from the factoextra package, we can see how the individual’s factor map appears and see how the colours correspond to their cos2 (cos squared) and analysis contribution values.
fviz_famd_ind(wine_famd,col.ind = "cos2", gradient.cols = c("blue", "orange", "red"), repel = TRUE)
Multiple Correspondence Analysis is an alternative to PCA for categorical variables (MCA)
When trying to decrease the dimensions in a data set with categorical variables, another option to Principal Component Analysis is to employ Multiple Correspondence Analysis (MCA).
In reality, when it comes to categorical data dimension reduction, this method is well known.
If our data collection contains categorical variables, this approach is highly practical. It aids in the identification of a group of people with comparable profiles and the relationships between the category factors.
Using the R programming language, a Multiple Correspondence Analysis (MCA) example
The MCA() function from the FactoMineR package can be used in R to implement the Multiple Correspondence Analysis.
We will use a portion of the Arthritis data set from the vcd package for this example:
data(Arthritis) arthritis_data <- Arthritis[,c(2,3,5)] head(arthritis_data)
Treatment Sex Improved 1 Treated Male Some 2 Treated Male None 3 Treated Male None 4 Treated Male Marked 5 Treated Male Marked 6 Treated Male Marked
We can now use our data frame to construct the MCA() function.
arthritis_mca <- MCA(arthritis_data, ncp = 3, graph = FALSE) arthritis_mca
**Results of the Multiple Correspondence Analysis (MCA)** The analysis was performed on 84 individuals, described by 3 variables *The results are available in the following objects: name description 1 "$eig" "eigenvalues" 2 "$var" "results for the variables" 3 "$var$coord" "coord. of the categories" 4 "$var$cos2" "cos2 for the categories" 5 "$var$contrib" "contributions of the categories" 6 "$var$v.test" "v-test for the categories" 7 "$ind" "results for the individuals" 8 "$ind$coord" "coord. for the individuals" 9 "$ind$cos2" "cos2 for the individuals" 10 "$ind$contrib" "contributions of the individuals" 11 "$call" "intermediate results" 12 "$call$marge.col" "weights of columns" 13 "$call$marge.li" "weights of rows"
The number of dimensions (ncp) we wish to keep in the results must be specified in this function. In this instance, we’ve gone with three dimensions.
We can also set the graph= to TRUE, like with the FAMD() function, if we want to see the MCA factor map for variables, persons, and the representation of variables.
The biplot of people and variables using the factoextra package’s fviz mca biplot() function is shown below.
fviz_mca_biplot(arthritis_mca, repel = TRUE, ggtheme = theme_minimal())
As demonstrated, we may substitute the FAMD for the PCA when lowering the dimensions of a data collection that includes both category and numeric data.
It would be better to apply the MCA if our data set just had categorical information.