How to Label Outliers in Boxplots in ggplot2?
How to Label Outliers in Boxplots in ggplot2, This article offers a detailed illustration of how to name outliers in ggplot2 boxplots.
Step 1: Construct the data frame.
Create the following data frame first, which will include details on the 60 distinct basketball players who played for three separate teams and scored points.
How to add labels at the end of each line in ggplot2? (datasciencetut.com)
Make this illustration repeatable.
set.seed(123)
Now we can create a data frame
df <- data.frame(team=rep(c('A', 'B', 'C'), each=20), player=rep(LETTERS[1:20], times=3), points=round(rnorm(n=60, mean=30, sd=10), 2))
Let’s view the head of the data frame
head(df)
team player points 1 A A 37.84 2 A B 42.60 3 A C 40.96 4 A D 5.78 5 A E 37.65 6 A F 24.98
Step 2: Define a Function to Identify Outliers
An observation is considered an outlier in ggplot2 if it satisfies any of the following two criteria:
The observation falls within the first quartile by 1.5 times the interquartile range (Q1)
The observation exceeds the third quartile by 1.5 times the interquartile range (Q3).
If an observation satisfies any of these two criteria, we can build the following function in the R language to classify it as an outlier.
Change ggplot2 Theme Color in R- Data Science Tutorials
findoutlier <- function(x) { return(x < quantile(x, .25) - 1.5*IQR(x) | x > quantile(x, .75) + 1.5*IQR(x)) }
Step 3: In ggplot2, label outliers in boxplots
The next step is to use the code below to label outliers in ggplot2 boxplots:
library(ggplot2) library(dplyr)
to the data frame, including a new column that shows if each observation is an outlier.
df <- df %>% group_by(team) %>% mutate(outlier = ifelse(findoutlier(points), points, NA))
Now we can create a box plot of points by team and label outliers
ggplot(df, aes(x=team, y=points)) + geom_boxplot() + geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)
Please take note that we may alternatively classify these outliers using a different variable.
To label the outliers based on the player name instead, we could, for instance, switch out points for players in the mutate() function.
library(ggplot2) library(dplyr) df <- df %>% group_by(team) %>% mutate(outlier = ifelse(findoutlier(points), player, NA))
build a box plot of the team’s points and identify outliers.
Best GGPlot Themes You Should Know – Data Science Tutorials
ggplot(df, aes(x=team, y=points)) + geom_boxplot() + geom_text(aes(label=outlier), na.rm=TRUE, hjust=-.5)
The outlier on team A now has a label of N and the outlier on team B now has a label of D, since these represent the player names who have outlier values for points.