Stratified Sampling in R With Examples
Researchers frequently take samples from a population and use the data from the sample to make generalizations about the entire population.
A typical sampling approach is stratified random sampling, which divides a population into groups and selects a random number of people from each category to be included in the sample.
This article shows you how to use R to achieve stratified random sampling.
Principal Component Analysis in R » finnstats
Approach: Stratified Sampling in R
A corporation has 400 employees who are either freshers, juniors, mid-level employees, or senior employees.
Let’s say we want to obtain a stratified sample of 40 employees, with 10 employees from each level represented.
The following code explains how to create a 400-employee sample data frame.
With the help of set.seed, we can make this example repeatable.
set.seed(1)
Now let’s create a data frame
data <- data.frame(Level = rep(c("freshers", "juniors", "mid-level", "Senior"), each=100), Score = rnorm(400, mean=45, sd=2.2))
view the first six rows of a data frame
Free Data Science Books » EBooks » finnstats
head(data)
Level Score 1 freshers 46.81129 2 freshers 45.61885 3 freshers 47.13777 4 freshers 45.54551 5 freshers 45.06891 6 freshers 45.68639
The following code demonstrates how to use the dplyr package’s group_by() and sample_n() methods to create a stratified random sample of 40 employees, with 10 employees from each Level.
library(dplyr)
To get a stratified sample from a data frame.
stratified <- data %>% group_by(Level) %>% sample_n(size=10)
To find the frequency of employees from each Level.
NLP Courses Online (Natural Language Processing) » finnstats
table(stratified$Score)
40.6277541808117 41.8867328984806 42.1225665842419 42.5233762802742 42.5544884803451 1 1 1 1 1 42.7536151417636 42.8846937474664 42.9742927968522 43.1218453854941 43.1558424722147 1 1 1 1 1 43.6575315133425 43.7415578635583 43.7732881183767 44.6932550551858 44.8755449387381 1 1 1 1 1 45.0020656995027 45.2668319456886 45.3899139820568 45.4797068293891 45.5017168903959 1 1 1 1 1 45.5455064157118 46.1478255944327 46.3450739535307 46.3836008714994 46.5858975045594 1 1 1 1 1 46.6546954492613 46.7620971328865 46.9493723718007 47.0418493618535 47.1284691388457 1 1 1 1 1 47.1753773706728 47.2486845777309 47.3834597232738 47.4520743699156 47.6813717922399 1 1 1 1 1 47.6916655311883 48.4030768433805 48.7269106424762 48.9858858605196 49.0114190243513 1 1 1 1 1
Conclusions
We’ve discussed the most important sampling technique a data scientist should know in this article.
Remember that in machine learning, a well-generated sample can make all the difference because it allows us to work with less data while maintaining statistical significance.