Stratified Sampling in R With Examples

by finnstats

Researchers frequently take samples from a population and use the data from the sample to make generalizations about the entire population.

A typical sampling approach is stratified random sampling, which divides a population into groups and selects a random number of people from each category to be included in the sample.

This article shows you how to use R to achieve stratified random sampling.

Principal Component Analysis in R » finnstats

Approach: Stratified Sampling in R

A corporation has 400 employees who are either freshers, juniors, mid-level employees, or senior employees.

Let’s say we want to obtain a stratified sample of 40 employees, with 10 employees from each level represented.

The following code explains how to create a 400-employee sample data frame.

With the help of set.seed, we can make this example repeatable.

set.seed(1)

Now let’s create a data frame

data <- data.frame(Level = rep(c("freshers", "juniors", "mid-level", "Senior"), each=100),                 Score = rnorm(400, mean=45, sd=2.2))

view the first six rows of a data frame

Free Data Science Books » EBooks » finnstats

head(data)

   Level    Score
1 freshers 46.81129
2 freshers 45.61885
3 freshers 47.13777
4 freshers 45.54551
5 freshers 45.06891
6 freshers 45.68639

The following code demonstrates how to use the dplyr package’s group_by() and sample_n() methods to create a stratified random sample of 40 employees, with 10 employees from each Level.

library(dplyr)

To get a stratified sample from a data frame.

stratified <- data %>%
  group_by(Level) %>%
  sample_n(size=10)

To find the frequency of employees from each Level.

NLP Courses Online (Natural Language Processing) » finnstats

table(stratified$Score)

40.6277541808117 41.8867328984806 42.1225665842419 42.5233762802742 42.5544884803451
               1                1                1                1                1
42.7536151417636 42.8846937474664 42.9742927968522 43.1218453854941 43.1558424722147
               1                1                1                1                1
43.6575315133425 43.7415578635583 43.7732881183767 44.6932550551858 44.8755449387381
               1                1                1                1                1
45.0020656995027 45.2668319456886 45.3899139820568 45.4797068293891 45.5017168903959
               1                1                1                1                1
45.5455064157118 46.1478255944327 46.3450739535307 46.3836008714994 46.5858975045594
               1                1                1                1                1
46.6546954492613 46.7620971328865 46.9493723718007 47.0418493618535 47.1284691388457
               1                1                1                1                1
47.1753773706728 47.2486845777309 47.3834597232738 47.4520743699156 47.6813717922399
               1                1                1                1                1
47.6916655311883 48.4030768433805 48.7269106424762 48.9858858605196 49.0114190243513
               1                1                1                1                1

Conclusions

We’ve discussed the most important sampling technique a data scientist should know in this article.

Remember that in machine learning, a well-generated sample can make all the difference because it allows us to work with less data while maintaining statistical significance.