pling! Sampling!

R
Author

Burak Demirtas

Published

June 1, 2023

Image Source : https://www.pexels.com/photo/colony-of-emperor-penguins-on-land-4147990/

A silly headline? Maybe! But it has some clues about the topic!

If the population is Sampling! one sample from this population would be pling! right?

This whole article is all about this and how to do the sampling in R.

What I will not explain: 😒

Needed Libraries

Hide / show the code
if (!require("pacman")) install.packages("pacman")
Loading required package: pacman
Hide / show the code
p_load(pacman, 
       kernlab, 
       tidyverse)

Some optional settings:

Hide / show the code
options(scipen = 999, # Who likes scientifi notations anyway?!
        digits = 3)   # More than 3 digits will be needed when I'm going to MARS

Simple Random Sampling

To make a basic slicing, you just need a variation of function slice() as slice_sample().

Let’s use spam dataset of HP Labs1:

Hide / show the code
data(spam)
head(spam)
  make address  all num3d  our over remove internet order mail receive will
1 0.00    0.64 0.64     0 0.32 0.00   0.00     0.00  0.00 0.00    0.00 0.64
2 0.21    0.28 0.50     0 0.14 0.28   0.21     0.07  0.00 0.94    0.21 0.79
3 0.06    0.00 0.71     0 1.23 0.19   0.19     0.12  0.64 0.25    0.38 0.45
4 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31
5 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31
6 0.00    0.00 0.00     0 1.85 0.00   0.00     1.85  0.00 0.00    0.00 0.00
  people report addresses free business email  you credit your font num000
1   0.00   0.00      0.00 0.32     0.00  1.29 1.93   0.00 0.96    0   0.00
2   0.65   0.21      0.14 0.14     0.07  0.28 3.47   0.00 1.59    0   0.43
3   0.12   0.00      1.75 0.06     0.06  1.03 1.36   0.32 0.51    0   1.16
4   0.31   0.00      0.00 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00
5   0.31   0.00      0.00 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00
6   0.00   0.00      0.00 0.00     0.00  0.00 0.00   0.00 0.00    0   0.00
  money hp hpl george num650 lab labs telnet num857 data num415 num85
1  0.00  0   0      0      0   0    0      0      0    0      0     0
2  0.43  0   0      0      0   0    0      0      0    0      0     0
3  0.06  0   0      0      0   0    0      0      0    0      0     0
4  0.00  0   0      0      0   0    0      0      0    0      0     0
5  0.00  0   0      0      0   0    0      0      0    0      0     0
6  0.00  0   0      0      0   0    0      0      0    0      0     0
  technology num1999 parts pm direct cs meeting original project   re  edu
1          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
2          0    0.07     0  0   0.00  0       0     0.00       0 0.00 0.00
3          0    0.00     0  0   0.06  0       0     0.12       0 0.06 0.06
4          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
5          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
6          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
  table conference charSemicolon charRoundbracket charSquarebracket
1     0          0          0.00            0.000                 0
2     0          0          0.00            0.132                 0
3     0          0          0.01            0.143                 0
4     0          0          0.00            0.137                 0
5     0          0          0.00            0.135                 0
6     0          0          0.00            0.223                 0
  charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
1           0.778      0.000    0.000       3.76          61          278 spam
2           0.372      0.180    0.048       5.11         101         1028 spam
3           0.276      0.184    0.010       9.82         485         2259 spam
4           0.137      0.000    0.000       3.54          40          191 spam
5           0.135      0.000    0.000       3.54          40          191 spam
6           0.000      0.000    0.000       3.00          15           54 spam

If we would want to take randomly just 200 emails (!) from this dataset, how many of them would be spam? Let’s randomly select 200 emails from spam dataset using slice_sample function:

Hide / show the code
set.seed(42) # let's set the random seed for a cool value

random200 <-
  spam %>%
    select(type) %>% # lets not bother with other variables for now
    slice_sample(n = 200) # gets random 400 records

glimpse(random200)
Rows: 200
Columns: 1
$ type <fct> nonspam, nonspam, nonspam, spam, spam, spam, nonspam, nonspam, sp…

How is the ratio of spam / non-spam ? Let’s have a look:

Hide / show the code
table_of_counts <- table(random200$type)
table_of_counts # shows the counts

nonspam    spam 
    128      72 
Hide / show the code
prop.table(table_of_counts) # shows the proportion

nonspam    spam 
   0.64    0.36 

%36 of the randomly taken emails are spams! huge amount! What if we would check the whole data?

Hide / show the code
table_of_counts_all <- table(spam$type)
table_of_counts_all # shows the counts

nonspam    spam 
   2788    1813 
Hide / show the code
prop.table(table_of_counts_all) # shows the proportion

nonspam    spam 
  0.606   0.394 

Surprisingly (maybe not surprisingly at all) ,our 200 observations has the similar spam/non-spam ratio with the all dataset!🥸

An alternative way also exists!

What if we don’t have the data frame but just only a vector! Then good old sample() comes to help!

Hide / show the code
spam_or_not_spam <- spam$type # Creating a vector of spam or not spam column

random200_of_vector <-
  sample(spam_or_not_spam, size = 200) # This time it's not "n" but "size"

table_of_counts <- table(random200_of_vector)
table_of_counts # shows the counts
random200_of_vector
nonspam    spam 
    111      89 
Hide / show the code
prop.table(table_of_counts) # shows the proportion
random200_of_vector
nonspam    spam 
  0.555   0.445 

As a result, just by writing couple of lines of code, we could have get random samples from our data and we saw that , at least in terms of spam/non spam ratio, our sample is nearly perfectly representing the all population data!

What we did is simple random sampling and works like a game of bingo. You start with your population of all numbers, and randomly pick them out one at a time until you have a winner or in our case, until you reach the defined sample size!

While using this technique, sometimes we can end up with 2 numbers in order like 6 and 7 being drawn in order, and sometimes we might end up with big gaps between numbers. If the game has numbers between 1 to 99, in first 15 draws, there could be no number over 50. We have no control on sample picking.

Systematic Sampling

Another alternative to what we did is systematic sampling. This samples the population at regular intervals. Which means, for example, let’s say we put all the numbers to a table. Instead of blindly drawing them from a bag, now we are picking every fifth number on the table. In this way, at least, we are being sure of each section of the table area is represented equally in the sample!

To demonstrate systematic sampling, we can add one column called rowid with rowid_to_column function:

Hide / show the code
# Set the seed
set.seed(42)

spam2 <- spam %>% 
  # Add a row ID column
  rowid_to_column 

# View
head(spam2)
  rowid make address  all num3d  our over remove internet order mail receive
1     1 0.00    0.64 0.64     0 0.32 0.00   0.00     0.00  0.00 0.00    0.00
2     2 0.21    0.28 0.50     0 0.14 0.28   0.21     0.07  0.00 0.94    0.21
3     3 0.06    0.00 0.71     0 1.23 0.19   0.19     0.12  0.64 0.25    0.38
4     4 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31
5     5 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31
6     6 0.00    0.00 0.00     0 1.85 0.00   0.00     1.85  0.00 0.00    0.00
  will people report addresses free business email  you credit your font num000
1 0.64   0.00   0.00      0.00 0.32     0.00  1.29 1.93   0.00 0.96    0   0.00
2 0.79   0.65   0.21      0.14 0.14     0.07  0.28 3.47   0.00 1.59    0   0.43
3 0.45   0.12   0.00      1.75 0.06     0.06  1.03 1.36   0.32 0.51    0   1.16
4 0.31   0.31   0.00      0.00 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00
5 0.31   0.31   0.00      0.00 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00
6 0.00   0.00   0.00      0.00 0.00     0.00  0.00 0.00   0.00 0.00    0   0.00
  money hp hpl george num650 lab labs telnet num857 data num415 num85
1  0.00  0   0      0      0   0    0      0      0    0      0     0
2  0.43  0   0      0      0   0    0      0      0    0      0     0
3  0.06  0   0      0      0   0    0      0      0    0      0     0
4  0.00  0   0      0      0   0    0      0      0    0      0     0
5  0.00  0   0      0      0   0    0      0      0    0      0     0
6  0.00  0   0      0      0   0    0      0      0    0      0     0
  technology num1999 parts pm direct cs meeting original project   re  edu
1          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
2          0    0.07     0  0   0.00  0       0     0.00       0 0.00 0.00
3          0    0.00     0  0   0.06  0       0     0.12       0 0.06 0.06
4          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
5          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
6          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
  table conference charSemicolon charRoundbracket charSquarebracket
1     0          0          0.00            0.000                 0
2     0          0          0.00            0.132                 0
3     0          0          0.01            0.143                 0
4     0          0          0.00            0.137                 0
5     0          0          0.00            0.135                 0
6     0          0          0.00            0.223                 0
  charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
1           0.778      0.000    0.000       3.76          61          278 spam
2           0.372      0.180    0.048       5.11         101         1028 spam
3           0.276      0.184    0.010       9.82         485         2259 spam
4           0.137      0.000    0.000       3.54          40          191 spam
5           0.135      0.000    0.000       3.54          40          191 spam
6           0.000      0.000    0.000       3.00          15           54 spam
Hide / show the code
# Set the sample size to 200
sample_size <- 200

# Get the population size from spam
pop_size <- nrow(spam)

# Calculate the interval
interval <- pop_size %/% sample_size

interval
[1] 23

Using this interval in sampling:

Hide / show the code
# Get row indexes for the sample
row_indexes <- seq_len(sample_size) * interval

spam_samp_systemetic <- spam %>% 
  # Add a row ID column
  rowid_to_column() %>% 
  # Get 200 rows using systematic sampling
  slice(row_indexes)

# See the result
head(spam_samp_systemetic)
  rowid make address  all num3d  our over remove internet order mail receive
1    23 0.00    0.00 0.00     0 2.94 0.00   0.00     0.00  0.00 0.00    0.00
2    46 0.15    0.45 1.05     0 0.45 0.00   0.00     1.81  0.60 0.75    0.00
3    69 0.30    0.00 0.61     0 0.00 0.00   0.00     0.00  0.00 0.92    0.30
4    92 0.47    0.31 0.47     0 0.05 0.13   0.05     0.26  0.44 0.76    0.26
5   115 0.12    1.76 0.63     0 0.88 0.00   0.12     0.50  0.25 3.90    0.50
6   138 1.00    0.00 0.33     0 0.66 0.66   0.00     0.00  0.00 0.00    0.00
  will people report addresses free business email  you credit your font num000
1 0.00   0.00   0.00       0.0 2.94     0.00     0 0.00   0.00 0.00    0   0.00
2 0.90   0.30   0.00       0.3 0.00     0.00     0 4.07   0.00 1.51    0   0.00
3 0.92   0.30   0.30       0.0 2.15     0.61     0 5.53   0.00 1.23    0   0.00
4 0.97   0.58   1.26       0.0 0.26     0.44     0 3.25   0.00 1.50    0   1.05
5 0.88   0.12   0.00       0.0 0.25     0.12     0 2.90   0.25 1.38    0   1.13
6 0.33   0.66   0.00       0.0 0.66     0.66     0 2.33   0.00 0.33    0   1.66
  money hp hpl george num650 lab labs telnet num857 data num415 num85
1  0.00  0   0      0      0   0    0      0      0    0    0.0     0
2  0.30  0   0      0      0   0    0      0      0    0    0.0     0
3  0.30  0   0      0      0   0    0      0      0    0    0.3     0
4  0.34  0   0      0      0   0    0      0      0    0    0.0     0
5  0.12  0   0      0      0   0    0      0      0    0    0.0     0
6  0.33  0   0      0      0   0    0      0      0    0    0.0     0
  technology num1999 parts pm direct cs meeting original project   re edu table
1          0       0     0  0      0  0       0     0.00       0 0.00   0     0
2          0       0     0  0      0  0       0     0.00       0 0.15   0     0
3          0       0     0  0      0  0       0     0.00       0 0.30   0     0
4          0       0     0  0      0  0       0     0.00       0 0.02   0     0
5          0       0     0  0      0  0       0     0.12       0 0.00   0     0
6          0       0     0  0      0  0       0     0.00       0 0.00   0     0
  conference charSemicolon charRoundbracket charSquarebracket charExclamation
1          0         0.404            0.404             0.000           0.809
2          0         0.000            0.250             0.000           1.318
3          0         0.000            0.100             0.000           1.053
4          0         0.004            0.066             0.000           0.322
5          0         0.019            0.379             0.159           0.000
6          0         0.000            0.060             0.000           0.120
  charDollar charHash capitalAve capitalLong capitalTotal type
1      0.000    0.000       4.86          12           34 spam
2      0.068    0.000       5.30         130          774 spam
3      0.351    0.250       3.88          66          303 spam
4      0.764    0.159       6.10         193         3038 spam
5      0.119    0.000       4.16          38          507 spam
6      0.541    0.000       5.43          21          304 spam

Notice the rowid jumps with the interval we found!👍😎

To check if there’s a problem with systematic sampling, we need to consider whether the data is sorted or has a pattern in the row order. If it does, the sample may not represent the whole population well. But don’t worry, we can solve this problem by shuffling the rows. This basically makes systematic sampling the same as simple random sampling.

The trick in here is to use prop argument of the slice_sample() function. Normally, we use it to say like “take %10 of the data randomly” by assigning 0.1 to it. It will start randomly picking 1 by one until it reaches %10 of the population. What if we put 1 instead? It will mean “take %100 of the data randomly” which it will pick the data randomly until all the data is picked. Which basically means, data will be shuffled

This is the original data:

Hide / show the code
head(spam)
  make address  all num3d  our over remove internet order mail receive will
1 0.00    0.64 0.64     0 0.32 0.00   0.00     0.00  0.00 0.00    0.00 0.64
2 0.21    0.28 0.50     0 0.14 0.28   0.21     0.07  0.00 0.94    0.21 0.79
3 0.06    0.00 0.71     0 1.23 0.19   0.19     0.12  0.64 0.25    0.38 0.45
4 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31
5 0.00    0.00 0.00     0 0.63 0.00   0.31     0.63  0.31 0.63    0.31 0.31
6 0.00    0.00 0.00     0 1.85 0.00   0.00     1.85  0.00 0.00    0.00 0.00
  people report addresses free business email  you credit your font num000
1   0.00   0.00      0.00 0.32     0.00  1.29 1.93   0.00 0.96    0   0.00
2   0.65   0.21      0.14 0.14     0.07  0.28 3.47   0.00 1.59    0   0.43
3   0.12   0.00      1.75 0.06     0.06  1.03 1.36   0.32 0.51    0   1.16
4   0.31   0.00      0.00 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00
5   0.31   0.00      0.00 0.31     0.00  0.00 3.18   0.00 0.31    0   0.00
6   0.00   0.00      0.00 0.00     0.00  0.00 0.00   0.00 0.00    0   0.00
  money hp hpl george num650 lab labs telnet num857 data num415 num85
1  0.00  0   0      0      0   0    0      0      0    0      0     0
2  0.43  0   0      0      0   0    0      0      0    0      0     0
3  0.06  0   0      0      0   0    0      0      0    0      0     0
4  0.00  0   0      0      0   0    0      0      0    0      0     0
5  0.00  0   0      0      0   0    0      0      0    0      0     0
6  0.00  0   0      0      0   0    0      0      0    0      0     0
  technology num1999 parts pm direct cs meeting original project   re  edu
1          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
2          0    0.07     0  0   0.00  0       0     0.00       0 0.00 0.00
3          0    0.00     0  0   0.06  0       0     0.12       0 0.06 0.06
4          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
5          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
6          0    0.00     0  0   0.00  0       0     0.00       0 0.00 0.00
  table conference charSemicolon charRoundbracket charSquarebracket
1     0          0          0.00            0.000                 0
2     0          0          0.00            0.132                 0
3     0          0          0.01            0.143                 0
4     0          0          0.00            0.137                 0
5     0          0          0.00            0.135                 0
6     0          0          0.00            0.223                 0
  charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
1           0.778      0.000    0.000       3.76          61          278 spam
2           0.372      0.180    0.048       5.11         101         1028 spam
3           0.276      0.184    0.010       9.82         485         2259 spam
4           0.137      0.000    0.000       3.54          40          191 spam
5           0.135      0.000    0.000       3.54          40          191 spam
6           0.000      0.000    0.000       3.00          15           54 spam

Let’s shuffle the rows:

Hide / show the code
# Shuffle the rows of attrition_pop then add row IDs
spam_shuffled <- spam %>%
    slice_sample(prop = 1)

head(spam_shuffled)
     make address  all num3d  our over remove internet order mail receive will
2609    0       0 0.00     0 0.00    0   0.00        0   0.0 0.00       0 2.85
4069    0       0 0.00     0 0.00    0   0.00        0   0.0 0.00       0 0.00
2369    0       0 0.00     0 0.77    0   0.00        0   0.0 0.00       0 0.00
1098    0       0 0.40     0 0.00    0   0.20        0   0.8 0.20       0 0.40
1252    0       0 0.72     0 0.72    0   0.72        0   0.0 0.00       0 0.00
634     0       0 0.00     0 0.00    0   0.00        0   0.0 1.49       0 1.49
     people report addresses free business email  you credit your font num000
2609   0.00   0.00       0.0  0.0        0     0 0.00   0.00 0.00    0      0
4069   0.00   0.00       0.0  0.0        0     0 0.00   0.00 0.00    0      0
2369   0.00   0.00       0.0  0.0        0     0 1.55   0.00 0.77    0      0
1098   0.00   1.41       0.2  0.4        0     0 3.44   3.03 2.22    0      0
1252   0.72   0.00       0.0  0.0        0     0 1.45   0.00 1.45    0      0
634    0.00   0.00       0.0  0.0        0     0 7.46   0.00 1.49    0      0
     money hp hpl george num650 lab labs telnet num857 data num415 num85
2609  0.00  0   0   0.00      0   0    0      0      0    0      0     0
4069  0.00  0   0   0.00      0   0    0      0      0    0      0     0
2369  0.77  0   0   0.77      0   0    0      0      0    0      0     0
1098  0.40  0   0   0.00      0   0    0      0      0    0      0     0
1252  0.72  0   0   0.00      0   0    0      0      0    0      0     0
634   0.00  0   0   0.00      0   0    0      0      0    0      0     0
     technology num1999 parts pm direct cs meeting original project   re  edu
2609          0       0     0  0      0  0       0        0       0 0.00 0.00
4069          0       0     0  0      0  0       0        0       0 0.00 4.00
2369          0       0     0  0      0  0       0        0       0 1.55 0.77
1098          0       0     0  0      0  0       0        0       0 0.00 0.00
1252          0       0     0  0      0  0       0        0       0 0.00 0.00
634           0       0     0  0      0  0       0        0       0 0.00 0.00
     table conference charSemicolon charRoundbracket charSquarebracket
2609     0          0         0.000            0.000                 0
4069     0          0         0.000            0.684                 0
2369     0          0         0.000            0.490                 0
1098     0          0         0.033            0.066                 0
1252     0          0         0.000            0.364                 0
634      0          0         0.000            0.000                 0
     charExclamation charDollar charHash capitalAve capitalLong capitalTotal
2609           0.480      0.000        0       2.00           7           26
4069           0.000      0.000        0       1.12           2            9
2369           0.196      0.000        0       3.16          10           79
1098           0.133      0.066        0       2.70          30          192
1252           0.729      0.121        0       7.78          32          249
634            0.000      0.000        0       2.69          15           35
        type
2609 nonspam
4069 nonspam
2369 nonspam
1098    spam
1252    spam
634     spam

Now let’s overwrite the rowids starting one and we will have shuffled order:

Hide / show the code
spam_shuffled <- spam_shuffled %>% 
  rowid_to_column()

head(spam_shuffled)
  rowid make address  all num3d  our over remove internet order mail receive
1     1    0       0 0.00     0 0.00    0   0.00        0   0.0 0.00       0
2     2    0       0 0.00     0 0.00    0   0.00        0   0.0 0.00       0
3     3    0       0 0.00     0 0.77    0   0.00        0   0.0 0.00       0
4     4    0       0 0.40     0 0.00    0   0.20        0   0.8 0.20       0
5     5    0       0 0.72     0 0.72    0   0.72        0   0.0 0.00       0
6     6    0       0 0.00     0 0.00    0   0.00        0   0.0 1.49       0
  will people report addresses free business email  you credit your font num000
1 2.85   0.00   0.00       0.0  0.0        0     0 0.00   0.00 0.00    0      0
2 0.00   0.00   0.00       0.0  0.0        0     0 0.00   0.00 0.00    0      0
3 0.00   0.00   0.00       0.0  0.0        0     0 1.55   0.00 0.77    0      0
4 0.40   0.00   1.41       0.2  0.4        0     0 3.44   3.03 2.22    0      0
5 0.00   0.72   0.00       0.0  0.0        0     0 1.45   0.00 1.45    0      0
6 1.49   0.00   0.00       0.0  0.0        0     0 7.46   0.00 1.49    0      0
  money hp hpl george num650 lab labs telnet num857 data num415 num85
1  0.00  0   0   0.00      0   0    0      0      0    0      0     0
2  0.00  0   0   0.00      0   0    0      0      0    0      0     0
3  0.77  0   0   0.77      0   0    0      0      0    0      0     0
4  0.40  0   0   0.00      0   0    0      0      0    0      0     0
5  0.72  0   0   0.00      0   0    0      0      0    0      0     0
6  0.00  0   0   0.00      0   0    0      0      0    0      0     0
  technology num1999 parts pm direct cs meeting original project   re  edu
1          0       0     0  0      0  0       0        0       0 0.00 0.00
2          0       0     0  0      0  0       0        0       0 0.00 4.00
3          0       0     0  0      0  0       0        0       0 1.55 0.77
4          0       0     0  0      0  0       0        0       0 0.00 0.00
5          0       0     0  0      0  0       0        0       0 0.00 0.00
6          0       0     0  0      0  0       0        0       0 0.00 0.00
  table conference charSemicolon charRoundbracket charSquarebracket
1     0          0         0.000            0.000                 0
2     0          0         0.000            0.684                 0
3     0          0         0.000            0.490                 0
4     0          0         0.033            0.066                 0
5     0          0         0.000            0.364                 0
6     0          0         0.000            0.000                 0
  charExclamation charDollar charHash capitalAve capitalLong capitalTotal
1           0.480      0.000        0       2.00           7           26
2           0.000      0.000        0       1.12           2            9
3           0.196      0.000        0       3.16          10           79
4           0.133      0.066        0       2.70          30          192
5           0.729      0.121        0       7.78          32          249
6           0.000      0.000        0       2.69          15           35
     type
1 nonspam
2 nonspam
3 nonspam
4    spam
5    spam
6    spam

Proportional stratified sampling

Before we discussed stratified sampling in this article. So if you want to know more about it, please have a look to it first.

What we didn’t discuss in there was doing it by proportion or by sample size.

  • We will use proportional stratified sampling on mtcars , grouped by cylinders of the cars
  • Then we will ungroup the stratified sample.

Let’s first see the distribution of the groups:

Hide / show the code
# From previous step
mtcars %>% 
  count(cyl, sort = TRUE) %>% 
  mutate(percent = 100 * n / sum(n))
  cyl  n percent
1   8 14    43.8
2   4 11    34.4
3   6  7    21.9

So, our sample should include similar ratio from these groups. Let’s see what happens:

Hide / show the code
# Use proportional stratified sampling to get 30% of each cylinder group
mtcars_stratified <- mtcars %>%  
      group_by(cyl) %>%  
      slice_sample(prop =0.3) %>%  
      ungroup()

mtcars_stratified
# A tibble: 9 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
2  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
3  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
4  19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4
5  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6
6  10.4     8 472     205  2.93  5.25  18.0     0     0     3     4
7  17.3     8 276.    180  3.07  3.73  17.6     0     0     3     3
8  16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3
9  14.7     8 440     230  3.23  5.34  17.4     0     0     3     4

Let’s compare the groups in terms of ratios:

Hide / show the code
mtcars %>% 
  count(cyl, sort = TRUE) %>% 
  mutate(percent = 100 * n / sum(n))
  cyl  n percent
1   8 14    43.8
2   4 11    34.4
3   6  7    21.9
Hide / show the code
mtcars_stratified %>% 
  count(cyl, sort = TRUE) %>% 
  mutate(percent = 100 * n / sum(n))
# A tibble: 3 × 3
    cyl     n percent
  <dbl> <int>   <dbl>
1     8     4    44.4
2     4     3    33.3
3     6     2    22.2

As we see, each group represented in the similar proportion between population and sample.

Equal counts stratified sampling

Instead of this, no matter the numbers in the population, you can say like, ” I just want to see 3 cars from each cylinder type”. Then, you can use equal counts as below:

Hide / show the code
# From previous step
mtcars_equal <- mtcars %>%
  group_by(cyl) %>% 
  slice_sample(n = 3) %>%
  ungroup()

mtcars_equal
# A tibble: 9 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
3  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
4  21.4     6 258     110  3.08  3.22  19.4     1     0     3     1
5  19.7     6 145     175  3.62  2.77  15.5     0     1     5     6
6  21       6 160     110  3.9   2.62  16.5     0     1     4     4
7  14.3     8 360     245  3.21  3.57  15.8     0     0     3     4
8  14.7     8 440     230  3.23  5.34  17.4     0     0     3     4
9  18.7     8 360     175  3.15  3.44  17.0     0     0     3     2

Let’s check the counts:

Hide / show the code
mtcars_equal %>%
    count(cyl, sort = TRUE) %>%
    mutate(percent = 100 * n / sum(n))
# A tibble: 3 × 3
    cyl     n percent
  <dbl> <int>   <dbl>
1     4     3    33.3
2     6     3    33.3
3     8     3    33.3

Weighted Sampling

Stratified sampling gives you guidelines on how likely it is to select rows from your dataset within specific subgroups. A broader approach to this is weighted sampling, where you can set rules for the probability of selecting individual rows. In weighted sampling, the likelihood of choosing a particular row is directly tied to its weight value. So, the higher the weight, the greater the chances of picking that row.

Hide / show the code
# this code weights by the physical weight of the cars, so heavy cars are more likely to get
# selected
mtcars %>% slice_sample(weight_by = wt, n = 5)
                   mpg cyl disp  hp drat   wt qsec vs am gear carb
Merc 450SE        16.4   8  276 180 3.07 4.07 17.4  0  0    3    3
Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
Toyota Corona     21.5   4  120  97 3.70 2.46 20.0  1  0    3    1
Merc 450SLC       15.2   8  276 180 3.07 3.78 18.0  0  0    3    3
Pontiac Firebird  19.2   8  400 175 3.08 3.85 17.1  0  0    3    2

Cluster Sampling

Stratified sampling has a drawback that requires data collection from all subgroups within the population. The challenge arises when the data collection process is costly, such as when physical travel is needed to obtain the required information. This expense can make the analysis financially burdensome.

However, there is a more cost-effective option called cluster sampling. With cluster sampling, you can limit the number of subgroups involved by randomly selecting a smaller subset through simple random sampling. Within each selected subgroup, you can then perform simple random sampling as before. This approach provides comparable results while reducing the expenses associated with collecting data from every single subgroup.

Let’s see the cut sizes in diamonds dataset:

Hide / show the code
diamonds %>%
  distinct(cut)
# A tibble: 5 × 1
  cut      
  <ord>    
1 Ideal    
2 Premium  
3 Good     
4 Very Good
5 Fair     

Now randomly pick some cut categories:

Hide / show the code
# Create clusters based on diamond cut
clusters <- diamonds %>%
  distinct(cut) %>%
  sample_n(size = 3) %>%
  pull(cut)

clusters
[1] Good  Fair  Ideal
Levels: Fair < Good < Very Good < Premium < Ideal

Now we can use only these randomly picked subgroups and perform simple random sampling:

Hide / show the code
# Perform cluster sampling
cluster_sample <- diamonds %>%
  filter(cut %in% clusters) %>%
  slice_sample(n = 100, replace = FALSE)

head(cluster_sample)
# A tibble: 6 × 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  3.02 Fair  I     I1       65.2    56 10577  9.11  9.02  5.91
2  0.34 Ideal I     SI1      60.2    57   589  4.54  4.5   2.72
3  1.27 Ideal F     VS2      61      59  9547  6.96  6.98  4.25
4  0.54 Ideal F     VVS2     61      54  2231  5.28  5.3   3.23
5  0.43 Ideal D     VS2      60.7    57  1129  4.91  4.88  2.97
6  1.01 Ideal F     SI2      61.3    57  3777  6.4   6.35  3.91
Hide / show the code
summary(cluster_sample$cut)
     Fair      Good Very Good   Premium     Ideal 
        6        17         0         0        77 

As we see, only 3 subgroups exists and they are randomly picked.

Conclusion

When it comes to real life data analysis, usually data is not big, it’s huge! So you will have to make some sampling one wa or another. As you can see, there are different methods you can use but they all depends on what type of data you have and how it’s collected. In the end, main objective is being away as much as possible from any kind of bias! (Which is , believe or not, one of the hardest things…)

I hope you enjoyed!😉