Hide / show the code
if (!require("pacman")) install.packages("pacman")
Loading required package: pacman
Hide / show the code
p_load(pacman,
kernlab, tidyverse)
Burak Demirtas
June 1, 2023
A silly headline? Maybe! But it has some clues about the topic!
If the population is Sampling!
one sample from this population would be pling!
right?
This whole article is all about this and how to do the sampling in R.
What I will not explain: 😒
Loading required package: pacman
Some optional settings:
To make a basic slicing, you just need a variation of function slice()
as slice_sample()
.
Let’s use spam dataset of HP Labs1:
make address all num3d our over remove internet order mail receive will
1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64
2 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79
3 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45
4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31
5 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31
6 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00
people report addresses free business email you credit your font num000
1 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00
2 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43
3 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16
4 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00
5 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00
6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0.00 0 0 0 0 0 0 0 0 0 0 0
2 0.43 0 0 0 0 0 0 0 0 0 0 0
3 0.06 0 0 0 0 0 0 0 0 0 0 0
4 0.00 0 0 0 0 0 0 0 0 0 0 0
5 0.00 0 0 0 0 0 0 0 0 0 0 0
6 0.00 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu
1 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
2 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.00
3 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.06
4 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
5 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
6 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
table conference charSemicolon charRoundbracket charSquarebracket
1 0 0 0.00 0.000 0
2 0 0 0.00 0.132 0
3 0 0 0.01 0.143 0
4 0 0 0.00 0.137 0
5 0 0 0.00 0.135 0
6 0 0 0.00 0.223 0
charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
1 0.778 0.000 0.000 3.76 61 278 spam
2 0.372 0.180 0.048 5.11 101 1028 spam
3 0.276 0.184 0.010 9.82 485 2259 spam
4 0.137 0.000 0.000 3.54 40 191 spam
5 0.135 0.000 0.000 3.54 40 191 spam
6 0.000 0.000 0.000 3.00 15 54 spam
If we would want to take randomly just 200 emails (!) from this dataset, how many of them would be spam? Let’s randomly select 200 emails from spam dataset using slice_sample
function:
Rows: 200
Columns: 1
$ type <fct> nonspam, nonspam, nonspam, spam, spam, spam, nonspam, nonspam, sp…
How is the ratio of spam / non-spam ? Let’s have a look:
nonspam spam
128 72
nonspam spam
0.64 0.36
%36 of the randomly taken emails are spams! huge amount! What if we would check the whole data?
nonspam spam
2788 1813
nonspam spam
0.606 0.394
Surprisingly (maybe not surprisingly at all) ,our 200 observations has the similar spam/non-spam ratio with the all dataset!🥸
What if we don’t have the data frame but just only a vector! Then good old sample()
comes to help!
random200_of_vector
nonspam spam
111 89
random200_of_vector
nonspam spam
0.555 0.445
As a result, just by writing couple of lines of code, we could have get random samples from our data and we saw that , at least in terms of spam/non spam ratio, our sample is nearly perfectly representing the all population data!
What we did is simple random sampling and works like a game of bingo. You start with your population of all numbers, and randomly pick them out one at a time until you have a winner or in our case, until you reach the defined sample size!
While using this technique, sometimes we can end up with 2 numbers in order like 6 and 7 being drawn in order, and sometimes we might end up with big gaps between numbers. If the game has numbers between 1 to 99, in first 15 draws, there could be no number over 50. We have no control on sample picking.
Another alternative to what we did is systematic sampling. This samples the population at regular intervals. Which means, for example, let’s say we put all the numbers to a table. Instead of blindly drawing them from a bag, now we are picking every fifth number on the table. In this way, at least, we are being sure of each section of the table area is represented equally in the sample!
To demonstrate systematic sampling, we can add one column called rowid with rowid_to_column
function:
rowid make address all num3d our over remove internet order mail receive
1 1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00
2 2 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21
3 3 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38
4 4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31
5 5 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31
6 6 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00
will people report addresses free business email you credit your font num000
1 0.64 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00
2 0.79 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43
3 0.45 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16
4 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00
5 0.31 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00
6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0.00 0 0 0 0 0 0 0 0 0 0 0
2 0.43 0 0 0 0 0 0 0 0 0 0 0
3 0.06 0 0 0 0 0 0 0 0 0 0 0
4 0.00 0 0 0 0 0 0 0 0 0 0 0
5 0.00 0 0 0 0 0 0 0 0 0 0 0
6 0.00 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu
1 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
2 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.00
3 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.06
4 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
5 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
6 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
table conference charSemicolon charRoundbracket charSquarebracket
1 0 0 0.00 0.000 0
2 0 0 0.00 0.132 0
3 0 0 0.01 0.143 0
4 0 0 0.00 0.137 0
5 0 0 0.00 0.135 0
6 0 0 0.00 0.223 0
charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
1 0.778 0.000 0.000 3.76 61 278 spam
2 0.372 0.180 0.048 5.11 101 1028 spam
3 0.276 0.184 0.010 9.82 485 2259 spam
4 0.137 0.000 0.000 3.54 40 191 spam
5 0.135 0.000 0.000 3.54 40 191 spam
6 0.000 0.000 0.000 3.00 15 54 spam
[1] 23
Using this interval in sampling:
rowid make address all num3d our over remove internet order mail receive
1 23 0.00 0.00 0.00 0 2.94 0.00 0.00 0.00 0.00 0.00 0.00
2 46 0.15 0.45 1.05 0 0.45 0.00 0.00 1.81 0.60 0.75 0.00
3 69 0.30 0.00 0.61 0 0.00 0.00 0.00 0.00 0.00 0.92 0.30
4 92 0.47 0.31 0.47 0 0.05 0.13 0.05 0.26 0.44 0.76 0.26
5 115 0.12 1.76 0.63 0 0.88 0.00 0.12 0.50 0.25 3.90 0.50
6 138 1.00 0.00 0.33 0 0.66 0.66 0.00 0.00 0.00 0.00 0.00
will people report addresses free business email you credit your font num000
1 0.00 0.00 0.00 0.0 2.94 0.00 0 0.00 0.00 0.00 0 0.00
2 0.90 0.30 0.00 0.3 0.00 0.00 0 4.07 0.00 1.51 0 0.00
3 0.92 0.30 0.30 0.0 2.15 0.61 0 5.53 0.00 1.23 0 0.00
4 0.97 0.58 1.26 0.0 0.26 0.44 0 3.25 0.00 1.50 0 1.05
5 0.88 0.12 0.00 0.0 0.25 0.12 0 2.90 0.25 1.38 0 1.13
6 0.33 0.66 0.00 0.0 0.66 0.66 0 2.33 0.00 0.33 0 1.66
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0.00 0 0 0 0 0 0 0 0 0 0.0 0
2 0.30 0 0 0 0 0 0 0 0 0 0.0 0
3 0.30 0 0 0 0 0 0 0 0 0 0.3 0
4 0.34 0 0 0 0 0 0 0 0 0 0.0 0
5 0.12 0 0 0 0 0 0 0 0 0 0.0 0
6 0.33 0 0 0 0 0 0 0 0 0 0.0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0.00 0 0.00 0 0
2 0 0 0 0 0 0 0 0.00 0 0.15 0 0
3 0 0 0 0 0 0 0 0.00 0 0.30 0 0
4 0 0 0 0 0 0 0 0.00 0 0.02 0 0
5 0 0 0 0 0 0 0 0.12 0 0.00 0 0
6 0 0 0 0 0 0 0 0.00 0 0.00 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0.404 0.404 0.000 0.809
2 0 0.000 0.250 0.000 1.318
3 0 0.000 0.100 0.000 1.053
4 0 0.004 0.066 0.000 0.322
5 0 0.019 0.379 0.159 0.000
6 0 0.000 0.060 0.000 0.120
charDollar charHash capitalAve capitalLong capitalTotal type
1 0.000 0.000 4.86 12 34 spam
2 0.068 0.000 5.30 130 774 spam
3 0.351 0.250 3.88 66 303 spam
4 0.764 0.159 6.10 193 3038 spam
5 0.119 0.000 4.16 38 507 spam
6 0.541 0.000 5.43 21 304 spam
Notice the rowid
jumps with the interval
we found!👍😎
To check if there’s a problem with systematic sampling, we need to consider whether the data is sorted or has a pattern in the row order. If it does, the sample may not represent the whole population well. But don’t worry, we can solve this problem by shuffling the rows. This basically makes systematic sampling the same as simple random sampling.
This is the original data:
make address all num3d our over remove internet order mail receive will
1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64
2 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79
3 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45
4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31
5 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31
6 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00
people report addresses free business email you credit your font num000
1 0.00 0.00 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00
2 0.65 0.21 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43
3 0.12 0.00 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16
4 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00
5 0.31 0.00 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00
6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0.00 0 0 0 0 0 0 0 0 0 0 0
2 0.43 0 0 0 0 0 0 0 0 0 0 0
3 0.06 0 0 0 0 0 0 0 0 0 0 0
4 0.00 0 0 0 0 0 0 0 0 0 0 0
5 0.00 0 0 0 0 0 0 0 0 0 0 0
6 0.00 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu
1 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
2 0 0.07 0 0 0.00 0 0 0.00 0 0.00 0.00
3 0 0.00 0 0 0.06 0 0 0.12 0 0.06 0.06
4 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
5 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
6 0 0.00 0 0 0.00 0 0 0.00 0 0.00 0.00
table conference charSemicolon charRoundbracket charSquarebracket
1 0 0 0.00 0.000 0
2 0 0 0.00 0.132 0
3 0 0 0.01 0.143 0
4 0 0 0.00 0.137 0
5 0 0 0.00 0.135 0
6 0 0 0.00 0.223 0
charExclamation charDollar charHash capitalAve capitalLong capitalTotal type
1 0.778 0.000 0.000 3.76 61 278 spam
2 0.372 0.180 0.048 5.11 101 1028 spam
3 0.276 0.184 0.010 9.82 485 2259 spam
4 0.137 0.000 0.000 3.54 40 191 spam
5 0.135 0.000 0.000 3.54 40 191 spam
6 0.000 0.000 0.000 3.00 15 54 spam
Let’s shuffle the rows:
make address all num3d our over remove internet order mail receive will
2609 0 0 0.00 0 0.00 0 0.00 0 0.0 0.00 0 2.85
4069 0 0 0.00 0 0.00 0 0.00 0 0.0 0.00 0 0.00
2369 0 0 0.00 0 0.77 0 0.00 0 0.0 0.00 0 0.00
1098 0 0 0.40 0 0.00 0 0.20 0 0.8 0.20 0 0.40
1252 0 0 0.72 0 0.72 0 0.72 0 0.0 0.00 0 0.00
634 0 0 0.00 0 0.00 0 0.00 0 0.0 1.49 0 1.49
people report addresses free business email you credit your font num000
2609 0.00 0.00 0.0 0.0 0 0 0.00 0.00 0.00 0 0
4069 0.00 0.00 0.0 0.0 0 0 0.00 0.00 0.00 0 0
2369 0.00 0.00 0.0 0.0 0 0 1.55 0.00 0.77 0 0
1098 0.00 1.41 0.2 0.4 0 0 3.44 3.03 2.22 0 0
1252 0.72 0.00 0.0 0.0 0 0 1.45 0.00 1.45 0 0
634 0.00 0.00 0.0 0.0 0 0 7.46 0.00 1.49 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
2609 0.00 0 0 0.00 0 0 0 0 0 0 0 0
4069 0.00 0 0 0.00 0 0 0 0 0 0 0 0
2369 0.77 0 0 0.77 0 0 0 0 0 0 0 0
1098 0.40 0 0 0.00 0 0 0 0 0 0 0 0
1252 0.72 0 0 0.00 0 0 0 0 0 0 0 0
634 0.00 0 0 0.00 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu
2609 0 0 0 0 0 0 0 0 0 0.00 0.00
4069 0 0 0 0 0 0 0 0 0 0.00 4.00
2369 0 0 0 0 0 0 0 0 0 1.55 0.77
1098 0 0 0 0 0 0 0 0 0 0.00 0.00
1252 0 0 0 0 0 0 0 0 0 0.00 0.00
634 0 0 0 0 0 0 0 0 0 0.00 0.00
table conference charSemicolon charRoundbracket charSquarebracket
2609 0 0 0.000 0.000 0
4069 0 0 0.000 0.684 0
2369 0 0 0.000 0.490 0
1098 0 0 0.033 0.066 0
1252 0 0 0.000 0.364 0
634 0 0 0.000 0.000 0
charExclamation charDollar charHash capitalAve capitalLong capitalTotal
2609 0.480 0.000 0 2.00 7 26
4069 0.000 0.000 0 1.12 2 9
2369 0.196 0.000 0 3.16 10 79
1098 0.133 0.066 0 2.70 30 192
1252 0.729 0.121 0 7.78 32 249
634 0.000 0.000 0 2.69 15 35
type
2609 nonspam
4069 nonspam
2369 nonspam
1098 spam
1252 spam
634 spam
Now let’s overwrite the rowids starting one and we will have shuffled order:
rowid make address all num3d our over remove internet order mail receive
1 1 0 0 0.00 0 0.00 0 0.00 0 0.0 0.00 0
2 2 0 0 0.00 0 0.00 0 0.00 0 0.0 0.00 0
3 3 0 0 0.00 0 0.77 0 0.00 0 0.0 0.00 0
4 4 0 0 0.40 0 0.00 0 0.20 0 0.8 0.20 0
5 5 0 0 0.72 0 0.72 0 0.72 0 0.0 0.00 0
6 6 0 0 0.00 0 0.00 0 0.00 0 0.0 1.49 0
will people report addresses free business email you credit your font num000
1 2.85 0.00 0.00 0.0 0.0 0 0 0.00 0.00 0.00 0 0
2 0.00 0.00 0.00 0.0 0.0 0 0 0.00 0.00 0.00 0 0
3 0.00 0.00 0.00 0.0 0.0 0 0 1.55 0.00 0.77 0 0
4 0.40 0.00 1.41 0.2 0.4 0 0 3.44 3.03 2.22 0 0
5 0.00 0.72 0.00 0.0 0.0 0 0 1.45 0.00 1.45 0 0
6 1.49 0.00 0.00 0.0 0.0 0 0 7.46 0.00 1.49 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0.00 0 0 0.00 0 0 0 0 0 0 0 0
2 0.00 0 0 0.00 0 0 0 0 0 0 0 0
3 0.77 0 0 0.77 0 0 0 0 0 0 0 0
4 0.40 0 0 0.00 0 0 0 0 0 0 0 0
5 0.72 0 0 0.00 0 0 0 0 0 0 0 0
6 0.00 0 0 0.00 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu
1 0 0 0 0 0 0 0 0 0 0.00 0.00
2 0 0 0 0 0 0 0 0 0 0.00 4.00
3 0 0 0 0 0 0 0 0 0 1.55 0.77
4 0 0 0 0 0 0 0 0 0 0.00 0.00
5 0 0 0 0 0 0 0 0 0 0.00 0.00
6 0 0 0 0 0 0 0 0 0 0.00 0.00
table conference charSemicolon charRoundbracket charSquarebracket
1 0 0 0.000 0.000 0
2 0 0 0.000 0.684 0
3 0 0 0.000 0.490 0
4 0 0 0.033 0.066 0
5 0 0 0.000 0.364 0
6 0 0 0.000 0.000 0
charExclamation charDollar charHash capitalAve capitalLong capitalTotal
1 0.480 0.000 0 2.00 7 26
2 0.000 0.000 0 1.12 2 9
3 0.196 0.000 0 3.16 10 79
4 0.133 0.066 0 2.70 30 192
5 0.729 0.121 0 7.78 32 249
6 0.000 0.000 0 2.69 15 35
type
1 nonspam
2 nonspam
3 nonspam
4 spam
5 spam
6 spam
Before we discussed stratified sampling in this article. So if you want to know more about it, please have a look to it first.
What we didn’t discuss in there was doing it by proportion or by sample size.
mtcars
, grouped by cylinders
of the carsLet’s first see the distribution of the groups:
cyl n percent
1 8 14 43.8
2 4 11 34.4
3 6 7 21.9
So, our sample should include similar ratio from these groups. Let’s see what happens:
# A tibble: 9 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
2 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
3 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
4 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
5 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
9 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
Let’s compare the groups in terms of ratios:
cyl n percent
1 8 14 43.8
2 4 11 34.4
3 6 7 21.9
# A tibble: 3 × 3
cyl n percent
<dbl> <int> <dbl>
1 8 4 44.4
2 4 3 33.3
3 6 2 22.2
As we see, each group represented in the similar proportion between population and sample.
Instead of this, no matter the numbers in the population, you can say like, ” I just want to see 3 cars from each cylinder type”. Then, you can use equal counts as below:
# A tibble: 9 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
6 21 6 160 110 3.9 2.62 16.5 0 1 4 4
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Let’s check the counts:
Stratified sampling gives you guidelines on how likely it is to select rows from your dataset within specific subgroups. A broader approach to this is weighted sampling, where you can set rules for the probability of selecting individual rows. In weighted sampling, the likelihood of choosing a particular row is directly tied to its weight value. So, the higher the weight, the greater the chances of picking that row.
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 450SE 16.4 8 276 180 3.07 4.07 17.4 0 0 3 3
Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Toyota Corona 21.5 4 120 97 3.70 2.46 20.0 1 0 3 1
Merc 450SLC 15.2 8 276 180 3.07 3.78 18.0 0 0 3 3
Pontiac Firebird 19.2 8 400 175 3.08 3.85 17.1 0 0 3 2
Stratified sampling has a drawback that requires data collection from all subgroups within the population. The challenge arises when the data collection process is costly, such as when physical travel is needed to obtain the required information. This expense can make the analysis financially burdensome.
However, there is a more cost-effective option called cluster sampling. With cluster sampling, you can limit the number of subgroups involved by randomly selecting a smaller subset through simple random sampling. Within each selected subgroup, you can then perform simple random sampling as before. This approach provides comparable results while reducing the expenses associated with collecting data from every single subgroup.
Let’s see the cut sizes in diamonds dataset:
# A tibble: 5 × 1
cut
<ord>
1 Ideal
2 Premium
3 Good
4 Very Good
5 Fair
Now randomly pick some cut categories:
[1] Good Fair Ideal
Levels: Fair < Good < Very Good < Premium < Ideal
Now we can use only these randomly picked subgroups and perform simple random sampling:
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
2 0.34 Ideal I SI1 60.2 57 589 4.54 4.5 2.72
3 1.27 Ideal F VS2 61 59 9547 6.96 6.98 4.25
4 0.54 Ideal F VVS2 61 54 2231 5.28 5.3 3.23
5 0.43 Ideal D VS2 60.7 57 1129 4.91 4.88 2.97
6 1.01 Ideal F SI2 61.3 57 3777 6.4 6.35 3.91
As we see, only 3 subgroups exists and they are randomly picked.
When it comes to real life data analysis, usually data is not big, it’s huge! So you will have to make some sampling one wa or another. As you can see, there are different methods you can use but they all depends on what type of data you have and how it’s collected. In the end, main objective is being away as much as possible from any kind of bias! (Which is , believe or not, one of the hardest things…)
I hope you enjoyed!😉