1 Background

I was working on a project and I needed a quick randomized data of an existing dataset. However, I don’t need it to be just randomized. I also want it to be randomly correlated with a selected variable. Why? Because I needed a practice file to work on regression models and there should be some correlation to be able to work on more realistic way.

I searched a little, couldn’t find a “plug&play” method. (Maybe I wasn’t looking to the right place.)

So, I created this example code set to give an idea about how to do it quickly and in a relatively efficient way. I hope this helps to somebody!

2 Libraries to Use

Hide / show the code

# Setting the Working directory---------------------------------------------
WorkingDirectory <- getwd() # IF you saved the file, this gets the file path
if (!is.null(WorkingDirectory)) setwd(WorkingDirectory)
# Loading the libraries-----------------------------------------------------
if (!require("pacman")) install.packages("pacman")

Loading required package: pacman

Hide / show the code

p_load(pacman, # if you have a missing library, pacman will find and install!
       GGally, # to create a correlation plot
       MASS, # to get the example dataset called Boston 
       openxlsx, # creating the excel outputs
       tidyverse  # for just many things :) 
       )

3 Before the Randomizing

In this example, I’m gonna use Boston dataset. ¹

Hide / show the code

dataset.df <- Boston

Let’s have a look to the heat map of current correlations between each variable:

Hide / show the code

# Creating a correlation model with categorical variables also taken into account
corr_model_matrix<-model.matrix(~., data=dataset.df) 
# Creating a correlation plot
ggcorr(data= corr_model_matrix[,-1], 
       label = T, label_round= 2,hjust = 1, digits = 2, layout.exp = 3,
       label_size = 3,low="#F21A00",high="#3B9AB2" )

4 Randomizing Process

When you need “repeatability” in your project, you can set the seed with the code below but I’m not going to use it, I just need randomness. Putting here just in case you need:

Hide / show the code

# Set seed for reproducibility (only if you want to replicate the operation)
# set.seed(42)

Our strategy will be as simple as possible: To make the randomizing, we will use existing values of each variable and apply add/subtract of some numbers to them. These numbers which are to be added or subtracted will be related to standard deviation of each variable we are applying. In addition, we will put another add/subtract operation with random weight and this time, we will also use the standard deviation of our response (dependent) variable.

Let’s see how this work in action.

1- We need some weights for each variable which are randomly created. Then we will use these weights to change the variables on minus or plus side. We are putting “-1” because we will need 1 weight value for each variable except the dependent variable which we will not touch.

Hide / show the code

weights <- rnorm(ncol(Boston) - 1)

2- We also need some Gaussian (normal distribution) noise to add to each variable. To do that, instead of using a fixed number, I also create options between 0.1 to 0.6 and randomly select one of them in each run. Additionally, I’m creating a copy of original data as “randomized_dataset” to alter it after and dataset dimensions to use after.

Hide / show the code

randomized_noise_options <- seq(0.3, 1, by = 0.1)
noise_level <- sample(randomized_noise_options, 1, replace = TRUE)
randomized_dataset.df <- dataset.df
ncols <- ncol(dataset.df) # number of columns in the dataset
nrows <- nrow(dataset.df) # number of rows in the dataset

3- In a for loop, we need to run over each variable, replacing them with altered versions.

Hide / show the code

for (i in 1:(ncols - 1)) { # Iterate over columns except the response(medv)
  if (names(dataset.df)[i] != "medv") { # if medv comes,code skips it
    randomized_dataset.df[[i]] <- 
      randomized_dataset.df[[i]] + # taking the current value then start to add noise
      # Create Gaussian noise with each variables own std. dev.
      weights[i] * rnorm( nrows, #number of random points needed 
                          mean = 0, # zero mean for centered noise
                          sd = noise_level * sd(dataset.df[[i]])) + 
      # And weighted randomness to add by using medv's sd
      weights[i] * rnorm(nrows, 0, noise_level * sd(dataset.df[["medv"]]))
  }
}

4- Now if we check the correlations, we will see a different result but still some correlations with medv:

Hide / show the code

# Creating a correlation model with categorical variables also taken into account
corr_model_matrix<-model.matrix(~., data=randomized_dataset.df) 
# Creating a correlation plot
ggcorr(data= corr_model_matrix[,-1], 
       label = T, label_round= 2,hjust = 1, digits = 2, layout.exp = 3,
       label_size = 3,low="#F21A00",high="#3B9AB2" )

If you will take all the code until here and run, each time you run it, it will not only change the original dataset without destroying all the correlations, but also randomize the correlations and correlated variables.

5 Assemble the Pieces

Now the last step is to merge all the code and create the desired number of excel outputs.

Let’s say we need 30 different randomized Boston datasets and needed them exported as excel files. So, I created a variable to control how many outputs will be generated.

Hide / show the code

# Define the number of randomized datasets to create
num_datasets <- 30

After, we can assemble all the code together:

Hide / show the code

# Setting the Working directory---------------------------------------------
WorkingDirectory <- getwd() # IF you saved the file, this gets the file path
if (!is.null(WorkingDirectory)) setwd(WorkingDirectory)
# Loading the libraries-----------------------------------------------------
if (!require("pacman")) install.packages("pacman")
p_load(pacman, # if you have a missing library, pacman will find and install!
       GGally, # to create a correlation plot
       MASS, # to get the example dataset called Boston 
       openxlsx, # creating the excel outputs
       tidyverse  # for just many things :) 
       )

# Load the Boston housing dataset
dataset.df <- Boston

# Set seed for reproducibility (only if you want to replicate the operation)
# set.seed(123)  # I commented here, because the idea is the randomizing.

# Define the number of randomized datasets to create
num_datasets <- 30

# Prepare the column and row number variables to use after
ncols <- ncol(dataset.df) - 1 # number of columns to change
nrows <- nrow(dataset.df)     # number of rows in each variable

# Loop through the dataset defined times and create randomized versions
for (i in 1:num_datasets) {
  
  # Create a copy of the Boston dataset
  randomized_dataset.df <- dataset.df
  
  # Randomizing the noise level and weights
  randomized_noise_options <- seq(0.1, 0.6, by = 0.1)
  noise_level <- sample(randomized_noise_options, 1, replace = TRUE)
  weights <- rnorm(ncols)
  
  for (j in 1:ncols) {
    if (names(randomized_dataset.df)[j] != "medv") {
        # taking the current value then start to add noise
        randomized_dataset.df[[j]] <- randomized_dataset.df[[j]] + 
        # Create Gaussian noise with each variables own std. dev.
        rnorm(nrows, mean=0, sd = noise_level * sd(dataset.df[[j]])) +
        # And weighted randomness to add by using medv's sd
        weights[j] * rnorm(nrow(randomized_dataset.df), 
                           mean= 0, sd= noise_level * sd(dataset.df[["medv"]]))
    }
  }
  # Save the randomized dataset as an Excel file
  filename <- paste0("Boston_Randomized_", i, ".xlsx")
  write.xlsx(randomized_dataset.df, file = filename, rowNames = FALSE)
}

Et voila!

6 Conclusion

This is just one way to do this. There could be other ways or may be dedicated packages but I don’t know them yet. For the moment, this solved my issue and I hope solves yours, too.

Footnotes

https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html ↩︎