This document describes the workflow I use to run fake-data simulations for regression analysis in R. At the end of the document one can find a template to run a simple simulation from A to Z. It can be copied/pasted and easily adapted to address specific questions.

I developed this workflow with Léo Zabrocki. On the website of one of our projects, we describe how we ran more advanced fake data simulations for various causal inference methods (IV, event-study, RDD, matching). On the website of another project, we describe how we ran simulations based on real data, also for various causal inference methods.

The workflow described here builds on purrr and tidyverse principles. This framework enables to easily and explicitly run simulations for several sets of parameter values. It also offers the possibility to very easily parallelize the code using the package furrr.

Simulation workflow

Define a set of baseline parameters and store them in a data frame,
Generate data for a given set of parameters,
Run the estimation and extract the outcomes of interest,
Compute one simulation by combing the previous steps,
Replicate the process for a number of iterations and a set of parameters.

Detailed explanation

To illustrate this workflow, consider an example from an extremely simple setting. In this example, the goal is to study how the distribution of an estimator varies with the variance of the associated variable. Let’s consider the following model:

\[ y_i = \alpha + \beta x_i + u_i \] for individual \(i \in \{1, .., N\}\) where \(u_i \sim \mathcal{N}(0, \sigma_u^2)\) and \(x_i \sim \mathcal{N}(\mu_x, \sigma_x^2)\).

Packages

First, I load the packages necessary to run the simulation. Here I also load an additional and a non-necessary package to set my default ggplot theme for graphs, mediocrethemes.

#necessary packages
library(tidyverse)
library(broom)

#additional packages
library(mediocrethemes) #my ggplot theme package
mediocrethemes::set_mediocre_all(gradient = "right") #will create all the graphs
#using my theme

Define a set of baseline parameters

I store the values of the baseline parameters in a one-row tibble, baseline_param:

baseline_param <- tibble(
  N = 1000,
  mu_x = 1.2, 
  sigma_x = 0.9,
  sigma_u = 0.8,
  alpha = 0.5,
  beta = 0.9
)

Generate data

I then write a function that takes these parameters as inputs and returns a randomly generated data set:

generate_data <- function(N,
                          mu_x,
                          sigma_x,
                          sigma_u, 
                          alpha,
                          beta) {
  
  data <- tibble(id = 1:N) %>%
    mutate(
      x = rnorm(N, mu_x, sigma_x),
      u = rnorm(N, 0, sigma_u),
      y = alpha + beta*x + u
    )
}

One can pass the parameters to the function using the pmap_dfr function for instance:

example_data <- purrr::pmap_dfr(baseline_param, generate_data)

Run the estimation

I then write a function that takes the generated data as input, runs the estimation and returns the parameters of interest in a data frame. In this example, I am interested in estimates of \(\beta\) and therefore only consider the regression output for the \(x\) term.

run_estim <- function(data) {
  lm(data = data, y ~ x) %>%
    broom::tidy() %>%
    filter(term == "x") 
}

This function, applied to example_data, returns the following data set:

term	estimate	std.error	statistic	p.value
x	0.93	0.03	33.74	0

Compute simulation

I then combine generate_data and run_estim into a function that computes one iteration of the simulation. I also add a line of code to store the parameters used in this simulation.

compute_sim <- function(...) {
  generate_data(...) %>% 
    run_estim() %>% 
    cbind(as_tibble(list(...))) #add parameters used for generation
}

Replicate the process

To run several iterations of the simulation, I build a data frame of parameters such that each row contains the values of the parameters for one single iteration of the simulation.

If one wants to run n_iter iterations of the simulation for the baseline parameter values, they only have to replicate the one-row tibble baseline_param using crossing for instance.

n_iter <- 5

param <- baseline_param %>% 
  crossing(rep_id = 1:n_iter) %>% 
  select(-rep_id)

It produces a data frame with n_iter (here 5) identical rows:

N	mu_x	sigma_x	sigma_u	alpha	beta
1000	1.2	0.9	0.8	0.5	0.9
1000	1.2	0.9	0.8	0.5	0.9
1000	1.2	0.9	0.8	0.5	0.9
1000	1.2	0.9	0.8	0.5	0.9
1000	1.2	0.9	0.8	0.5	0.9

Then I cna pass the resulting set of parameter values to the function compute_sim using pmap_dfr:

result_sim <- pmap_dfr(param, compute_sim)

Using this method, one can also easily run a series of simulations for several parameter values. Here, I vary sigma_x and run n_iter iterations of the simulation for each value of the varying parameter sigma_x.

n_iter <- 5

vect_sigma_x <- c(0.2, 0.8)

param_sigma_x <- baseline_param %>% 
  crossing(rep_id = 1:n_iter)  %>% 
  select(-sigma_x) %>% 
  crossing(sigma_x = vect_sigma_x) %>% 
  select(-rep_id)

It produces a data frame with n_iter times the number of values for the varying parameter rows (here 5 times 2 simulations):

N	mu_x	sigma_u	alpha	beta	sigma_x
1000	1.2	0.8	0.5	0.9	0.2
1000	1.2	0.8	0.5	0.9	0.8
1000	1.2	0.8	0.5	0.9	0.2
1000	1.2	0.8	0.5	0.9	0.8
1000	1.2	0.8	0.5	0.9	0.2
1000	1.2	0.8	0.5	0.9	0.8
1000	1.2	0.8	0.5	0.9	0.2
1000	1.2	0.8	0.5	0.9	0.8
1000	1.2	0.8	0.5	0.9	0.2
1000	1.2	0.8	0.5	0.9	0.8

I can then pass this data frame to compute_sim to obtain my simulation results.

result_sim_sigma_x <- pmap_dfr(param_sigma_x, compute_sim)

Here are the results!

term	estimate	std.error	statistic	N	mu_x	sigma_u	alpha	beta	sigma_x
x	0.7919152	0.1262817	6.271021	1000	1.2	0.8	0.5	0.9	0.2
x	0.9077674	0.0309158	29.362533	1000	1.2	0.8	0.5	0.9	0.8
x	0.9630684	0.1279793	7.525188	1000	1.2	0.8	0.5	0.9	0.2
x	0.9009793	0.0315959	28.515721	1000	1.2	0.8	0.5	0.9	0.8
x	1.0628069	0.1300458	8.172556	1000	1.2	0.8	0.5	0.9	0.2
x	0.9083459	0.0314618	28.871383	1000	1.2	0.8	0.5	0.9	0.8
x	0.8016967	0.1304078	6.147612	1000	1.2	0.8	0.5	0.9	0.2
x	0.8811167	0.0323071	27.273188	1000	1.2	0.8	0.5	0.9	0.8
x	0.6900479	0.1242532	5.553564	1000	1.2	0.8	0.5	0.9	0.2
x	0.9149920	0.0315210	29.028001	1000	1.2	0.8	0.5	0.9	0.8

Parallel computing

The simulation process can easily be parallelized using furrr. To do so, I use future_pmap_dfr instead of pmap_dfr. To do that, I need to first define the type of session to use and potentially set the seed option to TRUE.

library(furrr)

future::plan(multisession, workers = availableCores() - 1)

result_sim <- future_pmap_dfr(param, compute_sim,
        .options = furrr_options(seed = TRUE))

Analysis of the results

I then briefly analyse the output of this simulation. In this example, I am interested in studying how the distribution of the estimates varies with the value of \(\sigma_x\). I therefore run a longer simulation with 500 iterations and 4 values for \(\sigma_x\).

I then make the graph of interest:

result_sim_long %>% 
  mutate(sigma_x = as.factor(sigma_x)) %>% 
  ggplot() +
  geom_density(aes(x = estimate, color = sigma_x), alpha = 0.03) +
  labs(
    title = "Distribution of estimates of beta",
    subtitle = "Comparison for different values of the standard deviation of x",
    color = "sd(x)",
    fill = "sd(x)",
    x = "Estimate of beta",
    y = "Density"
  )

Each distribution is built based on 500 points (estimates). This graph illustrates a well know result: the distribution of the estimates widens when \(\sigma_x\) decreases.

A simple template

The following code summarizes the previous chunks. One can thus easily copy/paste the entire simulation to adapt it to a specific context.

#packages and set cores
library(tidyverse)
library(broom)
library(furrr)

future::plan(multisession, workers = availableCores() - 1)

#set baseline parameters
baseline_param <- tibble(
  N = 1000,
  mu_x = 1.2, 
  sigma_x = 0.9,
  sigma_u = 0.8,
  alpha = 0.5,
  beta = 0.9
)

#function to generate data
generate_data <- function(N,
                          mu_x,
                          sigma_x,
                          sigma_u, 
                          alpha,
                          beta) {
  
  data <- tibble(id = 1:N) %>%
    mutate(
      x = rnorm(N, mu_x, sigma_x),
      u = rnorm(N, 0, sigma_u),
      y = alpha + beta*x + u
    )
}

#function to run the estimation
run_estim <- function(data) {
  lm(data = data, y ~ x) %>%
    broom::tidy() %>%
    filter(term == "x") 
}

#function to compute a simulation
compute_sim <- function(...) {
  generate_data(...) %>% 
    run_estim() %>% 
    cbind(as_tibble(list(...))) #add parameters used for generation
}

#replicate the process
#set the number of iterations and parameters to vary
n_iter <- 1000
vect_sigma_x <- seq(0.2, 2, 0.2)
#define the complete set of parameters
param <- baseline_param %>% 
  crossing(rep_id = 1:n_iter)  %>% 
  select(-sigma_x) %>% 
  crossing(sigma_x = vect_sigma_x) %>% 
  select(-rep_id) 

result_sim <- future_pmap_dfr(param, compute_sim,
        .options = furrr_options(seed = TRUE))

The thumbnail photo: Maxime Lebrun on Unsplash ↩︎

A workflow to run simulations