In this short document I present a simple workflow I use to run simulations for regression analysis in R.
This document describes the workflow I use to run fake-data simulations for regression analysis in R. At the end of the document one can find a template to run a simple simulation from A to Z. It can be copied/pasted and easily adapted to address specific questions.
I developed this workflow with Léo Zabrocki. On the website of one of our projects, we describe how we ran more advanced fake data simulations for various causal inference methods (IV, event-study, RDD, matching). On the website of another project, we describe how we ran simulations based on real data, also for various causal inference methods.
The workflow described here builds on purrr and tidyverse principles. This framework enables to easily and explicitly run simulations for several sets of parameter values. It also offers the possibility to very easily parallelize the code using the package furrr.
To illustrate this workflow, consider an example from an extremely simple setting. In this example, the goal is to study how the distribution of an estimator varies with the variance of the associated variable. Let’s consider the following model:
\[ y_i = \alpha + \beta x_i + u_i \] for individual \(i \in \{1, .., N\}\) where \(u_i \sim \mathcal{N}(0, \sigma_u^2)\) and \(x_i \sim \mathcal{N}(\mu_x, \sigma_x^2)\).
First, I load the packages necessary to run the simulation. Here I also load an additional and a non-necessary package to set my default ggplot theme for graphs, mediocrethemes.
#necessary packages
library(tidyverse)
library(broom)
#additional packages
library(mediocrethemes) #my ggplot theme package
mediocrethemes::set_mediocre_all(gradient = "right") #will create all the graphs
#using my theme
I store the values of the baseline parameters in a one-row tibble, baseline_param
:
baseline_param <- tibble(
N = 1000,
mu_x = 1.2,
sigma_x = 0.9,
sigma_u = 0.8,
alpha = 0.5,
beta = 0.9
)
I then write a function that takes these parameters as inputs and returns a randomly generated data set:
One can pass the parameters to the function using the pmap_dfr
function for instance:
example_data <- purrr::pmap_dfr(baseline_param, generate_data)
I then write a function that takes the generated data as input, runs the estimation and returns the parameters of interest in a data frame. In this example, I am interested in estimates of \(\beta\) and therefore only consider the regression output for the \(x\) term.
This function, applied to example_data
, returns the following data set:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
x | 0.93 | 0.03 | 33.74 | 0 |
I then combine generate_data
and run_estim
into a function that computes one iteration of the simulation. I also add a line of code to store the parameters used in this simulation.
To run several iterations of the simulation, I build a data frame of parameters such that each row contains the values of the parameters for one single iteration of the simulation.
If one wants to run n_iter
iterations of the simulation for the baseline parameter values, they only have to replicate the one-row tibble baseline_param
using crossing
for instance.
It produces a data frame with n_iter
(here 5) identical rows:
N | mu_x | sigma_x | sigma_u | alpha | beta |
---|---|---|---|---|---|
1000 | 1.2 | 0.9 | 0.8 | 0.5 | 0.9 |
1000 | 1.2 | 0.9 | 0.8 | 0.5 | 0.9 |
1000 | 1.2 | 0.9 | 0.8 | 0.5 | 0.9 |
1000 | 1.2 | 0.9 | 0.8 | 0.5 | 0.9 |
1000 | 1.2 | 0.9 | 0.8 | 0.5 | 0.9 |
Then I cna pass the resulting set of parameter values to the function compute_sim
using pmap_dfr
:
result_sim <- pmap_dfr(param, compute_sim)
Using this method, one can also easily run a series of simulations for several parameter values. Here, I vary sigma_x
and run n_iter
iterations of the simulation for each value of the varying parameter sigma_x
.
It produces a data frame with n_iter
times the number of values for the varying parameter rows (here 5 times 2 simulations):
N | mu_x | sigma_u | alpha | beta | sigma_x |
---|---|---|---|---|---|
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
I can then pass this data frame to compute_sim
to obtain my simulation results.
result_sim_sigma_x <- pmap_dfr(param_sigma_x, compute_sim)
Here are the results!
term | estimate | std.error | statistic | p.value | N | mu_x | sigma_u | alpha | beta | sigma_x |
---|---|---|---|---|---|---|---|---|---|---|
x | 0.7919152 | 0.1262817 | 6.271021 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
x | 0.9077674 | 0.0309158 | 29.362533 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
x | 0.9630684 | 0.1279793 | 7.525188 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
x | 0.9009793 | 0.0315959 | 28.515721 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
x | 1.0628069 | 0.1300458 | 8.172556 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
x | 0.9083459 | 0.0314618 | 28.871383 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
x | 0.8016967 | 0.1304078 | 6.147612 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
x | 0.8811167 | 0.0323071 | 27.273188 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
x | 0.6900479 | 0.1242532 | 5.553564 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.2 |
x | 0.9149920 | 0.0315210 | 29.028001 | 0 | 1000 | 1.2 | 0.8 | 0.5 | 0.9 | 0.8 |
The simulation process can easily be parallelized using furrr. To do so, I use future_pmap_dfr
instead of pmap_dfr
. To do that, I need to first define the type of session to use and potentially set the seed option to TRUE
.
library(furrr)
future::plan(multisession, workers = availableCores() - 1)
result_sim <- future_pmap_dfr(param, compute_sim,
.options = furrr_options(seed = TRUE))
I then briefly analyse the output of this simulation. In this example, I am interested in studying how the distribution of the estimates varies with the value of \(\sigma_x\). I therefore run a longer simulation with 500 iterations and 4 values for \(\sigma_x\).
I then make the graph of interest:
result_sim_long %>%
mutate(sigma_x = as.factor(sigma_x)) %>%
ggplot() +
geom_density(aes(x = estimate, color = sigma_x), alpha = 0.03) +
labs(
title = "Distribution of estimates of beta",
subtitle = "Comparison for different values of the standard deviation of x",
color = "sd(x)",
fill = "sd(x)",
x = "Estimate of beta",
y = "Density"
)
Each distribution is built based on 500 points (estimates). This graph illustrates a well know result: the distribution of the estimates widens when \(\sigma_x\) decreases.
The following code summarizes the previous chunks. One can thus easily copy/paste the entire simulation to adapt it to a specific context.
#packages and set cores
library(tidyverse)
library(broom)
library(furrr)
future::plan(multisession, workers = availableCores() - 1)
#set baseline parameters
baseline_param <- tibble(
N = 1000,
mu_x = 1.2,
sigma_x = 0.9,
sigma_u = 0.8,
alpha = 0.5,
beta = 0.9
)
#function to generate data
generate_data <- function(N,
mu_x,
sigma_x,
sigma_u,
alpha,
beta) {
data <- tibble(id = 1:N) %>%
mutate(
x = rnorm(N, mu_x, sigma_x),
u = rnorm(N, 0, sigma_u),
y = alpha + beta*x + u
)
}
#function to run the estimation
run_estim <- function(data) {
lm(data = data, y ~ x) %>%
broom::tidy() %>%
filter(term == "x")
}
#function to compute a simulation
compute_sim <- function(...) {
generate_data(...) %>%
run_estim() %>%
cbind(as_tibble(list(...))) #add parameters used for generation
}
#replicate the process
#set the number of iterations and parameters to vary
n_iter <- 1000
vect_sigma_x <- seq(0.2, 2, 0.2)
#define the complete set of parameters
param <- baseline_param %>%
crossing(rep_id = 1:n_iter) %>%
select(-sigma_x) %>%
crossing(sigma_x = vect_sigma_x) %>%
select(-rep_id)
result_sim <- future_pmap_dfr(param, compute_sim,
.options = furrr_options(seed = TRUE))
The thumbnail photo: Maxime Lebrun on Unsplash↩︎