Simulations RDD

In this document, we run a simulation exercise to illustrate how using a Regression Discontinuity Design (RDD) to avoid confounders may lead to a loss in power and inflated effect sizes. To make these simulations realistic, I emulate a typical study estimating the impact of additional lessons on students’grades.

Intuition

In the case of the RDD, the trade-off between avoiding confounding and inflated effect sizes due to low power issues is mediated by the size of the bandwidth considered in the analysis. The underlying idea is that the smaller the bandwidth, the more comparable units are and therefore the smaller the risk of confounding is. Yet, with a smaller bandwidth, sample size and thus power decrease, increasing the exaggeration ratio.

Simulation framework

Illustrative example

To illustrate this trade-off, we consider a standard application of the RD design in economics of education where a grant or additional lessons are assigned based on the score obtained by students on a standardized test. Students with test scores below a given threshold receive the treatment while those above do not. Yet, students far above and far below the threshold may differ along unobserved characteristics such as ability. To limit this bias, the effect of the treatment is estimated by comparing the outcomes of students just below and just above this threshold. This enable to limit disparities in terms of unobserved characteristics.

Thistlewaite and Campbell (1960) introduced the concept of RDD using this type of quasi-experiment. In their paper, they take advantage of a sharp discontinuity in the assignment of an award (a Certificate of Merit) based on qualifying scores at a test. This type of analysis is still used today and many papers leveraging similar methodologies have been published since this seminal work. For instance, Jacob and Lefgren (2004) exploit this type of discontinuity to study the impact of summer school and grade retention programs on test scores. Students who score below a given score are required to attend a summer school and to retake the test. Students who do not pass the second have to repeat the grade.

Modeling choices

In the present analysis, we build our simulations to replicate a similar type of quasi-experiment. In our fictional example, all students scoring below a cutoff \(C\) in a qualification test are required to take additional lessons. We want to estimate the effect of these additional lessons on scores on a final test taken by all students a year later.

We assume that the final score of student \(i\), \(Final_i\), is correlated with their qualification score \(Qual_i\) and their treatment status \(T_i\), ie whether student \(i\) received additional lessons or not. We further assume that both qualification and final test scores are affected by students’ unobserved ability \(w_i\) in a non linear way.

The DGP can be represented using the following Directed Acyclic Graph (DAG):

Final test scores are thus defined as follows:

\[Final_{i} = \beta_0 + \beta_1 T_i + \eta Qual_{i} + \delta f(w_i) + u_{i}\]

Where \(\beta_0\) is a constant, \(f\) a non linear function and \(u\) noise. The parameter of interest is \(\beta_1\). In the potential outcomes framework, we have:

Qualifying test scores are as follows: \(Qual_i = \mu_q + \gamma f(w_i) + \epsilon_i\)

To simplify, we consider the following assumptions:

More precisely, we set:

Implementation of the simulations

Data generation

Generating function

We write a simple function that generates the data. It takes as input the values of the different parameters and returns a data frame containing all the variables for this analysis.

Once the fake data is generated, to make things more realistic we consider our data as if it was actual data. We do not take advantage of our knowledge of the data generating process in the estimation procedure. However, we observe both potential outcomes and the unobserved ability. Note that, in a real world setting, one would generally know the value of the threshold (and thus of \(q_c\)). Based on that and to simplify the computation of the bandwidth, we store \(q_c\).

generate_data_RDD <- function(N,
                              mu_q,
                              sigma_epsilon,
                              beta_0,
                              sigma_u,
                              beta_1,
                              eta,
                              gamma,
                              q_c) {
  
  # mu_q <- mean_q
  # sigma_epsilon <- sqrt(sd_q^2 - gamma^2*(1/3)) #var(u) = 1/3
  # beta_0 <- mean_f - eta*mean_q - beta_1*q_c #q_c = mean(treated)
  # sigma_u <- sd_f

  data <- tibble(id = 1:N) |>
    mutate(
      w = runif(N, -1, 1),
      epsilon = rnorm(N, 0, sigma_epsilon),
      qual = mu_q + gamma*w^3 + epsilon,
      u = rnorm(N, 0, sigma_u),
      treated = qual < quantile(qual, q_c),
      final0 = beta_0 + eta*qual + gamma*(1-eta)*w^3 + u,
      final1 = final0 + beta_1,
      final = final0 + beta_1*treated,
      q_c = q_c
    )
}

Definition the bandwidth

In a RDD, the model is estimated only for observations close enough to the threshold, ie in a given bandwidth. We therefore create a function to define this bandwidth by adding a variable to the data set treated_bw that is equal to NA if the observations is outside of the bandwidth, TRUE if the observation falls in the bandwidth and the student is treated and FALSE if the observation falls in the bandwidth and the student is not treated. The bandwidth parameter bw represents the proportion of units that are in the bandwidth. If bw = 0.1, 10% of the students are in the bandwidth for instance.

define_bw <- function (data, bw) {
  data <- data |> 
    mutate(
      treated_bw = ifelse(
        dplyr::between(
          qual, 
          quantile(qual, unique(q_c) - bw/2), 
          quantile(qual, unique(q_c) + bw/2)
        ), 
        treated, 
        NA
      )
    )
} 

Calibration and baseline parameters’ values

We set baseline values for the parameters to emulate a somehow realistic observational study in this field. We make the following assumptions:

N mu_q sigma_epsilon beta_0 sigma_u beta_1 eta q_c gamma
60000 1060 145 850 140 6 0.2 0.5 360

Here is an example of data created with our data generating process:

Show the code used to generate the table
baseline_param_RDD |>
  mutate(N = 10) |>
  pmap(generate_data_RDD) |> #use pmap to pass the set of parameters
  list_rbind() |> 
  kable()
id w epsilon qual u treated final0 final1 final q_c
1 -0.8612782 -39.482926 790.5144 -98.48500 TRUE 725.6158 731.6158 731.6158 0.5
2 0.6355504 -45.725563 1106.6916 166.44308 FALSE 1311.7151 1317.7151 1311.7151 0.5
3 0.8852435 -91.097009 1218.6445 47.67172 FALSE 1341.1938 1347.1938 1341.1938 0.5
4 -0.4612362 -15.437263 1009.2385 70.97554 FALSE 1094.5639 1100.5639 1094.5639 0.5
5 -0.6613038 62.062146 1017.9490 -41.06272 FALSE 929.2366 935.2366 929.2366 0.5
6 -0.9322088 -112.769339 655.5941 31.30980 TRUE 779.1193 785.1193 785.1193 0.5
7 -0.6424300 -187.612933 776.9362 281.00820 TRUE 1210.0347 1216.0347 1216.0347 0.5
8 0.2833307 -113.037144 955.1510 141.67708 FALSE 1189.2578 1195.2578 1189.2578 0.5
9 -0.9542445 1.733005 748.9224 -42.34429 TRUE 707.1917 713.1917 713.1917 0.5
10 -0.9833503 -22.100355 695.5835 -143.53428 TRUE 571.7296 577.7296 577.7296 0.5

We check the standard deviation and means of the generated exam scores to make sure that they correspond to what we expect:

Show code
ex_data_RDD <- baseline_param_RDD |> 
  pmap(generate_data_RDD) |> 
  list_rbind()

ex_data_RDD_mean <- ex_data_RDD |> 
  summarise(across(.cols = c(qual, final0), .fns = mean)) |> 
  mutate(Statistic = "Mean") |> 
  select(Statistic, everything())

ex_data_RDD_sd <- ex_data_RDD |> 
  summarise(across(.cols = c(qual, final0), .fns = stats::sd)) |> 
  mutate(Statistic = "Standard Deviation") |> 
  select(Statistic, everything())

ex_data_RDD_mean |> 
  rbind(ex_data_RDD_sd) |> 
  kable()
Statistic qual final0
Mean 1060.4593 1062.4691
Standard Deviation 199.1573 198.5572

Exploration of the data generated

The following graph illustrates this process by plotting final test scores against qualification ones depending on the value of treated_bw.

Show the code used to generate the graph
ex_data_RDD |> 
  define_bw(0.5) |>
  ggplot(aes(x = qual, y = final, color = treated_bw)) +
  geom_point() +
  labs(
    title = "Final scores against qualification scores",
    subtitle = "Illustration of the definition of the bandwidth and the treatment",
    x = "Qualification score",
    y = "Final sccore",
    color = NULL
  ) +
  scale_color_discrete(
    labels = c(
      "Non treated,\nin bandwidth",
      "Treated,\nin bandwidth",
      "Outside\nbandwidth")
  ) +
  coord_fixed()

Then, we quickly look at the distributions of the different variables to check that they have a shape similar to what we expect.

Show code
ex_data_RDD |> 
  mutate(is_treated = ifelse(treated, "Treated", "Non treated")) |> 
  ggplot(aes(x = qual, fill = is_treated)) +
  geom_histogram() + 
  labs(
    title = "Distribution of qualifying scores depending on treatment status",
    x = "Qualifying Score",
    y = "Count",
    fill = NULL
  ) 
Show code
ex_data_RDD |> 
  mutate(is_treated = ifelse(treated, "Treated", "Non treated")) |> 
  ggplot(aes(x = final, fill = is_treated)) +
  geom_histogram(position = "identity", alpha = 0.8, bins = 50) + 
  labs(
    title = "Distribution of final scores depending on treatment status",
    x = "Final Score",
    y = "Count",
    fill = NULL
  ) 

We then look at relation between the qualifying score and the unobserved ability. This enables us to understand how where the bias comes form:

Show code
ex_data_RDD |> 
  define_bw(0.7) |>
  ggplot(aes(x = qual, y = w, color = treated_bw)) + 
  geom_point() + 
  labs(
    title = "Relationship between unobserved ability and qualifying score",
    x = "Qualifying score",
    y = "Unobserved ability",
    color = NULL
  ) +
  scale_color_discrete(
    labels = c(
      "Non treated,\nin bandwidth",
      "Treated,\nin bandwidth",
      "Outside\nbandwidth")
  ) 

Estimation

After generating the data, we can run an estimation.

Note that to run power calculations, we need to have access to the true effects. Therefore, before running the estimation, we write a short function to compute the average treatment effect on the treated (ATET). We will add this information to the estimation results.

compute_true_effect_RDD <- function(data) {
  treated_data <- data |> 
    filter(treated) 
  return(round(mean(treated_data$final1 - treated_data$final0)))
}  

We then run the estimation. To do so, we only consider observations within the bandwidth and regress the final test scores on the treatment, the qualification score and their interaction. Note that we include this interaction term to allow more flexibility and to mimic an realistic estimation. Yet, we know that this interaction term does not appear in the DGP. Including it or not do not change the results. Also note that, of course, we do not include the unobserved ability in this model to create an OVB.

estimate_RDD <- function(data, bw) {
  data_in_bw <- data |> 
    define_bw(bw = bw) |> 
    filter(!is.na(treated_bw))
  
  reg <- lm(
    data = data_in_bw, 
    formula = final ~ treated + qual
  ) |> 
    broom::tidy() |>
    filter(term == "treatedTRUE") |>
    rename(p_value = p.value, se = std.error) |>
    select(estimate, p_value, se) |>
    mutate(
      true_effect = compute_true_effect_RDD(data),
      bw = bw
    )
  
  return(reg)
}

One simulation

We can now run a simulation, combining generate_data_RDD and estimate_RDD. To do so we create the function compute_sim_RDD. This simple function takes as input the various parameters along with a vector of bandwidth sizes, vect_bw. If we want to run several simulations with different bandwidths, we can reuse the same data, hence why we allow to passing a vector of bandwidths and not only one bandwidth. The function returns a table with the estimate of the treatment, its p-value and standard error, the true effect and the bandwidth and intensity of the OVB considered (\(\gamma\)). Note for now, that we do not store the values of the other parameters for simplicity because we consider them fixed over the study.

compute_sim_RDD <- function(vect_bw, ...) {
  data <- generate_data_RDD(...) 
  
  map(vect_bw, estimate_RDD, data = data) |> 
    list_rbind()
} 

Here is an example of an output of this function.

estimate p_value se true_effect bw
8.915184 0.2619015 7.945699 6 0.1
2.910494 0.6058127 5.639690 6 0.2

All simulations

We will run the simulations for different sets of parameters by mapping our compute_sim_RDD function on each set of parameters. We thus create a table with all the values of the parameters we want to test param_rdd. Note that in this table each set of parameters appears n_iter times as we want to run the analysis \(n_{iter}\) times for each set of parameters.

fixed_param_RDD <- baseline_param_RDD |> #|> rbind(...)
  mutate(vect_bw = 0.1)
vect_bw <- c(seq(0.1, 0.9, 0.1))
n_iter <- 1000

param_rdd <- fixed_param_RDD |> 
  select(-vect_bw) |> #parameters we modify
  mutate(vect_bw = list(vect_bw)) |> 
  crossing(sim_id = 1:n_iter) |> 
  select(-sim_id)

We then run the simulations by mapping our compute_sim_RDD function on param_rdd.

tic()
sim_rdd <- pmap(param_rdd, compute_sim_RDD, .progress = TRUE) |> 
  list_rbind(names_to = "sim_id")
beep()
toc()

# saveRDS(sim_rdd, here("Outputs/sim_rdd.RDS"))

Analysis of the results

Quick exploration

First, we quickly explore the results. In the following figure, we can see that for small bandwidth estimates are unbiased but imprecise while for large bandwidths estimates are precise but biased.

Show code
sim_rdd <- readRDS(here("Outputs", "sim_rdd.RDS"))

sim_rdd |>
  filter(bw %in% seq(0.1, 0.9, 0.1)) |> 
  mutate(bw_name = str_c("Bandwidth: ", bw)) |> 
  ggplot(aes(x = estimate)) +
  geom_density() +
  # geom_vline(aes(xintercept = mean(estimate))) +
  geom_vline(aes(xintercept = true_effect), linetype = "dashed") +
  facet_wrap(~ bw_name) +
  labs(
    title = "Distribution of the estimates of the treatment effect",
    subtitle = "For different bandwidth sizes, as proportion of observations in the bandwidth",
    x = "Estimate of the treatment effect",
    y = "Density",
  )

When the bandwidth is relatively small, estimates are spread out and the mean of statistically significant estimates is larger than the true effect. Note that the average of all estimates, significant and non-significant, is close to the true effect. Applying a statistical significance filter leads to overestimate the true effect in this case.

Show code
sim_rdd |> 
  filter(bw == vect_bw[2]) |>
  mutate(significant = ifelse(p_value < 0.05, "Significant", "Non-significant")) |> 
  ggplot(aes(x = estimate, fill = significant)) + 
  geom_histogram(bins = 25) +
  geom_vline(
    aes(xintercept = mean(abs(estimate[significant == "Significant"]))),
    linetype = "solid"
  ) +
  geom_vline(aes(xintercept = 1)) +
  # facet_wrap(~ bw) +
  labs(
    title = "Distribution of the estimates of the treatment effect",
    subtitle = str_c("For a bandwidth containing ", vect_bw[2]*100, "% of the observations"),
    x = "Estimate of the treatment effect",
    y = "Count",
    fill = "",
    caption = "The solid line represents the average mean effect for significant 
    estimates (in absolute value), the dashed line represents the true effect"
  )

Computing the bias and exaggeration ratio

We want to compare \(\mathbb{E}\left[ \left| \frac{\widehat{\beta_{RDD}}}{\beta_1}\right|\right]\) and \(\mathbb{E}\left[ \left| \frac{\widehat{\beta_{RDD}}}{\beta_1} \right| | \text{signif} \right]\). The first term represents the bias and the second term represents the exaggeration ratio. This terms depend on the true effect size. To enable comparison across simulations and getting terms independent of effect sizes, we also compute the average of the ratios between the estimate and the true effect, conditional on significance.

source(here("functions.R"))

summary_sim_rdd <- summarise_sim(
  data = sim_rdd, 
  varying_params = c(bw), 
  true_effect = true_effect
)

# saveRDS(summary_sim_rdd, here("Outputs/summary_sim_rdd.RDS"))

Graphs

To analyze our results, we build a unique and simple graph:

Show code
main_graph_RDD <- summary_sim_rdd |> 
  pivot_longer(cols = c(type_m, bias_all), names_to = "measure") |> 
  mutate(
    measure = ifelse(measure == "type_m", "Significant", "All")
  ) |> 
  ggplot(aes(x = bw, y = value, color = measure)) + 
  geom_line(linewidth = 1.2) +
  labs(
    x = "Bandwidth size", 
    y = expression(paste("Average  ", frac("|Estimate|", "|True Effect|" ))),
    color = "Estimates",
    title = "Evolution of the bias with bandwidth size for the RDD",
    subtitle = "Conditional on significativity",
    caption = "Bandwidth size as a proportion of the total number of observations"
  ) +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 5))

main_graph_RDD

We notice that, the smaller the bandwidth size, the closer the average of all estimates is to the true effect. Yet, when the bandwidth gets small significant estimates overestimate the true effect. This arises because of a loss of power, as shown in the graph below.

We can then look in more details to the distribution of the estimates for different bandwidths. Statistically significant estimates, when power is low (ie when the bandwidth is small) are located in the tail of the distribution all estimates that is itself centered around the true effect.

Show code
graph_distrib_RDD <- sim_rdd |> 
  mutate(
    significant = ifelse(p_value < 0.05, "Significant", "Non-significant"),
    ratio = estimate/true_effect
  ) |> 
  group_by(bw) |> 
  mutate(
    mean_signif_bw = mean(ifelse(p_value < 0.05, ratio, NA), na.rm = TRUE),
    mean_all_bw = mean(ratio, na.rm = TRUE)
  ) |> 
  ungroup() |> 
  ggplot(aes(x = ratio, fill = "All estimates")) + 
  facet_grid(~ bw, switch = "x") +
  geom_vline(aes(xintercept = 1)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
  coord_flip() +
  scale_y_continuous(breaks = NULL) + 
  labs(
    y = "Bandwidth size", 
    x = expression(paste(frac("Estimate", "True Effect" ))),
    fill = NULL,
    title = 
      paste("Distribution of RDD estimates for", n_iter, "simulated datasets"),
    subtitle = "Comparison across bandwidth sizes", 
    caption = 
      "Bandwidth size as a proportion of the total number of observations"
  )

graph_distrib_RDD +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits = c(-5, 8)) 
Show code
graph_distrib_RDD +
  geom_histogram(bins = 45, alpha = 0.85) 
Show code
graph_distrib_RDD_signif <- graph_distrib_RDD +
  geom_histogram(bins = 45, alpha = 0.85, aes(fill = significant)) +
  geom_vline(xintercept = 1) 

graph_distrib_RDD_signif
Show code
graph_distrib_RDD_complete <- graph_distrib_RDD_signif +
  geom_vline(
    aes(xintercept = mean_signif_bw),
    color = "#976B21",
    linetype = "solid",
    linewidth = 0.9
  ) 

graph_distrib_RDD_complete

Note that a Jacob and Lefgren (2004) have about 5000 observations inside their bandwidth. In our simulations, the 0.1 bandwidth has even more observations (6000) and displays particularly large exaggeration.

Further checks

Representativeness of the estimation

We calibrated our simulations to emulate a typical study from this literature. To further check that the results are realistic, we compare the average Signal-to-Noise Ratio (SNR) of our regressions to the range of SNR of an existing study. The closest study to our simulation is Jacob and Lefgren (2004). In their main result table, Table 3, the SNRs range from 0.3 to 8.

I find SNR in a similar range or even larger.

Bandwidth Median SNR Bias ratio
0.1 0.85 3.21
0.2 1.15 2.31
0.3 1.43 1.94
0.4 1.70 1.73
0.5 2.11 1.63
0.6 2.80 1.67
0.7 3.86 1.91
0.8 5.70 2.55
0.9 8.95 3.65

The precision of our results is therefore in line with precision that can be found in the literature. It is important to acknowledge that our simulation in no means reproduces the study in Jacob and Lefgren (2004) and can be used to analyze their results directly. Rather, it combine insights from several studies to mimic a “typical” study from this literature, while remaining conservative with regards to many assumptions, in particular in terms of data generating model (constant and homogenous treatment, ect). Parameters values however are such that they yield a design that can be underpowered, something that can happen in this literature but does not affect every single study.

Variation in effect size

I then explore where between bandwidth size for one given data set comes from. To do so, I run a few simulations and plot the results.

few_sim <- baseline_param_RDD |> 
  crossing(sim_id = 1:4) |>
  select(-sim_id) |> 
  pmap(compute_sim_RDD, vect_bw = seq(0.1, 0.9, 0.1)) |> 
  list_rbind(names_to = "sim_id") |> 
  mutate(
    sim_id = paste("Simulation", sim_id),
    signif = ifelse(p_value < 0.05, "Significant", "Non-significant")
  )

few_sim |> 
  ggplot(aes(x = bw, y = estimate, color = signif)) +
  geom_point() +
  geom_pointrange(aes(ymin = estimate - 1.96*se, ymax = estimate + 1.96*se)) +
  geom_hline(aes(yintercept = true_effect)) +
  geom_hline(yintercept = 0, linetype = "solid") + 
  facet_wrap(~ sim_id) +
  labs(
    title = "Estimation results with fixed data sets",
    x = "Bandwidth size",
    y = "Estimate", 
    color = NULL,
    caption = "The doted line represents the true effect
      Error bars represent the 95% CI and
      The bandwidth size is the proportion of the total number of observations"
  )