Retrospective Power Analysis of the Causal Inference Literature

In this document, we carry out a retrospective power analysis of the causal inference literature on the acute health effects of air pollution.

We used an extensive search strategy on Google Scholar, PubMed, and IDEAS to retrieve studies that (i) focus on the short-health effects of air pollution on mortality and morbidity outcomes, and (ii) rely on a causal inference method. We exclude the very recent literature on the effects of air pollution on COVID-19 health outcomes. We found a corpus of 36 relevant articles. For each study, we retrieved the method used by the authors, which health outcome and air pollutant they consider, the point estimate and the standard error of results. We coded all main results but also those on heterogeneous effects by age categories and control outcomes (“placebo tests”).

Our document is organized as follows:

Set-Up

Packages and Data

Show the packages used
library("groundhog")
packages <- c(
  "here",
  "tidyverse", 
  "knitr",
  "retrodesign",
  "kableExtra",
  "patchwork",
  "DT",
  "mediocrethemes"
  # "vincentbagilet/mediocrethemes"
)

# groundhog.library(packages, "2022-11-28")
lapply(packages, library, character.only = TRUE)

set_mediocre_all(pal = "leo")

We load the literature review data:

# load literature review data
data_literature <-
  readRDS(here::here(
    "data",
    "literature_review_causal_inference",
    "data_literature_review.rds"
  ))

Description of Coded Variables

We retrieved data 537 estimates from 36 articles. For each paper, we coded 25 variables:

Empirical Strategy Proportion (%)
instrumental variable 44
reduced-form 17
conventional time series 14
multi pollutant iv-lasso models 6
difference in differences 5

The data set is as follows:

Show code
data_literature |> 
  DT::datatable(
    filter = 'top', 
    options = list(
      pageLength = 5,
      scrollY = TRUE,
      scrollX = TRUE,
      autoWidth = TRUE
    )
  )

Computing Standardized Effect Sizes

We standardize the effect sizes using the standard deviations of the independent and outcome variables:

In the case where authors used linear regression models with log-transformed variables, we rely on the formulas provided by Rodríguez-Barranco et al. (2017) to standardize the effect size.

We are able to standardize the effects of 63% of all estimates. We display below summary statistics on the distribution of the standardized effect sizes of causal inference methods:

Min First Quartile Mean Median Third Quartile Maximum
0 0.0193351 0.9527143 0.0763023 0.2724913 34.99225

We see that half of the studies estimated effect sizes below 0.03 standard deviation.

We plot below the ratio of 2SLS estimates over OLS estimates:

For half of the studies reporting both an IV estimate and an OLS estimate, the ratio of these respective estimates is larger than 3.8. The median ratio of the ratio of the standard errors is 3.8.

Evidence of Publication Bias

Distribution of t-statistics

We plot the distribution of weighted t-statistics by following Brodeur et al. (2020) where the weights are equal to the inverse of the number of tests presented in the same table multiplied by the inverse of the number of tables in the article.

We then restrict the sample to studies published in economics journals. The figure remains essentially the same.

Estimated Effect Sizes versus Precision

We plot below the relationship between the absolute values of standardized estimates and the inverse of their standard errors. We do not include control outcomes (“placebo” tests) and conventional time series estimates.

For economics journals, we then compare top 5 to other journals.

Computing Statistical Power, Exaggeration and Type S Errors

In this section, we compute the statistical power, the exaggeration ratio and the probability to make a type S error for each study. We rely on the retrodesign package.

To compute the three metrics, we need to make an assumption about the true effect size of each study:

Overview

We define the true effect sizes as a decreasing fraction of the estimates. We want to see how the overall distributions of the three metrics evolve with as we decrease the hypothesized true effect size.

# test exaggeration and type s errors
data_retrodesign <- data_literature %>%
  # drop control outcomes
  filter(control_outcome == 0) %>%
  # drop conventional time series models
  filter(!(
    empirical_strategy %in% c(
      "conventional time series",
      "conventional time series - suggestive evidence"
    )
  )) %>%
  select(paper_label, paper_estimate_id, estimate, standard_error) %>%
  # select statistical significant estimates at the 5% level
  filter(abs(estimate / standard_error) >= 1.96)

For each study, we compute the statistical power, the exaggeration ratio and the probability to make a type S error by defining their true effect sizes as decreasing fraction of the estimates.

# run retrospective power analysis for decreasing effect sizes
data_retrodesign_fraction <- data_retrodesign |> 
  crossing(percentage = seq(from = 30, to = 100) / 100) |> 
  mutate(hypothetical_effect_size = percentage * estimate) |> 
  mutate(
    retro = map2(hypothetical_effect_size, standard_error, 
                 \(x, y) retro_design_closed_form(x, y))
    #retro_design returns a list with power, type_s, type_m
  ) |> 
  unnest_wider(retro) |> 
  mutate(power = power * 100) |> 
  pivot_longer(
    cols = c(power, type_m, type_s),
    names_to = "metric",
    values_to = "value"
  ) %>%
  mutate(
    metric = case_when(
      metric == "power" ~ "Statistical Power (%)",
      metric == "type_m" ~ "Exaggeration Ratio",
      metric == "type_s" ~ "Type S Error (%)"
    )
  )

# compute mean values of metrics for the entire literature
data_retrodesign_fraction_mean <- data_retrodesign_fraction %>%
  group_by(metric, percentage) %>%
  summarise(median_value = median(value))

We then plot the power and the exaggeration ratio metrics for the different scenarios (we do not report Type S error as this issue is limited in this setting):

We display below summary statistics for the scenario where true effect sizes are equal to the observed estimates reduced by 25%:

Metric Min First Quartile Mean Median Third Quartile Maximum
Exaggeration Ratio 1.0 1.1 1.3 1.3 1.5 1.8
Statistical Power (%) 31.4 46.2 65.5 62.3 85.0 100.0
Type S Error (%) 0.0 0.0 0.0 0.0 0.0 0.0

And here when estimates are divided by two:

Metric Min First Quartile Mean Median Third Quartile Maximum
Exaggeration Ratio 1.0 1.4 1.7 1.7 2.0 2.5
Statistical Power (%) 16.6 23.7 41.3 32.9 51.5 100.0
Type S Error (%) 0.0 0.0 0.0 0.0 0.0 0.0

OLS Estimates as True Effect Sizes

For statistically significant 2SLS estimates, we define the true values of effect size as the corresponding OLS estimates. We assume that (i) the causal estimand targeted by the naive and instrumental variable strategy is the same (i.e., we are in the case of homogeneous constant treatment effects), (ii) that there are no omitted variables and (iii) no classical measurement errors in the air pollution exposure.

We retrieve 98 2SLS estimates that were statistically significant at the 5% level and had

The median statistical power is 8.4% and the median exaggeration ratio is 4.5. 2.04% of IV designs have adequate power (i.e. greater than 80%).