In this document, I carry out complementary analyses to the literature review I implemented in my companion paper on short term health effects of air pollution.
library(tidyverse)
library(knitr)
library(mediocrethemes)
library(here)
library(retrodesign)
library(haven)
library(DT)
library(ggbeeswarm)
set_mediocre_all(pal = "coty")
First, I load the data set containing results of the meta-analysis. I then print the whole table:
The goal is to compare the OVB in a non-causal case to exaggeration of a corresponding causal analysis. To do so, I focus on IVs because for the same design, they report both a “naive” OLS estimate and a causal one. To disentangle the biases of exaggeration and omitted variables, I make the following hypotheses:
To be able to compare this exaggeration and OVB, I need to have an estimate of the “true effect”. That way I can compare how far are the OLS and IV point estimates from this true effect and to compute the mean exaggeration.
First, I define the true effect based on a meta-analysis carried out in Shah et al. (2015). They find the following impact of standardized increments in pollutant concentration on relative risk for mortality and hospital admissions (CI = 95% CIs):
I then restrict the sample to studies that consider treatments that can be compared to these standardized increments (thus removing for instance treatments such as “1 additional fog day”).
I also restrict the sample to IV and their corresponding OLS (or conventional time series) in order to be able to compare a a “causal” method to a naive one.
method_interest <- c("conventional time series", "instrumental variable")
pollutant_interest <- c("co", "no2", "o3", "ozone", "pm10", "pm2.5")
#there might be an issue with the increase_independent_variable for CO
estimates_interest <- data_lit_review |>
filter(empirical_strategy %in% method_interest) |>
filter(independent_variable %in% pollutant_interest) |>
filter(str_detect(increase_independent_variable, "ppb|µg")) |>
mutate(
independent_variable =
ifelse(independent_variable == "ozone", "o3", independent_variable)
)
In the literature review the studies consider various increases in the independent variable. They are as follows (with the number of study for each pollutant):
increase_independent_variable | n |
---|---|
1 ppb increase in co | 40 |
1 ppb increase in no2 | 29 |
1 ppb increase in o3 | 2 |
1 µg/m³ increase in no2 | 22 |
1 µg/m³ increase in pm10 | 91 |
1 µg/m³ increase in pm2.5 | 63 |
1 µg/m³ reduction in pm2.5 | 3 |
They differ from those in Shah et al. I modify those in Shah, to make them comparable to mine. To do so I assume linearity and for instance divide the effects by 10 and the effects for \(CO\) by 1000:
#there is an issue here with the / 10 thing
effects_shah <-
tibble(
pollutant = c("co", "so2", "no2", "pm2.5", "pm10", "o3"),
rr_shah = c(1.015, 1.019, 1.014, 1.011, 1.003, 1.001)
) |>
mutate(effect_shah = (rr_shah - 1)*100) |>
mutate(
effect_shah = ifelse(
pollutant == "co",
effect_shah / 1000,
effect_shah / 10
)
) |>
select(-rr_shah)
I also check the distribution of the effect sizes found in each analysis of the restricted sample of the literature review, using a log 10 scale for the x-axis:
estimates_interest |>
filter(empirical_strategy %in% c("conventional time series", "instrumental variable")) |>
filter(control_outcome == 0) |> #remove placebos
ggplot(aes(x = estimate, y = independent_variable, color = empirical_strategy)) +
# geom_histogram(bins = 10) +
geom_beeswarm(cex = 1.5) +
# facet_wrap(~ independent_variable, scales = "free") +
scale_x_log10() +
labs(
title = "Distribution of effect sizes in the IV literature",
x = "Point estimate",
y = NULL,
color = NULL
)
It seems that the magnitude of the estimates vary a lot from study to study.
estimates_interest |>
filter(empirical_strategy %in% c("conventional time series", "instrumental variable")) |>
filter(control_outcome == 0) |> #remove placebos
ggplot(aes(x = estimate, y = independent_variable, color = empirical_strategy)) +
# geom_histogram(bins = 10) +
geom_beeswarm(cex = 1.5) +
# facet_wrap(~ independent_variable, scales = "free") +
scale_x_log10() +
labs(
title = "Distribution of effect sizes in the IV literature",
x = "Point estimate",
y = NULL,
color = NULL
)
I then study the case of He, Liu, and Zhou (2020). They find that ” a 10 μg/m3 increase in PM2.5 increases mortality by 3.25%” (se 1.43%). Their OLS results suggest a 0.32% increase (sd 0.23%). Deryugina et al (2019) find that “each 1 μg/m3 increase in daily PM 2.5 exposure causes […] a 0.18 percent increase relative to the average three-day mortality rate”. I explore whether the findings in He et al. IV could be exaggeration if the true effecct was equal to the one found in Deryugina.
true_deryugina <- 0.18
true_shah <- 0.11
estimate_iv_he <- 0.325
se_iv_he <- 0.143
estimate_ols_he <- 0.032
se_ols_he <- 0.023
N <- 100000
estimates_he <-
tibble(true_deryugina, true_shah, estimate_iv_he, estimate_ols_he) |>
mutate(
exagg_iv_he_deryugina =
true_deryugina * retrodesign(true_deryugina, se_iv_he)$type_m,
exagg_iv_he_shah =
true_shah * retrodesign(true_shah, se_iv_he)$type_m
)
#comparison exagg ovb
exagg_ovb_shah <-
abs(true_shah - estimates_he$exagg_iv_he_shah) /
abs(true_shah - estimate_ols_he)
exagg_ovb_deryugina <-
abs(true_deryugina - estimates_he$exagg_iv_he_deryugina) /
abs(true_deryugina - estimate_ols_he)
# estimates_he |>
# pivot_longer(cols = everything(), names_to = "name", values_to = "estimate") |>
# filter(!str_detect(name, "shah")) |>
# ggplot(aes(x = estimate, y = 0, color = name)) +
# geom_point(size = 3) +
# xlab(c(0, NA))
I then plot the distribution of the OLS estimator, assuming that the point estimate found in He, Liu, and Zhou (2020) is at the center of this distribution. I also plot the distribution of the 2SLS estimator of He, Liu, and Zhou (2020) that would be obtained if the IV was centered on the true effect (as defined in Shah et al. (2015)). The standard deviation of this distribution is equal to the standard error of the 2SLS estimator of He, Liu, and Zhou (2020).
distrib_he_shah <-
tibble(
OLS = rnorm(N, estimate_ols_he, se_ols_he),
IV = rnorm(N, true_shah, se_iv_he)
) |>
pivot_longer(cols = everything(), names_to = "model", values_to = "estimate") |>
mutate(
se = ifelse(model == "ols", se_ols_he, se_iv_he),
signif = (estimate > 1.96*se)
)
he_graph <- distrib_he_shah |>
mutate(
cat = ifelse(signif, "2SLS Significant", "2SLS Non-significant"),
cat = ifelse(model == "OLS", "OLS", cat)
) |>
ggplot(aes(x = estimate, fill = cat)) +
geom_histogram(bins = 190, position = "identity", alpha = 0.9) +
# geom_area(stat = "bin", bins = 100, position = "identity", alpha = 0.9) +
geom_vline(xintercept = true_shah, linetype = "solid") +
geom_vline(xintercept = estimate_iv_he, linetype = "longdash") +
geom_vline(
xintercept = estimate_ols_he,
linetype = "dotted",
linewidth = 0.8
# color = "#E08E53"
) +
# facet_wrap(~ model_desc, nrow = 2) +
labs(
x = "Estimate",
y = "Count",
fill = NULL
) +
xlim(c(-0.35, 0.6))
ggsave(here("images", "air_pollution", "graph_he.pdf"), he_graph, width = 8, height = 4)
he_graph +
labs(
title = "The OLS estimate can be closer to the 'true effect' than the 2SLS one",
subtitle = "In He et al (2020), exaggeration might explains why",
caption = "100000 draws from two normal distibutions
The true effect (solid line) is the meta-analysis one found in Shah et al (2016)
The dashed and dotted lines represent the 2SLS and OLS estimates in He et al., respectively
The distributions are 2SLS ~ N('true effect', var 2SLS), OLS ~ N(OLS estimate, var OLS)"
)
I then plot the same graph if the true effect was the one found in Deryugina et al. (2019):
distrib_he_deryugina <-
tibble(
OLS = rnorm(N, estimate_ols_he, se_ols_he),
IV = rnorm(N, true_deryugina, se_iv_he)
) |>
pivot_longer(cols = everything(), names_to = "model", values_to = "estimate") |>
mutate(
se = ifelse(model == "ols", se_ols_he, se_iv_he),
signif = (estimate > 1.96*se)
)
he_graph_deryugina <- distrib_he_deryugina |>
mutate(
signif = ifelse(signif, "IV significant", "IV non-significant"),
cat = ifelse(model == "OLS", "OLS", signif),
model_desc = ifelse(
model == "IV",
"IV ~ N('true effect', var IV)",
"OLS ~ N(OLS estimate, var OLS)")
) |>
ggplot(aes(x = estimate, fill = cat)) +
geom_vline(xintercept = 0, linetype = "solid", linewidth = 0.1) +
geom_histogram(bins = 200, position = "identity", alpha = 0.9) +
# geom_area(stat = "bin", bins = 183, position = "identity", alpha = 0.9) +
geom_vline(xintercept = true_deryugina, linetype = "solid") +
geom_vline(xintercept = estimate_iv_he, linetype = "longdash") +
geom_vline(
xintercept = estimate_ols_he,
linetype = "dotted",
linewidth = 0.8
# color = "white"
) +
# facet_wrap(~ model_desc, nrow = 2) +
labs(
x = "Estimate",
y = "Count",
fill = NULL,
color = NULL
) +
xlim(c(-0.25, 0.6))
ggsave(
here("images", "air_pollution", "he_graph_deryugina.pdf"),
he_graph_deryugina,
width = 8, height = 3.5
)
he_graph_deryugina +
labs(
title = "The IV estimate can be as far from the 'true effect' than the OLS one",
subtitle = "In He et al (2020), exaggeration might explains why",
caption = "100000 draws from two normal distibutions, filtering out extreme values
The true effect, represented by the dark solid line, is the estimate from Deryugina et al. (2019)
The dashed and dotted lines represent the IV and OLS estimates in He et al., respectively
The distributions are IV ~ N('true effect', var IV), OLS ~ N(OLS estimate, var OLS)"
)