In this document, I run a simulation exercise to illustrate how using an Instrumental Variable (IV) strategy to avoid confounders may lead to a loss in power and inflated effect sizes. To make these simulations realistic, I emulate a typical study estimating the impact of turnout on vote shares.
In the case of the IV, the confounding-fexaggeration trade-off is mediated by the ‘strength’ of the instrument considered. When the instrument only explains a limited portion of the variation in the explanatory variable, the IV can still be successful in avoiding confounders but power can low, potentially leading to exaggeration issues to arise.
To illustrate this loss in power, we could consider a large variety of settings, distribution of the parameters or parameter values. I narrow this down to an example setting, considering only one setting and one set of parameter values. I examine an analysis of the impact of voter turnout on election results, instrumenting voter turnout with rainfall on the day of the election. My point should stand in more general settings and the choice of values is mostly for illustration.
A threat of confounders often arises when analyzing the link between voter turnout and election results. To estimate such an effect causally, one can consider exogenous shocks to voter turnout such as rainfall. Some potential exclusion restriction problems have been highlighted in this instance Mellon (2021) but I abstract from them and simulate no exclusion restriction violations here.
For simplicity, I make several assumptions:
The DGP can be represented using the following Directed Acyclic Graph (DAG):
second_color <- str_sub(colors_mediocre[["four_colors"]], 10, 16)
dagify(S ~ T + w + u,
T ~ R + w + e,
exposure = c("S", "T", "R", "w"),
# outcome = "w",
latent = c("u", "e"),
coords = list(x = c(S = 3, T = 2, w = 2.5, R = 1, e = 1.6, u = 3),
y = c(S = 1, T = 1, w = 0, R = 1, e = 0.7, u = 1.5))
) |>
ggdag_status(text_size = 5) +
theme_dag_blank(legend.position = "none") +
scale_mediocre_d(pal = "coty") +
annotate(#parameters
"text",
x = c(2.5, 1.5, 2.8, 2.2),
y = c(1.1, 1.1, 0.45, 0.45),
label = c("beta", "pi", "delta", "gamma"),
parse = TRUE,
color = "black",
size = 5
)

The DGP for the vote share of let’s say the republican party in location \(i\), \(Share_i\), is defined as follows:
\[Share_{i} = \beta_{0} + \beta_{1} Turnout_{i} + \delta w_{i} + u_{i}\]
Where \(\beta_0\) is a constant, \(w\) represents an unobserved variable and \(u \sim \mathcal{N}(0, \sigma_{u}^{2})\) noise. \(\beta_1\) is the parameter of interest. Let’s call it ‘treatment effect’. Note that parameters names are consistent with the maths section and the other simulation exercises.
The DGP for the turnout data is as follows:
\[Turnout_{i} = \pi_0 + \pi_1 Rain_{i} + \gamma w_{i} + e_{i}\]
Where \(\pi_0\) is a constant, \(Rain\) is either a continuous variable (amount of rain in location \(i\) on the day of the election) or a dummy variable (whether it rained or not) and \(e \sim \mathcal{N}(0, \sigma_{e}^{2})\) noise. Let’s refer to \(\pi_1\) as “IV strength”.
The impact of voter turnout on election outcome (share of the republican party) is estimated using 2 Stages Least Squares.
More precisely, let’s set:
If one abstracts from the name of the variables, they can notice that this setting is actually very general.
Let’s first write a simple function that generates the data. It takes as input the values of the different parameters and returns a data frame containing all the variables for this analysis.
The parameter type_rain describes whether \(Rain\) is a random sample from a normal or Bernoulli distribution. The distributions of rainfall heights can be approximated with a gamma distribution. The Bernoulli distribution is used if one only consider the impact of rain or no rain on voter turnout. A normal distribution does not represent actual rainfall distributions but is added to run these simulations in other contexts than linking rainfall, voter turnout and election outcomes.
type_rain can take the values gamma, bernoulli or normal. param_rain represents either \(\sigma_R\) if \(Rain\) is normal, \(p_R\) if it is Bernoulli or a vector of shape and scale parameters for the gamma distribution.
generate_data_iv <- function(N,
type_rain, #"gamma", "normal" or "bernoulli"
param_rain,
sigma_w,
sigma_share,
sigma_turnout,
beta_0,
beta_1,
pi_0,
pi_1,
delta,
gamma) {
if (type_rain == "bernoulli") {
rain_gen <- rbinom(N, 1, param_rain[1])
sd_rain <- sqrt(param_rain[1]*(1-param_rain[1]))
} else if (type_rain == "normal") {
rain_gen <- rnorm(N, 0, param_rain[1])
} else if (type_rain == "gamma") {
rain_gen <- rgamma(N, shape = param_rain[1], scale = param_rain[2])
sd_rain <- sqrt(param_rain[1]*param_rain[2]^2)
} else {
stop("type_rain must be either 'bernoulli', 'gamma' or 'normal'")
}
data <- tibble(id = 1:N) |>
mutate(
rain = rain_gen,
w = rnorm(N, 0, sigma_w),
sigma_rain = sd_rain,
sigma_e = sqrt(sigma_turnout^2 - pi_1^2*sigma_rain^2 - gamma^2*sigma_w^2),
e = rnorm(N, 0, sigma_e),
turnout = pi_0 + pi_1*rain + gamma*w + e,
sigma_u = sqrt(
sigma_share^2
- beta_1^2*sigma_turnout^2
- delta^2*sigma_w^2
- 2*beta_1*delta*gamma*sigma_w^2
),
u = rnorm(N, 0, sigma_u),
share = beta_0 + beta_1*turnout + delta*w + u
)
return(data)
}
I now set baseline values for the parameters to emulate a somehow realistic observational study. I get “inspiration” for the values of parameters from Fujiwara, Meng, and Vogl (2016) and Cooperman (2017) who replicates a work by Gomez, Hansford, and Krause (2007).
I consider that:
# I retrieve data for Gomez from here:
#https://myweb.fsu.edu/bgomez/research.html
library(readr)
gomez <- read_csv(here("inputs", "IV_vote", "Weather_publicfile.csv")) |>
janitor::clean_names() |>
mutate(log_income = log(adj_income))
#delta
feols(gop_vote_share ~ log_income | year, data = gomez)
#gamma
feols(turnout ~ log_income | year, data = gomez)
#sigma_w
feols(log_income ~ 1 | year, data = gomez) |> resid() |> sd()
sd(gomez$log_income, na.rm = TRUE)
Let’s thus consider the following parameters:
baseline_param_iv <- tibble(
N = 30000,
# type_rain = "bernoulli",
# param_rain = 0.3,
type_rain = "gamma",
param_rain = list(c(0.13, 18)),
sigma_w = 0.28,
sigma_share = 14.2,
sigma_turnout = 14,
beta_0 = 60,
beta_1 = -0.2,
pi_0 = 59,
pi_1 = -0.05,
delta = -13.4,
gamma = 14.7
)
baseline_param_iv |> kable()
| N | type_rain | param_rain | sigma_w | sigma_share | sigma_turnout | beta_0 | beta_1 | pi_0 | pi_1 | delta | gamma |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 30000 | gamma | 0.13, 18.00 | 0.28 | 14.2 | 14 | 60 | -0.2 | 59 | -0.05 | -13.4 | 14.7 |
Here is an example of data created with this data generating process:
| id | rain | w | sigma_rain | sigma_e | e | turnout | sigma_u | u | share |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0000003 | -0.1643113 | 6.489992 | 13.37734 | -19.446284 | 37.13834 | 13.17366 | -9.6955121 | 45.07859 |
| 2 | 0.0043522 | -0.2211127 | 6.489992 | 13.37734 | -10.655240 | 45.09419 | 13.17366 | 4.5245275 | 58.46860 |
| 3 | 0.0000000 | 0.4500603 | 6.489992 | 13.37734 | -4.925928 | 60.68996 | 13.17366 | 7.3605116 | 49.19171 |
| 4 | 4.7328542 | -0.1974975 | 6.489992 | 13.37734 | 5.971075 | 61.83122 | 13.17366 | -4.7516850 | 45.52854 |
| 5 | 0.0000863 | -0.0082812 | 6.489992 | 13.37734 | 8.021376 | 66.89964 | 13.17366 | 3.2696067 | 50.00065 |
| 6 | 0.0015507 | -0.1085520 | 6.489992 | 13.37734 | -14.067633 | 43.33657 | 13.17366 | -16.1009148 | 36.68637 |
| 7 | 0.4799312 | -0.4280953 | 6.489992 | 13.37734 | 21.885920 | 74.56892 | 13.17366 | 0.9802753 | 51.80297 |
| 8 | 0.0502115 | -0.1862775 | 6.489992 | 13.37734 | 4.311583 | 60.57079 | 13.17366 | 5.7308737 | 56.11283 |
| 9 | 0.0790413 | -0.0414453 | 6.489992 | 13.37734 | 8.452267 | 66.83907 | 13.17366 | -10.1846131 | 37.00294 |
| 10 | 0.0009223 | 0.1919409 | 6.489992 | 13.37734 | 2.604034 | 64.42552 | 13.17366 | -14.5344265 | 30.00846 |
I just quickly explore the distribution of the data for a baseline set of parameters. For this, I consider a mid-range value for IV strength (-0.5).
ex_data_iv <- baseline_param_iv |>
pmap(generate_data_iv) |>
list_rbind()
ex_data_iv |>
ggplot(aes(x = turnout)) +
geom_density() +
labs(
title = "Distribution of turnout",
x = "Turnout (in %)",
y = "Density"
) +
xlim(c(0,100))

I also check the standard deviation and means of the variables and at the same time verify that they do not do not change when we vary \(\pi_1\). They are consistent with what we wanted:
vect_pi_1 <- c(seq(0.1, 1, 0.1))
param_iv <- baseline_param_iv |>
select(-pi_1) |>
crossing(vect_pi_1) |>
rename(pi_1 = vect_pi_1)
gen_with_param <- function(...) {
generate_data_iv(...) |>
mutate(pi_1 = list(...)$pi_1)
}
ex_data_iv <- pmap(param_iv, gen_with_param) |>
list_rbind()
ex_data_mean <- ex_data_iv |>
group_by(pi_1) |>
summarise(across(.cols = c(share, turnout, rain), mean)) |>
mutate(stat = "mean")
ex_data_sd <- ex_data_iv |>
group_by(pi_1) |>
summarise(across(.cols = c(share, turnout, rain), sd)) |>
mutate(stat = "sd")
ex_data_mean |>
rbind(ex_data_sd) |>
pivot_wider(names_from = stat, values_from = c(share, turnout, rain)) |>
kable()
| pi_1 | share_mean | share_sd | turnout_mean | turnout_sd | rain_mean | rain_sd |
|---|---|---|---|---|---|---|
| 0.1 | 48.04941 | 14.28623 | 59.27723 | 14.01282 | 2.320063 | 6.574221 |
| 0.2 | 48.13388 | 14.15855 | 59.48296 | 14.05485 | 2.367790 | 6.429612 |
| 0.3 | 48.09473 | 14.22863 | 59.73053 | 13.96563 | 2.281051 | 6.138917 |
| 0.4 | 47.88660 | 14.27701 | 59.91820 | 14.01559 | 2.320802 | 6.488280 |
| 0.5 | 47.98173 | 14.14524 | 60.14044 | 13.92176 | 2.304321 | 6.308877 |
| 0.6 | 47.71045 | 14.19988 | 60.49100 | 14.00489 | 2.375658 | 6.579245 |
| 0.7 | 48.01123 | 14.15247 | 60.56042 | 13.87831 | 2.304649 | 6.445715 |
| 0.8 | 47.81367 | 14.23189 | 60.84392 | 13.94361 | 2.322503 | 6.554110 |
| 0.9 | 47.62748 | 14.09742 | 61.25774 | 13.97077 | 2.309919 | 6.503217 |
| 1.0 | 47.62345 | 14.21288 | 61.50701 | 13.96111 | 2.384839 | 6.708181 |
After generating the data, we can run an estimation. The goal is to compare the IV and the OLS for different IV strength values. Hence, I estimate both an IV and an OLS and return both set of outcomes of interest (point estimate, p-value, standard error and f-stat of the first stage).
estimate_iv <- function(data) {
# reg_iv <- AER::ivreg(data = data, formula = share ~ turnout | rain)
reg_iv <- feols(data = data, fml = share ~ 1 | 0 | turnout ~ rain)
fstat_iv <- fitstat(reg_iv, "ivf")$`ivf1::turnout`$stat
reg_iv <- reg_iv |>
broom::tidy() |>
mutate(model = "IV", fstat = fstat_iv)
reg_OLS <-
feols(data = data, fml = share ~ turnout) |>
broom::tidy() |>
mutate(model = "OLS", fstat = NA)
reg <- reg_iv |>
rbind(reg_OLS) |>
filter(term %in% c("turnout", "fit_turnout")) |>
rename(p_value = p.value, se = std.error) |>
select(estimate, p_value, se, fstat, model)
return(reg)
}
We can now run a simulation, combining generate_data_iv and estimate_iv. To do so I create the function compute_sim_iv. This simple function takes as input the various parameters. It returns a table with the estimate of the treatment, its p-value and standard error, the F-statistic for the IV, the true effect, the IV strength and the intensity of the OVB considered (delta). Note that for now, we do not store the values of the other parameters since we consider them fixed over the study.
The output of one simulation, for baseline parameters values is:
pmap(baseline_param_iv, compute_sim_iv) |>
list_rbind() |>
kable()
| estimate | p_value | se | fstat | model | pi_1 | delta | param_rain | true_effect |
|---|---|---|---|---|---|---|---|---|
| 0.1048483 | 0.8588483 | 0.5895587 | 3.196156 | IV | -0.05 | -13.4 | 0.13, 18.00 | -0.2 |
| -0.2804462 | 0.0000000 | 0.0056640 | NA | OLS | -0.05 | -13.4 | 0.13, 18.00 | -0.2 |
I then run the simulations for different sets of parameters by mapping the compute_sim_iv function on each set of parameters. I thus create a table with all the values of the parameters we want to test, param_iv. Note that in this table each set of parameters appears n_iter times as we want to run the analysis \(n_{iter}\) times for each set of parameters.
Finally, I run all the simulations by looping the compute_sim_iv function on the set of parameters param_iv.
future::plan(multisession, workers = availableCores() - 1)
sim_iv <- future_pmap(param_iv, compute_sim_iv, .progress = TRUE) |>
list_rbind()
beepr::beep()
saveRDS(sim_iv, here("Outputs/sim_iv.RDS"))
First, I quickly explore the results.
sim_iv <- readRDS(here("Outputs/sim_iv.RDS"))
sim_iv |>
filter(between(estimate, -1, 0.5)) |>
# filter(delta == sample(vect_delta, 1)) |>
filter(pi_1 %in% c(0.04, 0.2, 0.14)) |>
mutate(iv_strength = str_c("IV strength: ", pi_1)) |>
ggplot(aes(x = estimate, fill = model, color = model)) +
geom_vline(xintercept = unique(sim_iv$true_effect)) +
geom_density() +
facet_wrap(~ iv_strength) +
labs(
title = "Distribution of the estimates of the treatment effect",
subtitle = "For different IV strengths and models",
color = "",
fill = "",
x = "Estimate of the treatment effect",
y = "Density",
caption = "The vertical line represents the true effect"
) +
scale_mediocre_d()
sim_iv |>
filter(between(estimate, -1.2, 1)) |>
filter(model == "IV") |>
ggplot() +
geom_density(aes(
x = estimate,
color = as.factor(pi_1)),
alpha = 0
) +
geom_vline(xintercept = unique(sim_iv$true_effect)) +
labs(
title = "Distribution of the estimates of the treatment effect",
subtitle = "Comparison across IV strengths",
color = "IV strength",
fill = "IV strength",
x = "Estimate of the treatment effect",
y = "Density",
caption = "For readibility, extreme estimates are filtered out
The vertical line represents the true effect"
)

data_one_sim_iv <- sim_iv |>
filter(between(estimate, -0.5, 0.2)) |>
filter(pi_1 == 0.1) |>
mutate(significant = ifelse(p_value < 0.05, "Significant", "Non significant"))
data_one_sim_iv |>
ggplot(aes(x = estimate, fill = significant)) +
geom_histogram(bins = 70) +
geom_vline(xintercept = unique(sim_iv$true_effect)) +
geom_vline(xintercept = 0, linetype = "solid", alpha = 0.3) +
facet_wrap(~ model, nrow = 3) +
theme_mediocre(pal = "coty") +
labs(
title = "Distribution of the estimates of the treatment effect conditional on significativity",
subtitle = paste(
"For different models (IV strength =",
unique(data_one_sim_iv$pi_1), ")"
),
x = "Estimate of the treatment effect",
y = "Count",
fill = "",
caption = "The sample is restricted to estimates relatively close to the true value
The vertical doted line represents the true effect"
)

The OLS is always biased and that the IV is never biased. However, for limited IV strengths, the distribution of the estimates flattens. The smaller the IV strength, the most like it is to get an estimate away from the true value, even though the expected value remains equal to the true effect size.
We want to compare \(\mathbb{E}\left[ \left| \frac{\widehat{\beta_{IV}}}{\beta_1}\right|\right]\) and \(\mathbb{E}\left[ \left| \frac{\widehat{\beta_{IV}}}{\beta_1} \right| | \text{signif} \right]\). The first term represents the bias and the second term represents the exaggeration ratio.
To do so, I use the function summmarise_sim defined in the functions.R file.
To analyze the results, I build a unique and simple graph:
main_graph_iv <- summary_sim_iv |>
ggplot(aes(x = abs(pi_1), y = type_m, color = model)) +
geom_line(size = 1.2, aes(linetype = model)) +
# geom_point(size = 3) +
labs(
x = "|IV strength|",
y = expression(paste("Average ", frac("|Estimate|", "|True Effect|"))),
color = "Model",
linetype = "Model",
title = "Evolution of bias with intensity of the IV",
subtitle = "For statistically significant estimates"
) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 4)) +
scale_mediocre_d()
main_graph_iv

If IV strength is low, on average, statistically significant estimates overestimate the true effect. If the IV strength is too low, it might even be the case that the benefit of the IV is overturned by the exaggeration issue. The IV yields an unbiased estimate and enables to get rid of the OVB but such statistically significant estimates fall, on average, even further away from the true effect.
Of course, if one considers all estimates, as the IV is unbiased, this issue does not arise.
I then graph the distribution of estimates conditional on significativity. It represents the same phenomenon but with additional information.
set_mediocre_all(pal = "coty")
sim_iv |>
filter(model %in% c("IV")) |>
mutate(
signif = ifelse(p_value < 0.05, "Significant", "Non-significant"),
ratio_exagg = estimate/true_effect
) |>
group_by(pi_1, model) |>
mutate(
mean_signif = mean(ifelse(p_value < 0.05, ratio_exagg, NA), na.rm = TRUE),
mean_all = mean(ratio_exagg, na.rm = TRUE)
) |>
ungroup() |>
filter(as_factor(pi_1) %in% seq(0.02, 0.16, 0.02)) |>
ggplot(aes(x = ratio_exagg, fill = "All Estimates")) +
facet_grid(~ pi_1, switch = "x") +
geom_vline(xintercept = 1) +
geom_vline(xintercept = 0, linetype = "solid", linewidth = 0.12) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 6)) +
coord_flip() +
scale_y_continuous(breaks = NULL) +
labs(
y = "|IV strength|",
x = expression(paste(frac("Estimate", "True Effect"))),
fill = "",
title = "Distribution of estimates from 1000 simulated datasets",
subtitle = "Conditional on significativity",
# caption =
# "The green dashed line represents the result aimed for
# "The brown solid line represents the average of significant estimates"
) +
geom_histogram(bins = 100, alpha = 0.85, aes(fill = signif)) +
geom_vline(xintercept = 1) +
geom_vline(
aes(xintercept = mean_signif),
color = "#976B21",
linetype = "solid",
size = 0.5
)

To be able to determine where an actual setting could fall on the main graph, let’s discuss again what are realistic values for the strength of the IV in this setting. The exaggeration ratios we found in my simulations are as follows:
| IV strength | Exaggeration Ratio | Median SNR |
|---|---|---|
| 0.03 | 3.83 | 0.69 |
| 0.04 | 3.18 | 0.77 |
| 0.05 | 2.86 | 0.91 |
| 0.06 | 2.49 | 1.05 |
| 0.07 | 2.14 | 1.14 |
| 0.08 | 1.86 | 1.29 |
| 0.09 | 1.74 | 1.55 |
| 0.10 | 1.57 | 1.66 |
| 0.11 | 1.48 | 1.83 |
| 0.12 | 1.39 | 1.96 |
| 0.16 | 1.16 | 2.64 |
| 0.20 | 1.06 | 3.24 |
| 0.24 | 1.01 | 3.89 |
| 0.28 | 1.00 | 4.60 |
As discussed above, a typical magnitude of the first stage effect (IV strength) is around -0.05. That would be associated with an exaggeration ratio of about 2.9.
I calibrated the simulations to emulate a typical study from this literature. To further check that the results are realistic, I compare the average Signal-to-Noise Ratio (SNR) of the simulations to the range of \(t\)-stats of an existing study. In Hansford and Gomez (2010) (Table 1, column 2), the \(t\)-stat is about 1.7. However, in this specification, there are interaction terms for turnout. I replicate the results dropping the estimation terms. In that instance, the \(t\)-stats is about 3.9. While they do not compute the same 2SLS, Fujiwara, Meng, and Vogl (2016) estimate a reduced form estimate of the impact of rain on republican vote share; they get a t-stat of 1.7. (table A3).
For SNRs of the order of magnitude of the \(t\)-stats observed in the literature, exaggeration can be substantial.
# I retrieve data for Hansford and Gomez (2010) from here:
#https://myweb.fsu.edu/bgomez/research.html
library(readr)
hansford <- read_csv(here("inputs", "IV_vote", "HansfordGomez_data.csv"))
#replication
feols(
DemVoteShare2 ~ DemVoteShare2_3MA + Yr52 + Yr56 + Yr60 + Yr64 + Yr68 +
Yr72 + Yr76 + Yr80 + Yr84 + Yr88 + Yr92 + Yr96 + Yr2000 |
FIPS_County |
Turnout + GOPIT + TO_DVS23MA ~ DNormPrcp_KRIG + RainGOPI + Rain_DVS23MA,
data = hansford,
vcov = "hetero"
)
#removing interactions
feols(
DemVoteShare2 ~ DemVoteShare2_3MA + Yr52 + Yr56 + Yr60 + Yr64 + Yr68 +
Yr72 + Yr76 + Yr80 + Yr84 + Yr88 + Yr92 + Yr96 + Yr2000 |
FIPS_County |
Turnout ~ DNormPrcp_KRIG,
data = hansford,
vcov = "hetero"
)
I then run some exploratory analysis to study the link between exaggeration and F-stat (under construction).
sim_iv |>
filter(model == "IV") |>
mutate(signif = (p_value <= 0.05)) |>
ggplot(aes(x = pi_1, y = fstat, color = signif)) +
geom_point(alpha = 0.5) +
geom_jitter(alpha = 0.5) +
labs(
title = "A correlation between IV strength and F-statistic",
subtitle = "By significance",
x = "IV strength",
y = "F-statistic",
color = "Significant"
)
# ylim(c(0, 40))
# lm(data = sim_iv, fstat ~ pi_1) |>
# summary() %>%
# .$adj.r.squared
# sim_iv |>
# mutate(significant = (p_value <= 0.05)) |>
# filter(model == "IV") |>
# ggplot() +
# geom_density_ridges(aes(x = fstat, y = factor(pi_1), fill = significant, color = significant), alpha = 0.6)+
# coord_flip() +
# xlim(c(0, 50)) +
# labs(
# title = "F-statistics larger than 10, even for small IV strength",
# subtitle = "Distribution of F-statistics by IV strength and significance",
# x = "F-statistic",
# y = "IV strength",
# fill = "Significant",
# color = "Significant"
# )
# sim_iv |>
# filter(model == "IV") |>
# mutate(
# significant = (p_value <= 0.05),
# bias = abs((estimate - true_effect)/true_effect)
# ) |>
# filter(fstat > 10) |>
# ggplot(aes(x = pi_1, y = fstat, color = bias)) +
# geom_point(alpha = 0.5) +
# geom_jitter(alpha = 0.5) +
# labs(
# title = "A correlation between IV strength and F-statistic",
# subtitle = "By significance",
# x = "IV strength",
# y = "F-statistic"
# # color = "Significant"
# )
# sim_iv |>
# filter(model == "IV") |>
# mutate(
# significant = (p_value <= 0.05)
# ) |>
# group_by(pi_1) |>
# summarise(
# mean_fstat = mean(fstat),
# type_m = median(ifelse(significant, abs(estimate/true_effect), NA), na.rm = TRUE)
# ) |>
# ggplot(aes(x = pi_1, y = mean_fstat, color = type_m)) +
# geom_point() +
# # geom_jitter(alpha = 0.5) +
# labs(
# title = "A correlation between IV strength and F-statistic",
# subtitle = "By significance",
# x = "IV strength",
# y = "F-statistic"
# # color = "Significant"
# ) +
# ylim(c(0,20))
Unsurprisingly, there is a clear positive correlation between what we call IV strength and the F-statistic. I then investigate the link between exaggeration ratios and F-statistics.
sim_iv |>
filter(model == "IV") |>
mutate(significant = (p_value <= 0.05)) |>
mutate(bias = estimate/true_effect) |>
filter(fstat > 10) |>
# pull(bias) |>mean(., na.rm = TRUE)
# filter(abs(bias) < 10) |>
ggplot(aes(x = fstat, y = bias, color = significant)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 1) +
labs(
title = "Bias as a function of the F-statistic in the simulations",
subtitle = "By significance and only for F-stat above 10",
x = "F-statistic",
y = expression(paste(frac("Estimate", "True Effect"))),
color = "Significant"
)

We notice that, even when the F-statistic is greater than the usual but arbitrary threshold of 10, statistically significant estimates may, on average overestimate the true effect.
We cannot compute directly the bias of interest against the F-statistic because the F-statistic is not a parameter of the simulations and we do not control them, only the IV strength. To overcome this, I compute the median power by binned F-statistic. However, this is not correct as we end up comparing and pulling together simulations with different parameter values. I still display the graph, keeping this limitation in mind:
sim_iv |>
filter(model == "IV") |>
mutate(
significant = (p_value <= 0.05),
bin_fstat = cut_number(fstat, n = 20) |>
paste() |>
str_extract("(?<=,)(\\d|\\.)+") |>
as.numeric()
) |>
group_by(delta, bin_fstat) |>
summarise(
power = mean(significant, na.rm = TRUE)*100,
type_m = mean(ifelse(significant, abs(estimate - true_effect), NA), na.rm = TRUE),
bias_signif = mean(ifelse(significant, estimate/true_effect, NA), na.rm = TRUE),
bias_all = mean(estimate/true_effect, na.rm = TRUE),
bias_all_median = median(estimate/true_effect, na.rm = TRUE),
median_fstat = mean(fstat, na.rm = TRUE),
.groups = "drop"
) |>
ungroup() |>
ggplot(aes(x = bin_fstat, y = bias_signif)) +
geom_line(linetype = "dashed") +
geom_vline(xintercept = 10, linetype ="solid") +
xlim(c(0, 80)) +
labs(
x = "Binned F-statistic",
y = expression(paste("Average ", frac("Estimate", "True Effect"))),
color = "Model",
title = "Evolution of bias with binned F-statistic",
subtitle = "For statistically significant estimates",
caption = "Note that this graph does not acurately represent
what we are interested in"
)
