Building a simulation to explore some drivers of omitted variable bias.
Code to load the R packages used in this doc
library(tidyverse)library(broom)library(knitr)set.seed(35) #fix the starting point of the random numbers generator# The code below is to use my own package for ggplot themes# You can remove it. Otherwise you will need to install the package via GitHub# using the following line of code# devtools::install_github("vincentbagilet/mediocrethemes")library(mediocrethemes)set_mediocre_all(pal ="portal") #set theme for all the subsequent plots
Objective
Simulate fake data to explore some of the drivers of omitted variable bias.
To do so, we consider a very simple data generating process and vary some parameters to understand how it affects exaggeration. We consider the following Data Generating Process (DGP):
We want to estimate the following regression models in which we will be interested in estimates of :
Long regression:
Short regression:
Since we generate the data ourselves, we will know what the true effect is and can compare our estimate to this true effect. In the short regression, we omit ; the regression model does not represent the DGP. We study the impact of omitting this variable on the estimate of , .
No correlation
We need to generate 3 variables. We define the parameters and draw observations from given distributions and store them in a tibble. Note that we need to define parameter values ahead.
We can quickly look at the distribution of the variables we generated and to the relation between x and y.
Code
data |>pivot_longer(cols =c(x, y, w)) |>ggplot(aes(x = value)) +geom_density() +facet_wrap(~ name, scales ="free") +labs(title ="Distribution of x, w, and y")
Code
data |>ggplot(aes(x = x, y = y)) +geom_point() +geom_smooth(method ="lm", formula ='y ~ x') +labs(title ="Relationship between x and y")
We then estimate the long and short regressions via OLS and compute the bias.
reg_incl <-lm(data = data, y ~ x + w)# summary(reg_incl)reg_incl |> broom::tidy() |> knitr::kable()
term
estimate
std.error
statistic
p.value
(Intercept)
10.028072
0.0569289
176.15068
0
x
4.987800
0.0162203
307.50421
0
w
-3.009797
0.0405145
-74.28944
0
bias <- reg_ov$coefficients[["x"]] - beta
The “bias” is about 0.036 for a true effect size of 5. I put bias in quotes because the bias is defined as the expected ; here we only have one occurrence. To compute a numeric approximation of this expectation, we need to generate many different data sets and run the analysis several times:
The bias is thus indeed basically 0 (3.5^{-4}). This is not surprising, the omitted variable is correlated with and not with .
Correlation between x and w
Let’s thus change the data generating process so that . Note that we first need to generate and before we are able to generate . Let’s also remove the for loop for simplicity.
We can then play with the values of the parameters and notice some regularities:
If and have the same sign, the bias is positive. If they have opposite signs, it is negative.
The magnitude bias seems to increase with the value of , and . It seem to decrease with .
These result actually make sence
Conclusion
Thanks to this very quick analysis, we were able to get an idea of how the bias without deriving the maths, although they are relatively straightforward in this case $( bias = ) $.
Exercise (optional)
Let’s keep on exploring this to better understand the role of the parameters.
Additional exercise
How do these parameters affect the variance of the estimator. Why? Think about the mathematical expression of the variance of an estimator. Pay attention to the impact these changes have on .
Play around with all the parameters and explore how it affects the graph describing the relationship between and .
Wrap your code into functions: one function to generate the data, one to estimate the model and one to compute a one-iteration simulation.
Build a graph representing the bias as a function of various parameters (eg or ).