Simple Simulation and OVB

Building a simulation to explore some drivers of omitted variable bias.

Code to load the R packages used in this doc

library(tidyverse)
library(broom)
library(knitr)

set.seed(35) #fix the starting point of the random numbers generator

# The code below is to use my own package for ggplot themes
# You can remove it. Otherwise you will need to install the package via GitHub
# using the following line of code
# devtools::install_github("vincentbagilet/mediocrethemes")
library(mediocrethemes)
set_mediocre_all(pal = "portal") #set theme for all the subsequent plots

Objective

Simulate fake data to explore some of the drivers of omitted variable bias.

To do so, we consider a very simple data generating process and vary some parameters to understand how it affects exaggeration. We consider the following Data Generating Process (DGP):

$y = \alpha + \beta x + \delta w + u$ We want to estimate the following regression models in which we will be interested in estimates of $\beta$ :

Long regression: $y = \alpha + \beta x + \delta w + u$
Short regression: $y = \alpha + \beta x + u$

Since we generate the data ourselves, we will know what the true effect is and can compare our estimate to this true effect. In the short regression, we omit $w$ ; the regression model does not represent the DGP. We study the impact of omitting this variable on the estimate of $\beta$ , $\hat{\beta}$ .

No correlation

We need to generate 3 variables. We define the parameters and draw observations from given distributions and store them in a tibble. Note that we need to define parameter values ahead.

n <- 1000
alpha <- 10
beta <- 5
delta <- -3

mean_x <- 2
sigma_x <- 1
mean_w <- 1
sigma_w <- 0.4
sigma_u <- 0.5

x <- rnorm(n, mean_x, sigma_x)
w <- rnorm(n, mean_w, sigma_w)
u <- rnorm(n, 0, sigma_u)

y <- alpha + beta*x + delta*w + u

data <- tibble(y, x, w)

We can quickly look at the distribution of the variables we generated and to the relation between x and y.

Code

data |> 
  pivot_longer(cols = c(x, y, w)) |> 
  ggplot(aes(x = value)) + 
  geom_density() + 
  facet_wrap(~ name, scales = "free") +
  labs(title = "Distribution of x, w, and y")

Code

data |> 
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = 'y ~ x') +
  labs(title = "Relationship between x and y")

We then estimate the long and short regressions via OLS and compute the bias.

reg_ov <- lm(data = data, y ~ x)
# summary(reg_ov)
reg_ov |> 
  broom::tidy() |> 
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	6.855562	0.0961922	71.26941	0
x	5.036292	0.0414122	121.61361	0

reg_incl <- lm(data = data, y ~ x + w)
# summary(reg_incl)
reg_incl |> 
  broom::tidy() |> 
  knitr::kable()

term	estimate	std.error	statistic
(Intercept)	10.028072	0.0569289	176.15068
x	4.987800	0.0162203	307.50421
w	-3.009797	0.0405145	-74.28944

bias <- reg_ov$coefficients[["x"]] - beta

The “bias” is about 0.036 for a true effect size of 5. I put bias in quotes because the bias is defined as the expected ; here we only have one occurrence. To compute a numeric approximation of this expectation, we need to generate many different data sets and run the analysis several times:

vect_bias <- c()
n_iter <- 1000

for (i in 1:n_iter) {
x <- rnorm(n, mean_x, sigma_x)
w <- rnorm(n, mean_w, sigma_w)
u <- rnorm(n, 0, sigma_u)

y <- alpha + beta*x + delta*w + u

data <- tibble(y, x, w)

reg_ov <- lm(data = data, y ~ x)
bias <- reg_ov$coefficients[["x"]] - beta
vect_bias <- append(vect_bias, bias)
}

bias <- mean(vect_bias)

The bias is thus indeed basically 0 (3.5^{-4}). This is not surprising, the omitted variable is correlated with $w$ and not with $x$ .

Correlation between x and w

Let’s thus change the data generating process so that $x = \pi + \gamma w + e$ . Note that we first need to generate $w$ and $e$ before we are able to generate $x$ . Let’s also remove the for loop for simplicity.

n <- 1000
alpha <- 10
beta <- 5
delta <- -3
pi <- -2
gamma <- 0.4


mean_x <- 2
sigma_x <- 1
mean_w <- 1
sigma_w <- 0.4
sigma_u <- 0.5
sigma_e <- 0.3

w <- rnorm(n, mean_w, sigma_w)
e <- rnorm(n, 0, sigma_e)
x <- pi + gamma * w + e
u <- rnorm(n, 0, sigma_u)

y <- alpha + beta*x + delta*w + u

data <- tibble(y, x, w)

reg_ov <- lm(data = data, y ~ x)
bias_est <- reg_ov$coefficients[["x"]] - beta

bias_est

[1] -1.566722

We can then play with the values of the parameters and notice some regularities:

If $\gamma$ and $\delta$ have the same sign, the bias is positive. If they have opposite signs, it is negative.
The magnitude bias seems to increase with the value of $\gamma$ , $\delta$ and $\sigma_w$ . It seem to decrease with $sigma_x$ .

These result actually make sence

Conclusion

Thanks to this very quick analysis, we were able to get an idea of how the bias without deriving the maths, although they are relatively straightforward in this case $( bias = ) $.

Exercise (optional)

Let’s keep on exploring this to better understand the role of the parameters.

Additional exercise

How do these parameters affect the variance of the estimator. Why? Think about the mathematical expression of the variance of an estimator. Pay attention to the impact these changes have on $y$ .
Play around with all the parameters and explore how it affects the graph describing the relationship between $x$ and $y$ .
Wrap your code into functions: one function to generate the data, one to estimate the model and one to compute a one-iteration simulation.
Build a graph representing the bias as a function of various parameters (eg $\gamma$ or $\sigma_w$ ).