Building a simulation to explore the impact of failure of modelling assumptions.
Code to load the R packages used in this doc
library(tidyverse)library(broom)library(knitr)library(modelsummary)set.seed(24) #fix the starting point of the random numbers generator# The code below is to use my own package for ggplot themes# You can remove it. Otherwise you will need to install the package via GitHub# using the following line of code# devtools::install_github("vincentbagilet/mediocrethemes")library(mediocrethemes)set_mediocre_all(pal ="portal") #set theme for all the subsequent plots
Objective
Simulate fake data to explore some of the drivers of omitted variable bias.
Important
Play around with the values of the parameters, run each simulation several times, etc.
Modelling
Let’s first explore what happens when some of the Gauss-Markov (and normality) assumptions are violated. In each specification we estimate the same simple model:
However, in each case, to emulate failure of each assumption, we consider a true Data Generating Process (DGP) that differs from this econometric model that we estimate. For simplicity, except if otherwise mentioned, we assume that variables are normally distributed.1
Non-linearity
Let’s assume that the DGP is actually as follows but estimate the model specified above:
n <-1000alpha <-10beta <-2mu_x <-2sigma_x <-1sigma_u <-2data_non_linear <-tibble(x =rnorm(n, mu_x, sigma_x),u =rnorm(n, 0, sigma_u),y = alpha + beta*x^2+ u )reg_non_linear <-lm(y ~ x, data = data_non_linear) list("Non-linear"= reg_non_linear) |>modelsummary(gof_omit ="IC|Adj|F|RMSE|Log")
Non-linear
(Intercept)
4.484
(0.251)
x
7.763
(0.114)
Num.Obs.
1000
R2
0.822
While the true effect of interest, , is 2, the estimate we get is 7.7626693. Let’s explore what is actually happening:
Code
data_non_linear |>ggplot(aes(x = x, y = y)) +geom_point() +geom_smooth(method ="lm", formula ='y ~ x') +labs(title ="The functional form of the model does not represent the DGP")
Collinearity
Perfect collinearity
To explore the impact of perfect collinearity, we need to assume that there is an other explanatory variable (eg ). We then try to estimate a different model that actually represent the true DGP: .
Not accounting for heteroskedasticity leads to an inaccurate standard errors for .
Let’s have a look at the raw data:
Code
data_heterosked |>ggplot(aes(x = x, y = y)) +geom_point() +geom_smooth(method ="lm", formula ='y ~ x') +labs(title ="The variance of the residual increases with x")
Non-normal errors
Let’s finally consider non-normal errors and assume that they follow a Student-t distribution (ie with fatter tails).