Grasping the impact of modelling assumptions

Building a simulation to explore the impact of failure of modelling assumptions.

Code to load the R packages used in this doc
library(tidyverse)
library(broom)
library(knitr)
library(modelsummary)

set.seed(24) #fix the starting point of the random numbers generator

# The code below is to use my own package for ggplot themes
# You can remove it. Otherwise you will need to install the package via GitHub
# using the following line of code
# devtools::install_github("vincentbagilet/mediocrethemes")
library(mediocrethemes)
set_mediocre_all(pal = "portal") #set theme for all the subsequent plots
Objective

Simulate fake data to explore some of the drivers of omitted variable bias.

Important

Play around with the values of the parameters, run each simulation several times, etc.

Modelling

Let’s first explore what happens when some of the Gauss-Markov (and normality) assumptions are violated. In each specification we estimate the same simple model:

y=α+βx+ey = \alpha + \beta x + e However, in each case, to emulate failure of each assumption, we consider a true Data Generating Process (DGP) that differs from this econometric model that we estimate. For simplicity, except if otherwise mentioned, we assume that variables are normally distributed.1

Non-linearity

Let’s assume that the DGP is actually as follows but estimate the model specified above:

y=α0+β0x2+uy = \alpha_0 + \beta_0 x^2 + u

n <- 1000
alpha <- 10
beta <- 2
mu_x <- 2
sigma_x <- 1
sigma_u <- 2

data_non_linear <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x^2 + u  
)

reg_non_linear <- lm(y ~ x, data = data_non_linear) 

list("Non-linear" = reg_non_linear) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")
Non-linear
(Intercept) 4.484
(0.251)
x 7.763
(0.114)
Num.Obs. 1000
R2 0.822

While the true effect of interest, β\beta, is 2, the estimate we get is 7.7626693. Let’s explore what is actually happening:

Code
data_non_linear |> 
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = 'y ~ x') +
  labs(title = "The functional form of the model does not represent the DGP")

Collinearity

Perfect collinearity

To explore the impact of perfect collinearity, we need to assume that there is an other explanatory variable (eg w=0.3×xw = 0.3\times x). We then try to estimate a different model that actually represent the true DGP: y=α+βx+γw+uy = \alpha + \beta x + \gamma w + u.

gamma <- 0.2

data_collin <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  w = 0.3*x,
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x + gamma*w + u  
)

reg_collin <- lm(y ~ x + w, data = data_collin)

list("Perfect collin." = reg_collin) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")
Perfect collin.
(Intercept) 9.820
(0.142)
x 2.138
(0.062)
Num.Obs. 1000
R2 0.542

The parameter for ww was not estimable because of collinearity (cf the message). lm removes the variable from the regression.

Almost-perfect collinearity

Let’s then explore almost-perfect collinearity (where w=0.3×x+ϵw = 0.3\times x + \epsilon where ϵ\epsilon is a small error).

data_almost_collin <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  w = 0.3*x + rnorm(n, 0, 0.01),
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x + gamma*w + u  
)

reg_almost_collin <- lm(y ~ x + w, data = data_almost_collin)

list("Almost perfect collin." = reg_almost_collin) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")
Almost perfect collin.
(Intercept) 9.777
(0.134)
x 3.453
(1.798)
w -4.346
(6.006)
Num.Obs. 1000
R2 0.562

Almost perfect collinearity biases the estimate of β\beta: the OLS cannot separate the effect of xx from that of ww.

Endogeneity

Let’s now assume E[uiXi]0E[u_i \mid X_i] \neq 0, ie that uu is correlated with xx (by adding a 0.5×x0.5\times x component to uu):

data_endog <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = 0.5 * x + rnorm(n, 0, sigma_u),
  y = alpha + beta*x + u  
)

reg_endog <- lm(y ~ x, data = data_endog)

list("Endogeneity" = reg_endog) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")
Endogeneity
(Intercept) 10.153
(0.138)
x 2.403
(0.060)
Num.Obs. 1000
R2 0.614

While the true effect of interest, β\beta, is 2, the estimate we get is 2.4033472. There is bias; the OVB we explored in the first assignment.

Autocorrelation

Let’s assume that errors are not independent (eg that there is autocorrelation). Each error term depends on the previous error term: ui=0.9ui1+ũu_i = 0.9u_{i-1} + \tilde{u} where ũ𝒩(0,σu2)\tilde{u} \sim \mathcal{N}(0, \sigma_u^2)

u_auto <- numeric(n)
for (i in 2:n) u_auto[i] <- 0.9*u_auto[i-1] + rnorm(1, 0, sigma_u)

data_auto <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = u_auto,
  y = alpha + beta*x + u
) 

reg_auto <- lm(y ~ x, data = data_auto) 

reg_auto |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "HAC"))
(1) (2)
(Intercept) 10.103 10.103
(0.317) (0.620)
x 2.004 2.004
(0.140) (0.152)
Num.Obs. 1000 1000
R2 0.170 0.170
Std.Errors IID HAC

Assuming iid standard errors leads to an inaccurate standard errors for β̂\widehat{\beta}.

Heteroskedasticity

Let’s now assume that the variance of the residuals depends on xx (assuming that $u (0, (_u + x2)2 ) $)

data_heterosked <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rnorm(n, 0, sigma_u + x^2),
  y = alpha + beta*x + u  
)

reg_heterosked <- lm(y ~ x, data = data_heterosked) 

reg_heterosked |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "robust"))
(1) (2)
(Intercept) 10.919 10.919
(0.573) (0.607)
x 1.411 1.411
(0.259) (0.390)
Num.Obs. 1000 1000
R2 0.029 0.029
Std.Errors IID HC3

Not accounting for heteroskedasticity leads to an inaccurate standard errors for β̂\widehat{\beta}.

Let’s have a look at the raw data:

Code
data_heterosked |> 
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = 'y ~ x') +
  labs(title = "The variance of the residual increases with x")

Non-normal errors

Let’s finally consider non-normal errors and assume that they follow a Student-t distribution (ie with fatter tails).

df <- 2

data_non_normal <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rt(n, df), #fatter tails
  y = alpha + beta*x + u  
)

reg_non_normal <- lm(y ~ x, data = data_non_normal)

reg_non_normal |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "bootstrap"))
(1) (2)
(Intercept) 10.030 10.030
(0.190) (0.180)
x 2.026 2.026
(0.086) (0.076)
Num.Obs. 1000 1000
R2 0.357 0.357
Std.Errors IID Bootstrap
  • Not accounting for non-normal errors leads to an inaccurate standard errors for β̂\widehat{\beta}, in small samples.

Footnotes

  1. You are of course more than welcome to play around with this assumption.↩︎