Grasping the impact of modelling assumptions

Building a simulation to explore the impact of failure of modelling assumptions.

Code to load the R packages used in this doc

library(tidyverse)
library(broom)
library(knitr)
library(modelsummary)

set.seed(24) #fix the starting point of the random numbers generator

# The code below is to use my own package for ggplot themes
# You can remove it. Otherwise you will need to install the package via GitHub
# using the following line of code
# devtools::install_github("vincentbagilet/mediocrethemes")
library(mediocrethemes)
set_mediocre_all(pal = "portal") #set theme for all the subsequent plots

Objective

Simulate fake data to explore some of the drivers of omitted variable bias.

Important

Play around with the values of the parameters, run each simulation several times, etc.

Modelling

Let’s first explore what happens when some of the Gauss-Markov (and normality) assumptions are violated. In each specification we estimate the same simple model:

$y = \alpha + \beta x + e$ However, in each case, to emulate failure of each assumption, we consider a true Data Generating Process (DGP) that differs from this econometric model that we estimate. For simplicity, except if otherwise mentioned, we assume that variables are normally distributed.¹

Non-linearity

Let’s assume that the DGP is actually as follows but estimate the model specified above:

$y = \alpha_0 + \beta_0 x^2 + u$

n <- 1000
alpha <- 10
beta <- 2
mu_x <- 2
sigma_x <- 1
sigma_u <- 2

data_non_linear <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x^2 + u  
)

reg_non_linear <- lm(y ~ x, data = data_non_linear) 

list("Non-linear" = reg_non_linear) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")

	Non-linear
(Intercept)	4.484
	(0.251)
x	7.763
	(0.114)
Num.Obs.	1000
R2	0.822

While the true effect of interest, $\beta$ , is 2, the estimate we get is 7.7626693. Let’s explore what is actually happening:

Code

data_non_linear |> 
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = 'y ~ x') +
  labs(title = "The functional form of the model does not represent the DGP")

Collinearity

Perfect collinearity

To explore the impact of perfect collinearity, we need to assume that there is an other explanatory variable (eg $w = 0.3\times x$ ). We then try to estimate a different model that actually represent the true DGP: $y = \alpha + \beta x + \gamma w + u$ .

gamma <- 0.2

data_collin <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  w = 0.3*x,
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x + gamma*w + u  
)

reg_collin <- lm(y ~ x + w, data = data_collin)

list("Perfect collin." = reg_collin) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")

	Perfect collin.
(Intercept)	9.820
	(0.142)
x	2.138
	(0.062)
Num.Obs.	1000
R2	0.542

The parameter for $w$ was not estimable because of collinearity (cf the message). lm removes the variable from the regression.

Almost-perfect collinearity

Let’s then explore almost-perfect collinearity (where $w = 0.3\times x + \epsilon$ where $\epsilon$ is a small error).

data_almost_collin <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  w = 0.3*x + rnorm(n, 0, 0.01),
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x + gamma*w + u  
)

reg_almost_collin <- lm(y ~ x + w, data = data_almost_collin)

list("Almost perfect collin." = reg_almost_collin) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")

	Almost perfect collin.
(Intercept)	9.777
	(0.134)
x	3.453
	(1.798)
w	-4.346
	(6.006)
Num.Obs.	1000
R2	0.562

Almost perfect collinearity biases the estimate of $\beta$ : the OLS cannot separate the effect of $x$ from that of $w$ .

Endogeneity

Let’s now assume $E[u_i \mid X_i] \neq 0$ , ie that $u$ is correlated with $x$ (by adding a $0.5\times x$ component to $u$ ):

data_endog <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = 0.5 * x + rnorm(n, 0, sigma_u),
  y = alpha + beta*x + u  
)

reg_endog <- lm(y ~ x, data = data_endog)

list("Endogeneity" = reg_endog) |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")

	Endogeneity
(Intercept)	10.153
	(0.138)
x	2.403
	(0.060)
Num.Obs.	1000
R2	0.614

While the true effect of interest, $\beta$ , is 2, the estimate we get is 2.4033472. There is bias; the OVB we explored in the first assignment.

Autocorrelation

Let’s assume that errors are not independent (eg that there is autocorrelation). Each error term depends on the previous error term: $u_i = 0.9u_{i-1} + \tilde{u}$ where $\tilde{u} \sim \mathcal{N}(0, \sigma_u^2)$

u_auto <- numeric(n)
for (i in 2:n) u_auto[i] <- 0.9*u_auto[i-1] + rnorm(1, 0, sigma_u)

data_auto <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = u_auto,
  y = alpha + beta*x + u
) 

reg_auto <- lm(y ~ x, data = data_auto) 

reg_auto |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "HAC"))

	(1)	(2)
(Intercept)	10.103	10.103
	(0.317)	(0.620)
x	2.004	2.004
	(0.140)	(0.152)
Num.Obs.	1000	1000
R2	0.170	0.170
Std.Errors	IID	HAC

Assuming iid standard errors leads to an inaccurate standard errors for $\widehat{\beta}$ .

Heteroskedasticity

Let’s now assume that the variance of the residuals depends on $x$ (assuming that $u (0, (_u + x²⁾2 ) $)

data_heterosked <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rnorm(n, 0, sigma_u + x^2),
  y = alpha + beta*x + u  
)

reg_heterosked <- lm(y ~ x, data = data_heterosked) 

reg_heterosked |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "robust"))

	(1)	(2)
(Intercept)	10.919	10.919
	(0.573)	(0.607)
x	1.411	1.411
	(0.259)	(0.390)
Num.Obs.	1000	1000
R2	0.029	0.029
Std.Errors	IID	HC3

Not accounting for heteroskedasticity leads to an inaccurate standard errors for $\widehat{\beta}$ .

Let’s have a look at the raw data:

Code

data_heterosked |> 
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = "lm", formula = 'y ~ x') +
  labs(title = "The variance of the residual increases with x")

Non-normal errors

Let’s finally consider non-normal errors and assume that they follow a Student-t distribution (ie with fatter tails).

df <- 2

data_non_normal <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rt(n, df), #fatter tails
  y = alpha + beta*x + u  
)

reg_non_normal <- lm(y ~ x, data = data_non_normal)

reg_non_normal |> 
  modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "bootstrap"))

	(1)	(2)
(Intercept)	10.030	10.030
	(0.190)	(0.180)
x	2.026	2.026
	(0.086)	(0.076)
Num.Obs.	1000	1000
R2	0.357	0.357
Std.Errors	IID	Bootstrap

Not accounting for non-normal errors leads to an inaccurate standard errors for $\widehat{\beta}$ , in small samples.

Footnotes

You are of course more than welcome to play around with this assumption.↩︎