Lecture 7 - Modelling and Analysis


Topics in Econometrics - M2 ENS Lyon

Vincent Bagilet

2025-10-14

Introduction

Short feedback form

https://forms.gle/wxwZNFXyPxprLELT7

Steps of an Econometrics Analysis

  • Design: decisions of data collection and measurement

    • eg, decisions related to sample size and ensuring exogeneity of the treatment
  • Modelling: define statistical models

    • eg selecting variables, functional forms, etc
  • Analysis: estimation and questions of statistical inference

    • eg standard errors, hypothesis tests, and estimator properties

Goal of the session

  • Main focus in causal inference is often identification
  • So far, we have: a good question and a convincing quasi-random allocation of the treatment
  • How do we make inference for the population?
  • We make causal claims based on significance \Rightarrow need modelling assumptions to hold and reliable SEs
  • Aim to give intuition about some key points and provide you with resources to learn more about them

Modelling

Modelling assumptions matter

  • OLS valid under a set of assumptions: the Gauss–Markov conditions
  • If these assumptions, or any modelling one, do not hold we cannot make reliable inference
  • Let’s see why!
  • Modelling matters: specification choices affect the results of our studies and our ability to make reliable inference

Gauss–Markov conditions


Assumption Idea If violated
1. Linearity Model linear in its parameters Estimates misspecified \Rightarrow biased/inconsistent
2. No perfect collinearity (XX)1(X'X)^{-1} exists Coefficients not identifiable
3. Exogeneity E[uiXi]=0E[u_i \mid X_i] = 0 Estimator biased
4.a. Independent errors Cov(ui,ujX)=0Cov(u_i,u_j\mid X)=0 for iji\neq j (eg no autocorrelation) Invalid inference
4.b. Homoskedasticity Var(uiXi)=σ2=cstVar(u_i \mid X_i)=\sigma^2 = \text{cst} Inefficient
  • 4.a + 4.b = spherical errors

Implications

  • Finite sample properties:
    • 1 \to 3: β̂OLS\widehat{\beta}_{OLS} unbiased
    • 1 \to 4: β̂OLS\widehat{\beta}_{OLS} efficient (Best Linear Unbiased Estimator)
  • Asymptotically:
    • Unbiased
    • Normally distributed
    • Efficient

Normally distributed estimator

  • Required for making inference (eg computing confidence intervals or p-values)
  • Additional assumption: normal errors \Rightarrow β̂OLS\hat{\beta}_{OLS} normally distributed
  • If errors non-normal
    • Alternative way to compute SE (eg bootstrap)
    • If nn is large enough, Central Limit Theorem (CLT) + Weak Law of Large Numbers (WLLN) \Rightarrow β̂OLS\hat{\beta}_{OLS} approximately normal

Exercise

  • Generate fake data and analyse the impact of violations of some of these assumptions
    1. Non-linearity
    2. Perfect collinearity
    3. Endogeneity
    4. Autocorrelation
    5. Heteroskedasticity
    6. Non-normal errors

Non-linear

n <- 1000
alpha <- 10
beta <- 2
mu_x <- 2
sigma_x <- 1
sigma_u <- 2

data_non_linear <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x^2 + u  
)

reg_non_linear <- lm(y ~ x, data = data_non_linear) 

# list("Non-linear" = reg_non_linear) |> 
#   modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")
Non-linear
(Intercept) 4.129
(0.232)
x 7.898
(0.106)
Num.Obs. 1000
R2 0.848

Collinearity

gamma <- 0.2

data_collin <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  w = 0.3*x,
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x + gamma*w + u  
)

reg_collin <- lm(y ~ x + w, data = data_collin)

# list("Perfect collin." = reg_collin) |> 
#   modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")

data_almost_collin <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  w = 0.3*x + rnorm(n, 0, 0.01),
  u = rnorm(n, 0, sigma_u),
  y = alpha + beta*x + gamma*w + u  
)

reg_almost_collin <- lm(y ~ x + w, data = data_almost_collin)

# list("Almost perfect collin." = reg_almost_collin) |> 
#   modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")
Perfect collin.
(Intercept) 9.936
(0.140)
x 2.089
(0.062)
Num.Obs. 1000
R2 0.529
  • The parameter for ww not estimable: lm removes the variable from the regression
Almost perfect collin.
(Intercept) 9.954
(0.139)
x 5.660
(1.896)
w -11.844
(6.319)
Num.Obs. 1000
R2 0.537
  • Cannot separate the effect of xx from that of ww

Endogeneity

data_endog <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = 0.5 * x + rnorm(n, 0, sigma_u),
  y = alpha + beta*x + u  
)

reg_endog <- lm(y ~ x, data = data_endog)

# list("Endogeneity" = reg_endog) |> 
#   modelsummary(gof_omit = "IC|Adj|F|RMSE|Log")
Endogeneity
(Intercept) 9.922
(0.136)
x 2.486
(0.060)
Num.Obs. 1000
R2 0.631
  • The estimate for β\beta is biased; it is the OVB we explored in the first assignment

Autocorrelation

u_auto <- numeric(n)
for (i in 2:n) u_auto[i] <- 0.9*u_auto[i-1] + rnorm(1, 0, sigma_u)

data_auto <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = u_auto,
  y = alpha + beta*x + u
) 

reg_auto <- lm(y ~ x, data = data_auto) 

# reg_auto |>
#   modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "HAC"))
(1) (2)
(Intercept) 9.343 9.343
(0.352) (0.581)
x 2.076 2.076
(0.157) (0.153)
Num.Obs. 1000 1000
R2 0.149 0.149
Std.Errors IID HAC
  • Assuming iid standard errors leads to an inaccurate standard errors for β̂\widehat{\beta}

Heteroskedasticity

data_heterosked <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rnorm(n, 0, sigma_u + x^2),
  y = alpha + beta*x + u  
)

reg_heterosked <- lm(y ~ x, data = data_heterosked) 

# reg_heterosked |> 
#   modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "robust"))
(1) (2)
(Intercept) 10.277 10.277
(0.578) (0.543)
x 1.935 1.935
(0.261) (0.363)
Num.Obs. 1000 1000
R2 0.052 0.052
Std.Errors IID HC3
  • Not accounting for heteroskedasticity leads to an inaccurate standard errors for β̂\widehat{\beta}

Non-normal errors

df <- 2

data_non_normal <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rt(n, df), #fatter tails
  y = alpha + beta*x + u  
)

reg_non_normal <- lm(y ~ x, data = data_non_normal)

# reg_non_normal |> 
#   modelsummary(gof_omit = "IC|Adj|F|RMSE|Log", vcov = c("classical", "bootstrap"))
(1) (2)
(Intercept) 9.928 9.928
(0.174) (0.143)
x 1.995 1.995
(0.076) (0.056)
Num.Obs. 1000 1000
R2 0.408 0.408
Std.Errors IID Bootstrap
  • Not accounting for non-normal errors leads to an inaccurate standard errors for β̂\widehat{\beta}, in small samples

Limited outcome models

  • Often, yy is limited: binary, categorical, censored, etc
  • Linear regression is not appropriate:
    • It does not take the constraints on yy into account
    • ie wrongly assumes linearity and errors non-normal
  • Use Generalized Linear Models (GLMs):
    • Idea: uses an invertible link function gg to transform a limited yy into a continuous variable
    • g(𝔼[y|X])=β0+β1X1+...+βkXkg(\mathbb{E}[y|X]) = \beta_0 + \beta_1 X_1 + ... + \beta_k X_k

Limited outcome models






Limited yy Example Regression model
Binary y{0,1}y \in \{0,1\} Probit, logit, …
Count y{0,1,2,3,...}y \in \{0,1, 2, 3, ...\} Poisson, negative binomial, …
Censored eg y=max(0,y*)y = \max(0, y^*) Censored regression models (eg tobit)

Binary outcome

Example

  • The outcome follows a Bernoulli distribution: y|XBer(π)y | X \sim Ber(\pi)
  • A regression model expresses the conditional probability π=P[y=1|X]\pi = P[y = 1|X] as a function of XX and β\beta: π=g1(Xβ)\pi = g^{-1}(X'\beta)

Binary outcome

Models

  • Linear Probability Model
    • g:xxg : x \mapsto x
    • Almost always yields biased and inconsistent estimates
  • Logistic regression model
    • g:xlog(x1x)g : x \mapsto log\left(\dfrac{x}{1-x}\right), the logit function
  • Probit regression model
    • g: probit, ie the quantile function of the standard normal distribution (and g1g^{-1} is the CDF of the normal)
    • Rule of thumb: coefs \simeq logit coefs divided by 1.6
    • Very similar to logit; sometimes easier to implement

Binary outcome models

Interpreting coefficients

  • β\beta gives the direction BUT not the magnitude
  • The marginal effect depends on XX: effects largest in the middle of the distribution

Binary outcome models

Example fits

Count data models

  • Poisson regression model
    • The Poisson distribution Pois(λ)\text{Pois}(\lambda) models the number of events occurring in a fixed interval if when events occur idependently and at a constant mean rate
    • Choice function: ln
    • Imposes 𝕍[y|X]=𝔼[y|X]\mathbb{V}[y|X] = \mathbb{E}[y|X]
  • Negative binomial model
    • The negative binomial distribution NB(p,r)NB(p,r) models the umber of successes in a sequence of iid Ber(p)Ber(p) trials before rr failures occur
    • Allows 𝕍[y|X]𝔼[y|X]\mathbb{V}[y|X] \neq \mathbb{E}[y|X]

A note on controls

  • Overall two reasons to include controls:
    1. To ensure random allocation of the treatment
      • Necessary for identification
    2. To improve precision
      • To better explain yy: \nearrow R2(=1FUV)R^2 \ (= 1 - FUV) and \searrow σu2\sigma_u^2)
  • Adjusting for pre-treatment covariates may
    • Reduce the variation in yy, increases precision
    • Reduce the variation in xx, decreases precision

Bad controls

Do not adjust for post-treatment variables that may be affected by the treatment

Simulating bad controls

#reuse the parameters from above
n <- 10000 #to limit sampling variation
mu_b <- 3
sigma_b <- 1
gamma <- 5
kappa <- 4
sigma_a <- 2
delta <- 0

data_bad_control <- tibble(
  x = rnorm(n, mu_x, sigma_x),
  u = rnorm(n, 0, sigma_u),
  b = rnorm(n, mu_b, sigma_b),
  a = kappa*x + rnorm(n, 0, sigma_a),
  y = alpha + beta*x + gamma*b + delta*a + u  
)

reg_short <-  lm(data = data_bad_control, y ~ x)
reg_pre <- lm(data = data_bad_control, y ~ x + b)
reg_post <- lm(data = data_bad_control, y ~ x + a)
reg_pre_post <- lm(data = data_bad_control, y ~ x + b + a)
No controls Pre treatment control Post treatment control Pre and post treatment
(Intercept) 24.826 9.929 24.824 9.929
(0.120) (0.074) (0.120) (0.074)
x 2.085 2.035 2.217 2.022
(0.054) (0.020) (0.121) (0.045)
b 5.001 5.001
(0.020) (0.020)
a -0.033 0.003
(0.027) (0.010)
Num.Obs. 10000 10000 10000 10000
R2 0.131 0.881 0.131 0.881

Analysis

SEs: what are they?

  • Standard errors = estimate of sampling variability
  • Tell us how precise estimates are, how much β̂\widehat{\beta} would vary across repeated samples
  • They determine confidence intervals and p-values
  • An example will better illustrate

Why care about them?

  • Violations of classical assumptions:
    • Heteroskedasticity: variance depends on XX
    • Non-independence: errors correlated within groups
  • Implications:
    • t-tests and CIs misleading
    • In general, SEs too small \Rightarrow inference too optimistic

(Non-)spherical errors


  • Under assumptions 1 \to 3 we saw earlier, the asymptotic distribution of β̂OLS\widehat{\beta}_{OLS} is

β̂OLSa𝒩(β0,(XX)1XΣX(XX)1)\widehat{\beta}_{OLS} \overset{a}{\sim} \mathcal{N}(\beta_0, (X'X)^{-1} X'\Sigma X (X'X)^{-1})

  • If spherical errors (ie Σ=σ2I\Sigma = \sigma^2 I), use unbiased sample variance
  • If non-spherical errors, need a covariance matrix estimator that is consistent under this misspecification
  • \Rightarrow use sandwich estimators of the variance ((XX)1breadXΣ̂Xmeat(XX)1bread\underbrace{(X'X)^{-1}}_{bread} \underbrace{X'\widehat{\Sigma} X}_{meat} \underbrace{(X'X)^{-1}}_{bread})
  • Heteroskedasticity: compute White SEs
  • Autocorrelation: compute Newey-West/Conley SEs if correlated in time/space

Why clustering SEs

  • When errors are correlated within groups (e.g. individuals, firms, regions), clustering adjust for this

𝔼[uiuj|X]0 for i,j in the same cluster\mathbb{E}[u_iu_j| X] \neq 0 \text{ for } i, j \text{ in the same cluster}

  • Examples
    • Panel data: repeated measures of same unit
    • Group-level treatment, eg policy at regional level with many individuals per region
  • Intuition
    • We do not have NN independent observations but GG clusters
    • eg 10 classrooms of 30 students does not correspond to 300 independent observations
    • The real sample size is closer to the number of clusters, not observations

What clustering does

  • Do not affect point estimates
  • Allows for intra-cluster correlation and adjusts the effective number of independent observations
  • Increases SEs to reflect within-group dependence \Rightarrow wider CIs

VarCR̂(β̂)=(XX)1(1npgGXgrgrgXg)(XX)1\widehat{\text{Var}_{CR}} (\widehat{\beta}) = (X'X)^{-1} \left( \dfrac{1}{n-p} \sum_{g \in G} X_g'r_gr_g' X_g \right) (X'X)^{-1} where XgX_g and rgr_g are data and residuals for cluster gg

  • Intuition: each cluster provides one independent piece of information about how residuals co-evolve with regressors

Which level to cluster at?

  • Clustering accounts for correlation in residuals
  • \Rightarrow the level of clustering depends on where correlation comes from
  • Rule of thumb:
    • Cluster at the level of the shock or treatment variation
    • If unsure, cluster higher rather than lower
  • If level of clustering too low, SEs too small, overconfidence in results
  • If level of clustering too large, SEs too large, under reject

When is clustering difficult to implement?

  • When few clusters (eg < 30) \Rightarrow asymptotic results unreliable

  • When complex dependence or unclear correlation structure

  • When model residuals violate independence in unknown ways

  • Solution

    • Resampling methods (eg bootstrap)
    • Hierarchical clustering

Bootstrap

  • Intuition: simulate the sampling variability by resampling from our data
  • Steps:
    1. Draw samples (with replacement)
    2. Estimate β̂\hat{\beta} on each sample
    3. Compute the SD of β̂\hat{\beta} across replications
    4. That SD = bootstrap SE
  • When resampling, respect the correlation structure: sample clusters

Summaries

Summary of today

  • Regression models and OLS estimation rest on a set of assumptions
  • Ensure estimates are unbiased, efficient, and statistically valid
  • When fail, estimates are biased and standard errors misleading
  • We reviewed these modelling assumptions and discuss what happens when they fail
  • Limited outcome models or clustering help overcome the issues

Goal of the whole course




Give us a deeper understanding of:

  • How regression works “under the hood”: intuition
  • Causal identification strategies and their assumptions
  • How design, modeling, and analysis choices shape empirical results
  • Common pitfalls and challenges in empirical work
  • How to use simulations to explore estimator behavior and diagnose potential problems specific to your own cases
  • Existing references and where to find additional information on a specific topic

To mention when pitching your analysis

  1. Research question
    • What causal effect of interest are you trying to estimate?
  2. Ideal experiment
    • What ideal experiment would capture the causal effect?
  3. Identification strategy
    • How are the observational data used to make comparisons that approximate such an experiment?
  4. Estimation method (including assumptions made when constructing standard errors)
  5. Falsification tests that support the identifying assumptions

To also mention in your pitch

  • Motivation
    • Why is your research question important?
  • Contributions to the literature
  • Methodological contributions
  • Internal validity
    • Are the identifying assumptions plausible?
    • Are there unexplained results?
  • External validity
    • Gap between policy questions and the analyses performed?
    • Generalization to other populations and settings?

Summary of the entire course




Take away messages

  • Your research question and underlying theory are crucial
  • Think about what is the identifying variation in your model:
    • What are you estimating exactly with your model?
    • Which observations contribute to identification?
  • Always wonder what you are comparing
  • Use simulations to explore and understand points

References