Lecture 2 - Simulations for regression analysis

class: right, middle, inverse, title-slide

.title[
# Lecture 2 - Simulations for regression analysis
]
.subtitle[
## <br> Topics in Econometrics
]
.author[
### Vincent Bagilet
]
.date[
### 2025-09-16
]

---

class: titled, middle

# Housekeeping

- Tomorrow's class moved to next week

- Graded assignment 1 due next week

- Reading: I will post the paper online. Due next Wednesday

- Replication exercise: format TBA

---
class: titled, middle

# Take-away points from last week

- Applied economics aim to produce **accurate causal estimates** (*eg* inform public policy)

- *Objective of the class*: discuss **practical issues** that may prevent us from doing so

- Can arise in any of the steps of research: **design**, **modeling** and **analysis**

- There are some fundamental **hurdles** to estimating causal effects

- **Simulations** can help spot and understand these hurdles

---
class: titled, middle

# Order of concern

- There are different type of hurdles

- Each type only matters to the extent that the previous one are addressed

- We need to have, in that order of concern:

1. A good **research question**, grounded in theory
  
  1. A good **identification strategy** to avoid some fundamental hurdles (reverse causality, confounders, etc)
  
  1. A specification that allow us to estimate the quantity we want to estimate
  
  1. Avoided econometric hurdles

---
class: right, middle, inverse
layout:false

# Simulations for regression analysis
## What, Why and How?

---
class: titled, middle

# Lessons from last week's simulation

- What was the **idea behind** the implementation of simulations last week?

- *Objective*: explore how several parameters affect the estimate of interest
  
  - *Approach*: Generate fake data (we thus know the whole DGP) and run an analysis

- Were they **useful?** If so, in what way?

- Understand how various parameters affect the estimate of interest, without deriving the maths
  
  - Help to shape intuition and understanding

---
class: titled, middle

# What is a simulation for regression analysis?

- A process in which we:

1. Generate **artificial data**
  
  1. Run an analysis on this data
  
  1. Repeat the process many times

- We **know the data generating process**

- Can simulate data:

- From scratch (**fake data simulation**) or 
  
  - On top of an existing data set (**real data simulation**)

---
class: titled, middle

# Overall principles

- Whole game in our metrics analyses: **approximate the DGP**

- With a simulation, we know the true DGP (at least to some extent)

- We can assess the performance of our analysis:

- **Can we accurately estimate the true effect of interest?**
  
  - Are there hurdles to doing so and can we overcome them?

---
class: titled, middle

# Why do simulations?

- To understand econometric concepts

- To design a study, before having the data

- To design a study, once having the data

- Tests and checks, after running the analysis

- As a rhetorical tool
  
---
# Why do simulations? 
## To understand econometric concepts
  
- **No maths** required and allows to consider many general cases easily

- Useful to **get intuition** on how econometric aspects work

- Understand **general** concepts:

- *eg* what happens in general when we omit a variable, or highlight issues with TWFE
  - Can explore this with naive fake-data simulations (*eg* `\(x \sim \mathcal{N} (0, 1)\)`)

- Understand conceptual hurdles **specific** to our context:

- *eg* what happens if there is autocorrelation in *this* particular variable
  - Can explore this with calibrated fake-data or real-data simulations

---
class: titled, middle
# The example of leverage and influence

- **What did you learn** from the exercise you had to do?

- What affects leverage? How does it affect the parameter of interest?

- Present the **intuition** behind leverage and influence

- How did you implement your simulation?

- Any cool graphs/outputs?

???
On R/quarto

---
# Why do simulations? 
## To design a study, before having the data

- Useful to get started on a **concrete reflection** about:

- The setting
  
  - What we want to estimate, *exactly*

- The data need and its granularity

- The identification strategy

- As a proof of concept (to apply for grants, data access, etc)

---
# Why do simulations? 
## To design a study, once having the data

- Useful to think about:

- Threats to identification and important assumptions
  
  - The statistical power of our study (difficult to do without a simulation)
  
- Explore **where to best invest resources**:

- Larger sample
  
  - Improved data precision (reduce measurement error)

---
# Why do simulations? 
## Tests and checks, after running the analysis

<br>

- Does your analysis detects the effect you are interested in, in a **pristine setting**?

- If the analysis faces issues in simulations, it will probably also in an actual setting

- What happens to the product of our analysis if the setting is slightly more complex?

- What happens if some hypotheses do not hold?

- All this can be discussed **even after the analysis has been run**

---
# Why do simulations? 
## As a rhetorical tool

- Simplify what we are working on to the bare minimum

- What is **the simplest way of pitching** the analysis and the identification strategy?

- Can help build simple visualizations

- Can be useful to **illustrate why a given approach does not work**

- To argue why we chose a certain approach
  
  - In a referee report

---
class: titled, middle
layout: false

# General approach to simulations

- **Start with a simple DGP**:

- Simple correlation structure
  
  - Our model represents the actual DGP

- Does our analysis recover the effect in a rather "pristine" setting?

- Then **complexify the DGP**

---
# Steps of the simulation approach

<br>

1. Define a DGP and the distribution of variables

1. Set parameters values

1. Generate a data set

1. Estimate the effect in the generated data set

1. Repeat many times

1. Compute the measure of interest

???

- For one set of parameters

---
class: titled, middle

# Next steps

- **Change parameters values**

- Understand how the measure of interest is affected by a given parameter
  
  - *eg* how does bias evolve with the correlation between `\(x_1\)` and `\(x2\)`?

- **Complexify the DGP**

- Would our method still performs well if the DGP was more and more complex?
  
- Repeat

---
class: right, middle, inverse
layout:false

# Exercise
## Simulating an RCT

---
class: titled, middle

## Setting

- Impact of receiving extra lessons on students’ grades

- Simulate an experiment (RCT):

`\(\forall i \in \{1, .., n\}, \quad Grade_i =  \alpha_0 + \beta_0 Treat_i + u_i\)`

- Which sample size and proportion of treated to have a high probability of detecting the effect?

## How?

- Simulate many experiments

- Compute the proportion of effects detected

---
class: right, middle

# Switch to Quarto document for coding

---
class: right, middle, inverse

# Thank you!