Lecture 3 - Design Beyond Identification

.title[
# Lecture 3 - Design Beyond Identification
]
.subtitle[
## <br> Topics in Econometrics
]
.author[
### Vincent Bagilet
]
.date[
### 2025-09-23
]

---

# Housekeeping

- **Replication games**:

- October 9, all day, mandatory
  - Make groups and register, quickly
  - Pre-game meeting at 1pm to explain how it will take place
  - I will grade your assignments
  
- **Project proposal**:

- Groups?
  - Thought about your subject?

---

# Summary from last week(s)

- *Goal of the class*: develop a better understanding and **intuition** of how applied econometric analyses work **under the hood**

- Last week, learned how to implement simulations:
  
  - To understand econometric concepts

- To design a study
  
  - Run tests and checks
  
  - Use as a rhetorical tool
  
---
# Steps of the simulation approach

1. Define a DGP and the distribution of variables

1. Set parameters values (`baseline_param`)

1. Generate a data set  (`generate_data()`)

1. Estimate the effect in the generated data set  (`run_estim()`)

1. Repeat many times (`compute_sim()` and `pmap()`)

1. Compute the measure of interest

1. Change parameters values (potentially)

1. Complexify the DGP

--
  
1. Repeat

---
class: right, middle, inverse

# Design Matters

---
class: titled, middle

# Steps of an Econometrics Analysis

- **Design**: decisions of data collection and measurement

- *eg*, decisions related to sample size and ensuring exogeneity of the treatment
  
- **Modeling**: define statistical models

- In between design and analysis

- **Analysis**: estimation and questions of statistical inference

- *eg* standard errors, hypothesis tests, and estimator properties

---
class: titled, middle

# Design in Economics

- In (non-experimental) economics, design presented in this lexicographic order:

1. Identification
  
  1. Unbiasedness
  
  1. Minimum variance
  
  1. Robustness to misspecification somewhere in the mix
  
- Design includes **identification but not only**

- These steps **interact** with one another
  
???

- Identification: insure a quasi-random allocation of the treatment

---
class: titled, middle

# The importance of design

- We want to **have an accurate measure of the quantity of interest**

- For that, need to have a causal identification strategy

- But useless if the design is poor in other dimensions and prevents us from even **detecting** the effect

- Statistical power will be central here

---
class: titled, middle

# Statistical power

- Power is a key **implication of design choices**

- *Definition*:

- Probability of rejecting the null (often of no effect) when it is false:
  
  `$$\text{Power} = 1 - \text{rate of Type II error}$$`
  
  - Roughly the probability of detecting an effect when there is one
  
- Power is a function of design: poor designs can lead to low statistical power

---

# Why is low power problematic?

- We want to be able to detect an effect if there is one (that is large enough to be relevant)

- Because costly to run a study for "nothing"

- In RCT, typical threshold for power: **80%**

- In observational settings, why not run a study with say 20% power?

- Because low statistical power `$\Rightarrow$` exaggeration

---
class: right, middle, inverse

# Low power and exaggeration

---

---

---

---

---

---

---
class: titled, middle

# Exaggeration: definition and main drivers

- Definition: `$$E = \dfrac{\mathbb{E} [ | \hat{\beta} | | \text{ signif } ]}{| \beta_1 |} = \dfrac{\mathbb{E} [ | \hat{\beta} | | \beta_{1}, \sigma, | \hat{\beta} | > z_{\alpha} \sigma ]}{| \beta_1 |}$$`

- Exaggeration `$\searrow$` with statistical power and thus:

- `$\searrow$` with precision
  - `$\searrow$` with effect size
  
- **When power is low, significant estimates from an unbiased estimator ALWAYS exaggerate the true effect**

- There are also less straightforward drivers (*we are going to discuss them later today*)

---
class: titled, middle

# Economics faces the two ingredients for exagg.

1. **Significant results are favored**

- Evidence of a significance filter in economics 
  
  - (Rosenthal 1979, Andrews and Kasy 2019, Abadie 2020, Brodeur et al. 2016, 2020)
  
1. **Low statistical power**
    
  - Median power in economics: 18% 
    
  - (Ioannidis et al. 2017, Ferraro and Shukla 2020)

---
class: titled, middle

# Why are significant results favored?

- Editorial process favors significant results for publication

- In a way, that makes sense if a non-significant result reflects a poor research question `$\Rightarrow$` importance of theory

- But, might also be that the effect is **difficult to capture**

- **File drawer problem**: tend to give up projects more when results are non-significant (put them away in a drawer)

- **Forking paths**: we make many choices when implementing a study and they may be more likely to lead to a significant outcome

???

- Forking paths (from Causal Exagg paper): Decision forks appear at various stages along the path of research, for instance in data preparation, regarding the inclusion of a given control variable in the model or observation in the sample or later, regarding whether to carry on with a research that yields non-significant results. Due to the structural flaw that favors significance, the path followed may be more likely to lead to a statistically significant result. These choices are most often not the result of bad researcher practices but instead a product of a structure that portrays significant results as an end goal of research.

---
# Exaggeration matters in actual settings

- In economics, nearly **80% of estimates are exaggerated by a factor of 2**  (Ioannidis et al. 2017)

- Not all designs suffer from exaggeration

- But exaggeration is likely substantial in many studies
]
.pull-right[

<img src="data:image/png;base64,#images/spiderman.png" width="90%" style="display: block; margin: auto;" />
]

---
class: titled, middle

# Design Beyond Identification, Straightforward?

- Have a large enough **sample size** and we're good?

- Not so simple!

- Other aspects than sample size affect power:

- Effect size 
  - Proportion of treated
  - Number of shocks
  - Measurement error
  - Strength of the instrument
  - Count of the outcome

---
class: right, middle, inverse

# Multiple Goals

---
class: titled, middle

# ATE But Not Only

- Often, goal of an econometrics study: estimate the ATE (*Does the treatment work?*)

- But also, *where and when does it work?*:

- Capture **heterogeneity**: treatment effect varies across time and individuals
  
  - Often consider effect on **multiple outcomes**
  
  - **Extrapolate**

---
class: titled, middle

# Implications of Multiple Goals
  
- They have **intertwined implications** for how we approach design

- Not possible to have high power for everything

- Goals can be **competing**

- Can take action at the design stage, acknowledging these multiple goals

---
class: titled, middle

# Heterogeneity

- Treatment effect rarely homogeneous

- The phrase "**Average** Treatment Effect" implicitly acknowledges this

- Variation across individuals, time, space, etc

- There are therefore potential confounders:

- Need to adjust for such variables
  
  - Measure them
  
---
class: titled, middle
# Heterogeneity

## Interactions

- An usual approach to account for heterogeneity is to use interactions

- To measure interactions, we **need 16 times the sample size**:
  
  - The estimates has twice the s.e. of the main effect
  
  - Reasonable to assume that interaction have half the magnitude of the main effect
  
  - Thus Signal to Noise Ratio `$\left( SNR = \frac{\text{True effect}}{\text{s.e.}} \right)$` is 4 times smaller for interaction
  
  - Thus need `$4^2 = 16$` times the sample size
  
---
class: titled, middle
# Heterogeneity

### Two-Ways Fixed-Effects (TWFE)

- Issues:

- When treatment effect heterogenous (in time or across groups)
  
  - Treated units in the control group
  
  - Negative weights

- The literature addressed it as a analysis problem: proposed alternative estimators

- But can see it as **non-modeled heterogeneity**

---
class: titled, middle

# Multiple Outcomes

- Rough approximation of the median number of estimates per paper: 19

- Bonferroni correction:

- Change the significance level to `$\frac{\alpha}{\text{Number of hypotheses tested}}$`
  
- Underlines that need more power `$\Rightarrow$` need to take that into account

---
class: titled, middle

# Extrapolation

- **External validity**

- When increase the sample size, often **changes the underlying estimand**

- *eg*, increasing sample size by increasing the time frame
  
  - or the spatial frame
  
- Increasing sample size not always a silver bullet

---
class: titled, middle

# Modeling affects the effective design

- Controlling and FEs partial out variation

- OLS estimator can be seen as a weighted average of individual treatment effects with `$w_i = (Di − \mathbb{E}[D_i | X_ i])^2$`

- Observations for which treatment is well explained by covariates do not contribute to the estimation

- **Modifies the effective sample** `$\Rightarrow$` can be different from nominal sample

- Can create power and exaggeration issues

---
class: right, middle, inverse

# Improving and Assessing Design

---
class: titled, middle

# Structural solutions

- Without publication bias this issue disappears

- Abandoning the 5% significance threshold

- Interpretation of CI’s width to embrace uncertainty

- Replication of studies with similar designs

---
class: titled, middle

# Improving design

Approach to improved design fall into four categories:

- Both nominal and effective

- **Increased effect size**

- Focus on units with the largest effect
  
  - Increase take-up of the treatment

]
.pull-right[
- **Decreased inferential uncertainty**

- More pre-treatment information
  
  - Better measurement of outcomes
  
- **Weave empirical models with substantive theory**

- Adjust the research question
  
  - Measure intermediate outcomes
    
]

---
class: titled, middle

# Asssessing design

- Use simple **design calculations**

- *Will my design allow me to detect an effect of magnitude `$m$`?*

- **Simulations** (*you now got that hopefully*)

- Same + *what happen if some of my hypotheses do not hold?*

- **Retrodesign calculations**
  
  - *Would my design allow me to detect a smaller effect than the one I got?*

---
class: titled, middle

# Design calculations

- Goal: **choose a design that would yield an adequate statistical power**

- Compute the expected power, in this setting, as a function of design and in particular sample size

- Find the necessary sample size

- Before implementing the analysis

- Common practice in experimental economics, much less in observational settings

---
class: titled, middle

# Necessary ingredients for design calculations

- Statistical power is a function of true effect size and s.e. of the estimator

- Strictly increasing with true effect sizes 
  
  - Strictly decreasing with s.e. of the estimator 
  
  - Slightly complex closed form 
  
- Need to **hypothesize a s.e. and a true effect size**
  
---
class: titled, middle

# Hypothesizing a standard error

- Unknown before the analysis

- Basically boil the analysis down to a difference of average outcome between treatment and controls

`$$se_{\bar{y_t} - \bar{y_c}} = \sqrt{\dfrac{\sigma_T^2}{n_T} + \dfrac{\sigma_C^2}{n_C}}$$`
- `$\sigma_T^2$` and `$\sigma_C^2$` variance of the outcome for the treatment and control group respectively (after partialing out controls)

- Assuming  `$\sigma_T^2 = \sigma_C^2 = \sigma^2$` and for `$p_T = \frac{n_T}{n}$`, this simplifies to `$se_{\bar{y_t} - \bar{y_c}} = \frac{\sigma}{\sqrt{n}}\sqrt{\frac{1}{p_T(1-p_T)}}$`

---
class: titled, middle

# Hypothesizing effect sizes

1. Consider the proportion of affected individuals

1. Consider a **range of effects** (make several assumptions)

- Derived from the literature
  
  - Based on theory
  
  - Consider what could be reasonable deviations from these effects
  
1. Multiply the fraction of non-zero effect with the hypothesize effects

- Help think about reasonable effect sizes and ways to focus on larger effects or reduce s.e.

---
class: titled, middle

# Retrodesign calculations

- Once an estimate has been obtained

- Ask the question **would my design allow me to detect a smaller effect** (of magnitude `$m$`)?

- Need the standard error of your estimate and an hypothetical true effect size ( `$m$` )

- One line of `r` code: `retrodesign::retrodesign(m, se)`

- Run it for a range of values

---
class: right, middle, inverse

# Calibrating simulations

---
class: titled, middle

# Why calibrating?

- So far, we considered very simple simulations, with "naive" distributions

- Calibrating can help make simulations more realistic

- But simulations will never be truly realistic

- Yet can still allow to run some sort of **robustness check** on the ability of your design to retrieve the effects of interest

- Also allows you to **think** about the DGP, your identification strateggy, and so on

---
layout:true

# Fake data simulations

---

### Distributions of the variables

- Emulate the distribution of variables in existing data sets

---
class: titled, middle

### Relationships between variables

- Read the literature

- Get a sense of **typical effect sizes and of relationships** between variables

- Make assumptions on those relationships. Acknowledge them.

- Complexify later if needed. You choose when you stop.

- <svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M256 32c14.2 0 27.3 7.5 34.5 19.8l216 368c7.3 12.4 7.3 27.7 .2 40.1S486.3 480 472 480H40c-14.3 0-27.6-7.7-34.7-20.1s-7-27.8 .2-40.1l216-368C228.7 39.5 241.8 32 256 32zm0 128c-13.3 0-24 10.7-24 24V296c0 13.3 10.7 24 24 24s24-10.7 24-24V184c0-13.3-10.7-24-24-24zm32 224a32 32 0 1 0 -64 0 32 32 0 1 0 64 0z"/></svg> Varying parameters values might change the distribution of some "variables"

- *eg* of the error term
  
  - Difficult to work *ceteris paribus*

---
layout: false
class: titled, middle

# Real data simulations
### General approach

- Start from an existing data set

- Not yours. At least not the subset you are interested in

- Try to pick a subset where there is not already a treatment effect

- Define a treatment allocation mechanism

- **Add an artificial treatment effect** to the outcome variable in your initial data set, *eg*

`$$Y_i(1) = Y_i(0) + \beta_i T_i$$`
- Run your analysis and try to recover it

---
# Real data simulations
### Complexifying

- There is only one artificial aspect in such simulations: the treatment

- We can play on only 2 components:

.pull-left[
- **Who is treated?** *Treatment allocation*
  
  - Everyone
  
  - Only a subset of the population
]
.pull-right[
- **How?** *Treatment effect*

- Homogenous
  
  - Heterogenous but random
  
  - Some specific correlation structure
]

---
class: right, middle, inverse

# Summary

---
class: titled, middle

# Take away messages

- **Design matters**:

- Beyond identification

- Even after a significant estimate has been obtained

- When power is low, significant estimates from an unbiased estimator are always far from the true effect

- Might have important **implications for policy making**
]
.pull-right[

<img src="data:image/png;base64,#images/padme.png" width="90%" style="display: block; margin: auto;" />
]

---
class: right, middle, inverse

# Thank you!