Lecture 4 - Model Selection

.title[
# Lecture 4 - Model Selection
]
.subtitle[
## <br> Econometrics 1
]
.author[
### Vincent Bagilet
]
.date[
### 2024-10-08
]

---

# Quizz

---

# End of last week's slides

---
# What's Next?

- We now know:

- How to estimate a model with OLS
  
  - The necessary conditions for the estimator to have nice properties
  
  - How to consider various functional forms in our model
  
--

- How do we choose **which variables to include** in our model?

- What happens if we omit some variables?

- What if we have irrelevant variables in our econometrics model?

---

# Omitted Variables

---
class: titled, middle

---
# Simulation

- Gender affects education and wage

- Actual DGP: `$Wage_i = \alpha + \beta Educ_i + \delta Gender_i + e_i$`

- Omitting Gender: `$Wage_i = \alpha + \beta Educ_i + e_i$`

- True effect: `$\beta = 2000$`

|model            | estimate| std.error| statistic| p.value|
|:----------------|--------:|---------:|---------:|-------:|
|Omitted Variable |  2501.06|  231.0856| 10.823091|       0|
|Actual DGP       |  1972.76|  250.6226|  7.871435|       0|

---

---
# On a Real Dataset

### Long regression

|term        |   estimate| std.error|  statistic|   p.value|
|:-----------|----------:|---------:|----------:|---------:|
|(Intercept) |  0.6228168| 0.6725334|  0.9260756| 0.3548338|
|educ        |  0.5064521| 0.0503906| 10.0505201| 0.0000000|
|female      | -2.2733619| 0.2790444| -8.1469542| 0.0000000|

### Short regression

|term        |   estimate| std.error| statistic|   p.value|
|:-----------|----------:|---------:|---------:|---------:|
|(Intercept) | -0.9048516| 0.6849678| -1.321013| 0.1870735|
|educ        |  0.5413593| 0.0532480| 10.166746| 0.0000000|

---
# Dealing with Variable Selection

- **Why** would we omit variables?

- Economic theory not perfectly defined `$\Rightarrow$` we do not know what to include or not
  
  - Inobserved variables
  
- **How** can we choose?

1. Estimate multiple models with different functional forms
  
  2. Compare model performance
  
---
class: titled, middle

# Under-specification

- When relevant inputs are omitted (there impact on the)

- Can create Omitted Variable Bias (OVB)

---

---
class: titled, middle

# Regression Table Comparison

### Effect of Gender on Education (initial regression)

### No effect of Gender on Education

|model            | estimate| std.error| statistic| p.value|
|:----------------|--------:|---------:|---------:|-------:|
|Omitted Variable | 1899.331|  278.9843|  6.808022|       0|
|Actual DGP       | 1877.030|  278.2012|  6.747024|       0|

---

---
class: titled, middle

# Regression Table Comparison

### Effect of Gender on Wage (initial regression)

### No effect of Gender on Education

|model            | estimate| std.error| statistic| p.value|
|:----------------|--------:|---------:|---------:|-------:|
|Omitted Variable | 1872.222|  217.3425|  8.614156|       0|
|Actual DGP       | 1862.111|  246.9989|  7.538944|       0|

---
class: right, middle

# Maths on the board

---
class: titled, middle

# Summary for Under-specification

- Problematic to ignore variables that are correlated with both `$x$` and `$y$`

- Ok if only correlated with `$x$` or `$y$`

- But, controling for variables correlated with `$y$` `$\searrow$` variance of errors

`$\Rightarrow$` `$\searrow$` variance of estimator

- OVB unobserved `$\Rightarrow$` cannot really asses its sign or magnitude

---
class: right, middle, inverse

# Over-Specification

---
class: titled, middle

# Over-Specification

- Over-Specification = including irrelevant variables

- Will not create bias

- But will affect the **variance** of our estimator

- Creates a **bias-variance trade-off**

`$$\mathbb{V}[\hat{\beta}] = \dfrac{\sigma_u^2}{n \sigma_x^2}$$`

---
class: titled, middle

# When to Adjust or Not?

- Including a variable to our model if not interested in its paramater is called **controling** or **adjusting**

- If including the variable `$\searrow$` variance of errors, include it ...

- **But** only if does not decreases too much the variance of our explanatory variable of interest

- If correlated with `$x$`, will `$\nearrow$` variance of errors

- In practice, often prefer to include too many variables than too little:

- Variance decreases with sample size but bias does not

---
class: right, middle, inverse

# Model Selection

---
class: titled, middle

# General Idea

- Evaluate the capacity to fit the relationship between the explained and explanatory variables

- Should **explain a large share of the variability** of the explained variable

- For instance explain why some mode individuals earn more than others

- Variance of `$y$` can be decomposed into that of the estimated response and error

`$$Var[y | X] = Var[\hat{y} | X ] + Var[\hat{e} | X]$$`

---
class: titled, middle

# `$R^2$`

- Proportion of the variance in `$y$` that is explained by the model

- Measures how well the model explains the variability in `$y$`

- Varies between 0 and 1

- Increases with the number of covariates `$\Rightarrow$` use the adjusted- `$R^2$`

---
class: right, middle

# Maths on the board

---
class: titled, middle

# Sum(s) of Squares

We define the following quantities:

- **TSS**: the Total Sum of Squares, `$\quad TSS = \sum_{i=1}^{n}(y_i - \bar{y})^2$`

- **ESS**: the Explained Sum of Squares, `$\quad ESS = \sum_{i=1}^{n}(\hat{y_i} - \bar{y})^2$`

- **RSS**: the Residual Sum of Squares, `$\quad RSS = \sum_{i=1}^{n}(\hat{y_i} - y_i)^2$`

We can show that:

`$$TSS = ESS + RSS$$`

---
class: titled, middle

# `$R^2$` and formulas

- Proportion of the variance in `$y$` that is explained by the model so:

`$$R^2 = \dfrac{ESS}{TSS} = 1 - \dfrac{RSS}{TSS}$$`
- It is also the square fo the correlation coefficient between `$y$` and `$\hat{y}$` (hence its name)

---
class: titled, middle

# Information Criteria

- To select between different models, we can also use different information criteria

- **AIC**: Akaike Information Criterion

- **BIC**: Bayesian Information Criterion

- *Approach*:
  
  1. Compute the information criterion for each specification
  
  1. Select the model that minimizes the information criterion
  
- However, it says nothing about the quality of the model

---
class: right, middle, inverse

# Thanks!