class: right, middle, inverse, title-slide .title[ # Lecture 4 - Model Selection ] .subtitle[ ##
Econometrics 1 ] .author[ ### Vincent Bagilet ] .date[ ### 2024-10-08 ] --- class: right, middle, inverse # Quizz --- class: right, middle, inverse # End of last week's slides --- # What's Next? - We now know: - How to estimate a model with OLS - The necessary conditions for the estimator to have nice properties - How to consider various functional forms in our model -- - How do we choose **which variables to include** in our model? - What happens if we omit some variables? - What if we have irrelevant variables in our econometrics model? --- class: right, middle, inverse # Omitted Variables --- class: titled, middle <img src="data:image/png;base64,#images/FE/DAG-1.png" width="70%" style="display: block; margin: auto;" /> --- # Simulation - Gender affects education and wage - Actual DGP: `\(Wage_i = \alpha + \beta Educ_i + \delta Gender_i + e_i\)` - Omitting Gender: `\(Wage_i = \alpha + \beta Educ_i + e_i\)` - True effect: `\(\beta = 2000\)` -- |model | estimate| std.error| statistic| p.value| |:----------------|--------:|---------:|---------:|-------:| |Omitted Variable | 2501.06| 231.0856| 10.823091| 0| |Actual DGP | 1972.76| 250.6226| 7.871435| 0| --- <img src="data:image/png;base64,#images/FE/plot_sim-1.png" width="70%" style="display: block; margin: auto;" /> --- # On a Real Dataset ### Long regression |term | estimate| std.error| statistic| p.value| |:-----------|----------:|---------:|----------:|---------:| |(Intercept) | 0.6228168| 0.6725334| 0.9260756| 0.3548338| |educ | 0.5064521| 0.0503906| 10.0505201| 0.0000000| |female | -2.2733619| 0.2790444| -8.1469542| 0.0000000| ### Short regression |term | estimate| std.error| statistic| p.value| |:-----------|----------:|---------:|---------:|---------:| |(Intercept) | -0.9048516| 0.6849678| -1.321013| 0.1870735| |educ | 0.5413593| 0.0532480| 10.166746| 0.0000000| --- # Dealing with Variable Selection - **Why** would we omit variables? -- - Economic theory not perfectly defined `\(\Rightarrow\)` we do not know what to include or not - Inobserved variables - **How** can we choose? -- 1. Estimate multiple models with different functional forms 2. Compare model performance --- class: titled, middle # Under-specification - When relevant inputs are omitted (there impact on the) - Can create Omitted Variable Bias (OVB) --- <img src="data:image/png;base64,#images/FE/plot_sim_0-1.png" width="70%" style="display: block; margin: auto;" /> --- class: titled, middle # Regression Table Comparison ### Effect of Gender on Education (initial regression) |model | estimate| std.error| statistic| p.value| |:----------------|--------:|---------:|---------:|-------:| |Omitted Variable | 2501.06| 231.0856| 10.823091| 0| |Actual DGP | 1972.76| 250.6226| 7.871435| 0| ### No effect of Gender on Education |model | estimate| std.error| statistic| p.value| |:----------------|--------:|---------:|---------:|-------:| |Omitted Variable | 1899.331| 278.9843| 6.808022| 0| |Actual DGP | 1877.030| 278.2012| 6.747024| 0| --- <img src="data:image/png;base64,#images/FE/plot_sim_02-1.png" width="70%" style="display: block; margin: auto;" /> --- class: titled, middle # Regression Table Comparison ### Effect of Gender on Wage (initial regression) |model | estimate| std.error| statistic| p.value| |:----------------|--------:|---------:|---------:|-------:| |Omitted Variable | 2501.06| 231.0856| 10.823091| 0| |Actual DGP | 1972.76| 250.6226| 7.871435| 0| ### No effect of Gender on Education |model | estimate| std.error| statistic| p.value| |:----------------|--------:|---------:|---------:|-------:| |Omitted Variable | 1872.222| 217.3425| 8.614156| 0| |Actual DGP | 1862.111| 246.9989| 7.538944| 0| --- class: right, middle # Maths on the board --- class: titled, middle # Summary for Under-specification - Problematic to ignore variables that are correlated with both `\(x\)` and `\(y\)` - Ok if only correlated with `\(x\)` or `\(y\)` - But, controling for variables correlated with `\(y\)` `\(\searrow\)` variance of errors `\(\Rightarrow\)` `\(\searrow\)` variance of estimator - OVB unobserved `\(\Rightarrow\)` cannot really asses its sign or magnitude --- class: right, middle, inverse # Over-Specification --- class: titled, middle # Over-Specification - Over-Specification = including irrelevant variables - Will not create bias - But will affect the **variance** of our estimator - Creates a **bias-variance trade-off** `$$\mathbb{V}[\hat{\beta}] = \dfrac{\sigma_u^2}{n \sigma_x^2}$$` --- class: titled, middle # When to Adjust or Not? - Including a variable to our model if not interested in its paramater is called **controling** or **adjusting** - If including the variable `\(\searrow\)` variance of errors, include it ... - **But** only if does not decreases too much the variance of our explanatory variable of interest - If correlated with `\(x\)`, will `\(\nearrow\)` variance of errors - In practice, often prefer to include too many variables than too little: - Variance decreases with sample size but bias does not --- class: right, middle, inverse # Model Selection --- class: titled, middle # General Idea - Evaluate the capacity to fit the relationship between the explained and explanatory variables - Should **explain a large share of the variability** of the explained variable - For instance explain why some mode individuals earn more than others - Variance of `\(y\)` can be decomposed into that of the estimated response and error `$$Var[y | X] = Var[\hat{y} | X ] + Var[\hat{e} | X]$$` --- class: titled, middle # `\(R^2\)` - Proportion of the variance in `\(y\)` that is explained by the model - Measures how well the model explains the variability in `\(y\)` - Varies between 0 and 1 - Increases with the number of covariates `\(\Rightarrow\)` use the adjusted- `\(R^2\)` --- class: right, middle # Maths on the board --- class: titled, middle # Sum(s) of Squares We define the following quantities: - **TSS**: the Total Sum of Squares, `\(\quad TSS = \sum_{i=1}^{n}(y_i - \bar{y})^2\)` - **ESS**: the Explained Sum of Squares, `\(\quad ESS = \sum_{i=1}^{n}(\hat{y_i} - \bar{y})^2\)` - **RSS**: the Residual Sum of Squares, `\(\quad RSS = \sum_{i=1}^{n}(\hat{y_i} - y_i)^2\)` We can show that: `$$TSS = ESS + RSS$$` --- class: titled, middle # `\(R^2\)` and formulas - Proportion of the variance in `\(y\)` that is explained by the model so: `$$R^2 = \dfrac{ESS}{TSS} = 1 - \dfrac{RSS}{TSS}$$` - It is also the square fo the correlation coefficient between `\(y\)` and `\(\hat{y}\)` (hence its name) --- class: titled, middle # Information Criteria - To select between different models, we can also use different information criteria - **AIC**: Akaike Information Criterion - **BIC**: Bayesian Information Criterion - *Approach*: 1. Compute the information criterion for each specification 1. Select the model that minimizes the information criterion - However, it says nothing about the quality of the model --- class: right, middle, inverse # Thanks!