Lecture 2 - Properties

.title[
# Lecture 2 - Properties
]
.subtitle[
## <br> Econometrics 1
]
.author[
### Vincent Bagilet
]
.date[
### 2024-09-24
]

---

# Quizz

---

# Research questions

---

# What is a good research question?

- It **can be answered**

- There is some sort of objective answer

- It should **improve our understanding of the world**

- Should inform theory in some way
  
  - Takes us from theory to an hypothesis (statement about what we will observe in the world)

---

# Start with a question

- Avoid data mining 
  
- We are interested in *why* and not *what*
  
- Data mining can still help identify *questions* to test on *other* data sets

<br>
  
# Identifying a research question
  
  - From theory
  
  - Thanks to opportunities

---

# Is your research question good?

- **Potential results**: what would any result tell you about your theory?

- **Feasibility**: is the right data available?

- **Scale**: how much resources would you need?

- **Research design**: is there a good one that would allow you to answer your question?

- **Keep it simple**: avoid building several questions into one

---

# Summary from last week

---

# Summary from last week

- Goal: answer **research questions**

- Evaluate **theory** (there is a *why* or *because*)

- Want to describe relationships between variables

- Build an econometric **model**

- Estimate the model

- Check and interpret the results

???

- Relationships: functional form, magnitude, sign

---

# Going Further

## Repeated Regressions

---

---

---

---

---

# Repeated regressions

- Different samples give different results

- Let's compute a lot of regressions and store the results in a data frame

- The first results look like this:

| sim_id| estimate| std.error|
|------:|--------:|---------:|
|      1| 2320.969|  1296.384|
|      2| 2048.209|  1341.469|
|      3| 1840.535|  1044.561|
|      4| 2864.967|  1377.671|

- Let's plot them!

---

---

---
class: titled, middle

# Properties of our estimator

- Are our estimates valid?

- Are they a good approximation of the population parameters?

| mu_educ| sigma_educ| sigma_u| alpha| beta|
|-------:|----------:|-------:|-----:|----:|
|       3|          1|    8000| 15000| 2000|

- We have many samples (unlike in actual settings)

- Do we, on average, retrieve the parameters of interest?

???

- Exercise

---
class: titled, middle

# Estimators as random variables

- The estimator is a **random variable** (r.v.): a variable whose outcome is uncertain

- Estimate = realization of the estimator

- For the same estimator, different samples `$\Rightarrow$` different estimates

- We can however study the properties of an estimator based on one sample only

---

# Properties of our estimator

- Here, we were able to derive properties because we had many samples

- What if, like in actual settings,

- We have only one draw from the population

- Population parameters are unknown?
  
--

- We use **theoretical properties** of the estimator

- Derive conditions under wich the OLS estimator produces valid estimates

---
class: right, middle, inverse

# Statistics reviews

---
class: titled, middle

# Random variables

- **Random variable** (r.v.): a variable whose outcome is uncertain

- **Support** of a random variable: set of values the r.v. can take

- Probabilities can be assigned to the set of values in the support

- *Example*: roll of a dice, coin flip, height of students in the class
  
---
class: titled, middle

# Probability function

- Probability that a random variable takes a given value

- Discrete variable: **probability mass function**:

`$$f_X : x \mapsto Pr[X = x]$$`

- Continuous variable: **probability density function**. It is such that

`$$Pr[a \leq Z \leq b] = \int_a^b f_Z(z) \text{d}z$$`
---
class: titled, middle

# Expected value

- First moment
  
- Measures the central tendency of the distribution
  
- Discrete: `$\mathbb{E}[X] = \sum_{i = 1}^{s} p(X = x_i)x_i$`
  
- Continuous: `$\mathbb{E}[Z] = \int_{- \infty}^{+\infty} z f_Z(z) \text{d}z$`

---
class: titled, middle

# Variance

- Second moment
  
- Measures the dispersion of the distribution
  
`$$\text{Var} [X] = \mathbb{E}[( X − \mu_x )^2]$$`

- where `$\mu_x = \mathbb{E}[X]$`

- Illustrations [here](https://seeing-theory.brown.edu/basic-probability/index.html)

---
class: right, middle, inverse

# Estimator Properties

---
class: titled, middle

# Unbiasedness

- **Bias** of the estimator `$\hat{\beta}$`: Bias = `$\mathbb{E}[\hat{\beta}|X] - \beta$`

- **Unbiasedness**:

- Bias = 0

- Distribution of the estimator centered around the true population parameter

- If bias > 0, estimator positively biased (there is an upward bias)

---
class: titled, middle

# Efficiency

- An estimator is efficient if **its variance is smaller than that of the other comparable estimators**

- We want estimates from any sample to be close from one another

- Efficiency is **relative** and is used to compare estimators that use the same information

---
class: titled, middle

# Asymptotic Consistency

- An estimator is **consistent** if its variance decreases as the sample size increases

`$$\lim_{n \to \infty} Var(\hat{\beta}| X) = 0$$`

- Variance is a negative function of the sample size

---
class: titled, middle

# Asymptotic Normality

- **Asymptotic Normality**: The error follows a normal distribution with mean zero and and constant variance

`$$e|X \sim \mathcal{N}(0, \sigma^2I)$$`

- Necessary for hypothesis testing on the parameters and assess their generality

- The error term is the sum of all the variables that are not included in the model

`$\to$` central limit theorem

---
class: right, middle, inverse

# Optimality

---
class: titled, middle

# Gauss-Markov theorem

- Gives the conditions under which the OLS estimator is **optimal**

- Optimal means: the unbiased linear estimator with the smallest possible variance (Best Linear Unbiased Estimator, BLUE)

- Ideal situation, often violated in practice

- Use correction and alternative estimators to recover valid estimates

---
class: titled, middle

# Linearity

- There exists a linear relationship between the inputs and the response

- The model is correctly specified

`$$y = X\beta + e$$`

- Mispecified model `$\Rightarrow$` bias and inconsistent standard errors

---
class: titled, middle

# Exogeneity

- There is no relationship between the input and the error term

`$$\mathbb{E}[e | X] = 0$$`

- Also called the zero conditional mean of the error

- Causes: simultaneity, omitted variables and measurement error

---
class: titled, middle

# No perfect collinearity

- `$X$` is a matrix of full rank, the `$k$` columns are linearly independent

- If collinearity cannot compute the OLS estimator

- Arises when an input is a linear function of other inputs

---
class: titled, middle

# Spherical errors

- Spherical errors are a combination of:

- **Homoskedasticity**: the variance of the errors does not depend on `$X$` ( `$\mathbb{V}[e_i|X] = \sigma$` )
  
  - **No serial correlation** or **independent errors**: `$e_i \perp e_j | X$`

- Combined togehter, this gives

`$$\mathbb{E}[e'e | X] = \sigma^2 I$$`
---
class: titled, middle

# OLS Properties and Conditions

- Assume linearity and no perfect colinearity,

- If in addition we have

- **Exogeneity**, the OLS estimator is **unbiased**
  
  - **Exogeneity** and **spherical errors**, the OLS estimator is **efficient** among *linear* estimators (BLUE)
  
  - That + **normally distributed errors**, the OLS estimator is **normally distributed**

---
class: right, middle, inverse

# A bit of maths

## On the board

---
class: titled, middle

# Derivations

- Bias of the estimator

- Variance of the estimator
  
---
class: right, middle, inverse

# Lecture summary

---
class: titled, middle

# This week

- Regression is a helpful tool to answer research questions

- The OLS estimator, under some conditions, has some neat properties

- We described these **properties** and some of the **necessary conditions** for these properties to hold

---
class: right, middle, inverse

# Thanks!