class: title-slide <br> <br> <br> <br> # Econometrics 1 ## Lecture 7: Hypothesis Testing ### .smaller[Gaetan Bakalli] <br> <img src="data:image/png;base64,#pics/liscence.png" width="25%" style="display: block; margin: auto;" /> .center[.tiny[License: [CC BY NC SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)]] --- # Hypothesis testing - In general (scientific) hypotheses can be translated into a set of (non-overlapping idealized) statistical hypotheses: `$$H_0: \theta \color{#eb078e}{\in} \Theta_0 \ \text{ and } \ H_a: \theta \color{#eb078e}{\not\in} \Theta_0.$$` - In a hypothesis test, the statement being tested is called the .hi-purple[null hypothesis] `\(\color{#373895}{H_0}\)`. A hypothesis test is designed to assess the strength of the evidence against the null hypothesis. - The .hi-purple[alternative hypothesis] `\(\color{#373895}{H_a}\)` is the statement we hope or suspect to be true instead of `\(\color{#373895}{H_0}\)`. - Each hypothesis excludes the other, so that one can exclude one in favor of the other using the data. - .pink[Example:] a drug represses the progression of cancer `$$H_0: \mu_{\text{Drug}} \color{#eb078e}{=} \mu_{\text{Control}} \ \text{ and } \ H_a: \mu_{\text{Drug}} \color{#eb078e}{<} \mu_{\text{Control}}.$$` --- # Hypothesis testing <img src="data:image/png;base64,#pics/great_pic.png" width="100%" style="display: block; margin: auto;" /> --- # Hypothesis testing .smallest[ | | `\(H_0\)` is true | `\(H_0\)` is false | | ------------------- |---------------------------------------------| ----------------------------------------| | Can't reject `\(H_0\)` | `\(\text{Correct decision (prob=}1-\alpha)\)` | `\(\text{Type II error (prob=}1-\beta)\)` | | Reject `\(H_0\)` | `\(\text{Type I error (prob}=\alpha)\)` | `\(\text{Correct decision (prob=}\beta)\)` | ] - The .pink[type I error] corresponds to the probability of rejecting `\(H_0\)` when `\(H_0\)` is true (also called .pink[false positive]). The .purple[type II error] corresponds to the probability of not rejecting `\(H_0\)` when `\(H_a\)` is true (also called .purple[false negative]). - A test is of .pink[significance level] `\(\color{#e64173}{\alpha}\)` when the probability of making a type I error equals `\(\alpha\)`. Usually we consider `\(\alpha = 5\%\)`, however, this can vary depending on the context. - A test is of .purple[power] `\(\color{#6A5ACD}{\beta}\)` when the probability to make a type II error is `\(1-\beta.\)` In other words, the power of a test is its probability of rejecting `\(H_0\)` when `\(H_0\)` is false (or the probability of accepting `\(H_a\)` when `\(H_a\)` is true). --- # What are p-values? - .smallest[The .hi-purple[p-value] is defined as the probability of observing a test statistic that is at least as extreme as actually observed, assuming that] `\(\small H_0\)` .smallest[is true.] - .smallest[Informally, .pink[a p-value can be understood as a measure of plausibility of the null hypothesis given the data]. Small p-value indicates strong evidence against] `\(\small H_0\)`. - .smallest[When the p-value is small enough (i.e. smaller than the significance level] `\(\small \alpha\)`.smallest[), one says that the test based on the null and alternative hypotheses is .pink[significant] or that the null hypothesis is rejected in favor of the alternative. This is generally what we want because it "verifies" our (research) hypothesis.] - .smallest[When the p-value is not small enough, with the available data, we cannot reject the null hypothesis so nothing can be concluded.] 🤔 - .smallest[The obtained p-value summarizes somehow the .pink[incompatibility between the data and the model] constructed under the set of assumptions.] .center[ .smaller.purple["Absence of evidence is not evidence of absence."] <sup>.smallest[👋]</sup>] .footnote[.smallest[👋] From the British Medical Journal.] --- # How to understand p-values? <img src="data:image/png;base64,#pics/p_value.png" width="45%" style="display: block; margin: auto;" /> 👋 .smallest[If you want to know more have a look [here](https://xkcd.com/1478/).] --- # So how does it work? Suppose that we collect data on wages and gender (female or male) on `\(n_1\)` female and `\(n_2\)` male. We measure `\(X_i\)` as the difference is wages between the two genders. Our hope is to show that there is significant difference in wages between male and female. Suppose that the (possibly dependent) data are such that `\(X_i \sim F_{\color{#eb078e}{i}}, \, i = 1,\ldots, n\)` and `\(\mathbb{E}[X_i] = \mu\)`. This is a rather plausible assumption (why? 🤔). To verify our hypothesis we consider: `$$H_0: \mu \color{#eb078e}{=} 0 \ \ \ \ \text{and} \ \ \ \ H_a: \mu \color{#eb078e}{>} 0.$$` This implies that we are considering the following model: `$$X_i = \mu + \color{#373895}{\varepsilon_i},$$` where `\(\color{#373895}{\varepsilon_i} = X_i - \mu\)` can be understood as ".purple[residuals]". --- # So how does it work? Then, by the CLT we have `$$T = \frac{\sqrt{n} \left(\bar{X}_n - \mu_{H_0}\right)}{\color{#373895}{S}} = \frac{\sqrt{n}\bar{X}_n}{\color{#373895}{S}} \color{#eb078e}{\underset{H_0}{\sim}} G \color{#eb078e}{\to} \mathcal{N}(0,1),$$` where `\(\color{#373895}{S} = \sqrt{ \frac{1}{n-1} \sum_{i = 1}^n (X_i - \bar{X}_n)^2}\)` and `\(\color{#eb078e}{\underset{H_0}{\sim}}\)` corresponds ".pink[distributed under] `\(\color{#eb078e}{H_0}\)` .pink[as]". In the above formula `\(T\)` is a random variable but we can compute its .hi-purple[realization] based on our sample, i.e. `\(\color{#eb078e}{\sqrt{n}\bar{x}_n/s}\)`. Let `\(Z \sim \mathcal{N}(0,1)\)`, then using the definition of the p-value<sup>.smallest[👋]</sup>, we have `$$\text{p-value} = \Pr \left(T > \frac{\sqrt{n}\bar{x}_n}{s} \right) \color{#eb078e}{\overset{\tiny CLT}{\approx}} \Pr \left(Z > \frac{\sqrt{n}\bar{x}_n}{s} \right) = 1 - \Phi\left(\frac{\sqrt{n}\bar{x}_n}{s}\right).$$` .footnote[.smallest[👋] Reminder: the .hi-purple[p-value] is defined as the probability of observing a test statistic that is at least as extreme as actually observed, assuming that the null hypothesis is true.] --- # So how does it work? <img src="data:image/png;base64,#pics/pvalue.png" width="80%" style="display: block; margin: auto;" /> --- # Wage example .panelset[ .panel[.panel-name[Import] ``` r # Import data library(wooldridge) data("wage1") wage_female = wage1$wage[wage1$female == 1] wage_male = wage1$wage[wage1$female == 0] # Calculate the standard errors se1 = sd(wage_female) / sqrt(length(wage_female)) se2 = sd(wage_male) / sqrt(length(wage_male)) head(wage_female) ``` ``` #> [1] 3.10 3.24 5.00 3.60 6.25 8.13 ``` ``` r head(wage_male) ``` ``` #> [1] 3.00 6.00 5.30 8.75 11.25 18.18 ``` ] .panel[.panel-name[P-value] ``` r X = rnorm(100) # Calculate the mean of the difference diff_wage = mean(wage_male) - mean(wage_female) # Calculate the standard error of the difference se_diff = sqrt(se1^2 + se2^2) # Compute test statistic t = diff_wage/se_diff # Calculate the two-tailed p-value (p_value = 1 - pnorm(t)) ``` ``` #> [1] 0 ``` .smallest[This result suggests that male have higher wage than women.] ] ] --- # One-sample Student's t-test Before we assume that the (possibly correlated) data is such that `\(X_i \sim F_{\color{#eb078e}{i}}, \, i = 1,\ldots, n\)` and `\(\mathbb{E}[X_i] = \mu\)`. However, in the .pink[very special] case where `$$X_i \color{#eb078e}{\overset{iid}{\sim}} \mathcal{N}(\mu, \sigma^2),$$` which corresponds to the following model: `$$X_i = \mu + \color{#373895}{\varepsilon_i},$$` with `\(\color{#373895}{\varepsilon_i} = X_i - \mu \color{#eb078e}{\overset{iid}{\sim}} \mathcal{N}(0,\sigma^2)\)`, we have `$$T = \frac{\sqrt{n} \left(\bar{X}_n - \mu_{H_0}\right)}{\color{#373895}{S}} = \frac{\sqrt{n}\bar{X}_n}{\color{#373895}{S}} \color{#eb078e}{\underset{H_0}{\overset{}{\sim}}} \text{Student}(n-1) \color{#eb078e}{\overset{}{\to}} \mathcal{N}(0,1).$$` Unlike our previous result, `\(T\)` .pink[follows exactly] a `\(\text{Student}(n-1)\)` distribution for all `\(n\)`. 📝 `\(\text{Student}(n) \color{#eb078e}{\overset{}{\to}} \mathcal{N}(0,1)\)` as `\(n \color{#eb078e}{\overset{}{\to}} \infty\)`. --- # Remarks - .smaller[.pink[Why is it called Student?] The Student's t distributions were discovered in 1908 by William S. Gosset, who is a statistician employed by the Guinness brewing company. Gosset devised the t-test as an economical way to monitor the quality of stout 🍻. The company forbade its scientists from publishing their findings, so Gosset published his statistical work under the pen name ".pink[Student]".] - .smaller[The t-test (similarly to the test previously discussed) can be used to test if the population mean is .pink[greater, smaller, or different than] a hypothesized value, i.e.] `$$H_0: \mu \color{#eb078e}{=} \mu_0 \ \ \ \ \text{and} \ \ \ \ H_a: \mu \ \big[ \color{#eb078e}{>} \text{ or } \color{#eb078e}{<} \text{ or } \color{#eb078e}{\neq} \big] \ \mu_0.$$` - .smaller[The t-test accounts for the uncertainty of sample variance and we have:] `$$\text{p-values based on } \mathcal{N} (0, 1)\color{#eb078e}{<}\text{p-values based on } \text{Student}(n).$$` --- # Remarks <img src="data:image/png;base64,#pics/pvalue2.png" width="80%" style="display: block; margin: auto;" /> --- # R syntax for the t-test In `R`, we can use the function `t.test(...)` to compute the p-value for the one(two)-sample Student's t-test. For more information, have a look at `?t.test`. Here are some examples with different alternative hypotheses: `$$H_0: \mu \color{#eb078e}{=} 5 \ \ \ \ \text{and} \ \ \ \ H_a: \mu \color{#eb078e}{>} 5$$` ``` r t.test(data, alternative = "greater", mu = 5) ``` `$$H_0: \mu \color{#eb078e}{=} 0 \ \ \ \ \text{and} \ \ \ \ H_a: \mu \color{#eb078e}{<} 0$$` ``` r t.test(data, alternative = "less", mu = 0) t.test(data, alternative = "less") ``` `$$H_0: \mu \color{#eb078e}{=} 0 \ \ \ \ \text{and} \ \ \ \ H_a: \mu \color{#eb078e}{\neq} 0$$` ``` r t.test(data, alternative = "two.sided", mu = 0) ``` --- # Diet Example (with t-test) .panelset[ .panel[.panel-name[Results] 1. .purple[Define hypotheses:] `\(H_0: \mu_m = \mu_f\)` and `\(H_a: \mu_m \color{#e64173}{>} \mu_f\)`. 2. .purple[Define] `\(\color{#373895}{\alpha}\)`: We consider `\(\alpha = 5\%\)`. 3. .purple[Compute p-value]: p-value = `\(4.243 \times 10^{-16}\)` (see R output tab for details). 4. .purple[Conclusion:] We have p-value < `\(\alpha\)` and so we can reject the null hypothesis at the significance level of 5% and conclude that male have significant higher wage than female. ] .panel[.panel-name[`R` output] ``` r t.test(wage_male,wage_female, alternative = "greater") ``` ``` #> #> Welch Two Sample t-test #> #> data: wage_male and wage_female #> t = 8.44, df = 456.33, p-value < 2.2e-16 #> alternative hypothesis: true difference in means is greater than 0 #> 95 percent confidence interval: #> 2.021307 Inf #> sample estimates: #> mean of x mean of y #> 7.099489 4.587659 ``` .smallest[As expected the p-value of this test is larger than the one of the previous test (based on the CLT).] ]] --- # Limitations of the one-sample t-test The .pink[reliability] of the t-test strongly relies on: 1. The absence of .pink[outliers]; 2. For moderate and small sample, the sample distribution is at least .pink[approximately normal] with no strong skewness (i.e. heavy tails). <sup>.smallest[✋]</sup> ⚠️ Therefore, before proceeding to any inference, we should check the data preliminarily using .purple[boxplot] or .purple[histogram] or .purple[QQ plot] <sup>.smallest[👋]</sup> to see if a t-test can be used. .footnote[✋ If you want to know more [here](https://www.annualreviews.org/doi/full/10.1146/annurev.publhealth.23.100901.140546) is an interesting reference. 👋 Check out [QQ plot](https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot).] --- # Is our analysis comprehensive? .panelset[ .panel[.panel-name[Code] Might there be other variable that explain .pink[a difference in wage]? Maybe the level of education? ``` r wageModel = lm(lwage ~ educ + female, data = wage1) summary(wageModel) ``` ] .panel[.panel-name[Result] ``` #> #> Call: #> lm(formula = lwage ~ educ + female, data = wage1) #> #> Residuals: #> Min 1Q Median 3Q Max #> -2.02672 -0.27470 -0.03731 0.26219 1.34738 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 0.826269 0.094054 8.785 <2e-16 *** #> educ 0.077203 0.007047 10.955 <2e-16 *** #> female -0.360865 0.039024 -9.247 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.4455 on 523 degrees of freedom #> Multiple R-squared: 0.3002, Adjusted R-squared: 0.2975 #> F-statistic: 112.2 on 2 and 523 DF, p-value: < 2.2e-16 ``` ] ] --- # .smaller[Interpretation of coefficient p-values] - We notice that for each coefficient `\(\beta_j\)`, there is a corresponding p-value associated to the (Wald t-)test of `\(H_0: \beta_j = 0\)` and `\(H_a: \beta_j \neq 0\)`. - .pink[A covariate with a small p-value (typically smaller than 5%) is considered to be a significant (meaningful) addition to the model], as changes in the values of such covariate can lead to changes in the response variable. - On the other hand, a large p-value (typically larger than 5%) suggests that the corresponding covariate is not (significantly) associated with changes in the response or that we don't have enough evidence (data) to show its effect. --- # F-Test in Linear Regression In the context of multiple linear regression, we often use the F-test to test whether .hi-purple[a group of predictors (or all predictors) significantly contribute to explaining the variance]in the dependent variable. - .hi-purple[Null Hypothesis]: The coefficients of the selected predictors are all zero. $$ H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 $$ - .hi-purple[Alternative Hypothesis]: At least one coefficient is non-zero. $$ H_1: \exists \, j \, \text{such that} \, \beta_j \neq 0 $$ What could be the limitation of of testing .pink[many] restrictions? --- # Constructing the F-Test Statistic .smallest[ 1. .purple[Fit the Full Model] (including the predictors we want to test) and obtain the residual sum of squares `\(\text{RSS}_{f}\)` 2. .pink[Fit the Reduced Model] (excluding these predictors) and obtain the residual sum of squares `\(\text{RSS}_{\text{r}}\)` 3. .purple[Calculate the F-statistic]: `$$\hat{F}=\frac{\frac{SSR_{r}-SSR_f}{p}}{\frac{SSR_f}{n-k}}~;~F\overset{\text{under }H_{0}}{\sim} F_{p;(n-k)}$$` - where `\(k\)` is the number of predictors tested (restrictions under `\(H_0\)`), `\(n\)` is the sample size, and `\(p\)` is the total number of predictors in the full model. 4. .pink[Decision Rule]: - Compare the F-statistic to a critical value from the `\(F\)`-distribution with `\(p\)` and `\(n - k\)` degrees of freedom, or use the p-value. A small p-value suggests we reject `\(H_0\)`, indicating that the group of predictors significantly explains variation in the outcome. ]