Code
# install.packages("gapminder")
library(tidyverse)
library(gapminder)
Why use particular functional transformations for the variables in your model?
October 8, 2024
This page gives both instructions and some interactive snippets. You will have to run most of your code on RStudio
, on your own computer.
First, let’s load useful packages. We need the tidyverse
package but also to install the gapminder
package (it contains the dataset we are interested in).
Load the data by calling gapminder
(after loading the package using library(gapminder)
).
What does each variable represent? (use the help()
function). How many countries are there in this data set? .
Here, we will study the relationship between lifeExp
and gdpPercap
.
The goal is to study the relationship between GDP per capita and life expectancy at birth. We can ask questions like: do “richer” countries have a larger life expectancy on average? On average, how different is life expectancy between two countries that have a 1 billion difference in GDP per capita?
Think about other interesting questions you could answer with this data set (start simple).
Plot a scatter plot of lifeExp
as a function of gdpPercap
. To do so, use the same lines of code as in the previous exercise on airquality, changing the relevant variables names.
Interpret this graph. For which level of GDP/capita is the difference between two countries having relatively similar GDP per capita the largest?
Does the relationship described on the graph you plotted look linear?
What if you plot the log of gdpPercap
, replacing gdpPercap
by log(gdpPercap)
? .
Why does the log allow to have a more linear relationship here? Plot the distribution of the gdpPercap
, using the following lines of code, filling in the blanks. Which variable should you put on the x-axis?
What do you notice about this distribution? Is it surprising to you?
If instead you plot the distribution of log(gdpPercap)
(using the previous lines of code), this distribution seems roughly . Importantly, it is relatively centered around its mean.
Assume you want to write a model to estimate the relationship between gdpPercap
and lifeExp
(or transformed versions of these variables) with OLS. For the resulting model to accurately represent the data generating process, the relationship between the variables therefore has to be linear. You want to write a model of the form
\[y = \alpha + \beta x + e\] What should be \(y\) in your case? . And x? .
To estimate this model, we will use the lm
function. Which formula should you pass to the function? . Run the estimtion and display the output using the summary()
function.
Before doing the next questions, please read the slides, including the few that we did not have time to cover in class, they will help you answer the questions.
What is the point estimate for lifeExp
? . How would you interpret this coefficient?
Is this number exact or is it an approximation?
Does this number seem reasonable to you? Think about what it entails.
Let’s now look at the residuals of this regression in order to explore potential risks of heteroskedasticity. We can retrieve the residuals of a regression with the function resid <- residuals(reg)
or by extracting them from the output of the regression: resid <- reg$residuals
. Try these two approaches and look at the resulting vecotrs.
Remember what heteroskedasticity is. Which scatter plot should we plot to observe it, if there is any to observe? against .
To do that, we should store these variable into a dataframe, using the tibble function. Let’s call them x1
and x2
for the sake of the example. Replace x1
and x2
by the relevant variables in the following code. Note that to access variables in the gapminder
dataset, you need to use $
: for instance to access the variable year, write gapminder$year
.
Does there seem to be some heteroskedasticity here?
Why is it problematic?
There are actually formal tests for heteroskedasticity. They however go beyond what is covered in this section of the class.
We saw in class that when the residuals are normally distributed, our estimator has nice properties. Plot an histogram to describe the distribution of the residuals. Which function should you use?
Does that seem roughly normally distributed?
To actually test for normality, there is a bunch of tests one can implement but you will learn more about these later. For example, we can plot the residuals against draws from a normal distribution. If the residuals are normally distributed, the resulting points will be on the identity line. To plot this, we can use the geom_qq()
function (geom_qq_line()
plots the identity line):