Code
# install.packages("wooldridge")
library(tidyverse)
library(wooldridge)
Why to select one model or another?
October 15, 2024
This page gives both instructions and some interactive snippets. You will have to run most of your code on RStudio
, on your own computer.
First, let’s load useful packages. We need the tidyverse
package but also to install the wooldridge
package (it contains the dataset we are interested in).
Load the data by calling wage1
(after loading the package using library(wooldridge)
).
Print the dataset. It prints in a different format than what we are used to. It is stored in a dataframe and not the tibbles we are used to. Let’s create a tibble version of this data set, using the function as_tibble()
; run the following line of code:
What does each variable represent? (use the help()
function). How many observations are there in this data set? .
We will study how to better explain/predict hourly wage, based on the set of covariates available in the data set
We want to build a model that accurately describes the variation in wage. Why can understanding how wage evolves with various covariates matter for economics research?
Before anything, let’s quickly explore the data. To have an overview of the dataset, we can use the function glimpse(wage)
.
What type of variable does profocc
seem to be? Confirm this by exploring the documentation for the data set (the help
)
The summary()
function can also provide interesting information on the dataset. Call this function. What is the median number of years of educations in the data set?
What is the proportion of married individuals in this data set? 1
We want to study the variation in wage
, before anything, let’s plot its distribution as we did in the previous exercise!
Do you notice anything weird or erroneous?
We can then explore its correlation with education. Which ggplot
function should we use?
What do we notice?
In addition, larger levels of education seem to be associated with hourly wages.
We may also be interested in disparities by gender. We can plot the same graph distinguishing by gender. To do so, add color = female
within the aes()
function: aes(x = ..., y = ..., color = female)
.
ggplot
treats female
as a continuous variable (an integer). Instead, this variable is binary. To transform it as categorical variable, we can use the as_factor()
function: aes(x = ..., y = ..., color = as_factor(female))
.
For a given level of education, the wage of female individuals seems to be than the wage of non-female ones.
Does the relationship between wage and education seem different for female and non-female?
Go on with this data exploration: look at the correlation between wage and experience, tenure and their squares. Can you see a clear relationship between the square of experience and wage?
Does that mean that there is no statistical relationship between these two variables?
How would you expect this relationship to be Why?
You can do a lot more data exploration using graphs but also tables. We will however now turn to modeling.
Let’s first explore the relationship between hourly wage, level of education, experience and experience squared. Regress wage on these variables, as we did last weeks.
The variable expersq
is already in the data set; you can call it directly. If you use exper^2
in your regression, what happens? To remedy this, you need to tell R
that you are actually creating a new variable, using the function I(exper^2)
: formula = ... ~ ... + I(exper^2)
.
What is the point estimate for educ
? . How would you interpret this coefficient?
What is the point estimate for exper
? . How would you interpret this coefficient?
For how many years of experience could you easily interpret this coefficient? . The presence of which variable in the analysis can explain this difference?
Comparing two individuals with the same level of education, we expect one with 3 years of experience to earn on average more per hour than one with 2 years of experience. What is the general formula to compute this quantity?
\[\frac{\partial \widehat{wage}}{\partial exp} = \hat{\beta_2} + 2 \hat{\beta_3} exp \]
How do you interpret the sign of expersq
?
There are decreasing marginal returns to experience.
We will now explore how the relationship between level of education and wage varies by marital status. If we add a married
dummy, what is the coefficient for educ
?
How do we interpret it?
For a given level of education, on average, we expect married individual to earn dollars more/less per hour than non married one.
Does the relationship between level of education and wage varies by marital status?
If you plot the regression lines for both groups, they will be
To make this relationship vary by marital status, we can interact the variable of interest, educ
, with married
. To interact two variables, we can use :
: eg y ~ x:z
.
Which variables do we need to include in the model?
educ
, married
, educ:married
, exper
and expersq
Try only having educ*married
, exper
and expersq
in your formula. What happens to your results? What does the *
do?
Comparing two married individuals with the same level of experience, we expect one with one more year of education to earn on average more per hour than the other.
Comparing two non-married individuals with the same level of experience, we expect one with one more year of education to earn on average more per hour than the other.
Plot the relationship between wage and education. Add a regression line. To do so, we need to use the function.
Let’s now plot separate lines for married vs non-married individuals. To do so, we need to add the color = married
aesthetic (aes(..., color = married)
).
Let’s now dive into model selection.
What is the \(R^2\) of the regression of wage on the level of education, the level of experience and its square? 2
What is the adjusted\(-R^2\)? . Is the difference with the \(R^2\) large? Why so?
How would you interpret the \(R^2\) (or the adjusted\(-R^2\))?
Add nonwhite
, tenure
, female
, married
, numdep
to the model. Does the this model explains a larger proportion of the variation in wage?
Does the coefficient for educ
vary? Why?
Is the interpretation of this coefficient the same as before?
Let’s get back to our simple model of wage on the level of education, the level of experience and its square. Does including female
affect the coefficient for educ
? Why?
In the first specification, female
is an . What other variables in the data set could have similar effects? Should we include them in the model?
Should we include any variables in the model? Why?
Variable selection involves arbitrating this bias-variance trade-off.
To select which variables to include and not to include, a basic and usual approach is to use Information Criteria. We will choose the model that minimizes an information criterion, either the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). The associated functions are AIC
and BIC
.
Write several models including some of the following variables, in addition of educ
: exper
, expersq
, married
, tenure
, female
, numdep
and nonwhite
.
Which model would you choose?
In particular, estimate the model including all the covariates, compute its AIC. Then do the same thing but excluding nonwhite
. Which model would you choose?3