Code
library(tidyverse)
A first exercise to get started with regression analysis in R
October 1, 2024
This page describe the instructions and some interactive snippets. You will have to run most of your code on RStudio
, on your own computer.
First, let’s load useful packages. For now, we will only need the tidyverse
package.
We will work with the airquality
data set.
This data set is automatically loaded to your R environment
; you can call it directly. What are its dimensions?
rows and columns.
What does each variable represent? Use the help
function to answer this question.
There is a shortcut for the help
function: just put a ?
before the function or data set you want help for, eg, help(print)
and ?print
will have the same effect. Try it out!
How are Ozone
and Wind
related? First, we will plot the raw data using a scatter plot and ggplot
.
ggplot
is a tool that allows you to make awesome plots in R. It is based on the Grammar of Graphics, a theoretical approach to make graph, that essentially builds graphs layer by layer.
Which variable should you put on the
Now, fill in the blanks in the following chunk of code to plot a graph describing the relationship between the two variables. To do so, replace x_variable
and y_variable
with the name of the variables you want to plot.
Do not forget to label the axes (indicating the units) and give a title to your graph. You can also play around with the various lines, for instance adding them one after the other, to understand their use. You will learn more about ggplot
in TD.
Now that we have quickly plotted the data, let’s run our first regression. To do so, we will use the lm()
function. What does lm
stands for? .
To regress, y
on x
, we need to write the formula y ~ x
. Write your formula of interest for the airquality
dataset and run the function.
You can store the result fo this regression, in the reg
variable for instance. Use the “assignment” sign <-
and this will create the reg
variable.
The nice thing with this is that you then can call the summary()
function on this object:
Run this function. There is a lot of information there. What is the value of the coefficient of interest? .
Interpret this coefficient. What word should your sentence start with? .
As always, there is some uncertainty associated with the value of this coefficient. What is the standard error of this coefficient? . Is the estimate statistically significant? .
Let’s consider a different model, including temperature. Assume that
...
by .
What is the coefficient associated with Wind
? .
Is it close to the one we had before, including only Wind
as a predictor? . Why is that different?
Make a sentence to interpret this coefficient, starting with the same word as usual ( ).
What is the unit of Wind
? . Is it easily interpretable for you? Not really. Let’s transform it into something we are more familiar with: .
Which dplyr
function would you use to create a new variable? .
To get from Wind
to your new variable, your need to Wind
by 1.609. Implement this and run your new regression.
How do you interpret the coefficient of interest?
You plotted the data earlier. Does the relationship between Ozone
and Wind
actually seem linear? . What physical process could explain this relationship?