Exercise - Hands-on linear regression

A first exercise to get started with regression analysis in R

Date

October 1, 2024

Instructions

This page describe the instructions and some interactive snippets. You will have to run most of your code on RStudio, on your own computer.

First, let’s load useful packages. For now, we will only need the tidyverse package.

Code
library(tidyverse)

We will work with the airquality data set.

Exploring the data set

This data set is automatically loaded to your R environment; you can call it directly. What are its dimensions?

rows and columns.

What does each variable represent? Use the help function to answer this question.

Tip

There is a shortcut for the help function: just put a ? before the function or data set you want help for, eg, help(print) and ?print will have the same effect. Try it out!

Plotting the relationship

How are Ozone and Wind related? First, we will plot the raw data using a scatter plot and ggplot.

Important

ggplot is a tool that allows you to make awesome plots in R. It is based on the Grammar of Graphics, a theoretical approach to make graph, that essentially builds graphs layer by layer.

Which variable should you put on the y axis? . Why?

Now, fill in the blanks in the following chunk of code to plot a graph describing the relationship between the two variables. To do so, replace x_variable and y_variable with the name of the variables you want to plot.

ggplot(data = airquality, aes(x = Wind, y = Ozone)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "",
    x = "Wind (in mph)",
    y = "Ozone (in ppb)"
  )

Do not forget to label the axes (indicating the units) and give a title to your graph. You can also play around with the various lines, for instance adding them one after the other, to understand their use. You will learn more about ggplot in TD.

A first regression

Now that we have quickly plotted the data, let’s run our first regression. To do so, we will use the lm() function. What does lm stands for? .

To regress, y on x, we need to write the formula y ~ x. Write your formula of interest for the airquality dataset and run the function.

lm(formula = y_variable ~ x_variable, data = airquality)

You can store the result fo this regression, in the reg variable for instance. Use the “assignment” sign <- and this will create the reg variable.

reg <- lm(formula = Ozone ~ Wind, data = airquality)

The nice thing with this is that you then can call the summary() function on this object:

summary(reg)

Run this function. There is a lot of information there. What is the value of the coefficient of interest? .

Interpret this coefficient. What word should your sentence start with? .

As always, there is some uncertainty associated with the value of this coefficient. What is the standard error of this coefficient? . Is the estimate statistically significant? .

Specification

Multiple regression

Let’s consider a different model, including temperature. Assume that i{1,..,n}, we have:

Ozonei=β0+β1Windi+β2Tempi+ei Run a longer regression by adding one predictor to the model. To do so, you need to replace the ... by .

reg_temp <- lm(formula = ..., data = airquality)

summary(reg_temp)

What is the coefficient associated with Wind? .

Is it close to the one we had before, including only Wind as a predictor? . Why is that different?

Make a sentence to interpret this coefficient, starting with the same word as usual ( ).

Scaling

What is the unit of Wind? . Is it easily interpretable for you? Not really. Let’s transform it into something we are more familiar with: .

Which dplyr function would you use to create a new variable? .

To get from Wind to your new variable, your need to Wind by 1.609. Implement this and run your new regression.

airquality_scaled <- airquality |> 
  ...

reg_scaled <- lm(formula = ..., data = airquality_scaled)

summary(reg_scaled)

How do you interpret the coefficient of interest?

Linear relationship?

You plotted the data earlier. Does the relationship between Ozone and Wind actually seem linear? . What physical process could explain this relationship?