In this document, we select NMMAPS cities with the least proportion of missing records to use them in our power simulations.
library(rmarkdown)
library(knitr)
library(here)
library(tidyverse)
library(missRanger)
library(mediocrethemes) #own ggplot theme
set_mediocre_all(pal = "leo")
First, we load the data. We then notice in the exploratory data analysis that average temperature readings were missing for all cities after 1998. We therefore select observations from 1987 to 1997:
We select cities with less than 5% of missing temperature readings:
We will only use the carbon monoxide in our power simulations as it is the pollutant with the smallest proportion of missing readings. We keep cities with less than 5% of missing CO readings:
We have 66 cities remaining. We plot the proportion of missing values for CO by city:
nmmaps_to_clean %>%
group_by(city) %>%
summarise(proportion_missing = sum(is.na(co))/n()*100) %>%
arrange(proportion_missing) %>%
mutate(city = factor(city, levels = city)) %>%
ggplot(., aes(x = proportion_missing, y = city)) +
geom_vline(xintercept = sum(is.na(nmmaps_to_clean$co))/nrow(nmmaps_to_clean)*100) +
geom_point(size = 3) +
xlab("Proportion of Missing Values (%)") +
ylab(NULL)
We also plot the time series of missing values for CO by city:
nmmaps_to_clean %>%
mutate(missing_co = ifelse(is.na(co), "Missing", "Not Missing")) %>%
ggplot(., aes(x = date, y = city, fill = fct_rev(missing_co))) +
geom_tile() +
scale_x_date(breaks = scales::pretty_breaks(n = 6)) +
labs(
x = "Date",
y = NULL,
fill = NULL
) +
theme(panel.grid.major.y = element_blank(),
legend.justification = "left",
axis.text.y = element_text(margin = margin(r = -0.5, unit = "cm")))
The overall distribution of CO in \(\mu g/m^{3}\) is as follows:
Mean | Standard Deviation | Minimum | Maximum |
---|---|---|---|
369.2361 | 398.3864 | 1.03e-05 | 8335.371 |
We impute missing observations for CO concentrations and the average temperature using the missRanger
package:
We scale (“standardize”) the CO concentrations:
We finally keep relevant variables and save the data:
We have 273224 observations, with 4018 daily observations by city.