Estimating Small Effects: Automated literature review: getting abstracts

Show the packages used

library("groundhog")
packages <- c(
  "tidyverse", 
  "tidytext", 
  "wordcloud",
  "retrodesign", 
  "lubridate",
  # "fulltext", #does not run with the last R version
  "DT"
)

# groundhog.library(packages, "2022-11-28")
lapply(packages, library, character.only = TRUE)

set.seed(1)

Overall approach

We take advantage of a somehow standardized reporting mechanism to retrieve point estimates and confidence intervals from the abstracts using REGular EXPressions (regex).

The algorithm we wrote is not perfect: it probably does not detect all estimates and may pick up some incorrect point estimates and/or standard error. However, based on quick non-automated checks we ran, these potential issues seem very limited. Retrieving all this information via careful reading of the abstracts would have been extremely cumbersome. This automated analysis is an important time saver. Importantly, the analysis carried out for this literature could be easily replicated for another literature, based on the code used and described in this document.

Selecting articles and retrieving metadata

We use the fulltext package to get the abstract of each article corresponding to our search query.

We focus on articles published on Scopus and Pubmed. To access Scopus API, one needs to register to get an API key (stored in the .Renviron) for Elsevier and a Crossref TDM API key. Pubmed articles are accessed via Entrez. An API key enables to increase the number of requests per seconds from 3 to 10. More information on authentication is available on the fulltext manual.

Set of articles to consider

First of all, we need to clearly define the set of articles we want to consider in this analysis. Our search query is:

‘TITLE((“air pollution” OR “air quality” OR “particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide” OR “PM10” OR “PM2.5” OR “carbon dioxide” OR “carbon monoxide”) AND (“emergency” OR “mortality” OR “stroke” OR “cerebrovascular” OR “cardiovascular” OR “death” OR “hospitalization”) AND NOT (“long term” OR “long-term”)) AND “short term”’

query <- 
  paste('TITLE(("air pollution" OR "air quality" OR "particulate matter" OR "ozone"', 
        'OR "nitrogen dioxide" OR "sulfur dioxide" OR "PM10" OR "PM2.5" OR', 
        ' "carbon dioxide" OR "carbon monoxide")', 
        'AND ("emergency" OR "mortality" OR "stroke" OR "cerebrovascular" OR', 
        '"cardiovascular" OR "death" OR "hospitalization")' ,
        'AND NOT ("long term" OR "long-term")) AND "short term"'
  )

opts_entrez <- list(use_history = TRUE)

#Run a search
search <- ft_search(query, from = "scopus", limit = 2000)
search_entrez <- ft_search(
  str_replace(query, "AND NOT", "NOT"), 
  from = "entrez", 
  limit = 300, 
  entrezopts = opts_entrez
)

We then retrieve and wrangle the related metadata. The metadata from different sources having different shapes, we only select a few relevant columns to build an overall metadata set.

metadata_scopus <- search$scopus$data %>% 
  as_tibble() %>% 
  rename_all(function(x) str_remove_all(names(.), "prism:|dc:")) %>% 
  rename_all(function(x) str_replace_all(names(.), "-", "_")) %>% 
  select(doi, title, creator, publicationName, pubmed_id, coverDate) %>% 
  rename(
    authors = creator,
    journal = publicationName
  ) %>% 
  mutate(
    pubmed_id = ifelse(!str_detect(pubmed_id, "[0-9]{7}"), NA, pubmed_id),
    pub_date = ymd(coverDate)
  ) %>% 
  select(-coverDate)

saveRDS(metadata_scopus, "data/literature_review_epi/outputs/metadata_scopus.RDS")

metadata_entrez <- search_entrez$entrez$data %>% #search_entrez$entrez$data 
  as_tibble() %>% 
  # rename(id = uid) %>% 
  select(doi, title, authors, fulljournalname, pmid, pubdate) %>% 
  rename(
    journal = fulljournalname,
    pubmed_id = pmid
  ) %>% 
  mutate(
    pubmed_id = ifelse(!str_detect(pubmed_id, "[0-9]{7}"), NA, pubmed_id),
    pub_date = ymd(pubdate)
  ) %>% 
  select(-pubdate)

saveRDS(metadata_entrez, "data/literature_review_epi/outputs/metadata_entrez.RDS")

metadata_lit_review <- metadata_scopus %>% 
  rbind(metadata_entrez) %>% 
  filter(!is.na(doi)) %>% 
  mutate(pb_doi = str_detect(doi, "[<>;]")) %>% #some dois are not valid
  filter(pb_doi == FALSE) %>% 
  select(-pb_doi) %>% 
  group_by(doi) %>% 
  filter(pub_date == max(pub_date, na.rm = TRUE)) %>% #some articles have been published twice
  mutate(n_with_doi = n()) %>% 
  filter(n_with_doi == 1 | (n_with_doi > 1 & pubmed_id == max(pubmed_id, na.rm = TRUE))) %>%
  #two weird articles with separate author names, I select one randomly
  select(-n_with_doi) %>% 
  ungroup() %>%
  distinct(title, .keep_all = TRUE)

# saveRDS(metadata_lit_review, "data/literature_review_epi/outputs/metadata_lit_review.RDS")

Retreiving abstracts

There is no fulltext function to access abstracts from Entrez. Therefore, using the DOI, we get the abstracts from Semantic Scholar. We also access Scopus abstracts from Semantic Scholar since, due to an IP address constraint we cannot access the texts and abstracts from Scopus.

In Semantic Scholar, there is a rate limit of 100 articles per 5 min or 20 articles per minute. We therefore need to pause the system to be able to download everything. In addition, some DOIs are not valid so we filtered them out in a previous step (pb_doi).¹

get_abstracts <- function(doi) {
  vect_doi <- unique(doi)
  number_periods <- (length(vect_doi) - 1) %/% 20
  abs <- NULL

  message(str_c("Total downloading time: ", number_periods, "min"))
  
  for (i in 0:number_periods) {
    
    doi_period <- vect_doi[(20*i+1):(20*(i+1))]
    doi_period <- doi_period[!is.na(doi_period)]
    
    skip_to_next <- FALSE #to handle issues, using tryCatch
    
    possible_error <- tryCatch(
      abs_period <- doi_period %>%
        ft_abstract(from = "semanticscholar") %>%
        .$semanticscholar %>%
        as_tibble() %>%
        unnest(cols = everything()) %>%
        pivot_longer(everything(), names_to = "doi", values_to = "abstract") %>%
        filter(doi != abstract),
      error = function(e) e
    )
    
    if (inherits(possible_error, "error")) {
      warning(
        str_c("The abstracts for the following articles could not be downloaded: ",
              str_c(doi_period, collapse = ",")))
      next
    } else {
       abs <- abs %>%
        rbind(abs_period)
    }
    
    if (i < number_periods & number_periods != 0) {
      message(str_c("Remaining time: ", (number_periods - i), "min"))
      Sys.sleep(63)
    }
  } 
  
  return(abs)
}

#run this from the console to see the time remaining (copy/paste it)
abstracts <- metadata_lit_review %>%
  .$doi %>% 
  get_abstracts()  %>% 
  left_join(metadata_lit_review, by = "doi") 

# saveRDS(abstracts, "data/literature_review_epi/outputs/abstracts.RDS")

Retreiving effects and confidence interavals

Now that we have retrieved the abstracts, we want to extract the effects and associated confidence intervals. Part of the literature, displays directly effects and 95% confidence intervals in their abstracts.² We identify effects and CIs as follows:

CI: any couple of numbers following by less than 4 characters a string describing a confidence interval (“95%”, “CI”, “confidence interval”). We also consider confidence intervals of the shape “(-8.7, 54.7)”: it needs to have a shape “(number separator number)” and needs to be preceded by a number, less than 5 characters away.
Effect: the first number preceding by less than 30 characters a similar string describing a confidence interval (apart numbers used to describe the “95” of a 95% CI).

string_confint <- str_c(
  "((?<!(\\d\\.|\\d))95\\s?%|(?<!(\\d\\.|\\d))95\\s(per(\\s?)cent)|",
  "\\bC(\\.)?(I|l)(\\.)?(s?)\\b|\\bPI(s?)\\b|\\b(i|I)nterval|",
  "\\b(c|C)onfidence\\s(i|I)nterval|\\b(c|C)redible\\s(i|I)nterval|", 
  "\\b(p|P)osterior\\s(i|I)nterval)"
  )
num_confint <- 
  "(-\\s?|−\\s?)?[\\d\\.]{1,7}[–\\s:\\~;,%\\-to\\‐-]{1,5}(-\\s?|−\\s?)?[\\d\\.]{1,7}"
num_effect <- "(-\\s?|−\\s?)?[\\d\\.]{1,7}"

detected_effects <- abstracts %>%
  mutate(abstract = str_replace_all(abstract, "·", ".")) %>%
  select(doi, abstract) %>%
  unnest_tokens(
    sentence,
    abstract,
    token = "sentences",
    to_lower = FALSE,
    drop = FALSE
  ) %>%
  mutate(
    # contains_CI = str_detect(sentence, string_confint),
    sentence = str_replace_all(
      sentence,
      "(?<=(?<!\\.)(?<!\\d)\\d{1,4}),(?=(\\d{3}(?!\\.)))",
      ""
    )
  ) %>%
  # filter(contains_CI) %>%
  mutate(
    CI = str_extract_all(
      sentence,
      str_c(
        "((?<=", string_confint, "[^\\d]{0,4})", num_confint,")|",
        "(?<=", num_effect,"[^\\d\\.]{0,5})", 
        "(?<=(\\(|\\[))", num_confint,"(?=%?[\\)\\];])"
      )
    ),
    effect = str_extract_all(
      sentence,
      str_c(
        num_effect,
        "(?=[^\\d\\.]{0,30}([^\\.\\d]", string_confint,"))(?<![^\\.\\d]95)|",
        num_effect,
        "(?=[^\\d\\.]{0,5}(\\(|\\[)", num_confint, "(?=%?[\\)\\];]))(?<![^\\.\\d]95)"
      )
    )
  )

These lines of code return a set of confidence intervals and effects for each sentence containing the phrase (“CI”, “confidence interval”, etc). For now, if we do not detect the same number of effects and confidence interval in a sentence, we drop the sentence, even though there are 5 pairs of effect-CI and only one of them is badly detected.

Note that some problems might remain with our estimates and CIs detected. Yet, a vast majority of estimates seems to be correctly detected. Here are examples of the confidence intervals and effects detected using our current method:

sentences_with_CI <- detected_effects %>% 
  filter(CI != "character(0)" & effect != "character(0)")
random_sentences <- sample(1:length(sentences_with_CI$sentence), 5)
str_view_all(
  sentences_with_CI$sentence[random_sentences],
  str_c(
    num_effect,
    "(?=[^\\d\\.]{0,17}([^\\.\\d]", string_confint, "))(?<![^\\.\\d]95)|",
    num_effect,
    "(?=[^\\d\\.]{0,5}(\\(|\\[)", num_confint, "(?=%?[\\)\\];]))(?<![^\\.\\d]95)",
    "|((?<=", string_confint, "[^\\d]{0,4})", num_confint,")|",
    "(?<=", num_effect,"[^\\d\\.]{0,5})(?<=(\\(|\\[))", num_confint, "(?=%?[\\)\\];])"
  )
)

[1] │ Results The excess risk for non-accidental mortality was <1.3>% [95% confidence interval (CI), <0.8–1.7>] per 10 μg/m3 of PM10, with higher excess risks for cardiovascular and above age 65 mortality of <1.9>% (95% CI, <0.8–3.0>) and <1.5>% (95% CI, <0.9–2.1>), respectively.
[2] │ After adjustment, top (hazard ratio [HR], <1.48>; 95% confidence interval [CI], <1.08-2.04>; P=.016) and bottom (HR, <1.54>; 95% CI, <1.08-2.18>; P=.017) quartiles of BE were associated with increased risk of readmission or death.
[3] │ On admission days with SO2 levels above the median, mortality was higher (OR <1.13>; 95% CI <1.10, 1.16>) at <12.2>% (95% CI <11.4, 13>) compared with <10.7>% (95% CI <10.3, 11.1>) on days when SO2 levels were below the median.
[4] │ Using these methods, we estimated <362000> (95% confidence interval, <173000–551000>) annual global premature cardiopulmonary deaths attributable to ozone, approximately 50% of the 700000 premature deaths we calculated in our original study (Anenberg et al.
[5] │ During the period of high PM10 concentration (>89.82 μg/m3), NO2 demonstrated its strongest effect for total non-accidental mortality (ERR%: <0.92>, 95% CI: <0.42–1.42>) and cardiovascular disease mortality (ERR%: <1.20>, 95% CI: <0.38–2.03>).

Once the effects and CI are identified, some wrangling is necessary in order to get the data into a usable format. We also choose to drop effects which do not fall into the CI (62 estimates) in order to get rid off most of the poorly detected effects-CIs.

estimates_to_clean <- detected_effects %>% 
  filter(lengths(effect) == lengths(CI)) %>% #if number of effects != nb of CI for a sentence,
  #can't attribute effects to CI so, drop sentence
  unnest(c(effect, CI), keep_empty = TRUE) %>%
  mutate(CI = str_remove_all(CI, "\\s")) %>% 
  # separate(CI, into = c("low_CI", "up_CI"), "([\\s,]+)|(?<!^)[-–]") %>% 
  separate(CI, into = c("low_CI", "up_CI"), "(?<!^)[–:;,%\\-to\\‐]{1,5}") %>%
  mutate(across(c("effect", "low_CI", "up_CI"), .fns = as.numeric)) %>% 
  mutate(
    low_CI = ifelse(is.na(up_CI), NA, low_CI),
    up_CI = ifelse(is.na(low_CI), NA, up_CI),
    effect = ifelse(is.na(low_CI), NA, effect)
  ) %>% 
  filter(!is.na(effect)) %>%  
  filter(effect > low_CI & effect < up_CI)

Expression in terms of percentage change

Note that some effects are reported in terms of relative risks or odds ratios. We need to express all estimates in terms of percentage increase or raw increases in order to have \(H_0: \text{effect} = 0\). Some abstracts, while mentioning “Relative Risk”, still report their estimates in terms of percentage. Others report their estimates in terms of actual relative risks, ie in the form 1.024 for a 2.4% increase for instance. We want to convert these terms into percentage changes.

We therefore detect abstracts mentioning “Relative Risk” or “Risk Ratio” and among these effects, we retrieve those expressed in actual (RR) terms and those expressed in percents. We then convert expressed in RR terms to percent. To limit further potential misdetection, we only transform effects that are between 0 and 2.

estimates_RR <- estimates_to_clean %>% 
  group_by(abstract) %>% 
  mutate(
    RR = str_detect(abstract, "((R|r)elative (R|r)isks?|(R|r)isk (R|r)atios?|\\WRR\\W)"),
    OR = str_detect(abstract, "(O|o)dds (R|r)atios?|\\WOR\\W")
  ) %>% 
  ungroup() %>% 
  mutate(
    effect_percent = str_detect(sentence, str_c("\\D", effect, "%")),
    effect = ifelse((OR|RR) & !effect_percent & between(effect, 0, 2), (effect - 1)*100, effect),
    low_CI = ifelse((OR|RR) & !effect_percent & between(effect, 0, 2), (low_CI - 1)*100, low_CI),
    up_CI = ifelse((OR|RR) & !effect_percent & between(effect, 0, 2), (up_CI - 1)*100, up_CI),
  ) %>%
  select(doi, effect, low_CI, up_CI)

Filtering out invalid articles

Reading manually through the abstracts for, we notice that some of the articles returned by the query do not correspond to the type of articles we want to study. For instance, some articles look at the impact of air pollution on animal health. Other articles returned by the query are actually studying long term effects. We therefore looked at all the abstracts and created a dummy variable describing whether these abstracts should be included in the analysis or not. We filtered at this stage to minimize the number of abstracts to read.

valid_articles <- read_csv("data/literature_review_epi/inputs/valid_articles.csv",
                           col_types = cols(valid = col_logical(), title = col_skip()))

estimates <- estimates_RR %>% 
  left_join(valid_articles, by = "doi") 

# saveRDS(estimates, "data/literature_review_epi/outputs/estimates.RDS")

The estimates data frame displays, in each row, a point estimate and the lower and upper band of the CI along with the DOI of the article from which this estimate is extracted. We retrieved 2666 valid point estimates and associated confidence intervals. We analyze them further in another document.

Retreiving additional information

It might also be interesting to have information about the type of pollutant considered in the study, the study period, the number of observations or the type of outcome studied (mortality, emergency admissions, stroke, cerebrovascular or cardiovascular diseases). We thus use regex to recover this information. Of course, this does not enable us to retrieve data for all the abstracts considered but it provides a useful source of information; retrieving this information “by hand” is extremely cumbersome. Regex are thus very helpful here. They also make this analysis reproducible.

Number of obsrvations

In most studies in this field, observations are daily and at the city level. To compute the number of observations, we thus retrieve the length of the study period from the abstract when possible, along with the number of cities considered in the study.

Length of the study period

To compute the length of the study period, we look for beginning and end dates for the period of study in the abstract. Some abstracts contain phrases such as “from January 2002 to March 2011” to indicate the study period. We take advantage of such mentions to retrieve the study period. Before anything, we transform the text data to a date format using the function text_to_date. Then, we detect, the study periods and wrangle them into a usable format. If we retrieve several length for a unique article, we take a conservative approach and only keep the longer one.

abstracts_only <- abstracts %>% 
  mutate(abstract = str_replace_all(abstract, "·", ".")) %>% 
  select(doi, abstract)

month_regex <- str_c(
  "\\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|", 
  "Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)")
date_regex <- str_c("(", month_regex, "\\s){0,1}(19|20)\\d{2}")

text_to_date <- function(date_text) {
  year <-  str_remove_all(date_text, "[^\\d]")
  month <- match(str_remove_all(date_text, "[\\d\\s]"), month.name)
  month <- ifelse(is.na(month), "01", month)
  date <- dmy(str_c("01-", str_pad(month, width = 2, pad = 0), "-", year))
  return(date)
}

articles_length_study <- abstracts_only %>%
  mutate(
    dates_obs = str_extract_all(
      abstract, 
      str_c(
        "(", date_regex, "|", month_regex, ")", 
        "( to |\\s?—\\s?|\\s?-\\s?| and )", 
        date_regex
      )
    )
  ) %>% 
  unnest(dates_obs, keep_empty = TRUE) %>%
  separate(
    dates_obs, 
    into = c("begin_obs", "end_obs"), 
    "( to |\\s?—\\s?|\\s?-\\s?| and )"
  ) %>% 
  mutate(#when begin_obs and end_obs are in the same year, 
    #we only get the month for begin_obs
    begin_obs = ifelse(
      !str_detect(begin_obs, "\\d"), 
      paste(begin_obs, str_remove_all(end_obs, "[^\\d]")),
      begin_obs)
  ) %>%
  mutate(
    begin_obs = text_to_date(begin_obs),
    end_obs = text_to_date(end_obs),
    length_study = time_length(end_obs - begin_obs, unit = "days"),
    length_study = ifelse(length_study < 0 | end_obs > today(), NA, length_study)
  ) %>% 
  select(doi, length_study) %>% 
  group_by(doi) %>% 
  mutate(
    length_study = max(length_study, na.rm = TRUE),
    length_study = ifelse(length_study < 0, NA, length_study)
  ) %>% 
  ungroup() %>% 
  distinct()

We retrieve a length of the study for 34.351145% of the articles. Note that with this method, we may miss studies which span for exactly a year, eg when only 2011 is mentioned.

Number of cities considered

We then try to detect the number of cities considered in each abstract. To do so, we use two different techniques:

We see whether the abstract uses phrases such as “in 34 cities” and retrieve this 34. We also convert all text-numbers (eg “one”) into their numerical values. When we get several numbers, we take a conservative approach and only keep the largest one.
We count the number of cities names appearing uniquely in each abstract. To do so, we use a database of cities names. We restrict our sample to large cities.

number_word <- tibble(
  number = 1:5000, 
  word = english::words(1:5000)
)

abstracts_in_numbers <- abstracts_only %>% 
  unnest_tokens(word, abstract) %>%
  left_join(number_word, by = "word") %>%
  mutate(word = ifelse(!is.na(number), number, word)) %>% 
  select(-number) %>% 
  group_by(doi) %>% 
  summarize(abstract = str_c(word, collapse = " ")) %>%
  ungroup() %>% 
  mutate(abstract = str_remove_all(abstract, "-"))

#Source: https://simplemaps.com/data/world-cities
worldcities <- read_csv("data/literature_review_epi/inputs/worldcities.csv")

worldcities_large <- worldcities %>%
  mutate(
    city_regex = str_c("\\b", city_ascii, "\\b"),
    city_regex = str_to_lower(city_regex),
    city_regex = str_remove_all(city_regex, "-")
  ) %>%
  filter(population > 500000)

articles_number_cities <- abstracts_in_numbers %>% 
   mutate(
    n_many_cities = str_extract_all(
      abstract, 
      # "(?<!more\\sthan)\\d+(?=\\s?(c|C)it(y|ies))"
      "\\d+(?=\\s?(c|C)it(y|ies))"
    ),
    abstract_ascii = stringi::stri_trans_general(abstract,"Latin-ASCII") %>% str_to_lower(),
    abstract_ascii = str_remove_all(abstract_ascii, "-")
  ) %>%
  unnest(n_many_cities, keep_empty = TRUE) %>%
  mutate(n_many_cities = as.numeric(n_many_cities)) %>% 
  group_by(doi) %>%
  mutate(n_many_cities = max(n_many_cities)) %>%
  ungroup() %>%
  distinct() %>% 
  group_by(doi) %>%
  mutate(
    n_names_cities = str_extract_all(abstract_ascii, worldcities_large[["city_regex"]]) %>% 
      as_vector() %>% 
      unique() %>% 
      length()
  ) %>%
  ungroup() %>% 
  group_by(doi) %>% 
  mutate(
    n_cities = max(n_many_cities, n_names_cities, na.rm = TRUE),
    n_cities = ifelse(n_cities == 0, NA, n_cities)
  ) %>% 
  ungroup() %>% 
  select(doi, n_cities) %>% 
  distinct()

We retrieve a number of cities considered in the study for 45.3653217% of the articles.

Number of observations

Finally, we combine these two information to compute the number of observations.

articles_n_obs <- articles_length_study %>% 
  full_join(articles_number_cities, by = "doi") %>% 
  mutate(n_obs = n_cities*length_study)

We retrieve a number of cities considered in the study for 20.4471101% of the articles.

Pollutant studied

We then recover, when possible, the pollutant(s) considered in the study. We assume that only pollutants studied are mentioned in the abstract. This might be slightly inaccurate but seems to be a coherent first order approximation. We recognize that some pollutants may be mentioned in an abstract even though the corresponding study does not run any analysis on these pollutants and outcomes. We however assume that it is rather unlikely that a study on particulate matter pollution for instance will also talk about ozone in its abstract. Note that there are sometimes several pollutants mentioned in an abstract and analyzed in a study.

abstracts_with_titles <- abstracts %>% 
  mutate(
    abstract = str_replace_all(abstract, "·", "."),
    abstract_title = str_c(title, abstract, sep = ". ")
  ) %>% 
  select(doi, abstract_title)

articles_pollutant <- abstracts_with_titles %>% 
  mutate(
    pollutant = str_extract_all(
      abstract_title,
      str_c("(\\bPM\\s?2(\\.|,)5|\\bPM\\s?10|\\bO\\s?3\\b|\\b(o|O)zone\\b|",
            "\\b(P|p)articulate(\\s(M|m)atter\\b)?|\\bNO\\s?2|",
            "\\b(n|N)itrogen\\s?(d|D)ioxide\\b|\\bNO\\b|",
            "\\b(n|N)itrogen\\s?(o|O)xide\\b|\\bNO\\s?(x|X)\\b|\\bSO\\s?2|",
            "\\bCO\\b|\\bBC\\b|\\b(A|a)ir\\s(Q|q)uality\\s(I|i)ndex\\b)")
    )
  ) %>% 
  unnest(pollutant, keep_empty = TRUE) %>% 
  group_by(doi) %>% 
  mutate(
    pollutant = tolower(pollutant), 
    pollutant = str_replace_all(pollutant, "\\s", ""),
    pollutant = str_replace_all(pollutant, ",", "\\."),
    pollutant = case_when(
      pollutant == "nitrogendioxide" ~ "no2",
      pollutant == "nitrogenoxide" ~ "no", 
      pollutant == "ozone" ~ "o3",
      pollutant == "particulate" ~ "particulatematter",
      TRUE ~ pollutant
    ),
    pollutant = str_to_upper(pollutant),
    pollutant = ifelse(pollutant == "PARTICULATEMATTER", "Particulate matter", pollutant),
    pollutant = ifelse(pollutant == "AIRQUALITYINDEX", "Air Quality Index", pollutant)
  ) %>% 
  distinct(pollutant, .keep_all = TRUE) %>% 
  ungroup() %>% 
  select(-abstract_title) %>% 
  nest(pollutant = pollutant)

We identify pollutants considered in the study for 82.4427481% of the articles.

Outcome considered

Following a similar methodology as for pollutants, we retrieve information about the outcomes considered.

articles_outcome <- abstracts_with_titles %>% 
  mutate(
    outcome = str_extract_all(
      abstract_title, 
      "(\\b(M|m)ortality\\b|\\b(D|d)eath(s)?\\b|\\b(H|h)ospitalization|\\b(E|e)mergenc)"
    )
  ) %>% 
  unnest(outcome, keep_empty = TRUE) %>% 
  group_by(doi) %>%
   mutate(
    outcome = tolower(outcome),
    outcome = ifelse(str_starts(outcome, "emergenc|hospitalization"), "Emergency",
                     ifelse(str_starts(outcome, "death|mortalit"), "Mortality", NA))
  ) %>%
  distinct(outcome, .keep_all = TRUE) %>%
  ungroup() %>%
  nest(outcome = outcome) %>%
  select(-abstract_title)

We identify outcomes considered in the study for 81.4067612% of the articles.

Sub-population considered

Using a similar methodology, we try to identify the sub-population considered (infants or elderly). Note that, when the whole population is studied, we do not recover any information. It might be a bit far fetch to consider that when no sub-population is mentioned, the whole population is studied. We therefore abstain from doing so.

articles_subpop <- abstracts_with_titles %>% 
  mutate(
    subpop = str_extract_all(
      abstract_title, 
      "(\\b(I|i)nfant|\\b(E|e)lder)"
    )
  ) %>% 
  unnest(subpop, keep_empty = TRUE) %>% 
  group_by(doi) %>%
   mutate(
    subpop = tolower(subpop),
    subpop = ifelse(str_starts(subpop, "infant"), "Infants",
                     ifelse(str_starts(subpop, "elder"), "Elders", NA))
  ) %>%
  distinct(subpop, .keep_all = TRUE) %>%
  ungroup() %>%
  nest(subpop = subpop) %>%
  select(-abstract_title)

We identify sub-population considered in the study for 12.9770992% of the articles.

Additional information on journals

It is also interesting to have access to journal fields. This will enable us to see whether some academic research fields are more subject to certain type of issues than others.

We retrieve information on journal fields from Scopus. In their source list, they classify all journals into approximately 330 sub-subject areas. We thus match this with journal names from our database. Scopus also provides coarser subject area categorizations, for instance one with 5 fields: Multidisciplinary, Physical Sciences, Health Sciences, Social Sciences and Life Science. They provide a correspondance table between those two classifications.

Note that, some journals mention several of these fields as references. We choose to classify those as multidisciplinary journals.

subject_subsubject_corres <-
  read_csv("data/literature_review_epi/inputs/scopus_subject_corres.csv") %>%
  rename(
    area_code = Code,
    subsubject_area = Field, 
    subject_area = `Subject area`
  ) %>%
  drop_na()

journal_subsubject_corres <- read_csv("data/literature_review_epi/inputs/scopus_journal_subsubject.csv") %>% 
  rename(
    journal = Title,
    subsubject_area = `Scopus Sub-Subject Area`,
    area_code = `Scopus ASJC Code (Sub-subject Area)`
  ) %>% 
  select(journal, subsubject_area, area_code)

#To classify journals as multidisciplinary
journal_subject_corres <- journal_subsubject_corres %>% 
  left_join(subject_subsubject_corres, by = c("area_code", "subsubject_area")) %>% 
  select(journal, subject_area) %>% 
  distinct() %>% 
  group_by(journal) %>% 
  mutate(
    n_subject_area = n(),
    subject_area = ifelse(n_subject_area > 1, "Multidisciplinary", subject_area)
  ) %>% 
  ungroup() %>% 
  select(-n_subject_area) %>% 
  distinct()

journal_subject <- journal_subsubject_corres %>% 
  left_join(journal_subject_corres, by = c("journal")) %>% 
  mutate(
    journal_merge = str_to_lower(journal),
    journal_merge = str_remove_all(journal_merge, "[^\\w\\s]")
  )

articles_journal_subject <- abstracts %>% 
  mutate(
    journal_merge = str_to_lower(journal),
    journal_merge = str_remove_all(journal_merge, "[^\\w\\s]")
  ) %>% 
  left_join(journal_subject, by = "journal_merge") %>% 
  select(doi, subject_area, subsubject_area) %>% 
  distinct() %>% 
  nest(subsubject_area = c(subsubject_area))

We retrieve information about the subject area for 87.7317339% articles and about the subsubject area for 94.1668846% of articles.

Agregating the information

Finally, we build the overall metadata set, by combining all the previous information.

abstracts_and_metadata <- abstracts %>% 
  full_join(articles_n_obs, by = "doi") %>% 
  full_join(articles_pollutant, by = "doi") %>% 
  full_join(articles_outcome, by = "doi") %>% 
  full_join(articles_subpop, by = "doi") %>% 
  full_join(articles_journal_subject, by = "doi") %>% 
  group_by(doi) %>% 
  filter(
    pub_date == max(pub_date, na.rm = TRUE) | 
      is.na(pub_date)
  ) %>% #some articles have been published twice
  ungroup() 

# saveRDS(abstracts_and_metadata, "data/literature_review_epi/outputs/abstracts_and_metadata.RDS")

In case any problem remains, we use tryCatch to record the DOIs corresponding to errors in order to be able to handle them later.↩︎
We analyze the characteristics of articles doing so in another document.↩︎