Skip to contents

This vignette introduces the main functions in the package and provides examples of simple analyses that can be carried out with these functions.

More information is available in the documentation of each function, accessible by typing help("gallicagram") in the console for instance, for the gallicagram function.

Main functions

Occurrences

The main function in this package, gallicagram, builds a data frame with the yearly, monthly or daily proportion of mentions of a term in one of the corpora.

For the Le Monde corpus, this function allows to compute the number of newspaper articles that mention the keywords (parameter n_of = "articles").

ex_occur <- gallicagram(
  keyword = "président",
  corpus = "lemonde",
  from = 1960,
  to = 1970,
  resolution = "monthly"
)

Here is an example of the first rows of the output:

date keyword n_occur n_total prop_occur year month corpus resolution n_of
1960-01-01 président 1338 872943 0.0015327 1960 1 lemonde monthly 1-grams
1960-02-01 président 1360 915672 0.0014852 1960 2 lemonde monthly 1-grams
1960-03-01 président 1461 928764 0.0015731 1960 3 lemonde monthly 1-grams
1960-04-01 président 1239 772707 0.0016035 1960 4 lemonde monthly 1-grams
1960-05-01 président 1355 835612 0.0016216 1960 5 lemonde monthly 1-grams
1960-06-01 président 1314 850245 0.0015454 1960 6 lemonde monthly 1-grams

Co-occurrences

gallicagram_cooccur builds a data frame with the yearly, monthly or daily proportion of mentions of close co-occurrences of two keywords in one of the three main corpora (lemonde, livres and presse) between two specified dates.

Close co-occurrences correspond to words that are less than 3 (4 in the Le Monde corpus) words away in the initial text. For the Le Monde corpus, this function allows to compute the number of co-occurrences within an entire article (parameter cooccur_level = "articles").

ex_cooccur <- gallicagram_cooccur(
  keyword_1 = "président",
  keyword_2 = "république",
  corpus = "lemonde",
  from = 1960,
  to = 1970,
  resolution = "monthly"
)

Here is an example of part of the output:

date keyword_1 keyword_2 n_cooccur n_total prop_cooccur year month corpus resolution cooccur_level
1960-01-01 président république 232 575465 0.0004032 1960 1 lemonde monthly 4-grams
1960-02-01 président république 295 606116 0.0004867 1960 2 lemonde monthly 4-grams
1960-03-01 président république 287 616477 0.0004655 1960 3 lemonde monthly 4-grams
1960-04-01 président république 212 513816 0.0004126 1960 4 lemonde monthly 4-grams
1960-05-01 président république 182 552610 0.0003293 1960 5 lemonde monthly 4-grams
1960-06-01 président république 196 563800 0.0003476 1960 6 lemonde monthly 4-grams

Associations

gallicagram_associated builds a data frame with the words most frequently used close to a given keyword over the period. For the Le Monde corpus, this function allows to compute the number of co-occurences within an entire article (parameter distance = "articles").

ex_associated <- gallicagram_associated(
  keyword = "camarade",
  corpus = "lemonde",
  from = 1960,
  to = 1970,
  n_results = 10,
  distance = 3
)

The terms most often associated with “camarade” are:

associated_word n_cooccur keyword corpus from to cooccur_level
khrouchtchev 76 camarade lemonde 1960 1970 3-grams
mao 47 camarade lemonde 1960 1970 3-grams
toung 40 camarade lemonde 1960 1970 3-grams
ancien 38 camarade lemonde 1960 1970 3-grams
tse 28 camarade lemonde 1960 1970 3-grams
dubcek 22 camarade lemonde 1960 1970 3-grams
promotion 18 camarade lemonde 1960 1970 3-grams
ami 16 camarade lemonde 1960 1970 3-grams
compagnie 16 camarade lemonde 1960 1970 3-grams
thorez 16 camarade lemonde 1960 1970 3-grams

Graphs

One of the main usage of Gallicagram is to plot time series of occurrences in a corpus. The function gallicagraph enables to do that in only one additional line of code. It take as input a dataframe produced by one of the functions from the package and automatically produces the corresponding graph:

# my own theme for ggplot graphs
mediocrethemes::set_mediocre_all()

gallicagram(keyword = "obama", corpus = "lemonde", from = 2005) |> 
  gallicagraph()

Note that I use my own ggplot theme mediocrethemes. The use of this theme is of course not mandatory and one can remove the line mediocrethemes::set_mediocre_all()).

One can also use the same function to plot words associated with a given keyword:

gallicagram_associated(keyword = "obama", corpus = "lemonde", from = 2005) |> 
  gallicagraph()

More than one keyword

Occurrences of a lexicon

Each of the main function described above has a lexicon counterpart, adding a suffix _lexicon (gallicagram_lexicon, gallicagram_cooccur_lexicon and gallicagram_associated_lexicon). Each function simply loop the intial function over each word the lexicon(s) and sums the results. It thus compute the sum of (co-)occurrences of each word in the lexicon:

ex_lexicon <- gallicagram_lexicon(
  lexicon = c("président", "présidente"),
  corpus = "lemonde",
  from = 1960,
  to = 1970,
  resolution = "monthly"
)

ex_cooccur_lexicon <- gallicagram_cooccur_lexicon(
  lexicon_1 = c("président", "présidente"),
  lexicon_2 = c("mauvais", "nul"),
  corpus = "lemonde",
  from = 1960,
  to = 1970,
  resolution = "monthly"
)

Here is an example of part of the output of gallicagram_lexicon:

date keyword n_occur n_total prop_occur year month corpus resolution n_of lexicon
1960-01-01 président 1341 872943 0.0015362 1960 1 lemonde monthly 1-grams président+présidente
1960-02-01 président 1362 915672 0.0014874 1960 2 lemonde monthly 1-grams président+présidente
1960-03-01 président 1465 928764 0.0015774 1960 3 lemonde monthly 1-grams président+présidente
1960-04-01 président 1241 772707 0.0016060 1960 4 lemonde monthly 1-grams président+présidente
1960-05-01 président 1358 835612 0.0016252 1960 5 lemonde monthly 1-grams président+présidente
1960-06-01 président 1315 850245 0.0015466 1960 6 lemonde monthly 1-grams président+présidente

These functions are particularly helpful to compute the number of occurrences of words from a whole lexical field (or the plural or feminine form of a keyword, etc).

Words with the same stem

The function get_same_stem() enables to retrieve part of these forms: those that share the same stem as the keyword of interest. For instance, for “écologie”:

get_same_stem("écologie")
#> [1] "écologie"       "écologique"     "écologiquement" "écologiques"   
#> [5] "écologiste"     "écologiste"     "écologistes"    "écologistes"

This function is particularly useful when its output is passed to one of the _lexicon functions:

ex_stem_lexicon <- get_same_stem("écologie") |> 
  gallicagram_lexicon(
    corpus = "lemonde",
    from = 1960,
    to = 1970,
    resolution = "yearly"
  )

Note that due to the number of requests sent to the API, it may take some time to run.

date keyword n_occur n_total prop_occur year corpus resolution n_of lexicon
1960-01-01 écologie 4 10378608 4e-07 1960 lemonde yearly 1-grams écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes
1961-01-01 écologie 1 10397003 1e-07 1961 lemonde yearly 1-grams écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes
1962-01-01 écologie 1 11987113 1e-07 1962 lemonde yearly 1-grams écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes
1963-01-01 écologie 0 12223185 0e+00 1963 lemonde yearly 1-grams écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes
1964-01-01 écologie 12 12823655 9e-07 1964 lemonde yearly 1-grams écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes
1965-01-01 écologie 6 12860422 5e-07 1965 lemonde yearly 1-grams écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes

Searching several keywords

One can also run the functions for several keywords but separately, without suming the result. To do so, one can use the function purrr::map_dfr. It takes as parameters the vector of keywords to search, followed by the function (gallicagram) and additional arguments to pass to the function. It returns a unique data frame with the results for searches corresponding to each keyword, basically binding the dataframes produced by each keyword search.

library(purrr)

keywords <- c("république", "france")

purrr::map_dfr(keywords, gallicagram,
               corpus = "lemonde", from = "1960", to = "1970")

Based on that same principle, we can run several searches at once, varying all parameters, not only the keyword searched. To do that, we can specify the series of parameters in a data frame, each row corresponding to a set of parameters to run the gallicagram function for. We then pass this data frame to purrr::pmap_dfr.

To specify the set of parameters, we can either build the parameters data frame by hand. It is also often helpful to use tidyr::crossing to create all combination of possible searches.

params_pmap <-
  tibble::tibble(
    from = 1850,
    to = 1870,
    resolution = "yearly"
  ) |>
  tidyr::crossing(corpus = c("press", "books")) |>
  tidyr::crossing(keyword = c("république", "france"))

The corresponding set of parameters to search looks like this:

from to resolution corpus keyword
1850 1870 yearly books france
1850 1870 yearly books république
1850 1870 yearly press france
1850 1870 yearly press république

We can then pass it to purrr::pmap_dfr that will call the function gallicagram for each of the 4 sets of parameters defined in the rows of params_pmap:

purrr::pmap_dfr(params_pmap, gallicagram)

This method also applies to the other functions in the rallicagram package.