rallicagram
rallicagram.Rmd
This vignette introduces the main functions in the package and provides examples of simple analyses that can be carried out with these functions.
More information is available in the documentation of each function,
accessible by typing help("gallicagram")
in the console for
instance, for the gallicagram
function.
Main functions
Occurrences
The main function in this package, gallicagram
, builds a
data frame with the yearly, monthly or daily proportion of mentions of a
term in one of the corpora.
For the Le Monde corpus, this function allows to compute the
number of newspaper articles that mention the keywords (parameter
n_of = "articles"
).
ex_occur <- gallicagram(
keyword = "président",
corpus = "lemonde",
from = 1960,
to = 1970,
resolution = "monthly"
)
Here is an example of the first rows of the output:
date | keyword | n_occur | n_total | prop_occur | year | month | corpus | resolution | n_of |
---|---|---|---|---|---|---|---|---|---|
1960-01-01 | président | 1338 | 872943 | 0.0015327 | 1960 | 1 | lemonde | monthly | 1-grams |
1960-02-01 | président | 1360 | 915672 | 0.0014852 | 1960 | 2 | lemonde | monthly | 1-grams |
1960-03-01 | président | 1461 | 928764 | 0.0015731 | 1960 | 3 | lemonde | monthly | 1-grams |
1960-04-01 | président | 1239 | 772707 | 0.0016035 | 1960 | 4 | lemonde | monthly | 1-grams |
1960-05-01 | président | 1355 | 835612 | 0.0016216 | 1960 | 5 | lemonde | monthly | 1-grams |
1960-06-01 | président | 1314 | 850245 | 0.0015454 | 1960 | 6 | lemonde | monthly | 1-grams |
Co-occurrences
gallicagram_cooccur
builds a data frame with the yearly,
monthly or daily proportion of mentions of close co-occurrences of two
keywords in one of the three main corpora (lemonde
,
livres
and presse
) between two specified
dates.
Close co-occurrences correspond to words that are less than 3 (4 in
the Le Monde corpus) words away in the initial text. For the
Le Monde corpus, this function allows to compute the number of
co-occurrences within an entire article (parameter
cooccur_level = "articles"
).
ex_cooccur <- gallicagram_cooccur(
keyword_1 = "président",
keyword_2 = "république",
corpus = "lemonde",
from = 1960,
to = 1970,
resolution = "monthly"
)
Here is an example of part of the output:
date | keyword_1 | keyword_2 | n_cooccur | n_total | prop_cooccur | year | month | corpus | resolution | cooccur_level |
---|---|---|---|---|---|---|---|---|---|---|
1960-01-01 | président | république | 232 | 575465 | 0.0004032 | 1960 | 1 | lemonde | monthly | 4-grams |
1960-02-01 | président | république | 295 | 606116 | 0.0004867 | 1960 | 2 | lemonde | monthly | 4-grams |
1960-03-01 | président | république | 287 | 616477 | 0.0004655 | 1960 | 3 | lemonde | monthly | 4-grams |
1960-04-01 | président | république | 212 | 513816 | 0.0004126 | 1960 | 4 | lemonde | monthly | 4-grams |
1960-05-01 | président | république | 182 | 552610 | 0.0003293 | 1960 | 5 | lemonde | monthly | 4-grams |
1960-06-01 | président | république | 196 | 563800 | 0.0003476 | 1960 | 6 | lemonde | monthly | 4-grams |
Associations
gallicagram_associated
builds a data frame with the
words most frequently used close to a given keyword over the period. For
the Le Monde corpus, this function allows to compute the number
of co-occurences within an entire article (parameter
distance = "articles"
).
ex_associated <- gallicagram_associated(
keyword = "camarade",
corpus = "lemonde",
from = 1960,
to = 1970,
n_results = 10,
distance = 3
)
The terms most often associated with “camarade” are:
associated_word | n_cooccur | keyword | corpus | from | to | cooccur_level |
---|---|---|---|---|---|---|
khrouchtchev | 76 | camarade | lemonde | 1960 | 1970 | 3-grams |
mao | 47 | camarade | lemonde | 1960 | 1970 | 3-grams |
toung | 40 | camarade | lemonde | 1960 | 1970 | 3-grams |
ancien | 38 | camarade | lemonde | 1960 | 1970 | 3-grams |
tse | 28 | camarade | lemonde | 1960 | 1970 | 3-grams |
dubcek | 22 | camarade | lemonde | 1960 | 1970 | 3-grams |
promotion | 18 | camarade | lemonde | 1960 | 1970 | 3-grams |
ami | 16 | camarade | lemonde | 1960 | 1970 | 3-grams |
compagnie | 16 | camarade | lemonde | 1960 | 1970 | 3-grams |
thorez | 16 | camarade | lemonde | 1960 | 1970 | 3-grams |
Graphs
One of the main usage of Gallicagram
is to plot time
series of occurrences in a corpus. The function
gallicagraph
enables to do that in only one additional line
of code. It take as input a dataframe produced by one of the functions
from the package and automatically produces the corresponding graph:
# my own theme for ggplot graphs
mediocrethemes::set_mediocre_all()
gallicagram(keyword = "obama", corpus = "lemonde", from = 2005) |>
gallicagraph()
Note that I use my own ggplot theme mediocrethemes
.
The use of this theme is of course not mandatory and one can remove the
line mediocrethemes::set_mediocre_all()
).
One can also use the same function to plot words associated with a given keyword:
gallicagram_associated(keyword = "obama", corpus = "lemonde", from = 2005) |>
gallicagraph()
More than one keyword
Occurrences of a lexicon
Each of the main function described above has a lexicon
counterpart, adding a suffix _lexicon
(gallicagram_lexicon
,
gallicagram_cooccur_lexicon
and
gallicagram_associated_lexicon
). Each function simply loop
the intial function over each word the lexicon(s) and sums the results.
It thus compute the sum of (co-)occurrences of each word in the
lexicon:
ex_lexicon <- gallicagram_lexicon(
lexicon = c("président", "présidente"),
corpus = "lemonde",
from = 1960,
to = 1970,
resolution = "monthly"
)
ex_cooccur_lexicon <- gallicagram_cooccur_lexicon(
lexicon_1 = c("président", "présidente"),
lexicon_2 = c("mauvais", "nul"),
corpus = "lemonde",
from = 1960,
to = 1970,
resolution = "monthly"
)
Here is an example of part of the output of
gallicagram_lexicon
:
date | keyword | n_occur | n_total | prop_occur | year | month | corpus | resolution | n_of | lexicon |
---|---|---|---|---|---|---|---|---|---|---|
1960-01-01 | président | 1341 | 872943 | 0.0015362 | 1960 | 1 | lemonde | monthly | 1-grams | président+présidente |
1960-02-01 | président | 1362 | 915672 | 0.0014874 | 1960 | 2 | lemonde | monthly | 1-grams | président+présidente |
1960-03-01 | président | 1465 | 928764 | 0.0015774 | 1960 | 3 | lemonde | monthly | 1-grams | président+présidente |
1960-04-01 | président | 1241 | 772707 | 0.0016060 | 1960 | 4 | lemonde | monthly | 1-grams | président+présidente |
1960-05-01 | président | 1358 | 835612 | 0.0016252 | 1960 | 5 | lemonde | monthly | 1-grams | président+présidente |
1960-06-01 | président | 1315 | 850245 | 0.0015466 | 1960 | 6 | lemonde | monthly | 1-grams | président+présidente |
These functions are particularly helpful to compute the number of occurrences of words from a whole lexical field (or the plural or feminine form of a keyword, etc).
Words with the same stem
The function get_same_stem()
enables to retrieve part of
these forms: those that share the same stem as the keyword of interest.
For instance, for “écologie”:
get_same_stem("écologie")
#> [1] "écologie" "écologique" "écologiquement" "écologiques"
#> [5] "écologiste" "écologiste" "écologistes" "écologistes"
This function is particularly useful when its output is passed to one
of the _lexicon
functions:
ex_stem_lexicon <- get_same_stem("écologie") |>
gallicagram_lexicon(
corpus = "lemonde",
from = 1960,
to = 1970,
resolution = "yearly"
)
Note that due to the number of requests sent to the API, it may take some time to run.
date | keyword | n_occur | n_total | prop_occur | year | corpus | resolution | n_of | lexicon |
---|---|---|---|---|---|---|---|---|---|
1960-01-01 | écologie | 4 | 10378608 | 4e-07 | 1960 | lemonde | yearly | 1-grams | écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes |
1961-01-01 | écologie | 1 | 10397003 | 1e-07 | 1961 | lemonde | yearly | 1-grams | écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes |
1962-01-01 | écologie | 1 | 11987113 | 1e-07 | 1962 | lemonde | yearly | 1-grams | écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes |
1963-01-01 | écologie | 0 | 12223185 | 0e+00 | 1963 | lemonde | yearly | 1-grams | écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes |
1964-01-01 | écologie | 12 | 12823655 | 9e-07 | 1964 | lemonde | yearly | 1-grams | écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes |
1965-01-01 | écologie | 6 | 12860422 | 5e-07 | 1965 | lemonde | yearly | 1-grams | écologie+écologique+écologiquement+écologiques+écologiste+écologiste+écologistes+écologistes |
Searching several keywords
One can also run the functions for several keywords but separately,
without suming the result. To do so, one can use the function
purrr::map_dfr
. It takes as parameters the vector of
keywords to search, followed by the function (gallicagram
)
and additional arguments to pass to the function. It returns a unique
data frame with the results for searches corresponding to each keyword,
basically binding the dataframes produced by each keyword search.
library(purrr)
keywords <- c("république", "france")
purrr::map_dfr(keywords, gallicagram,
corpus = "lemonde", from = "1960", to = "1970")
Based on that same principle, we can run several searches at once,
varying all parameters, not only the keyword searched. To do that, we
can specify the series of parameters in a data frame, each row
corresponding to a set of parameters to run the gallicagram
function for. We then pass this data frame to
purrr::pmap_dfr
.
To specify the set of parameters, we can either build the parameters
data frame by hand. It is also often helpful to use
tidyr::crossing
to create all combination of possible
searches.
params_pmap <-
tibble::tibble(
from = 1850,
to = 1870,
resolution = "yearly"
) |>
tidyr::crossing(corpus = c("press", "books")) |>
tidyr::crossing(keyword = c("république", "france"))
The corresponding set of parameters to search looks like this:
from | to | resolution | corpus | keyword |
---|---|---|---|---|
1850 | 1870 | yearly | books | france |
1850 | 1870 | yearly | books | république |
1850 | 1870 | yearly | press | france |
1850 | 1870 | yearly | press | république |
We can then pass it to purrr::pmap_dfr
that will call
the function gallicagram
for each of the 4 sets of
parameters defined in the rows of params_pmap
:
purrr::pmap_dfr(params_pmap, gallicagram)
This method also applies to the other functions in the
rallicagram
package.