Words the most often associated with a vector of keywords in a Gallicagram corpus — gallicagram_associated

Returns the word the most frequently at a given number of words (distance) from any of the keywords in the vector.

Usage

gallicagram_associated_lexicon(
  lexicon,
  corpus = "lemonde",
  from = "earliest",
  to = "latest",
  n_results = 20,
  distance = "max",
  stopwords = rallicagram::stopwords_gallica[1:500]
)

Arguments

lexicon

A character vector. Keywords to search.

corpus

A character string. The corpus to search. The list of available corpora can be found in the list_corpora dataset.

from

An integer or "earliest". The starting year. If set to "earliest", it uses the earliest date at which the data is reliable for this corpus, as described in list_corpora.

to

An integer or "latest". The end year. If set to "latest", it uses the latest date at which the data is reliable for this corpus, as described in list_corpora.

n_results

An integer. The number of most frequently associated words to return. n_results can also be set to "all" to return all the available results.

distance

An integer, "max" or "articles". The maximum distance, in number of words, at which to look for words associated with the keyword. The max length for each corpus (distance + number of words in the keyword) is described in the max_length column of the list_corpora dataset.

When set to "max", automatically considers the longest ngram possible for this corpus.

When set to "articles", looks for associated words within the whole article. Only available for the "lemonde" corpus and for unigrams (ie keywords only made of one word).

stopwords

A character vector of stopwords to remove. The default is the vector of the 500 most frequent words in the Gallica books dataset. We can change this number by passing stopwords_gallicca[1:300] (for instance, for the 300 most frequent) to the stopwords argument. Can also be lsa::stopwords_fr If NULL does not remove any stopwords.

Value

A tibble. With the words the most frequently associated with any of the keywords in the lexicon mentioned (associated_word), the first keyword in this lexicon, typically the main one and the number of co-occurrences between the keyword and the associated word over the period (n_co-occur). It also returns the input parameters keyword, corpus, from and to.

Details

This function sums the outputs of calls of gallicagram_associated obtained for each keyword in the vector.

Can typically be used with the function get_same_stem().

This function is only available for the three main corpora (historical press, Gallica books, Le Monde newspaper).

Examples

if (FALSE) {
  gallicagram_associated_lexicon(c("président", "présidentiel"))
}