Words the most often associated with a vector of keywords in a Gallicagram corpus
gallicagram_associated_lexicon.Rd
Returns the word the most frequently at a given number of words
(distance
) from any of the keywords in the vector.
Usage
gallicagram_associated_lexicon(
lexicon,
corpus = "lemonde",
from = "earliest",
to = "latest",
n_results = 20,
distance = "max",
stopwords = rallicagram::stopwords_gallica[1:500]
)
Arguments
- lexicon
A character vector. Keywords to search.
- corpus
A character string. The corpus to search. The list of available corpora can be found in the
list_corpora
dataset.- from
An integer or "earliest". The starting year. If set to "earliest", it uses the earliest date at which the data is reliable for this corpus, as described in
list_corpora
.- to
An integer or "latest". The end year. If set to "latest", it uses the latest date at which the data is reliable for this corpus, as described in
list_corpora
.- n_results
An integer. The number of most frequently associated words to return.
n_results
can also be set to "all" to return all the available results.- distance
An integer, "max" or "articles". The maximum distance, in number of words, at which to look for words associated with the keyword. The max length for each corpus (distance + number of words in the keyword) is described in the
max_length
column of thelist_corpora
dataset.When set to "max", automatically considers the longest ngram possible for this corpus.
When set to "articles", looks for associated words within the whole article. Only available for the "lemonde" corpus and for unigrams (ie keywords only made of one word).
- stopwords
A character vector of stopwords to remove. The default is the vector of the 500 most frequent words in the Gallica books dataset. We can change this number by passing
stopwords_gallicca[1:300]
(for instance, for the 300 most frequent) to thestopwords
argument. Can also belsa::stopwords_fr
IfNULL
does not remove any stopwords.
Value
A tibble. With the words the most frequently associated with any of
the keywords in the lexicon
mentioned (associated_word
),
the first keyword
in this lexicon, typically the main one
and the number of co-occurrences between the keyword and the associated
word over the period (n_co-occur
).
It also returns the input parameters
keyword
, corpus
, from
and to
.
Details
This function sums the outputs of calls of gallicagram_associated
obtained for each keyword in the vector.
Can typically be used with the function get_same_stem()
.
This function is only available for the three main corpora (historical press, Gallica books, Le Monde newspaper).
Examples
if (FALSE) {
gallicagram_associated_lexicon(c("président", "présidentiel"))
}