Words the most often associated with a keyword in a Gallicagram corpus
gallicagram_associated.Rd
Returns the word the most frequently at a given number of words
(distance
) from the keyword.
Usage
gallicagram_associated(
keyword,
corpus = "lemonde",
from = "earliest",
to = "latest",
n_results = 20,
distance = "max",
stopwords = rallicagram::stopwords_gallica[1:500],
remove_numbers = TRUE
)
Arguments
- keyword
A character string. Keyword to search. The string cannot contain more words than the
max_length
for this corpus, as indicated in thelist_corpora
dataset.- corpus
A character string. The corpus to search. The list of available corpora can be found in the
list_corpora
dataset.- from
An integer or "earliest". The starting year. If set to "earliest", it uses the earliest date at which the data is reliable for this corpus, as described in
list_corpora
.- to
An integer or "latest". The end year. If set to "latest", it uses the latest date at which the data is reliable for this corpus, as described in
list_corpora
.- n_results
An integer. The number of most frequently associated words to return.
n_results
can also be set to "all" to return all the available results.- distance
An integer, "max" or "articles". The maximum distance, in number of words, at which to look for words associated with the keyword. The max length for each corpus (distance + number of words in the keyword) is described in the
max_length
column of thelist_corpora
dataset.When set to "max", automatically considers the longest ngram possible for this corpus.
When set to "articles", looks for associated words within the whole article. Only available for the "lemonde" corpus and for unigrams (ie keywords only made of one word).
- stopwords
A character vector of stopwords to remove. The default is the vector of the 500 most frequent words in the Gallica books dataset. We can change this number by passing
stopwords_gallicca[1:300]
(for instance, for the 300 most frequent) to thestopwords
argument. Can also belsa::stopwords_fr
IfNULL
does not remove any stopwords.- remove_numbers
If TRUE removes numbers from the list of associated words.
Value
A tibble. Containing the words the most frequently associated with
the keyword
mentioned (associated_word
),
the syntagma at which the co-occurrences between the keyword and the
associated word are computed (cooccur_level
),
and the level at which the co-occurrences are computed (n-grams or articles,
reported in (cooccur_level
)).
It also returns the input parameters
keyword
, corpus
, from
and to
.
Details
This functions calls the 'associated
' route of the API.
This function is only available for the three main corpora (historical press, Gallica books, Le Monde newspaper). Searching the "press" corpus take a long time and return an error. If it does, run the same function again, it will start off where it stopped.
Note that the API route does not allow to search for associated words after a punctuation mark. For instance, in "son camarade, le chasseur, a", the function will not count "chasseur" as a word associated with "camarade". The ngram "son camarade le chasseur" is not in the database. Thus, there might be less associated words at a longer distance as it increases the probability of ngrams to be excluded from the database.
Apostrophes and letters preceding them are withdrawn from the dataset (except for n' since they carry meaning)
Examples
gallicagram_associated("camarade", from = 1960, to = 1970)
#> # A tibble: 20 × 7
#> associated_word n_cooccur keyword corpus from to cooccur_level
#> <chr> <int> <chr> <chr> <dbl> <dbl> <chr>
#> 1 khrouchtchev 76 camarade lemonde 1960 1970 3-grams
#> 2 mao 47 camarade lemonde 1960 1970 3-grams
#> 3 toung 40 camarade lemonde 1960 1970 3-grams
#> 4 ancien 38 camarade lemonde 1960 1970 3-grams
#> 5 tse 28 camarade lemonde 1960 1970 3-grams
#> 6 dubcek 22 camarade lemonde 1960 1970 3-grams
#> 7 promotion 18 camarade lemonde 1960 1970 3-grams
#> 8 ami 16 camarade lemonde 1960 1970 3-grams
#> 9 compagnie 16 camarade lemonde 1960 1970 3-grams
#> 10 thorez 16 camarade lemonde 1960 1970 3-grams
#> 11 togliatti 16 camarade lemonde 1960 1970 3-grams
#> 12 direction 15 camarade lemonde 1960 1970 3-grams
#> 13 rochet 14 camarade lemonde 1960 1970 3-grams
#> 14 club 13 camarade lemonde 1960 1970 3-grams
#> 15 vieux 13 camarade lemonde 1960 1970 3-grams
#> 16 waldeck 13 camarade lemonde 1960 1970 3-grams
#> 17 staline 12 camarade lemonde 1960 1970 3-grams
#> 18 tsé 12 camarade lemonde 1960 1970 3-grams
#> 19 combat 11 camarade lemonde 1960 1970 3-grams
#> 20 tué 11 camarade lemonde 1960 1970 3-grams