Words the most often associated with a keyword in a Gallicagram corpus

Returns the word the most frequently at a given number of words (distance) from the keyword.

Usage

gallicagram_associated(
  keyword,
  corpus = "lemonde",
  from = "earliest",
  to = "latest",
  n_results = 20,
  distance = "max",
  stopwords = rallicagram::stopwords_gallica[1:500],
  remove_numbers = TRUE
)

Arguments

keyword

A character string. Keyword to search. The string cannot contain more words than the max_length for this corpus, as indicated in the list_corpora dataset.

corpus

A character string. The corpus to search. The list of available corpora can be found in the list_corpora dataset.

from

An integer or "earliest". The starting year. If set to "earliest", it uses the earliest date at which the data is reliable for this corpus, as described in list_corpora.

to

An integer or "latest". The end year. If set to "latest", it uses the latest date at which the data is reliable for this corpus, as described in list_corpora.

n_results

An integer. The number of most frequently associated words to return. n_results can also be set to "all" to return all the available results.

distance

An integer, "max" or "articles". The maximum distance, in number of words, at which to look for words associated with the keyword. The max length for each corpus (distance + number of words in the keyword) is described in the max_length column of the list_corpora dataset.

When set to "max", automatically considers the longest ngram possible for this corpus.

When set to "articles", looks for associated words within the whole article. Only available for the "lemonde" corpus and for unigrams (ie keywords only made of one word).

stopwords

A character vector of stopwords to remove. The default is the vector of the 500 most frequent words in the Gallica books dataset. We can change this number by passing stopwords_gallicca[1:300] (for instance, for the 300 most frequent) to the stopwords argument. Can also be lsa::stopwords_fr If NULL does not remove any stopwords.

remove_numbers

If TRUE removes numbers from the list of associated words.

Value

A tibble. Containing the words the most frequently associated with the keyword mentioned (associated_word), the syntagma at which the co-occurrences between the keyword and the associated word are computed (cooccur_level), and the level at which the co-occurrences are computed (n-grams or articles, reported in (cooccur_level)). It also returns the input parameters keyword, corpus, from and to.

Details

This functions calls the 'associated' route of the API.

This function is only available for the three main corpora (historical press, Gallica books, Le Monde newspaper). Searching the "press" corpus take a long time and return an error. If it does, run the same function again, it will start off where it stopped.

Note that the API route does not allow to search for associated words after a punctuation mark. For instance, in "son camarade, le chasseur, a", the function will not count "chasseur" as a word associated with "camarade". The ngram "son camarade le chasseur" is not in the database. Thus, there might be less associated words at a longer distance as it increases the probability of ngrams to be excluded from the database.

Apostrophes and letters preceding them are withdrawn from the dataset (except for n' since they carry meaning)

Examples

gallicagram_associated("camarade", from = 1960, to = 1970)
#> # A tibble: 20 × 7
#>    associated_word n_cooccur keyword  corpus   from    to cooccur_level
#>    <chr>               <int> <chr>    <chr>   <dbl> <dbl> <chr>        
#>  1 khrouchtchev           76 camarade lemonde  1960  1970 3-grams      
#>  2 mao                    47 camarade lemonde  1960  1970 3-grams      
#>  3 toung                  40 camarade lemonde  1960  1970 3-grams      
#>  4 ancien                 38 camarade lemonde  1960  1970 3-grams      
#>  5 tse                    28 camarade lemonde  1960  1970 3-grams      
#>  6 dubcek                 22 camarade lemonde  1960  1970 3-grams      
#>  7 promotion              18 camarade lemonde  1960  1970 3-grams      
#>  8 ami                    16 camarade lemonde  1960  1970 3-grams      
#>  9 compagnie              16 camarade lemonde  1960  1970 3-grams      
#> 10 thorez                 16 camarade lemonde  1960  1970 3-grams      
#> 11 togliatti              16 camarade lemonde  1960  1970 3-grams      
#> 12 direction              15 camarade lemonde  1960  1970 3-grams      
#> 13 rochet                 14 camarade lemonde  1960  1970 3-grams      
#> 14 club                   13 camarade lemonde  1960  1970 3-grams      
#> 15 vieux                  13 camarade lemonde  1960  1970 3-grams      
#> 16 waldeck                13 camarade lemonde  1960  1970 3-grams      
#> 17 staline                12 camarade lemonde  1960  1970 3-grams      
#> 18 tsé                    12 camarade lemonde  1960  1970 3-grams      
#> 19 combat                 11 camarade lemonde  1960  1970 3-grams      
#> 20 tué                    11 camarade lemonde  1960  1970 3-grams