Use in economics and pre-processing
Machine Learning and Big Data
2025-11-26
A plethora of text data
Nowadays, we have both the technology and methods to study them massively
Text data is unstructured:
Basically not sorted in a sort of “table-like” format
Info we want mixed with info we do not want
Need to throw away some info select what to keep
We often want to relate text data to metadata
What type of text data? What source?
How would you get the data?
Which research question?
How would you go about studying this?
Exercise to do for the end of next week
You can start today, even if we will not have covered everything
Available on Portail des études
Why is text data useful in economics?
What is a typical empirical workflow?
How to concretely implement these analyses in Python?
What are recent developments in the field?
Link this section to the rest of the class
More on the data analysis and big data side than on the ML one
Here, focus on pre-processing steps, ie prepare the data to use it in:
Can use the tools and algorithms you saw in the rest of the class on text data
spaCy, a Python library for NLPnltk library)What?
Make pairwise document comparisons
Forming clusters of related documents
Examples
How?
Represent documents in some sort of vector form and then compute cosine similarity
There are different ways of defining vectors
What?
Example
How?
Dictionary methods / pattern matching
Topic models to identify latent concepts
Distance between documents and dictionaries
Supervised (human classification) learning problem
BERT/GPT type of models
What?
How?
Co-occurrence of dictionaries
Word Embedding Association Test (WEAT):
Examples
Impute an outcome of interest to other documents
Supervised learning
eg have some data on party label but not for everyone
Example
Text data is a sequence of characters called documents
A corpus is a collection of documents
The unit of analysis (the “document”) depends on the question:
Fine enough to fit relevant metadata variation
Not unnecessarily fine to reduce dimensionality
Text data is very highly dimensional
Need to transforming data into a useful representation for modeling
Encode text data into numeric arrays:
Tokenization and word counts
Document-Term Matrix
Latent Semantic Allocation
Word embeddings
Both for computational reasons and to have objects we can manipulate and do computations on
Gentzkow, Kelly, and Taddy (2019) summarize a text analysis in three steps:
Represent raw text as a numerical array
Map to predicted values of unknown outcomes
Use in subsequent descriptive or causal analysis
Dictionary Methods: rely on pre-defined lexicons and info associated with specific keywords
Rule-Based Methods: use predefined rules and patterns (eg RegEx)
Machine Learning Methods: methods leveraging lightweight ML models
Deep Learning Methods: methods leveraging neural networks
There is an important interpretability-flexibility trade-off
Political economy: news articles, parliamentary debates, speeches, social network posts, party manifestos, press releases, etc
Economic history: correspondence, institutional documents, books, etc
Labor and Industrial Organisation (IO): job adds, product descriptions, etc
Finance and macro: earnings conference calls, central bank speeches, etc
Digital archives of news (Factiva, ProQuest, Europress, etc)
Companies APIs (eg Twitter1, New York Times)
Data sets of occurrence of keywords (Google Trends, Google Ngram, Gallicagram)
Online scrapping
Printed texts: digitalize and OCR
The overall idea is to write algorithms that:
Browse a website and download relevant html pages
Transform the html pages into a data frame containing text data
Not limited to text data: you can retrieve anything that is online
HTML: HyperText Markup Language
Structure of an example html page
There are libraries for web scraping with useful documentation:
Basic pages are static and easy to scrape
More complex to scrape dynamic pages
Tips
Before scraping, look at whether someone did not already scraped what you want to scrape (eg for IMDb)
To identify relevant pages use the site sitemap (website.com/sitemap.xml)
Use ChatGPT or Claude for scraping
Use SelectorGadget or tools from the developer panel of your browser
Save html pages before processing them
Read the robots.txt page to learn what you can and cannot do
Going from raw text to something usable in analysis
the dimensionality by the size of the vocabulary
Identify some words as the same
Some information is useful, other less:
What to keep?
What format?
Tokens: a meaningful unit of text, such as a word, a character
In NLP, often split data into tokens
Example from the French parliament: “Ne fumez pas la moquette”
Words: {Ne, fumez, pas, la, moquette}
n-grams (here 2-grams): {Ne fumez, fumez pas, pas la, la moquette}
Sentences
Not all data is useful
Sometimes, may remove useful information: eg “happy” vs “not happy”
Remove capitalization?
Punctuation?
Numbers?
Stopwords?
Words that carry little information
How do you define this set?
Often uninformative
But sometimes important:
Sentence splitting
Part-of-speech tagging
Named entity recognition
Text generation
Option: keep caps not at the beginning of a sentence
We have seen that might create noise but may also carry meaning
How to define stopwords?
Can use existing lists
{'ayantes', 'ils', 'sera', 'eut', 'eussiez', 'est', 'sont', 'ta', 'moi', 'auront', 'eusse', 'la', 'ton', 'eûmes', 'avons', 'êtes', 'eus', 'eussent', 'n', 'te', 'il', 'ses', 'eurent', 'avait', 'étées', 'fûtes', 'son', 'étés', 'vos', 'étant', 'était', 'aurais', 'tes', 'étants', 'd', 'elle', 'y', 'ma', 'eût', 'aurons', 'auras', 'seraient', 'suis', 'eussions', 'pas', 'ayante', 'ont', 'auriez', 'ce', 'qu', 'fût', 'sa', 'aie', 'avais', 'ait', 'les', 'fussiez', 'l', 'j', 'serez', 'es', 'aies', 'sois', 'sur', 'sommes', 'de', 'soit', 'eusses', 'fus', 'avions', 'tu', 'eue', 'même', 'toi', 'un', 'aviez', 's', 'nous', 'ayant', 'eux', 'avec', 'mon', 'ayants', 'ai', 'me', 't', 'serai', 'aurai', 'eûtes', 'et', 'on', 'avaient', 'fut', 'dans', 'ayez', 'à', 'mes', 'notre', 'qui', 'ou', 'vous', 'aurez', 'mais', 'étais', 'fussions', 'fûmes', 'nos', 'étions', 'furent', 'seriez', 'serais', 'serons', 'je', 'par', 'ayons', 'étante', 'soient', 'avez', 'du', 'seras', 'aurait', 'ne', 'étiez', 'eues', 'que', 'soyez', 'eu', 'seront', 'fussent', 'étaient', 'aient', 'aura', 'votre', 'soyons', 'auraient', 'au', 'aux', 'une', 'ces', 'étée', 'fusse', 'leur', 'm', 'as', 'en', 'c', 'fusses', 'aurions', 'serait', 'des', 'étantes', 'pour', 'lui', 'serions', 'se', 'été', 'le'}
['fumez', 'moquette']
Build one
By hand, starting from existing ones
Remove words that appear the most frequently, eg by Inverse Document Frequency (idf):
idf decreases the weight of commonly used words and increases that of words that are rarely used in the corpus
tf-idf: multiplying the term frequency (tf) and its idf
Find the stem of the word:
Ruled based
Produces grams that are not actual words
stemmer = nltk.stem.snowball.SnowballStemmer(language="french")
tokens_stemmed = [stemmer.stem(word) for word in tokens]
print(tokens_stemmed)['ne', 'fum', 'pas', 'la', 'moquet']
Lemmatization:
Similar but with semantic rules
Produces actual words but sentences that do not mean anything