Text As Data - Lecture 1

Use in economics and pre-processing

Machine Learning and Big Data

Vincent Bagilet

2025-11-26

Introduction

Text as Data

A plethora of text data
Nowadays, we have both the technology and methods to study them massively

Text data is unstructured:
- Basically not sorted in a sort of “table-like” format
- Info we want mixed with info we do not want
- Need to throw away some info $\Rightarrow$ select what to keep

Text data is very high-dimensional $\left( n_{\text{vocab}}^{n_{\text{words in doc}}} \right)$

We often want to relate text data to metadata
- eg who, when, on what topic?

Think About Your Own Question

What type of text data? What source?
How would you get the data?
Which research question?
How would you go about studying this?

Outline and Resources

Housekeeping

Assignment

Exercise to do for the end of next week
You can start today, even if we will not have covered everything
Available on Portail des études

Material

Available on Portail des études AND on the course website

Objectives

Why is text data useful in economics?
What is a typical empirical workflow?
How to concretely implement these analyses in Python?
What are recent developments in the field?
Link this section to the rest of the class

Relation to the rest of the class

More on the data analysis and big data side than on the ML one
Here, focus on pre-processing steps, ie prepare the data to use it in:
- A ML algorithm
- An econometric analysis
Can use the tools and algorithms you saw in the rest of the class on text data
- Examples?

Outline

Introduction
Applications in economics
Workflow for analysis
Pre-processing
Representation
Dictionary based methods
Machine learning methods
Introduction to deep learning methods
Validation
Use in econometric analyses

Economics Resources

Elliott Ash’s NLP class

Python Resources

spaCy 101: tutorial for spaCy, a Python library for NLP
Natural Language Processing with Python: general Python book on NLP (with the nltk library)
Companion Python notebooks to Ash and Hansen (2023)

R Resources

Applications in Economics

Measuring Document Similarity

What?
- Make pairwise document comparisons
- Forming clusters of related documents
Examples
- Cagé, Hervé, and Viaud (2020) use the distance between online news articles and social media posts to group items into common stories
- Biasi and Ma (2022) build a measure of syllabi’s distance from frontier knowledge (academic articles) and then relate this metric to socio-economic variables
How?
- Represent documents in some sort of vector form and then compute cosine similarity
- There are different ways of defining vectors

Concept Detection

What?
- Detect the presence of a concept
Example
- Djourelova, Durante, and Martin (2024) interpretable news topics using CorEx
How?
- Dictionary methods / pattern matching
- Topic models to identify latent concepts
- Distance between documents and dictionaries
- Supervised (human classification) learning problem
- BERT/GPT type of models

Relation Between Concepts

What?
- How are concepts related?
How?
- Co-occurrence of dictionaries
- Word Embedding Association Test (WEAT):
  - Picks two terms at each end of the spectrum (eg rich and poor) and compute the cosine similarity of each term of interest with these “extreme” terms
Examples
- Ash, Chen, and Ornaghi (2024) on gender attitudes of individual US judges
- Kozlowski, Taddy, and Evans (2019) locate words on meaningful dimension, eg, rich/poor (this has a feel of Bourdieu’s social space)

Associating Text with Metadata

Impute an outcome of interest to other documents
Supervised learning
eg have some data on party label but not for everyone
Example
- Gentzkow and Shapiro (2010) build a model for party label using US Congressional speeches. Then use it to predict political bias of newspapers

Workflow for Analysis

How are text data structured?

Text data is a sequence of characters called documents
A corpus is a collection of documents
The unit of analysis (the “document”) depends on the question:
- Fine enough to fit relevant metadata variation
- Not unnecessarily fine to reduce dimensionality

From text to a usable format

Text data is very highly dimensional
Need to transforming data into a useful representation for modeling
Encode text data into numeric arrays:
- Tokenization and word counts
- Document-Term Matrix
- Latent Semantic Allocation
- Word embeddings
Both for computational reasons and to have objects we can manipulate and do computations on

Steps of Text Analysis

Gentzkow, Kelly, and Taddy (2019) summarize a text analysis in three steps:
1. Represent raw text $D$ as a numerical array $C$
2. Map $C$ to predicted values $\hat{V}$ of unknown outcomes $V$
3. Use $\hat{V}$ in subsequent descriptive or causal analysis

Workflow, rephrased

Collecting data
Pre-processing (stemming, lemming, etc)
Data transformation (from pre-processed text to numeric arrays)
Analysis and modelisation
Validation
Use in econometric analysis

Type of Analyses

Dictionary Methods: rely on pre-defined lexicons and info associated with specific keywords
Rule-Based Methods: use predefined rules and patterns (eg RegEx)
Machine Learning Methods: methods leveraging lightweight ML models
- Supervised ML
- Unsupervised ML
Deep Learning Methods: methods leveraging neural networks

There is an important interpretability-flexibility trade-off

Gathering data

Common Text Data Sources

Political economy: news articles, parliamentary debates, speeches, social network posts, party manifestos, press releases, etc
Economic history: correspondence, institutional documents, books, etc
Labor and Industrial Organisation (IO): job adds, product descriptions, etc
Finance and macro: earnings conference calls, central bank speeches, etc

Where to find data

Digital archives of news (Factiva, ProQuest, Europress, etc)
- Expensive. Some universities have subscriptions.
Companies APIs (eg Twitter¹, New York Times)
- In general there are wrappers to directly access the data from Python or R
Data sets of occurrence of keywords (Google Trends, Google Ngram, Gallicagram)
Online $\Rightarrow$ scrapping
Printed texts: digitalize and OCR
Curated lists of text data: here, here, here, and there

Overview of web scraping

The overall idea is to write algorithms that:
1. Browse a website and download relevant html pages
  - Main challenge: identify the relevant pages
2. Transform the html pages into a data frame containing text data
  - Typical structure: date column, title column, author column, text column, etc
  - Main challenge: identify the right tags
Not limited to text data: you can retrieve anything that is online

HTML pages

HTML: HyperText Markup Language
Structure of an example html page

<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>

An IMDb example

Webscraping concretely

There are libraries for web scraping with useful documentation:
- Beautiful Soup (Python)
- rvest (R)
Basic pages are static and easy to scrape
More complex to scrape dynamic pages
- Use Selenium
- It opens a browser and can be used to automate tasks

A few webscraping tips

Tips

Before scraping, look at whether someone did not already scraped what you want to scrape (eg for IMDb)
To identify relevant pages use the site sitemap (website.com/sitemap.xml)
- Example
Use ChatGPT or Claude for scraping
Use SelectorGadget or tools from the developer panel of your browser
Save html pages before processing them
Read the robots.txt page to learn what you can and cannot do

Optical Character Recognition (OCR)

Some data is either not digitized or digitized but in text/PDF format
Use software to transform it to text data
Tesseract and its Python and R wrappers
When PDFs are digitally made, you can use a text extraction software to directly retrieve the text from the metadata

Pre-processing

Goal

Going from raw text to something usable in analysis
$\searrow$ the dimensionality by $\searrow$ the size of the vocabulary
Identify some words as the same
Some information is useful, other less:
- What to keep?
- What format?

Tokens

Tokens: a meaningful unit of text, such as a word, a character
In NLP, often split data into tokens
Example from the French parliament: “Ne fumez pas la moquette”
Words: {Ne, fumez, pas, la, moquette}
n-grams (here 2-grams): {Ne fumez, fumez pas, pas la, la moquette}
Sentences

import nltk

# Run only once
nltk.download('punkt') 
nltk.download('punkt_tab')

sentence = "Ne fumez pas la moquette"
tokens = nltk.word_tokenize(sentence)

tokens

['Ne', 'fumez', 'pas', 'la', 'moquette']

Pre-processing Choices

Pre-processing choices often affects the results:

What To Keep?

Not all data is useful
Sometimes, may remove useful information: eg “happy” vs “not happy”
Remove capitalization?
Punctuation?
Numbers?
Stopwords?
- Words that carry little information
- How do you define this set?

Capitalization and punctuation

Often uninformative
But sometimes important:
- Sentence splitting
- Part-of-speech tagging
- Named entity recognition
- Text generation
Option: keep caps not at the beginning of a sentence

tokens_lower = [word.lower() for word in tokens]
print(tokens_lower)

['ne', 'fumez', 'pas', 'la', 'moquette']

Stop words

We have seen that might create noise but may also carry meaning
How to define stopwords?
Can use existing lists

stopwords = set(nltk.corpus.stopwords.words('french'))

print(stopwords)

{'ayantes', 'ils', 'sera', 'eut', 'eussiez', 'est', 'sont', 'ta', 'moi', 'auront', 'eusse', 'la', 'ton', 'eûmes', 'avons', 'êtes', 'eus', 'eussent', 'n', 'te', 'il', 'ses', 'eurent', 'avait', 'étées', 'fûtes', 'son', 'étés', 'vos', 'étant', 'était', 'aurais', 'tes', 'étants', 'd', 'elle', 'y', 'ma', 'eût', 'aurons', 'auras', 'seraient', 'suis', 'eussions', 'pas', 'ayante', 'ont', 'auriez', 'ce', 'qu', 'fût', 'sa', 'aie', 'avais', 'ait', 'les', 'fussiez', 'l', 'j', 'serez', 'es', 'aies', 'sois', 'sur', 'sommes', 'de', 'soit', 'eusses', 'fus', 'avions', 'tu', 'eue', 'même', 'toi', 'un', 'aviez', 's', 'nous', 'ayant', 'eux', 'avec', 'mon', 'ayants', 'ai', 'me', 't', 'serai', 'aurai', 'eûtes', 'et', 'on', 'avaient', 'fut', 'dans', 'ayez', 'à', 'mes', 'notre', 'qui', 'ou', 'vous', 'aurez', 'mais', 'étais', 'fussions', 'fûmes', 'nos', 'étions', 'furent', 'seriez', 'serais', 'serons', 'je', 'par', 'ayons', 'étante', 'soient', 'avez', 'du', 'seras', 'aurait', 'ne', 'étiez', 'eues', 'que', 'soyez', 'eu', 'seront', 'fussent', 'étaient', 'aient', 'aura', 'votre', 'soyons', 'auraient', 'au', 'aux', 'une', 'ces', 'étée', 'fusse', 'leur', 'm', 'as', 'en', 'c', 'fusses', 'aurions', 'serait', 'des', 'étantes', 'pour', 'lui', 'serions', 'se', 'été', 'le'}

tokens_filtered = [word for word in tokens_lower if word not in stopwords]
print(tokens_filtered)

['fumez', 'moquette']

Which list? How to define it?

Build one
- By hand, starting from existing ones
- Remove words that appear the most frequently, eg by Inverse Document Frequency (idf):

$idf(term) = \ln \left(\dfrac{n_{\text{documents}}}{n_{\text{document containing term}}}\right)$

idf decreases the weight of commonly used words and increases that of words that are rarely used in the corpus
tf-idf: multiplying the term frequency (tf) and its idf

Stemming/Lemmatization

Find the stem of the word:
- Ruled based
- Produces grams that are not actual words

stemmer = nltk.stem.snowball.SnowballStemmer(language="french")
tokens_stemmed = [stemmer.stem(word) for word in tokens]
print(tokens_stemmed)

['ne', 'fum', 'pas', 'la', 'moquet']

Lemmatization:
- Similar but with semantic rules
- Produces actual words but sentences that do not mean anything

import spacy
nlp = spacy.load("fr_core_news_sm")
doc = nlp(sentence)
lemmas = [token.lemma_ for token in doc]

print(lemmas)

['ne', 'fumer', 'pas', 'le', 'moquette']

Summary

NLP in Economics

Measuring document similarity
Concept detection
Relation between concepts
Associating text with metadata

Overal Approach

Get text data (ready-made, scrap, OCR, etc)
Pre-process the data
Transform data into useful format (a numeric array)
Run analysis. Several methods:
- Dictionary-based
- Rule-based
- Machine Learning
- Deep Learning
Use output in an econometric analysis

References

Ash, Elliott, Daniel L. Chen, and Arianna Ornaghi. 2024. “Gender Attitudes in the Judiciary: Evidence from US Circuit Courts.” American Economic Journal: Applied Economics 16 (1): 314–50. https://doi.org/10.1257/app.20210435.

Ash, Elliott, and Stephen Hansen. 2023. “Text Algorithms in Economics.” Annual Review of Economics 15 (1): 659–88. https://doi.org/10.1146/annurev-economics-082222-074352.

Biasi, Barbara, and Song Ma. 2022. “The Education-Innovation Gap.” Working {{Paper}}. Working Paper Series. National Bureau of Economic Research. https://doi.org/10.3386/w29853.

Cagé, Julia, Nicolas Hervé, and Marie-Luce Viaud. 2020. “The Production of Information in an Online World.” The Review of Economic Studies 87 (5): 2126–64. https://doi.org/10.1093/restud/rdz061.

Djourelova, Milena, Ruben Durante, and Gregory J Martin. 2024. “The Impact of Online Competition on Local Newspapers: Evidence from the Introduction of Craigslist.” The Review of Economic Studies, May, rdae049. https://doi.org/10.1093/restud/rdae049.

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74. https://doi.org/10.1257/jel.20181020.

Gentzkow, Matthew, and Jesse M Shapiro. 2010. “What Drives Media Slant? Evidence From U.S. Daily Newspapers.” Econometrica 78 (1): 35–71. https://doi.org/10.3982/ECTA7195.

Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings.” American Sociological Review, September. https://doi.org/10.1177/0003122419877135.