Use in economics and pre-processing

Date

November 26, 2025

Objective

Discuss some of the main uses of text as data in economics and describe the necessary steps to prepare your text data for analysis.

Summary

Text data is ubiquitous and developments in text analysis allow us to study economics questions that were not possible to explore before. The present lecture describes some of the type of analyses that can be implemented with text data offer. Yet, before the concrete analyses, one needs to gather and pre-process data. We thus discuss these steps in the second part of the lecture.

Session Outline

Introduction
- Specificity of text data
- Why using text data in economics?
Outline of the course and resources
Applications in economics
- Measuring document similarity
- Concept detection
- Relation between concepts
- Associating text with metadata
Workflow for analysis
Gathering data
- Common data sources and ways to gather data
- Introduction to web scraping
- Optical Character Recognition
Pre-processing
- Tokenization
- Capitatilization and punctuation
- Stemming/lemmatization
- Stopwords

Materials

Open slides in html

Open slides in pdf

Exercise

The assignment is available on the Portail des Etudes.

Specific resources for this lecture

If you should read only one thing

Ash and Hansen (2023) gives a great overview of the use of text analysis in economics and is the reference the first part this lecture is built on

Gathering data

Curated lists of text data: here, here, here, and there
Web scraping: read the documentation and tutorials of the main webscraping package, Beautiful Soup in Python and rvest in R

Pre-processing

Chapter 2 to 4 of Hvitfeldt and Silge (2022) a very clear overview of supervised ML (abstracting from the R coding)
Lecture 3 of Elliott Ash’s NLP class

References

Ash, Elliott, and Stephen Hansen. 2023. “Text Algorithms in Economics.” Annual Review of Economics 15 (1): 659–88. https://doi.org/10.1146/annurev-economics-082222-074352.

Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R.