Use in economics and pre-processing

Date

November 26, 2025

Objective

Discuss some of the main uses of text as data in economics and describe the necessary steps to prepare your text data for analysis.

Summary

Text data is ubiquitous and developments in text analysis allow us to study economics questions that were not possible to explore before. The present lecture describes some of the type of analyses that can be implemented with text data offer. Yet, before the concrete analyses, one needs to gather and pre-process data. We thus discuss these steps in the second part of the lecture.

Session Outline

  1. Introduction
    • Specificity of text data
    • Why using text data in economics?
  2. Outline of the course and resources
  3. Applications in economics
    • Measuring document similarity
    • Concept detection
    • Relation between concepts
    • Associating text with metadata
  4. Workflow for analysis
  5. Gathering data
    • Common data sources and ways to gather data
    • Introduction to web scraping
    • Optical Character Recognition
  6. Pre-processing
    • Tokenization
    • Capitatilization and punctuation
    • Stemming/lemmatization
    • Stopwords

Materials

Open slides in html

Open slides in pdf

Exercise

The assignment is available on the Portail des Etudes.

Specific resources for this lecture

If you should read only one thing

Ash and Hansen (2023) gives a great overview of the use of text analysis in economics and is the reference the first part this lecture is built on

Gathering data

  • Curated lists of text data: here, here, here, and there
  • Web scraping: read the documentation and tutorials of the main webscraping package, Beautiful Soup in Python and rvest in R

Pre-processing

  • Chapter 2 to 4 of Hvitfeldt and Silge (2022) a very clear overview of supervised ML (abstracting from the R coding)
  • Lecture 3 of Elliott Ash’s NLP class

References

Ash, Elliott, and Stephen Hansen. 2023. “Text Algorithms in Economics.” Annual Review of Economics 15 (1): 659–88. https://doi.org/10.1146/annurev-economics-082222-074352.
Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R.