Text As Data - Lecture 1

Use in economics and pre-processing

Machine Learning and Big Data

Vincent Bagilet

2025-11-26

Introduction

Text as Data

  • A plethora of text data

  • Nowadays, we have both the technology and methods to study them massively

  • Text data is unstructured:

    • Basically not sorted in a sort of “table-like” format

    • Info we want mixed with info we do not want

    • Need to throw away some info \Rightarrow select what to keep

  • Text data is very high-dimensional (nvocabnwords in doc)\left( n_{\text{vocab}}^{n_{\text{words in doc}}} \right)
  • We often want to relate text data to metadata

    • eg who, when, on what topic?

Think About Your Own Question

  • What type of text data? What source?

  • How would you get the data?

  • Which research question?

  • How would you go about studying this?

Outline and Resources

Housekeeping




Assignment

  • Exercise to do for the end of next week

  • You can start today, even if we will not have covered everything

  • Available on Portail des études


Material

Objectives

  • Why is text data useful in economics?

  • What is a typical empirical workflow?

  • How to concretely implement these analyses in Python?

  • What are recent developments in the field?

  • Link this section to the rest of the class

Relation to the rest of the class

  • More on the data analysis and big data side than on the ML one

  • Here, focus on pre-processing steps, ie prepare the data to use it in:

    • A ML algorithm
    • An econometric analysis
  • Can use the tools and algorithms you saw in the rest of the class on text data

    • Examples?

Outline

  1. Introduction
  2. Applications in economics
  3. Workflow for analysis
  4. Pre-processing
  5. Representation
  6. Dictionary based methods
  7. Machine learning methods
  8. Introduction to deep learning methods
  9. Validation
  10. Use in econometric analyses

Economics Resources

Elliott Ash’s NLP class

Python Resources

R Resources

Applications in Economics

Measuring Document Similarity

  • What?

    • Make pairwise document comparisons

    • Forming clusters of related documents

  • Examples

    • Cagé, Hervé, and Viaud (2020) use the distance between online news articles and social media posts to group items into common stories
    • Biasi and Ma (2022) build a measure of syllabi’s distance from frontier knowledge (academic articles) and then relate this metric to socio-economic variables
  • How?

    • Represent documents in some sort of vector form and then compute cosine similarity

    • There are different ways of defining vectors

Concept Detection

  • What?

    • Detect the presence of a concept
  • Example

    • Djourelova, Durante, and Martin (2024) interpretable news topics using CorEx
  • How?

    • Dictionary methods / pattern matching

    • Topic models to identify latent concepts

    • Distance between documents and dictionaries

    • Supervised (human classification) learning problem

    • BERT/GPT type of models

Relation Between Concepts

  • What?

    • How are concepts related?
  • How?

    • Co-occurrence of dictionaries

    • Word Embedding Association Test (WEAT):

      • Picks two terms at each end of the spectrum (eg rich and poor) and compute the cosine similarity of each term of interest with these “extreme” terms
  • Examples

    • Ash, Chen, and Ornaghi (2024) on gender attitudes of individual US judges

    • Kozlowski, Taddy, and Evans (2019) locate words on meaningful dimension, eg, rich/poor (this has a feel of Bourdieu’s social space)

Associating Text with Metadata

  • Impute an outcome of interest to other documents

  • Supervised learning

  • eg have some data on party label but not for everyone

  • Example

    • Gentzkow and Shapiro (2010) build a model for party label using US Congressional speeches. Then use it to predict political bias of newspapers

Workflow for Analysis

How are text data structured?

  • Text data is a sequence of characters called documents

  • A corpus is a collection of documents

  • The unit of analysis (the “document”) depends on the question:

    • Fine enough to fit relevant metadata variation

    • Not unnecessarily fine to reduce dimensionality

From text to a usable format

  • Text data is very highly dimensional

  • Need to transforming data into a useful representation for modeling

  • Encode text data into numeric arrays:

    • Tokenization and word counts

    • Document-Term Matrix

    • Latent Semantic Allocation

    • Word embeddings

  • Both for computational reasons and to have objects we can manipulate and do computations on

Steps of Text Analysis

  • Gentzkow, Kelly, and Taddy (2019) summarize a text analysis in three steps:

    1. Represent raw text DD as a numerical array CC

    2. Map CC to predicted values V̂\hat{V} of unknown outcomes VV

    3. Use V̂\hat{V} in subsequent descriptive or causal analysis

Workflow, rephrased

  1. Collecting data
  2. Pre-processing (stemming, lemming, etc)
  3. Data transformation (from pre-processed text to numeric arrays)
  4. Analysis and modelisation
  5. Validation
  6. Use in econometric analysis

Type of Analyses


  • Dictionary Methods: rely on pre-defined lexicons and info associated with specific keywords

  • Rule-Based Methods: use predefined rules and patterns (eg RegEx)

  • Machine Learning Methods: methods leveraging lightweight ML models

    • Supervised ML
    • Unsupervised ML
  • Deep Learning Methods: methods leveraging neural networks


There is an important interpretability-flexibility trade-off

Gathering data

Common Text Data Sources

  • Political economy: news articles, parliamentary debates, speeches, social network posts, party manifestos, press releases, etc

  • Economic history: correspondence, institutional documents, books, etc

  • Labor and Industrial Organisation (IO): job adds, product descriptions, etc

  • Finance and macro: earnings conference calls, central bank speeches, etc

Where to find data

Overview of web scraping

  • The overall idea is to write algorithms that:

    1. Browse a website and download relevant html pages

      • Main challenge: identify the relevant pages
    2. Transform the html pages into a data frame containing text data

      • Typical structure: date column, title column, author column, text column, etc
      • Main challenge: identify the right tags
  • Not limited to text data: you can retrieve anything that is online

HTML pages




  • HTML: HyperText Markup Language

  • Structure of an example html page

<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>

Webscraping concretely

  • There are libraries for web scraping with useful documentation:

  • Basic pages are static and easy to scrape

  • More complex to scrape dynamic pages

    • Use Selenium
    • It opens a browser and can be used to automate tasks

A few webscraping tips


Tips

  • Before scraping, look at whether someone did not already scraped what you want to scrape (eg for IMDb)

  • To identify relevant pages use the site sitemap (website.com/sitemap.xml)

  • Use ChatGPT or Claude for scraping

  • Use SelectorGadget or tools from the developer panel of your browser

  • Save html pages before processing them

  • Read the robots.txt page to learn what you can and cannot do

Optical Character Recognition (OCR)

  • Some data is either not digitized or digitized but in text/PDF format

  • Use software to transform it to text data

  • Tesseract and its Python and R wrappers

  • When PDFs are digitally made, you can use a text extraction software to directly retrieve the text from the metadata

Pre-processing

Goal

  • Going from raw text to something usable in analysis

  • \searrow the dimensionality by \searrow the size of the vocabulary

  • Identify some words as the same

  • Some information is useful, other less:

    • What to keep?

    • What format?

Tokens

  • Tokens: a meaningful unit of text, such as a word, a character

  • In NLP, often split data into tokens

  • Example from the French parliament: “Ne fumez pas la moquette”

  • Words: {Ne, fumez, pas, la, moquette}

  • n-grams (here 2-grams): {Ne fumez, fumez pas, pas la, la moquette}

  • Sentences

import nltk

# Run only once
nltk.download('punkt') 
nltk.download('punkt_tab')

sentence = "Ne fumez pas la moquette"
tokens = nltk.word_tokenize(sentence)

tokens


['Ne', 'fumez', 'pas', 'la', 'moquette']

Pre-processing Choices

  • Pre-processing choices often affects the results:

What To Keep?

  • Not all data is useful

  • Sometimes, may remove useful information: eg “happy” vs “not happy”

  • Remove capitalization?

  • Punctuation?

  • Numbers?

  • Stopwords?

    • Words that carry little information

    • How do you define this set?

Capitalization and punctuation


  • Often uninformative

  • But sometimes important:

    • Sentence splitting

    • Part-of-speech tagging

    • Named entity recognition

    • Text generation

  • Option: keep caps not at the beginning of a sentence


tokens_lower = [word.lower() for word in tokens]
print(tokens_lower)
['ne', 'fumez', 'pas', 'la', 'moquette']

Stop words

  • We have seen that might create noise but may also carry meaning

  • How to define stopwords?

  • Can use existing lists

stopwords = set(nltk.corpus.stopwords.words('french'))

print(stopwords)
{'ayantes', 'ils', 'sera', 'eut', 'eussiez', 'est', 'sont', 'ta', 'moi', 'auront', 'eusse', 'la', 'ton', 'eûmes', 'avons', 'êtes', 'eus', 'eussent', 'n', 'te', 'il', 'ses', 'eurent', 'avait', 'étées', 'fûtes', 'son', 'étés', 'vos', 'étant', 'était', 'aurais', 'tes', 'étants', 'd', 'elle', 'y', 'ma', 'eût', 'aurons', 'auras', 'seraient', 'suis', 'eussions', 'pas', 'ayante', 'ont', 'auriez', 'ce', 'qu', 'fût', 'sa', 'aie', 'avais', 'ait', 'les', 'fussiez', 'l', 'j', 'serez', 'es', 'aies', 'sois', 'sur', 'sommes', 'de', 'soit', 'eusses', 'fus', 'avions', 'tu', 'eue', 'même', 'toi', 'un', 'aviez', 's', 'nous', 'ayant', 'eux', 'avec', 'mon', 'ayants', 'ai', 'me', 't', 'serai', 'aurai', 'eûtes', 'et', 'on', 'avaient', 'fut', 'dans', 'ayez', 'à', 'mes', 'notre', 'qui', 'ou', 'vous', 'aurez', 'mais', 'étais', 'fussions', 'fûmes', 'nos', 'étions', 'furent', 'seriez', 'serais', 'serons', 'je', 'par', 'ayons', 'étante', 'soient', 'avez', 'du', 'seras', 'aurait', 'ne', 'étiez', 'eues', 'que', 'soyez', 'eu', 'seront', 'fussent', 'étaient', 'aient', 'aura', 'votre', 'soyons', 'auraient', 'au', 'aux', 'une', 'ces', 'étée', 'fusse', 'leur', 'm', 'as', 'en', 'c', 'fusses', 'aurions', 'serait', 'des', 'étantes', 'pour', 'lui', 'serions', 'se', 'été', 'le'}
tokens_filtered = [word for word in tokens_lower if word not in stopwords]
print(tokens_filtered)
['fumez', 'moquette']
  • Which list? How to define it?
  • Build one

    • By hand, starting from existing ones

    • Remove words that appear the most frequently, eg by Inverse Document Frequency (idf):

idf(term)=ln(ndocumentsndocument containing term) idf(term) = \ln \left(\dfrac{n_{\text{documents}}}{n_{\text{document containing term}}}\right)

  • idf decreases the weight of commonly used words and increases that of words that are rarely used in the corpus

  • tf-idf: multiplying the term frequency (tf) and its idf

Stemming/Lemmatization

  • Find the stem of the word:

    • Ruled based

    • Produces grams that are not actual words

stemmer = nltk.stem.snowball.SnowballStemmer(language="french")
tokens_stemmed = [stemmer.stem(word) for word in tokens]
print(tokens_stemmed)
['ne', 'fum', 'pas', 'la', 'moquet']
  • Lemmatization:

    • Similar but with semantic rules

    • Produces actual words but sentences that do not mean anything

import spacy
nlp = spacy.load("fr_core_news_sm")
doc = nlp(sentence)
lemmas = [token.lemma_ for token in doc]

print(lemmas)
['ne', 'fumer', 'pas', 'le', 'moquette']

Summary

NLP in Economics

  • Measuring document similarity
  • Concept detection
  • Relation between concepts
  • Associating text with metadata

Overal Approach

  • Get text data (ready-made, scrap, OCR, etc)
  • Pre-process the data
  • Transform data into useful format (a numeric array)
  • Run analysis. Several methods:
    • Dictionary-based
    • Rule-based
    • Machine Learning
    • Deep Learning
  • Use output in an econometric analysis

References

Ash, Elliott, Daniel L. Chen, and Arianna Ornaghi. 2024. “Gender Attitudes in the Judiciary: Evidence from US Circuit Courts.” American Economic Journal: Applied Economics 16 (1): 314–50. https://doi.org/10.1257/app.20210435.
Ash, Elliott, and Stephen Hansen. 2023. “Text Algorithms in Economics.” Annual Review of Economics 15 (1): 659–88. https://doi.org/10.1146/annurev-economics-082222-074352.
Biasi, Barbara, and Song Ma. 2022. “The Education-Innovation Gap.” Working {{Paper}}. Working Paper Series. National Bureau of Economic Research. https://doi.org/10.3386/w29853.
Cagé, Julia, Nicolas Hervé, and Marie-Luce Viaud. 2020. “The Production of Information in an Online World.” The Review of Economic Studies 87 (5): 2126–64. https://doi.org/10.1093/restud/rdz061.
Djourelova, Milena, Ruben Durante, and Gregory J Martin. 2024. “The Impact of Online Competition on Local Newspapers: Evidence from the Introduction of Craigslist.” The Review of Economic Studies, May, rdae049. https://doi.org/10.1093/restud/rdae049.
Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74. https://doi.org/10.1257/jel.20181020.
Gentzkow, Matthew, and Jesse M Shapiro. 2010. “What Drives Media Slant? Evidence From U.S. Daily Newspapers.” Econometrica 78 (1): 35–71. https://doi.org/10.3982/ECTA7195.
Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings.” American Sociological Review, September. https://doi.org/10.1177/0003122419877135.