Assignment - Replication and Calibration

Replication of a paper on impacts of high temperature on agricultural yields.

Instructions

Due date: Friday October 3rd, 8:00pm

Submission: On “Portail des Etudes”

Submission type: please submit your .html document (not a .qmd document), generated with Quarto, implementing the analysis required and answering the questions listed below.

Notes: The assignment is quite long. It is fine if you do not do everything in the simulation section.

Objective

Replicate and simulate a paper to allow us to evaluate the performance and robustness of the analysis and understand the role of fixed effects.

We first load the packages used in this document.

Code to load the R packages used in this doc

library(tidyverse)
library(knitr) 
library(kableExtra)
library(broom) 
library(here)
library(fixest) #for fixed effect regression
library(haven) #to read Stata .dta files

set.seed(35) #fix the starting point of the random numbers generator

# The code below is to use my own package for ggplot themes
# You can remove it. Otherwise you will need to install the package via GitHub
# using the following line of code
# devtools::install_github("vincentbagilet/mediocrethemes")
library(mediocrethemes)
set_mediocre_all(pal = "portal") #set theme for all the subsequent plots

# For exercises
library(webexercises)

Setting

In this example, we explore the impact of climate change (or high temperatures) on agricultural yields. Usual approaches to explore this question rely on variation over time to compare hot and cold conditions for a given area. This approach has been implemented in fear of omitted variable bias in traditional cross-sectional approaches (comparing hot areas to cold areas), climate being potentially correlated with other time invariant unobserved variables. A large set of papers leverage this approach to address this question, e.g. Deschênes and Greenstone (2007), Schlenker and Roberts (2009) or Burke and Emerick (2016).

In the present analysis, we will build simulations to emulate a typical study from this literature, measuring the impact of high temperatures on agricultural yields. A typical approach consists in estimating the link between high Growing Degree Days (GDD, roughly the number of days a crop is exposed to a certain range of temperatures) and yields, controlling for characteristics that are invariant in time and across locations. As in Burke and Emerick (2016) and some specifications from Schlenker and Roberts (2009), we will model log corn yield as a piecewise linear function of temperature, considering high and low GDDs. Most of the analyses in the literature being at the county-year level, we will simulate data with a similar structure.

To explore the data generating process, we will leverage data from Burke and Emerick (2016), henceforth BE. This will allow us to make a few simplifying assumptions regarding the data generating process. By replicating BE, we will try to confirm that these assumptions do not significantly alter the results of the estimation in an actual study and that simulations under these assumptions remain close to an actual study from this literature.

Let’s thus first replicate the analysis in BE.

Replication

Replication setting and set up

Burke and Emerick (2016), henceforth BE, aims to evaluate the impact of high temperature days on corn production, estimating the following equation:

$log(y_{ct}) = \beta_0 + \beta_1 H_{ct} + \beta_2 L_{ct} + \beta_3 H^{prec}_{ct} + \beta_4 L^{prec}_{ct} + \lambda_c + \kappa_t + u_{ct} \qquad(1)$

where $y_{ct}$ is the average crop yield in county $c$ in year $t$ , $L_{ct}$ the number of Growing Degree Days (GDDs)¹ below a given threshold (28 and 29°C for the panel and long-differences models respectively), $H$ the number of GDDs above this threshold. $H^{prec}_{ct}$ and $L^{prec}_{ct}$ represent equivalent measures for precipitations (threshold = 42 cm/year for the panel model, 50 for the long-differences one). $\lambda_c$ and $\kappa_t$ are county and time fixed effects respectively and $u_{ct}$ an error term.

The results we want to replicate are displayed in the paper, in Table 1 column 3:

Here we are going to focus on the results from table 1 in BE, for the panel model. Their code is written in Stata; we can translate it to R. For simplicity and since it is not the purpose of this exercise, I already did that and cleaned the data.

You however need to download their replication package on the paper’s page. Then store the data set us_panel.dta in your working directory (or a sub directory).

Warning

You will need to change the path in read_dta() and adapt it to your case.

BE_data <- haven::read_dta(
  here::here("content", "sim", "data", "replication_BE", "us_panel.dta"))

BE_wrangled <- BE_data |>
  filter(year >= 1978 & year <= 2002) |> 
  filter(longitude > -100) |> 
  transmute(
    year,
    fips = str_pad(fips, 5, "left", "0"),
    stfips = str_sub(fips, 1, 2),
    cornyield,
    logcornyield = log(cornyield),
    lower = dday0C - dday29C,
    higher = dday29C,
    prec_lo = ifelse(prec <= 42, prec - 42, 0),
    prec_hi = ifelse(prec > 42, prec - 42, 0),
    corn_area
  ) 

BE_clean <- BE_wrangled |> 
  group_by(fips) |> 
  mutate(
    #/sum to compute the equivalent of analytic weights in stata
    corn_area_78_02 = mean(corn_area, na.rm = TRUE)/sum(!is.na(corn_area))
  ) |> 
  ungroup()

Data structure

Before anything, let’s look at this data. It is a panel with a time dimension, and an individual dimension, .

How many time periods are there in this panel?

And how many individuals (counties)?

Is this panel balanced? .

The standard errors are clustered at the state level, we first have a quick look at this additional layer in the panel. How many states are there in this panel? . Is it equal to the number of states in the US? Why is that?

How many observations are there on average by state? (county/years) . Is the number of observations per state uniformely distributed across states?

Actual replication

Before anything, let’s try to replicate the main regression, as specified in the paper. The model is the one described in Equation 1.

Since the regression model specified in the paper is a fixed effect one, we are going to estimate it using the feols() function from the fixest package. Look at the documentation for this function. To access it, you can use the function.

Warning

To be able to replicate the results in the analysis, we need to include weights and cluster the standard errors at the state level.

reg_fe <- feols(
  ...,
  weights = ~ corn_area_78_02, 
  cluster = ~ stfips
)

Fixed-effects

Change the levels of fixed effects and explore how it affects:

The estimand
The estimate of the parameter of interest

Comment and discuss.

A simpler model?

We will generate data assuming that the econometric model we want to estimate accurately represents the DGP. However, this econometrics model is rather complex and it might be challenging to directly simulate a DGP based on it. Let’s try to see if a simpler model would yield very different results.

Remove the controls (ie only keeping one explanatory variable) and time fixed effects from your econometric model. Estimate this new econometric model. The explained variable is . The explanatory variable is and the fixed effects are . Let’s also remove the weights for simplicity.

reg_fe_no_ctrl <- feols(
  ...,
  cluster = ~stfips
)

Are the results approximately comparable to the full model?

We will therefore simulate this simplified model. How many variables do we need to generate?

Distribution of the variables

To be able to generate a fake data set, we first need to explore the distribution of the variables.

Distribution the logcornyield

We first focus on the log of corn yields. Since the dependent variable in the analysis is the transformed version of corn yields, we will directly generate it, instead of generating cornyield and then taking its log. We start by plotting the distribution of log corn yields and fitting a normal to it, manually selecting the parameters of the normal distribution.

Let’s first plot the distribution of logcornyield.

# plot

What could be a decent approximation of this distribution ? Using geom_function(), let’s overlay the cdf of this theoretical distribution on top of the distribution of logcornyield. Manually adjust the parameters of the theoretical distribution to get a good fit.

BE_clean |> 
  ggplot(aes(x = logcornyield)) + 
  geom_density() + 
  geom_function(
    fun = \(x) dnorm(x, mean  = 4.55, sd = 0.33),
      linewidth = 1
    ) +
  labs(
    title = "Distribution of log corn yields",
    subtitle = "Tentative normal fit",
    x = "Log Corn Yield",
    y = "Denisty"
  )

Decomposition

We actually want to take county fixed effects into account. We therefore want to decompose this distribution into a between-counties component and a within-county one.

We first assume that de data generating process is as follows:

For each county, draw observations for all the years from a distribution shared across counties ( $WC$ )
Add county fixed-effects that shift the mean of this distribution differently for each county. These fixed-effects therefore represent the variation in means that are county specific.

In other words, we assume that the DGP is as follows for $logyield$ in county $c$ in year $t$ :

$logyield_{ct} = \alpha + WC_t + BC_c$ Where $WC_t \sim f_t$ is a distribution shared across counties and $BC_c \sim g_c$ is a distribution shared across years. Both distribution have mean 0 ; the overall mean is captured by $\alpha$ .

The goal is to retrieve the two distributions and the constant. The DAG for our assumed DGP here is:

Distribution of within-county log corn yields

Let’s therefore first demean the (log)yields by removing the county average (log)yields. This gives a distribution of (log)yields centered on 0. Plot this distribution and overlay a (calibrated) theoretical distribution on it.

Note

We need to filter out a few observations for which logcornyield is equal to -Infinity (use filter(logcornyield != -Inf))

#plot

Distribution of between-counties log corn yields

Let’s then explore how the “fixed-effects” are distributed. Plot the distribution of the fixed effects (ie of the county means) and overlay a theoretical distribution on it.

#plot

Distribution of `higher`

We then turn to the explanatory variable of interest, higher. This variable represents the number of GDDs above 28°C.

We start by plotting this distribution and compute its mean and standard deviation.

# plot

We then reproduce the same analysis as for log yields. We assume that “higher GDD” in each counties are drawn from the same underlying distribution but shifted by a fixed effect. We therefore first want to compute the shape of the distribution of the demeaned higher variables.

First, we plot the distribution of the demeaned version of the variable and overlay a theoretical distribution.

#plot

We then explore the distribution of the fixed-effects. Plot the distribution of the “fixed-effects”.

#plot

Which type of distribution could roughly fit this shape? .

We will want to overlay a theoretical distribution to this graph.

We derive rough parameters values from properties of the gamma distribution and then adjust them manually. Denoting respectively $k$ and $\theta$ the shape and scale parameters, the mean of the distribution is given by $k \theta$ and the variance by $k \theta^2$ . We thus solved the system such that mean $\simeq 68 \ (= k \theta)$ and var $\simeq 59 \ (= k \theta^2)$ .

We therefore assume that the fixed-effects are drawn from a gamma distribution with shape 1.3, scale 50.

Note that, we want a distribution centered on 0 so we need to shift the distribution by its mean. The theoretical function to plot is therefore:

shape_g <- 1.3
scale_g <- 50

fun_gamma <-  \(x) dgamma(x + shape_g*scale_g, shape = shape_g, scale = scale_g)

Simulation

Objective of the Simulation

Through this simulation, we will explore whether the model we write actually retrieves the effects we are interested in, in a pristine setting.

Modeling choices

Based on the exploration of the data, we assume that de data generating process for both yield and high degree days is as follows:

Observations are drawn for all the years from a distribution shared across counties
These distributions are “shifted” county by county by adding county fixed effects—drawn from a distribution with mean 0, leaving the average unchanged—.

This leads to the following equation:

$logy_{ct} = \beta_0 + \beta_1 H_{ct} + \lambda_c + u_{ct}$

where $logy_{ct}$ is the log of the average crop yield in county $c$ in year $t$ , $H_{ct}$ the number of GDDs above a given threshold (28°C), $\lambda_c$ county fixed effects and $u_{ct}$ and error term. (log-)yields can be interpreted as the sum of:

A draw from a shared distribution (with variance $\sigma_u^2$ ) whose mean is county dependent ( $\beta_0 + \lambda_c$ )
An effect of high degree days ( $H_{ct}$ ), affecting the variance and mean the of log-yields.

Further, we assume that high degree days are defined as follows:

$H_{ct} = \mu + \eta_c + \epsilon_{ct}$

Where $\eta_c$ are county fixed effects, $\epsilon_{ct}$ an error term and $\mu$ an intercept. As for yields, $H$ in a given year are drawn from a shared distribution (represented by $\epsilon_{ct}$ ) whose mean is county dependent ( $\mu + \eta_c$ ). For simplicity in the simulations, we distinguish the county fixed effects for $H$ and $yield$ since the invariant component for each variable does not have same distribution. We will therefore generate two distinct fixed effect variables. Regardless, they are partialled out with the introduction of fixed effects.

More precisely, let’s set, mostly based on our previous analysis exploring the shape of the distributions:

$N_c$ the number of counties
$N_t$ the number of time periods
$\eta_c \sim \text{Gamma}(k, \theta) - k\theta$ county fixed effects for $H$ . Withdrawing $k\theta$ , the mean of the gamma distribution, ensures that in expectation these fixed effects are equal to 0. For readability, we call $k$ shape_fe_h and $\theta$ scale_fe_h.
$\epsilon_{ct} \sim \mathcal{N}(0, \sigma_{\epsilon}^{2})$ the shared distribution component of $H_{ct}$ .
$H_{ct} = \mu_{H} + \eta_c + \epsilon_{ct}$
$\lambda_c \sim \mathcal{N}(0, \sigma_{\lambda}^2)$ county fixed effects for $logy$ .
$u_{ct} \sim \mathcal{N}(0, \sigma_{u}^{2})$
$logy_{ct} = \beta_0 + \beta_1 H_{ct} + \lambda_c + u_{ct}$

Generate data

Write a generate_data function to generate data according to this DGP. Note that you will need to generate a panel. To do that, you can use the crossing function.

Calibration and baseline parameters’ values

Based on the data exploration, explain which baseline parameters you would choose for the following variables:

Number of observations:
Distribution of the variables:
Treatment effect:

Create a baseline_param dataset and a generate_data() function.

Data exploration

We can now quickly explore the data we generated. Verify that the distribution are correct, ie that the calibration has been done properly. Plot the distributions of the variables and compute their mean and standard deviation.

Model estimation

Which model would you estimate?

Write a function to run the estimation.

Then, write a function that generates a dataset and runs the estimation on this data set. Do you recover the effect of interest?

All simulations

Repeat the analysis several times, without varying the parameters. Do you retrieve the true effect you put in the data?

If not, explore why and fix the issue.

Complexify the DGP

Then, complexify the DGP by adding some correlation but do not change the econometric model. The goal is to explore what happens if the econometric model does not represent the DGP.

References

Burke, Marshall, and Kyle Emerick. 2016. “Adaptation to Climate Change: Evidence from US Agriculture.” American Economic Journal: Economic Policy 8 (3): 106–40. https://doi.org/10.1257/pol.20130025.

Deschênes, Olivier, and Michael Greenstone. 2007. “The Economic Impacts of Climate Change: Evidence from Agricultural Output and Random Fluctuations in Weather.” American Economic Review 97 (1): 354–85. https://doi.org/10.1257/aer.97.1.354.

Schlenker, Wolfram, and Michael J. Roberts. 2009. “Nonlinear Temperature Effects Indicate Severe Damages to U.S. Crop Yields Under Climate Change.” Proceedings of the National Academy of Sciences 106 (37): 15594–98. https://doi.org/10.1073/pnas.0906865106.

Footnotes

BE presents it as “the amount of time a crop is exposed to temperatures between a given lower and upper bound”. Schlenker and Roberts (2009) gives an example: for “bounds of 8°C and 32°C […] a day of 9°C contributes 1 degree day, a day of 10°C contributes 2 degree days, up to a temperature of 32°C, which contributes 24 degree days. All temperatures above 32°C also contribute 24 degree days. Degree days are then summed over the entire season.”↩︎

Setting

Replication

Replication setting and set up

Data structure

Actual replication

Fixed-effects

A simpler model?

Distribution of the variables

Distribution the logcornyield

Decomposition

Distribution of within-county log corn yields

Distribution of between-counties log corn yields

Distribution of higher

Simulation

Modeling choices

Generate data

Calibration and baseline parameters’ values

Data exploration

Model estimation

All simulations

Complexify the DGP

References

Footnotes

Distribution of `higher`