Replication of a paper on impacts of high temperature on agricultural yields.
Date
October 9, 2024
Objective
Replicate and simulate a paper to allow us to evaluate the performance and robustness of the analysis.
We first load the packages used in this document.
Code to load the R packages used in this doc
library(tidyverse)library(knitr) library(kableExtra)library(broom) library(here)library(fixest) #for fixed effect regressionlibrary(haven) #to read Stata .dta fileslibrary(modelsummary)set.seed(35) #fix the starting point of the random numbers generator# The code below is to use my own package for ggplot themes# You can remove it. Otherwise you will need to install the package via GitHub# using the following line of code# devtools::install_github("vincentbagilet/mediocrethemes")library(mediocrethemes)set_mediocre_all(pal ="portal") #set theme for all the subsequent plots# For exerciseslibrary(webexercises)
Setting
In this example, we explore the impact of climate change (or high temperatures) on agricultural yields. Usual approaches to explore this question rely on variation over time to compare hot and cold conditions for a given area. This approach has been implemented in fear of omitted variable bias in traditional cross-sectional approaches (comparing hot areas to cold areas), climate being potentially correlated with other time invariant unobserved variables. A large set of papers leverage this approach to address this question, e.g.Deschênes and Greenstone (2007), Schlenker and Roberts (2009) or Burke and Emerick (2016).
In the present analysis, we will build simulations to emulate a typical study from this literature, measuring the impact of high temperatures on agricultural yields. A typical approach consists in estimating the link between high Growing Degree Days (GDD, roughly the number of days a crop is exposed to a certain range of temperatures) and yields, controlling for characteristics that are invariant in time and across locations. As in Burke and Emerick (2016) and some specifications from Schlenker and Roberts (2009), we will model log corn yield as a piecewise linear function of temperature, considering high and low GDDs. Most of the analyses in the literature being at the county-year level, we will simulate data with a similar structure.
To explore the data generating process, we will leverage data from Burke and Emerick (2016), henceforth BE. This will allow us to make a few simplifying assumptions regarding the data generating process. By replicating BE, we will try to confirm that these assumptions do not significantly alter the results of the estimation in an actual study and that simulations under these assumptions remain close to an actual study from this literature.
Let’s thus first replicate the analysis in BE.
Replication
Replication setting and set up
Exploring the results
Burke and Emerick (2016), henceforth BE, aims to evaluate the impact of high temperature days on corn production, estimating the following equation:
where is the average crop yield in county in year , the number of Growing Degree Days (GDDs)1 below a given threshold (28 and 29°C for the panel and long-differences models respectively), the number of GDDs above this threshold. and represent equivalent measures for precipitations (threshold = 42 cm/year for the panel model, 50 for the long-differences one). and are county and time fixed effects respectively and an error term.
What is the parameter of interest here?
The results we want to replicate are displayed in the paper, in Table 1 column 3:
What is their estimated value of the parameter of interest?
Is their design good enough to have a high probability (let’s say more than 80%) to detect the true effect if it was equal to half the size of the estimated effect?
To explore this we can use the function .
This function takes two parameters: the hypothetical true effect size (here ) and the of the obtained estimate. Run this function. What would be the power of this analysis? %. Does this analysis seem powered enough to detect an effect of this magnitude? .
Note
Ideally, we would also test whether the design is good enough to detect effects that are consistent with typical effect sizes from the literature. For instance if a typical effect size in the literature was -0.001, would this design be good enough to detect it?
Try several hypothetical effect sizes. Note that you can pass a vector of hypothetical true effect sizes to the function. You can then plot a graph of the statistical power as a function of the hypothetical effect size for instance.
For which hypothetical true effect size does the statistical power become smaller than 80%?
Objective of the analysis
The goal of the present analysis will be to assess whether the design in BE is good enough to detect an effect size equal to half the effect size of the estimate found in the analysis.
Note that here we are replicating an analysis that has already been implemented. Therefore, we can run a retro design calculation even before doing anything else and in particular, before implementing the simulation. If we were implementing our own analysis, we could not do that and would only be able to run this retro design analysis after completing the whole analysis, hence the use of the simulation. In the present setting the simulation may seem useless as we know what result it should yield. We could however use the present simulation to study the drivers of statistical power for instance, or test for the effect of potentially erroneous hypotheses (here we are going to estimate an econometric model that exactly fits the DGP. We may wonder what would happen if it was not the case?)
Data wrangling and cleaning
Their code is written in Stata; we can translate it to R. For simplicity and since it is not the purpose of this exercise, I already did that and cleaned the data.
You however need to download their replication package on the paper’s page. Then store the data set us_panel.dta in your working directory (or a sub directory).
Warning
You will need to change the path in read_dta() and adapt it to your case.
Before anything, let’s look at this data. It is a panel with a time dimension, and an individual dimension, .
How many time periods are there in this panel?
And how many individuals (counties)?
Is this panel balanced? .
The standard errors are clustered at the state level, we first have a quick look at this additional layer in the panel. How many states are there in this panel? . Is it equal to the number of states in the US? Why is that?
How many observations are there on average by state? (county/years) . Is the number of observations per state uniformely distributed across states?
Actual replication
Before anything, let’s try to replicate the main regression, as specified in the paper. The model is the one described in Equation 1.
Since the regression model specified in the paper is a fixed effect one, we are going to estimate it using the feols() function from the fixest package. Look at the documentation for this function. To access it, you can use the function.
Warning
To be able to replicate the results in the analysis, we need to include weights and cluster the standard errors at the state level.
We will generate data assuming that the econometric model we want to estimate accurately represents the DGP. However, this econometrics model is rather complex and it might be challenging to directly simulate a DGP based on it. Let’s try to see if a simpler model would yield very different results.
Remove the controls (ie only keeping one explanatory variable) and time fixed effects from your econometric model. Estimate this new econometric model. The explained variable is . The explanatory variable is and the fixed effects are . Let’s also remove the weights for simplicity.
Are the results approximately comparable to the full model?
We will therefore simulate this simplified model. How many variables do we need to generate?
Distribution of the variables
To be able to generate a fake data set, we first need to explore the distribution of the variables.
Distribution the logcornyield
We first focus on the log of corn yields. Since the dependent variable in the analysis is the transformed version of corn yields, we will directly generate it, instead of generating cornyield and then taking its log. We start by plotting the distribution of log corn yields and fitting a normal to it, manually selecting the parameters of the normal distribution.
Let’s first plot the distribution of logcornyield.
Code
graph_check_distrib <-function(data, var, mean_norm, sd_norm) { data |>ggplot(aes(x = {{ var }})) +geom_density() +geom_function(fun = \(x) dnorm(x, mean = mean_norm, sd = sd_norm),linewidth =1 ) +#manually tune the paramslabs(y ="Density",caption =str_c("Mean normal: ", mean_norm, ", sd: ", sd_norm) )}BE_clean |>graph_check_distrib(logcornyield, 4.55, 0.33) +labs(title ="Distribution of corn yields",subtitle ="Tentative normal fit",x ="Corn yield" )
What could be a decent approximation of this distribution ? Using geom_function(), let’s overlay the cdf of this theoretical distribution on top of the distribution of logcornyield. Manually adjust the parameters of the theoretical distribution to get a good fit.
BE_clean |>ggplot(aes(x = logcornyield)) +geom_density() +geom_function(fun = \(x) dnorm(x, mean =4.55, sd =0.33),linewidth =1 ) +labs(title ="Distribution of log corn yields",subtitle ="Tentative normal fit",x ="Log Corn Yield",y ="Denisty" )
Decomposition
We actually want to take county fixed effects into account. We therefore want to decompose this distribution into a between-counties component and a within-county one.
We first assume that de data generating process is as follows:
For each county, draw observations for all the years from a distribution shared across counties ()
Add county fixed-effects that shift the mean of this distribution differently for each county. These fixed-effects therefore represent the variation in means that are county specific.
In other words, we assume that the DGP is as follows for in county in year :
Where is a distribution shared across counties and is a distribution shared across years. Both distribution have mean 0 ; the overall mean is captured by .
The goal is to retrieve the two distributions and the constant. The DAG for our assumed DGP here is:
To further explain the between/within county decomposition, let’s plot a graph:
Code
sample_counties <-sample(unique(BE_clean$fips) , 15)BE_clean |>filter(cornyield >0) |>filter(fips %in% sample_counties) |>ggplot(aes(x = logcornyield, y = fips)) + ggridges::geom_density_ridges(fill = colors_mediocre[["base"]],color = colors_mediocre[["base"]],alpha =0.5 ) +labs(title ="Distribution of Log Corn Yield by County",subtitle ="For a random subset of counties",# subtitle = paste("In the", fips, "county"),x ="Log cornyield",y ="County identifyier" )
This graph represents, for a subset of counties, the distribution, by county, of log corn yields for every year in the sample. The idea is that logcornyield come from an identical distribution for each county, the only difference being that it is shifted by a certain quantity (the fixed effect) for each county. The fixed-effect for a county therefore represent the mean of logcornyield for this county (more precisely the distance of this mean to the overall mean of logcornyield).
Note
You might (rightly) think that the distribution is absolutely not the same for each county. However, remember that we observe only 25 years of data. Each distribution is thus only composed of 25 points. Even if the underlying distribution was indeed the same, repeated density plots of 25 draws from this distribution would look very different. To convince yourself of that, you can for instance run the following lines of code several times:
This however does not mean that the hypothesis of an identical within-county distribution holds. But let’s assume it does, for the sake of the exercise. To double check that, we would need much more observations per group. To do so, instead of plotting distributions by county, we could for instance plot them by groups of counties.
Distribution of within-county log corn yields
Let’s therefore first demean the (log)yields by removing the county average (log)yields. This gives a distribution of (log)yields centered on 0. Plot this distribution and overlay a (calibrated) theoretical distribution on it.
Note
We need to filter out a few observations for which logcornyield is equal to -Infinity (use filter(logcornyield != -Inf))
Code
BE_demean_logyield <- BE_clean |>filter(logcornyield !=-Inf) |>#100 obs. To avoid pbs when computing meangroup_by(fips) |>mutate(county_mean_logyield =mean(logcornyield, na.rm =TRUE) ) |>ungroup() %>%mutate(logyield_demean = (logcornyield - county_mean_logyield) +mean(.$county_mean_logyield, na.rm =TRUE) )BE_demean_logyield |>graph_check_distrib(logyield_demean, 4.55, 0.2) +#manually tune the paramslabs(title ="Distribution of log of corn yields",subtitle ="Removing between counties mean variation\nand tentative normal fit",x ="Demeaned logcornyield" )
Distribution of between-counties log corn yields
Let’s then explore how the “fixed-effects” are distributed. Plot the distribution of the fixed effects (ie of the county means) and overlay a theoretical distribution on it.
Code
BE_demean_logyield %>%mutate(county_fe = county_mean_logyield -mean(.$county_mean_logyield, na.rm =TRUE) ) |>graph_check_distrib(county_fe, 0, 0.26) +labs(title ='Distribution of county "fixed-effects"',subtitle ="Tentative normal fit",x ="Distance of county mean to the overall mean" )
Distribution of higher
We then turn to the explanatory variable of interest, higher. This variable represents the number of GDDs above 28°C.
We start by plotting this distribution and compute its mean and standard deviation.
We then reproduce the same analysis as for log yields. We assume that “higher GDD” in each counties are drawn from the same underlying distribution but shifted by a fixed effect. We therefore first want to compute the shape of the distribution of the demeaned higher variables.
First, we plot the distribution of the demeaned version of the variable and overlay a theoretical distribution.
Code
demean_higher <- BE_clean |>group_by(fips) |>mutate(county_mean_higher =mean(higher, na.rm =TRUE),sd_higher =sd(higher, na.rm =TRUE) ) |>ungroup() %>%mutate(higher_demean = (higher - county_mean_higher) +mean(.$county_mean_higher, na.rm =TRUE) )demean_higher |>graph_check_distrib(higher_demean, 65, 18) +labs(title ="Distribution of the 'higher', centered by county",subtitle ="Very tentative normal fit",x ="County mean of 'higher'" )
We then explore the distribution of the fixed-effects. Plot the distribution of the “fixed-effects”.
Code
# shift_g <- 68# shape_g <- 1.3# scale_g <- 50demean_higher %>%mutate(county_fe = county_mean_higher -mean(.$county_mean_higher, na.rm =TRUE) ) |>ggplot(aes(x = county_fe)) +geom_density() +labs(title ="Distribution of the county means of 'higher'",x ="Distance of county mean of `higher` to the overall mean",y ="Density" )
Which type of distribution could roughly fit this shape? .
We will want to overlay a theoretical distribution to this graph.
We derive rough parameters values from properties of the gamma distribution and then adjust them manually. Denoting respectively and the shape and scale parameters, the mean of the distribution is given by and the variance by . We thus solved the system such that mean and var .
We therefore assume that the fixed-effects are drawn from a gamma distribution with shape 1.3, scale 50.
Note that, we want a distribution centered on 0 so we need to shift the distribution by its mean. The theoretical function to plot is therefore:
shift_g <-68shape_g <-1.3scale_g <-50demean_higher %>%mutate(county_fe = county_mean_higher -mean(.$county_mean_higher, na.rm =TRUE) ) |>ggplot(aes(x = county_fe)) +geom_density() +geom_function(fun = \(x) dgamma(x + shape_g*scale_g, shape = shape_g, scale = scale_g),linewidth =1 ) +#manually tune the param based on the values abovelabs(title ="Distribution of the county means of 'higher'",subtitle ="Tentative gamma fit",x ="Distance of county mean of `higher` to the overall mean",y ="Density",caption =str_c("Param of the gamma distrib: shape = ", shape_g, ", scale = ", scale_g,". Shifted by ", shift_g) )
Simulation
Objective of the Simulation
Through this simulation, we will explore whether the model we write actually retrieves the effects we are interested in, in a pristine setting.
Modeling choices
Based on the exploration of the data, we assume that de data generating process for both yield and high degree days is as follows:
Observations are drawn for all the years from a distribution shared across counties
These distributions are “shifted” county by county by adding county fixed effects—drawn from a distribution with mean 0, leaving the average unchanged—.
This leads to the following equation:
where is the log of the average crop yield in county in year , the number of GDDs above a given threshold (28°C), county fixed effects and and error term. (log-)yields can be interpreted as the sum of:
A draw from a shared distribution (with variance ) whose mean is county dependent ()
An effect of high degree days (), affecting the variance and mean the of log-yields.
Further, we assume that high degree days are defined as follows:
Where are county fixed effects, an error term and an intercept. As for yields, in a given year are drawn from a shared distribution (represented by ) whose mean is county dependent (). For simplicity in the simulations, we distinguish the county fixed effects for and since the invariant component for each variable does not have same distribution. We will therefore generate two distinct fixed effect variables. Regardless, they are partialled out with the introduction of fixed effects.
More precisely, let’s set, mostly based on our previous analysis exploring the shape of the distributions:
the number of counties
the number of time periods
county fixed effects for . Withdrawing , the mean of the gamma distribution, ensures that in expectation these fixed effects are equal to 0. For readability, we call shape_fe_h and scale_fe_h.
the shared distribution component of .
county fixed effects for .
Generate data
Write a generate_data function to generate data according to this DGP. Note that you will need to generate a panel. To do that, you can use the crossing function.
Calibration and baseline parameters’ values
Based on the data exploration, explain which baseline parameters you would choose for the following variables:
Number of observations: Deschênes and Greenstone (2007), Schlenker and Roberts (2009) and Burke and Emerick (2016) (henceforth DG, SR and BE respectively) all use data for counties to the east of the 100th parallel, summing up to about 2500 counties. Let’s therefore set N_c = 2500. They however consider quite substantial different number of years 4, 55 and 24 respectively. Let’s choose an intermediate value of N_t = 25.
Distribution of the variables: From the analysis above, we set set mu_hgdd, sigma_hgdd, mu_logyield, sigma_logyield, shape_fe_h, scale_fe_h and sigma_fe_yield. Note that I will vary sigma_fe_yield and scale_fe_h around these values to vary the share of the variance in of and that are location dependent. The variance of the fixed effects is given by = shape_fe_h*scale_fe_h^2.
Treatment effect: here, we want to evaluate whether the design of the study would be good enough to detect an effect that is half the size of the obtained estimate. Based on the previous analysis and replication, we can thus set beta_1 to -0.005/2 = -0.0025. Note that we can also consider other hypothetical true effect sizes (that are consistent with what is found in the literature).
We can now quickly explore the data we generated. Verify that the distribution are correct, ie that the calibration has been done properly. Plot the distributions of the variables and compute their mean and standard deviation.
Model estimation
Which model would you estimate?
Write a function to run the estimation.
Then, write a function that generates a dataset and runs the estimation on this data set. Do you recover the effect of interest?
All simulations
Repeat the analysis several times, without varying the parameters. Do you retrieve the true effect you put in the data?
If not, explore why and fix the issue.
Complexify the DGP
Then, complexify the DGP by adding some correlation but do not change the econometric model. The goal is to explore what happens if the econometric model does not represent the DGP.
Footnotes
BE presents it as “the amount of time a crop is exposed to temperatures between a given lower and upper bound”. Schlenker and Roberts (2009) gives an example: for “bounds of 8°C and 32°C […] a day of 9°C contributes 1 degree day, a day of 10°C contributes 2 degree days, up to a temperature of 32°C, which contributes 24 degree days. All temperatures above 32°C also contribute 24 degree days. Degree days are then summed over the entire season.”↩︎