The impact of extra lessons on students’ grades, a statistical power analysis.
Date
September 11, 2024
Imagine we want to measure the impact of receiving extra lessons on students’ grades. We will implement a simple randomized experiment in a set of classes and schools. We want to make sure that our design will enable us to detect the effect if there is one: how should we design our study to do so?
Setting
Let’s assume that we implement a very simple RCT:
We consider students indexed by ,
We randomly allocate a proportion of the students to the treatment. These students will receive additional lessons,
We give a test to all the students and estimate the effect of the treatment on grades.
Note
There are actually two approaches to estimate the effect of the treatment:
By computing the difference between the average of the grades of the treatment group (students who took additional classes) and of the grades of the control group (students who did not), or
With a simple linear regression of the grades on the treatment status.
Before implementing the experiment, we want to make sure that we will have a high probability of actually detecting an effect if there is one. That is to say, we want to be able to reject the null of no effect when it is actually false: we want to have adequate statistical power. A conventional (but somehow arbitrary) target threshold for adequate statistical power is 80%1.
We, as researchers, can choose the value of two parameters: the sample size and the proportion of treated observations. To pick them such that our design will have enough power, we could do a simple power calculation by hand. However, we will see that simulations can actually be much simpler when the data generating process is complex. This analysis will also help us identify the parameters driving power and easily play around the values given to the parameters.
Objective of the simulation
Choose how many students to include in the sample and the proportion of treated students to have adequate statistical power.
We will first generate one data set, estimate the parameter of interest (the average treatment effect) and then repeat this analysis. This will give us a set of estimates and we will in particular be able to compute the proportion that are statistically significant. Then, we will be able to vary the values of our design parameters of interest (sample size and proportion of treated observation) and determine which one to pick to detect the effect.
Generating a Data Set
Code to load the R packages used in this doc
library(tidyverse)library(knitr) library(kableExtra)library(broom)library(beepr)library(here)set.seed(35) #fix the starting point of the random numbers generator# The code below is to use my own package for ggplot themes# You can remove it. Otherwise you will need to install the package via GitHub# using the following line of code# devtools::install_github("vincentbagilet/mediocrethemes")library(mediocrethemes)set_mediocre_all(pal ="portal") #set theme for all the subsequent plots# For exerciseslibrary(webexercises)
We first assume that the Data Generating Process (DGP) for student is very simple and as follows:
Warning
There are some implicit assumptions behind this DGP. For instance, linearity and homogeneity of the treatment.
Distribution of the variables
The goal is to generate a data set. To do so, we need to generate , and (vectors of length ). will be given by . For the other two variables, we need to assume their distribution.
Since the treatment (0 or 1) is allocated at random, we can assume that
In the real world, grades are often approximately normally distributed. We can thus assume that . will represent the standard deviation of the grades (absent the treatment). will represent their mean.
Setting the parameters
To be able to generate the data, we now need to set values for the parameters (, , , , and ). We can distinguish between parameters on which the researcher has a hand and those on which they do not. They can choose how many participants to include in their sample () and the proportion of treated observations but cannot affect the rest.
These other parameters are generally fixed for the researcher. Here, since we are generating the data, we have a hand on them. We will therefore pick them such that the data generated is realistic and the distribution of grades resembles an actual distribution of grades.
n <-100alpha_0 <-10beta_0 <-1sigma_u <-2p_treat <-0.5
Generating the data
The data generating process we defined enables us to generate the variables.
Code
u <-rnorm(n, 0, sigma_u)treat <-rbernoulli(n, p = p_treat)grade <- alpha_0 + beta_0*treat + u
Note that, since we know the true data genrating process, we can compute the potential outcomes for each individual: for each individual, we can compute the grade they would have gotten if they had taken the treatment and the one they would have gotten if they did not. This is of course impossible to observe in an actual study; for each student, we only observe one of the potential outcomes.
kable and scroll_box come from other packages. For simplicity and clarity, we loaded those packages at the beginning of the document.
Important
In an actual study, we would only observe the two first columns: the treatment status of each individual and their actual grade.
Plotting the data
We can then plot the data to check that the distribution of grades is similar to what we wanted.
Code
gen_data |>ggplot(aes(x = grade)) +geom_density() +geom_vline(xintercept =mean(gen_data$grade), size =1) +labs(title ="Distribution of the observed grades in the sample",x ="Grade", y ="Density" )
Similarly, we compute the mean and standard deviation of our variable.
This distribution seems realistic: we chose realistic parameters values.
Warning
Note that the mean of our grade distribution depends on the treatment effect size and the proportion of treated: alpha sets the mean of , the grades in the absence of treatment, not of the observed grades (which is a mixture of and ).
Now that we have generated a data set, we can estimate the effect of the treatment in this data set.
Estimating the effect of the treatment
We now want to regress grade on treat and estimate the regression model . Note that our regression model represents the actual DGP here. In reality, in economics, the model basically never represents the DGP, which adds a layer of complexity.
reg <-lm(grade ~ treat, data = gen_data)
We then print the output of the regression, only focusing on the values of interest: the point estimate of and its standard error.
The estimate is not significant: we cannot reject the null of no effect (even if we know that there is indeed an effect; we put it ourselves in the data).
Comparison to the true effect
The whole game here is look whether the estimate we obtained is close to the true effect of the treatment and whether it is significant.
What is the true effect of the treatment?
At the population level, we set this true effect to be equal to . However, the true effect in the sample may differ from the true effect in the population. We can compute the true effect in the sample by computing the average of individual treatment effects, ie the difference between the potential outcomes. Here, since the treatment effect is homogeneous, all individual treatment effects will be identical and equal to the the population average treatment effect. We can therefore compute the “bias” of our estimate; the difference between the estimate and the true treatment effect.
bias <- reg_interest$estimate - beta_0
The bias is 0.29, ie 29% of the true effect. This is quite a large bias.
So far, we have generate one data set and estimated the average treatment effect in this data set. Yet, there is some uncertainty associated with this estimate: the associated standard error is 0.41. This means that, if we had a different data set, we may have obtained a different point estimate. We will now explore this question. What would have happened if we had another sample?
Repeat
For loop
To know what what would have happened if we had another sample, we can simply rerun the analysis, recompiling our code several times. This will generate a new data set and re-estimate the ATE but using this different data set. However, this approach consisting in recompiling everything by hand is quite cumbersome. Instead, we will write a for loop to automatically repeat the analysis (and store the results).
Let’s start again and act as if we had not written anything yet.
Tip
To write a for loop, first write what is inside the loop, make sure it runs and then put it into a loop.
res |>kable() |>scroll_box(width ="100%", height ="200px")
estimate
std.error
statistic
p.value
conf.low
conf.high
sim_id
0.8725538
0.4083589
2.1367326
0.0351095
0.0621788
1.6829288
100
1.0451750
0.4080325
2.5614994
0.0119463
0.2354478
1.8549022
99
1.1312827
0.4537682
2.4930852
0.0143398
0.2307945
2.0317709
98
1.1421168
0.3995833
2.8582695
0.0052035
0.3491567
1.9350769
97
0.6471430
0.4526530
1.4296670
0.1559929
-0.2511321
1.5454180
96
0.2971704
0.3841368
0.7736058
0.4410261
-0.4651365
1.0594774
95
0.5436053
0.3710357
1.4651024
0.1460939
-0.1927030
1.2799136
94
0.7115554
0.4149337
1.7148651
0.0895306
-0.1118671
1.5349778
93
1.0924949
0.4290158
2.5465142
0.0124374
0.2411270
1.9438629
92
0.2461773
0.4175948
0.5895124
0.5568742
-0.5825259
1.0748805
91
1.0333288
0.4163229
2.4820365
0.0147643
0.2071494
1.8595081
90
1.6530151
0.4170113
3.9639579
0.0001401
0.8254698
2.4805603
89
0.5895902
0.3960097
1.4888278
0.1397438
-0.1962781
1.3754586
88
0.9123130
0.4033208
2.2620035
0.0259059
0.1119360
1.7126899
87
1.3177480
0.3634680
3.6254856
0.0004598
0.5964576
2.0390385
86
0.9406685
0.3827689
2.4575364
0.0157460
0.1810761
1.7002610
85
0.6354703
0.3906715
1.6266104
0.1070312
-0.1398045
1.4107451
84
1.7939966
0.3950838
4.5408007
0.0000160
1.0099657
2.5780275
83
1.2135079
0.3997138
3.0359421
0.0030716
0.4202889
2.0067269
82
1.1593593
0.3800090
3.0508729
0.0029357
0.4052437
1.9134749
81
0.9147777
0.3896745
2.3475431
0.0209083
0.1414813
1.6880741
80
0.3692997
0.4357736
0.8474575
0.3988051
-0.4954790
1.2340783
79
0.8272228
0.3746005
2.2082799
0.0295560
0.0838403
1.5706053
78
1.2797161
0.3867299
3.3090691
0.0013103
0.5122631
2.0471690
77
0.3297614
0.4404869
0.7486295
0.4558736
-0.5443704
1.2038933
76
1.1104987
0.4518764
2.4575276
0.0157464
0.2137647
2.0072327
75
1.1296937
0.3376454
3.3457991
0.0011641
0.4596474
1.7997401
74
0.7509941
0.4380324
1.7144716
0.0896031
-0.1182669
1.6202552
73
1.0488365
0.4008660
2.6164271
0.0102917
0.2533311
1.8443420
72
0.8001015
0.3906172
2.0483008
0.0432046
0.0249344
1.5752686
71
0.7950048
0.4004750
1.9851548
0.0499222
0.0002753
1.5897343
70
0.5999817
0.4078927
1.4709301
0.1445137
-0.2094682
1.4094316
69
1.0625139
0.3586906
2.9622013
0.0038329
0.3507040
1.7743237
68
0.6729487
0.4133360
1.6280912
0.1067164
-0.1473032
1.4932006
67
0.7723290
0.4318465
1.7884343
0.0767961
-0.0846562
1.6293143
66
1.5703815
0.3984381
3.9413437
0.0001520
0.7796941
2.3610690
65
1.3377061
0.4367320
3.0629909
0.0028294
0.4710257
2.2043865
64
1.8131116
0.3566687
5.0834617
0.0000018
1.1053142
2.5209090
63
0.6942697
0.4343603
1.5983729
0.1131794
-0.1677041
1.5562436
62
1.2638621
0.4200097
3.0091264
0.0033306
0.4303667
2.0973576
61
0.4145652
0.3852562
1.0760767
0.2845350
-0.3499632
1.1790936
60
0.6091501
0.4391088
1.3872417
0.1685149
-0.2622471
1.4805473
59
1.0047163
0.4481372
2.2419839
0.0272171
0.1154027
1.8940300
58
1.1027469
0.4256980
2.5904444
0.0110467
0.2579631
1.9475306
57
0.6859754
0.3981062
1.7230966
0.0880250
-0.1040534
1.4760041
56
0.9697924
0.4222566
2.2966895
0.0237651
0.1318378
1.8077469
55
0.9573040
0.3979120
2.4058186
0.0180117
0.1676607
1.7469474
54
0.8836766
0.4370793
2.0217762
0.0459256
0.0163069
1.7510464
53
1.0097388
0.4199880
2.4042088
0.0180867
0.1762863
1.8431913
52
0.7924508
0.4013860
1.9742860
0.0511639
-0.0040867
1.5889884
51
0.4043102
0.3696081
1.0938889
0.2766852
-0.3291650
1.1377853
50
0.9576403
0.3681667
2.6011051
0.0107311
0.2270254
1.6882551
49
0.9118646
0.4236947
2.1521737
0.0338398
0.0710563
1.7526729
48
1.5369827
0.4024160
3.8193875
0.0002347
0.7384012
2.3355642
47
0.1715084
0.4209315
0.4074496
0.6845665
-0.6638165
1.0068333
46
0.8917278
0.4123860
2.1623622
0.0330240
0.0733613
1.7100943
45
0.8516098
0.3777465
2.2544477
0.0263941
0.1019841
1.6012355
44
0.9764321
0.3730199
2.6176406
0.0102576
0.2361862
1.7166781
43
1.5398226
0.3884399
3.9641203
0.0001401
0.7689762
2.3106690
42
1.0493278
0.4281191
2.4510187
0.0160168
0.1997395
1.8989162
41
0.9541602
0.4202993
2.2701924
0.0253858
0.1200900
1.7882304
40
0.9499480
0.4075083
2.3311132
0.0217962
0.1412610
1.7586351
39
0.3014509
0.3734616
0.8071806
0.4215172
-0.4396714
1.0425732
38
1.1349699
0.3863511
2.9376642
0.0041225
0.3682687
1.9016712
37
0.9241645
0.3459322
2.6715189
0.0088431
0.2376732
1.6106557
36
0.1684784
0.4049658
0.4160311
0.6782969
-0.6351630
0.9721198
35
0.2660638
0.4023680
0.6612449
0.5100069
-0.5324224
1.0645500
34
0.6540052
0.3969428
1.6476057
0.1026364
-0.1337148
1.4417253
33
1.2893119
0.4134889
3.1181292
0.0023895
0.4687566
2.1098672
32
1.3135402
0.4514491
2.9096086
0.0044784
0.4176541
2.2094262
31
1.0253142
0.4134565
2.4798600
0.0148493
0.2048233
1.8458052
30
1.0730963
0.3795312
2.8274256
0.0056894
0.3199289
1.8262636
29
0.6123842
0.4077969
1.5016891
0.1363928
-0.1968755
1.4216438
28
1.5374353
0.3900948
3.9411841
0.0001521
0.7633049
2.3115658
27
0.6956575
0.4224739
1.6466283
0.1028377
-0.1427282
1.5340433
26
0.7640088
0.4310064
1.7726161
0.0793996
-0.0913093
1.6193269
25
1.0218385
0.4063789
2.5144971
0.0135482
0.2153928
1.8282841
24
1.2499663
0.3542607
3.5283797
0.0006383
0.5469475
1.9529852
23
1.6103230
0.4078807
3.9480245
0.0001484
0.8008970
2.4197489
22
0.9244306
0.3693394
2.5029296
0.0139708
0.1914885
1.6573726
21
0.4136146
0.3672289
1.1263128
0.2627837
-0.3151392
1.1423684
20
0.5514754
0.4430437
1.2447427
0.2161948
-0.3277304
1.4306812
19
0.6549261
0.4323959
1.5146445
0.1330812
-0.2031496
1.5130018
18
0.5372136
0.4122891
1.3030020
0.1956274
-0.2809608
1.3553880
17
1.2956686
0.4321035
2.9985142
0.0034386
0.4381732
2.1531640
16
0.6674213
0.3931016
1.6978343
0.0927124
-0.1126760
1.4475186
15
0.9725591
0.4551595
2.1367432
0.0351086
0.0693098
1.8758084
14
-0.0312996
0.3960280
-0.0790339
0.9371668
-0.8172043
0.7546051
13
1.2320876
0.3974476
3.0999999
0.0025266
0.4433657
2.0208095
12
0.0957748
0.3800946
0.2519764
0.8015870
-0.6585104
0.8500601
11
1.2650406
0.4096523
3.0880839
0.0026206
0.4520990
2.0779823
10
1.2604960
0.4302230
2.9298668
0.0042187
0.4067325
2.1142596
9
1.2012604
0.4159299
2.8881322
0.0047696
0.3758611
2.0266596
8
0.9617566
0.3897398
2.4676892
0.0153323
0.1883307
1.7351825
7
0.8132844
0.4121962
1.9730516
0.0513066
-0.0047056
1.6312743
6
1.1310331
0.4255266
2.6579610
0.0091813
0.2865894
1.9754768
5
1.3008521
0.4142995
3.1398830
0.0022341
0.4786882
2.1230161
4
0.7035932
0.4304360
1.6346059
0.1053400
-0.1505930
1.5577795
3
1.6768648
0.4048088
4.1423620
0.0000730
0.8735348
2.4801947
2
0.8299627
0.3889488
2.1338610
0.0353501
0.0581064
1.6018189
1
We get n_iter estimates of our effect of interest, estimated on n_iter different data sets (generated from the same DGP). Let’s plot some of these estimates:
Code
res |>arrange(sim_id) |>slice(1:30) |>ggplot(aes(x = sim_id, y = estimate)) +geom_point() +geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +geom_hline(yintercept = beta_0) +geom_hline(yintercept =0, linetype ="solid") +labs(title ="Estimates of the ATE",subtitle =paste("Computed on 30 different data sets"),x ="Simulation id",y ="Estimate" )
Some are close to the true effect, some are further away, for a clearer summary of the distribution of estimates, we can directly plot it:
Code
res |>ggplot(aes(x = estimate)) +geom_dotplot() +geom_vline(aes(xintercept =mean(res$estimate)), size =1.2) +labs(title =paste("Distribution of", n_iter, "estimates of the ATE"),subtitle =paste("Computed on", n_iter, "different data sets"),x ="Estimate",y =NULL,caption ="Each dot represents one estimate The doted line represents the mean of the estimates" ) +theme(panel.grid.major.y =element_blank(), axis.text.y =element_blank())
We also notice that the average the estimates is rather close from the true effect: our estimator seems to be unbiased (in the sense that ).
However, if we implement our experiment with these chosen parameters ( and ), assuming that the DGP is as described, it is quite likely that the estimate we would be rather far from the true effect.
Note
We know that this estimator should be distributed normally. It is not clear from this graph since we have only a limited number of iterations. We will however double check that later in this document.
Statistical power of the analysis
Let’s now explore our initial question: would we have enough statistical power to detect the true effect?
Which simulations would yield a significant estimates?
Code
res |>kable() |>column_spec(column =2, color ="white",background =spec_color( res$p.value <0.05, pal =c("#C51E3A", "#009E60") ) ) |>scroll_box(height ="250px")
estimate
std.error
statistic
p.value
conf.low
conf.high
sim_id
0.8725538
0.4083589
2.1367326
0.0351095
0.0621788
1.6829288
100
1.0451750
0.4080325
2.5614994
0.0119463
0.2354478
1.8549022
99
1.1312827
0.4537682
2.4930852
0.0143398
0.2307945
2.0317709
98
1.1421168
0.3995833
2.8582695
0.0052035
0.3491567
1.9350769
97
0.6471430
0.4526530
1.4296670
0.1559929
-0.2511321
1.5454180
96
0.2971704
0.3841368
0.7736058
0.4410261
-0.4651365
1.0594774
95
0.5436053
0.3710357
1.4651024
0.1460939
-0.1927030
1.2799136
94
0.7115554
0.4149337
1.7148651
0.0895306
-0.1118671
1.5349778
93
1.0924949
0.4290158
2.5465142
0.0124374
0.2411270
1.9438629
92
0.2461773
0.4175948
0.5895124
0.5568742
-0.5825259
1.0748805
91
1.0333288
0.4163229
2.4820365
0.0147643
0.2071494
1.8595081
90
1.6530151
0.4170113
3.9639579
0.0001401
0.8254698
2.4805603
89
0.5895902
0.3960097
1.4888278
0.1397438
-0.1962781
1.3754586
88
0.9123130
0.4033208
2.2620035
0.0259059
0.1119360
1.7126899
87
1.3177480
0.3634680
3.6254856
0.0004598
0.5964576
2.0390385
86
0.9406685
0.3827689
2.4575364
0.0157460
0.1810761
1.7002610
85
0.6354703
0.3906715
1.6266104
0.1070312
-0.1398045
1.4107451
84
1.7939966
0.3950838
4.5408007
0.0000160
1.0099657
2.5780275
83
1.2135079
0.3997138
3.0359421
0.0030716
0.4202889
2.0067269
82
1.1593593
0.3800090
3.0508729
0.0029357
0.4052437
1.9134749
81
0.9147777
0.3896745
2.3475431
0.0209083
0.1414813
1.6880741
80
0.3692997
0.4357736
0.8474575
0.3988051
-0.4954790
1.2340783
79
0.8272228
0.3746005
2.2082799
0.0295560
0.0838403
1.5706053
78
1.2797161
0.3867299
3.3090691
0.0013103
0.5122631
2.0471690
77
0.3297614
0.4404869
0.7486295
0.4558736
-0.5443704
1.2038933
76
1.1104987
0.4518764
2.4575276
0.0157464
0.2137647
2.0072327
75
1.1296937
0.3376454
3.3457991
0.0011641
0.4596474
1.7997401
74
0.7509941
0.4380324
1.7144716
0.0896031
-0.1182669
1.6202552
73
1.0488365
0.4008660
2.6164271
0.0102917
0.2533311
1.8443420
72
0.8001015
0.3906172
2.0483008
0.0432046
0.0249344
1.5752686
71
0.7950048
0.4004750
1.9851548
0.0499222
0.0002753
1.5897343
70
0.5999817
0.4078927
1.4709301
0.1445137
-0.2094682
1.4094316
69
1.0625139
0.3586906
2.9622013
0.0038329
0.3507040
1.7743237
68
0.6729487
0.4133360
1.6280912
0.1067164
-0.1473032
1.4932006
67
0.7723290
0.4318465
1.7884343
0.0767961
-0.0846562
1.6293143
66
1.5703815
0.3984381
3.9413437
0.0001520
0.7796941
2.3610690
65
1.3377061
0.4367320
3.0629909
0.0028294
0.4710257
2.2043865
64
1.8131116
0.3566687
5.0834617
0.0000018
1.1053142
2.5209090
63
0.6942697
0.4343603
1.5983729
0.1131794
-0.1677041
1.5562436
62
1.2638621
0.4200097
3.0091264
0.0033306
0.4303667
2.0973576
61
0.4145652
0.3852562
1.0760767
0.2845350
-0.3499632
1.1790936
60
0.6091501
0.4391088
1.3872417
0.1685149
-0.2622471
1.4805473
59
1.0047163
0.4481372
2.2419839
0.0272171
0.1154027
1.8940300
58
1.1027469
0.4256980
2.5904444
0.0110467
0.2579631
1.9475306
57
0.6859754
0.3981062
1.7230966
0.0880250
-0.1040534
1.4760041
56
0.9697924
0.4222566
2.2966895
0.0237651
0.1318378
1.8077469
55
0.9573040
0.3979120
2.4058186
0.0180117
0.1676607
1.7469474
54
0.8836766
0.4370793
2.0217762
0.0459256
0.0163069
1.7510464
53
1.0097388
0.4199880
2.4042088
0.0180867
0.1762863
1.8431913
52
0.7924508
0.4013860
1.9742860
0.0511639
-0.0040867
1.5889884
51
0.4043102
0.3696081
1.0938889
0.2766852
-0.3291650
1.1377853
50
0.9576403
0.3681667
2.6011051
0.0107311
0.2270254
1.6882551
49
0.9118646
0.4236947
2.1521737
0.0338398
0.0710563
1.7526729
48
1.5369827
0.4024160
3.8193875
0.0002347
0.7384012
2.3355642
47
0.1715084
0.4209315
0.4074496
0.6845665
-0.6638165
1.0068333
46
0.8917278
0.4123860
2.1623622
0.0330240
0.0733613
1.7100943
45
0.8516098
0.3777465
2.2544477
0.0263941
0.1019841
1.6012355
44
0.9764321
0.3730199
2.6176406
0.0102576
0.2361862
1.7166781
43
1.5398226
0.3884399
3.9641203
0.0001401
0.7689762
2.3106690
42
1.0493278
0.4281191
2.4510187
0.0160168
0.1997395
1.8989162
41
0.9541602
0.4202993
2.2701924
0.0253858
0.1200900
1.7882304
40
0.9499480
0.4075083
2.3311132
0.0217962
0.1412610
1.7586351
39
0.3014509
0.3734616
0.8071806
0.4215172
-0.4396714
1.0425732
38
1.1349699
0.3863511
2.9376642
0.0041225
0.3682687
1.9016712
37
0.9241645
0.3459322
2.6715189
0.0088431
0.2376732
1.6106557
36
0.1684784
0.4049658
0.4160311
0.6782969
-0.6351630
0.9721198
35
0.2660638
0.4023680
0.6612449
0.5100069
-0.5324224
1.0645500
34
0.6540052
0.3969428
1.6476057
0.1026364
-0.1337148
1.4417253
33
1.2893119
0.4134889
3.1181292
0.0023895
0.4687566
2.1098672
32
1.3135402
0.4514491
2.9096086
0.0044784
0.4176541
2.2094262
31
1.0253142
0.4134565
2.4798600
0.0148493
0.2048233
1.8458052
30
1.0730963
0.3795312
2.8274256
0.0056894
0.3199289
1.8262636
29
0.6123842
0.4077969
1.5016891
0.1363928
-0.1968755
1.4216438
28
1.5374353
0.3900948
3.9411841
0.0001521
0.7633049
2.3115658
27
0.6956575
0.4224739
1.6466283
0.1028377
-0.1427282
1.5340433
26
0.7640088
0.4310064
1.7726161
0.0793996
-0.0913093
1.6193269
25
1.0218385
0.4063789
2.5144971
0.0135482
0.2153928
1.8282841
24
1.2499663
0.3542607
3.5283797
0.0006383
0.5469475
1.9529852
23
1.6103230
0.4078807
3.9480245
0.0001484
0.8008970
2.4197489
22
0.9244306
0.3693394
2.5029296
0.0139708
0.1914885
1.6573726
21
0.4136146
0.3672289
1.1263128
0.2627837
-0.3151392
1.1423684
20
0.5514754
0.4430437
1.2447427
0.2161948
-0.3277304
1.4306812
19
0.6549261
0.4323959
1.5146445
0.1330812
-0.2031496
1.5130018
18
0.5372136
0.4122891
1.3030020
0.1956274
-0.2809608
1.3553880
17
1.2956686
0.4321035
2.9985142
0.0034386
0.4381732
2.1531640
16
0.6674213
0.3931016
1.6978343
0.0927124
-0.1126760
1.4475186
15
0.9725591
0.4551595
2.1367432
0.0351086
0.0693098
1.8758084
14
-0.0312996
0.3960280
-0.0790339
0.9371668
-0.8172043
0.7546051
13
1.2320876
0.3974476
3.0999999
0.0025266
0.4433657
2.0208095
12
0.0957748
0.3800946
0.2519764
0.8015870
-0.6585104
0.8500601
11
1.2650406
0.4096523
3.0880839
0.0026206
0.4520990
2.0779823
10
1.2604960
0.4302230
2.9298668
0.0042187
0.4067325
2.1142596
9
1.2012604
0.4159299
2.8881322
0.0047696
0.3758611
2.0266596
8
0.9617566
0.3897398
2.4676892
0.0153323
0.1883307
1.7351825
7
0.8132844
0.4121962
1.9730516
0.0513066
-0.0047056
1.6312743
6
1.1310331
0.4255266
2.6579610
0.0091813
0.2865894
1.9754768
5
1.3008521
0.4142995
3.1398830
0.0022341
0.4786882
2.1230161
4
0.7035932
0.4304360
1.6346059
0.1053400
-0.1505930
1.5577795
3
1.6768648
0.4048088
4.1423620
0.0000730
0.8735348
2.4801947
2
0.8299627
0.3889488
2.1338610
0.0353501
0.0581064
1.6018189
1
Let’s compute the proportion of significant results at the 5% level:
power <-mean(res$p.value <0.05)
The statistical power of this design would be 64%. In 36% of cases, we would not be able to reject the null of no effect.
Why is that problematic ? Since we wrote the DGP, we know what the true effect of the treatment on grades is: the treatment increases grades by 1 points. If we invest resources into actually implementing this experiment, if the DGP is close to what we simulated, in 36% of cases, we not be able to state that the average treatment effect is non-null, even if it actually is equal to 1.
Functions
Now that we have computed the statistical power for this set of parameters, we may wonder how it varies with the value of the parameters. We therefore want to re-run the analysis but for another set of parameters. We could just rerun all this code, just changing the values of the parameters. However, this approach would not be reproducible. Instead, we will therefore write a function that takes as inputs the parameters we want to vary and that runs one simulation. Then, we will be able to call this function as many times as we want, even varying the value of the parameters.
A good practice in R programming is “one action, one function” and vice-versa. We will therefore create several functions and then wrap it all within one overall function (compute_sim()): we first generate_data() and then run_estim(). I chose short names that clearly describe what the functions do.
Important
The simulation workflow we will use through out the class is as follows:
Define a set of baseline parameters and store them in a data frame,
Generate data for a given set of parameters,
Run the estimation and extract the outcomes of interest,
Compute one simulation by combing the previous steps,
Replicate the process for a number of iterations and a set of parameters.
generate_data()
Let’s first write a function to generate data and run it on our set of parameters:
We will only be able to easily vary the parameters that are inputs of the functions. If we would have wanted to change other parameters, such as the functional form of the DGP, introducing heterogeneous effect for instance, we would have needed to add parameters to generate_data() that would have enabled us to do so.
To pass the value of the parameters to the function, we can store them in a data frame and use the function purrr::pmap(). You will see that this approach comes handy. Let’s therefore create a tibble to store our baseline parameters. Note that pmap() returns a list. To transform it into a tibble, we need to use the function list_rbind().
We then can write a function that takes as input a tibble generated by generate_data(), runs the estimation and selects the information of interest. Here, since our model is very simple, the estimation is also very simple.
baseline_param |> purrr::pmap(.f = generate_data) |>list_rbind() |>#pmap returns a list, turn it into a data framerun_estim() |>kable()
estimate
std.error
statistic
p.value
conf.low
conf.high
0.0987152
0.3982716
0.247859
0.8047621
-0.6916417
0.8890722
compute_sim()
We then combine generate_data() and run_estim() into a function that computes one iteration of the simulation. We can also add a line of code to store the parameters used in this simulation.
compute_sim <-function(...) {generate_data(...) %>%run_estim() %>%cbind(as_tibble(list(...))) |>#add parameters used for generationas_tibble()}
Taking the baseline parameters as inputs, it returns this data frame:
We now have a function that runs one simulation for a given set of parameters.
Replication of the process
To run several iterations of the simulation, we can build a data frame of parameters such that each row contains the values of the parameters for one single iteration of the simulation.
If we want to run n_iter iterations of the simulation for the baseline parameter values, we only have to replicate the one-row tibble baseline_param using the function uncount() for instance.
n_iter <-100param <- baseline_param |>uncount(n_iter) #replicates the rows in the datasetparam |>slice(1:10) |>kable()
n
p_treat
alpha_0
beta_0
sigma_u
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
100
0.5
10
1
2
Then, we can pass the resulting set of parameter values to compute_sim():
Using this method, we can also easily run a series of simulations for several parameter values.
Varying parameters
Remember, the initial goal was to find the right sample size for our an RCT: which sample size should we choose to have enough statistical power. To do, we can vary the sample size in our simulation and see at which sample size the power is larger than a given value.
Varying the sample size
To run simulations for several values of , we can modify the param dataframe such that, when passing it to compute_sim(), it will implement the analysis for several values of . To do so, we use the function crossing():
n_iter <-500vect_n <-seq(from =50, to =300, by =50)param_var_n <- baseline_param |>select(-n) |>crossing(n = vect_n) |>uncount(n_iter)
This new data set contains 6 (length of vect_n) times n_iter rows. For each value of vect_n, the parameters are repeated n_iter times:
To choose the range of values for the sample size, first we proceed a bit at random, indicating a wide but coarse array of values (for instance vect_n <- seq(10, 1000, by = 100)) and then narrow it down and make it more precise, depending on simulation results.
Tip
When exploring which parameters values you want to consider, use a small number of iterations. The results will be imprecise and vary a lot but will be quick to compute.
We can now compute the simulation for all parameters values and store the results in a data frame.
When the simulations take long time to run, we use the .progress parameter of the pmap() function. In addition, we use the beepr::beep() function to produce a sound to inform you that the simulation is done.
We can now compute the statistical power, ie, the proportion of significant estimates. We also compute the standard error of this mean; these results should not be taken at face value.
sim_power_var_n |>ggplot(aes(x = n, y = power)) +geom_point() +geom_line() +labs(title ="Statistical Power as Function of Sample Size",subtitle =paste("Computed with", n_iter, "simulations"),x ="Number of students",y ="Statistical Power" )
We notice that the power is an increasing function of the number of observations. This result is well kwon and thus not surprising. Power indeed depends on the precision (ie the variance) of the estimator; the more precise the estimate the larger the power. Now, the variance of this univariate OLS estimator is given by and is therefore a strictly decreasing function of the number of observations.
To reach the usual (but arbitrary) 80% power threshold, the graph indicates us that we would need about 150 observations. We have answered part of the question of interest.
Warning
This result only holds conditional on the true DGP being the one we simulated. This is very unlikely; the true DGP is certainly more complex.
Varying the proportion of treated
We could also have varied other parameters, like the proportion of treated. There are two approaches to do that: we can modify it ceteris paribus, thus holding the sample size constant or varying both at the same time. We will start by the latter one. To get to the former one, only filter out results for one sample size.
We define a new set of parameters and run the simulations for this set of parameters.
Warning
If we consider a proportion of treated that is too small, chances are that in some of the data sets, the number of treated (or non-treated) students will be 0, preventing the estimation of the effect of interest.
This function takes a tiny bit of time to run (about 40s). To avoid running it over and over again, in particular when compiling the Quarto document, we save the results of the simulation and do not run the code of this chunk.
Code
sim_res_var_n_ptreat <-readRDS(here("content", "sim", "data", "sim_res_var_n_ptreat.RDS"))sim_res_var_n_ptreat |>group_by(n, p_treat) |>summarise(power =mean(p.value <0.05)) |>ungroup() |>ggplot(aes(x = n, y = p_treat, fill = power)) +geom_tile(color ="white", linewidth =1) +scale_mediocre_c(pal ="portal", gradient ="left") +labs(title ="Given a sample size power is max when half of the sample is treated",subtitle ="Power, number of observations and proportion of treated",x ="Number of students",y ="Proportion of treated students",fill ="Statistical Power" )
Statistical power is (unsurprisingly) maximized for a proportion of treated of 0.5. This might however be clearer with another graph:
Code
sim_res_var_n_ptreat |>group_by(n, p_treat) |>summarise(power =mean(p.value <0.05)) |>ungroup() |>ggplot(aes(x = p_treat, y = power, color =factor(n))) +geom_line(linewidth =1.2) +labs(title ="Given a sample size power is max when half of the sample is treated",subtitle ="Power, number of observations and proportion of treated",x ="Proportion of treated students", y ="Statistical Power",color ="Number of students" )
Power and effect size
The statistical power we computed relied on a very important hypothesis: the effect size. If it was in fact smaller than what we hypothesized, we expect statistical power to be smaller. Simulations confirm that:
sim_res_var_n_beta_0 <-readRDS(here("content", "sim", "data", "sim_res_var_n_beta_0.RDS"))sim_res_var_n_beta_0 |>group_by(n, beta_0) |>summarise(power =mean(p.value <0.05)) |>ungroup() |>ggplot(aes(x = n, y = beta_0, fill = power)) +geom_tile(color ="white", linewidth =1) +scale_mediocre_c(pal ="portal", gradient ="left") +labs(title ="Given a sample size power is max when half of the sample is treated",subtitle ="Power, number of observations and proportion of treated",x ="Number of students",y ="Proportion of treated students",fill ="Statistical Power" )
Hypotheses regarding the true effect size matter a lot in regards to the resulting statistical power.
Complexifying the DGP
As discussed above, our whole analysis relied on the fact that we accurately represented the true DGP. However, it is very likely that the actual DGP is different from the one we modeled. For instance, effects are very likely heterogenous across individuals.
Note
We are often interested in estimating a version of an ATE (Average Treatment Effect). The wording itself (“average”) implies that effects are heterogenous across individuals.
Footnotes
The empirical design implies a 80% probability of getting a significant result.↩︎