Heaping-induced bias in regression-discontinuity designs.
Barreca, Alan I. ; Lindo, Jason M. ; Waddell, Glen R. 等
Heaping-induced bias in regression-discontinuity designs.
I. INTRODUCTION
Empirical researchers have witnessed a resurgence in the use of
regression-discontinuity (RD) designs since the late 1990s. This
approach to estimating causal effects is often characterized as superior
to all other non-experimental identification strategies (Cook 2008; Lee
and Lemieux 2010) as RD designs usually entail perfect knowledge of the
selection process and require comparatively weak assumptions (Hahn,
Todd, and van der Klaauw 2001; Imbens and Lemieux 2008; Lee 2008). This
view is supported by several studies that have shown that RD designs and
experimental studies produce similar estimates. (1) RD designs also
offer appealing intuition--so long as characteristics related to
outcomes are smooth around the treatment threshold, we can reasonably
attribute differences in outcomes across the threshold to the treatment.
In this paper, we discuss the appropriateness of this "smoothness
assumption" in the presence of heaping.
For a wide variety of reasons, heaping is common in many types of
data. For example, we often observe heaping when data are self-reported
(e.g., income, age, and height), when tools with limited precision are
used for measurement (e.g., birth weight and pollution), and when
continuous data are rounded or otherwise discretized (e.g., letter
grades and grade point averages). Heaping also occurs as a matter of
practice, such as with work hours (e.g., 8 hours per day, 40 hours per
week) and retirement ages (e.g., 62 and 65). In this paper, we show how
ignoring heaping can have serious consequences. In particular, in RD
designs, estimates are likely to be biased if attributes related to the
outcomes of interest predict heaping in the running variable.
While our earlier work (Barreca et al. 2011) identified one case in
which heaping led to biased estimates in a RD design, it left several
important gaps in the literature. Most crucially, it did not discuss the
different ways in which non-random heaping can lead to bias, approaches
to diagnosing non-random heaping, or how to correct for non-random
heaping. In this paper, we address the following four questions. First,
how well does the usual battery of diagnostic tests perform in
identifying non-random heaping? (It depends.) Second, are there
supplementary diagnostic tests that might be better suited to
identifying this type of problem? (We offer two.) Third, do we need to
worry about data heaps that are far away from the treatment threshold
but within the bandwidth? (Yes, we do.) Fourth, once the problem has
been diagnosed, what should a practitioner do? (Although our earlier
work may have left the impression that a researcher should drop
observations at data heaps from the analysis, this is not necessarily
the best solution--we consider alternative approaches that may be more
appropriate depending on the circumstances.)
We illustrate the general issue with a series of simulation
exercises that consider estimating the most common of sharp-RD models,
(1) [Y.sub.i] = [alpha] + [beta]1 ([R.sub.i] > c) +
[theta][R.sub.i] + [psi][R.sub.i]1 ([R.sub.i] > c) +
[[epsilon].sub.i],
where [R.sub.i] is the running variable, observations with
[R.sub.i] [greater than or equal to] > c are treated, and
[[epsilon].sub.i] is a random error term. As usual, this model measures
the local average treatment effect by considering the difference in the
estimated conditional expectations of [Y.sub.i] on each side of the
treatment threshold,
(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].
Our primary simulation exercise supposes that the treatment cutoff
occurs at zero, that there is no treatment effect, and that the running
variable is unrelated to the outcome. We introduce non-random heaping by
having the expected value of [Y.sub.i] vary across two data types,
heaped and non-heaped, where heaped types have [R.sub.i] randomly drawn
from {-100, -90, ..., 90, 100} and non-heaped types have [R.sub.i]
randomly drawn from {-100, -99, ..., 99, 100}. As such, we have a simple
data-generating process (DGP) in which an attribute that predicts
heaping in the running variable (type) is also related to outcomes. In
this stripped-down example, we show that estimating Equation (1) will
arrive at biased estimates and that the usual diagnostics may fail to
identify this type of problem. Furthermore, we show that non-random
heaping introduces bias even if a data heap does not fall near the
treatment threshold. On a brighter note, we offer alternative approaches
to considering the underlying data and to accommodating non-random
heaping.
To explore how non-random heaping can impair estimation in settings
beyond our simple DGP, we examine several alternative DGPs which allow
the heaped and non-heaped types to have different means, different
slopes, and different treatment effects. We consider several approaches
to addressing the bias in each DGP and come to the following
conclusions:
1. Omitting observations at data heaps from the analysis leads to
unbiased estimates of the treatment effect for non-heaped types.
2. Keeping only those observations at data heaps leads to unbiased
estimates of the average treatment effect across the two types weighted
by the share of non-heaped data that are observed at data heaps.
However, this approach: (a) limits the extent to which one can shrink
the bandwidth; (b) cannot be implemented when there are relatively few
heaps within reasonable bandwidths; and (c) may lead to problems of
inference associated with having too few clusters. When it can be
reasonably implemented, the resulting estimate can be combined with the
estimate for non-heaped types to produce an unbiased estimate of the
average treatment effect for the population.
3. Approaches to estimating the unconditional average treatment
effect that pool the data and control flexibly for data observed at heap
points reduce but do not eliminate bias.
We consider the lessons learned from our simulation exercise in the
context of three non-simulated environments. First, we consider the use
of birth weights as a running variable to highlight the efficacy of our
proposed tests for non-random heaping. We also use this example to
demonstrate the merits of alternative approaches to overcoming the bias
that non-random heaping introduces into RD designs. Second, we show that
mother's reported day of birth, previously used to estimate the
effect of maternal education on fertility and infant health, also
exhibits non-random heaping. Third, in order to further demonstrate the
pervasiveness of heaping in commonly used data, we document non-random
heaping in both income and hours worked in the Panel Study of Income
Dynamics (PSID). While clearly not an exhaustive consideration of the
existing RD literature or of data where heaping is evident, this
analysis suggests that empirical researchers should, as a matter of
practice, consider heaping as a potential threat to internal validity in
most any exercise.
II. A SIMPLE FRAMEWORK FOR HEAPING-INDUCED BIAS
Consider a situation in which neither the program administrator
(who is responsible for treatment) nor the researcher can observe the
true value of the running variable. [R.sup.*.sub.i]. Instead, they
observe:
(3) [R.sub.i] = [R.sup.*.sub.i] (1 - [K.sup.*.sub.i]) +
[GAMMA]([R.sup.*.sub.i])[K.sup.*.sub.i],
where [K.sup.*.sub.i] is an unobservable variable drawn from a
Bernoulli distribution with mean [lambda] that indicates whether the
data are heaped. The heaping function [GAMMA] can take the form of a
rounding function (e.g., nearest integer function, floor function, and
ceiling function) but may also reflect other imputation rules. For
example, missing birthdays may be recorded as happening on the 15th of
the month, in which case [GAMMA] would be a constant function. As
another example, topcoding would imply that [GAMMA] = [R.sup.*.sub.i]
for [R.sup.*.sub.i] < T and equal to T for [R.sup.*.sub.i] [greater
than or equal to] T.
Furthermore, suppose that such heaping is related to outcomes in a
linear RD setting such that
(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]
which allows for heterogeneity in outcomes ([Y.sub.i]) across data
types [K.sup.*]. Given this knowledge, a researcher would naturally
estimate Equation (4) directly and recover estimates of both
[[beta].sub.0] and [[beta].sub.1]. In practice, however, a researcher
who is not aware of this heterogeneity is likely to instead estimate
(5) [Y.sub.i] = [alpha] + [beta]1([R.sub.i] > c) +
[theta][R.sub.i] + [psi][R.sub.i]1 ([R.sub.i] > c) + [u.sub.i],
which is like Equation (1), but for the error term, [u.sub.i] =
[K.sup.*.sub.i]([[alpha].sub.1] + [[beta].sub.1]1 ([R.sub.i] > c) +
[[theta].sub.1][R.sub.i] + [[psi].sub.1] [R.sub.i]1([R.sub.i] > c)) +
[e.sub.i] where Equation (4) implies [K.sup.*.sub.i] = ([R.sub.i] -
[R.sup.*.sub.i])[([GAMMA]([R.sup.*.sub.i]) - [R.sup.*.sub.i]).sup.-1].
(2) As such, the degree to which there is bias depends on the degree to
which there is heterogeneity across data types in addition to how the
heaping function operates on the unobserved running variable [R.sup.*].
(3,4)
III. SIMULATION EXERCISE
A. Baseline DGP and Results
All of our simulation exercises are based on samples of 10,000
observations, 80% of which have [R.sub.i] randomly drawn from a discrete
uniform distribution on the integers {-100, -99, ..., 99, 100) (i.e.,
non-heaped types, or those with [K.sup.*] = 0), while the remainder have
[R.sub.i] drawn from a discrete uniform distribution on the integers
{-100, -90, ..., 90, 100} (i.e., heaped types, or those with [K.sup.*] -
1). Our baseline data-generating process (DGP-1) considers the case in
which the treatment cutoff occurs at zero (c = 0), there is no treatment
effect ([[beta].sub.0] = [[beta].sub.1] = 0), the running variable is
not related to the outcome ([[theta].sub.0] = [[theta].sub.1] =
[[psi].sub.0] = [[psi].sub.1] = 0), and [[epsilon].sub.i] is drawn from
N(0,1). The only difference between heaped types and non-heaped types
(besides having [R.sub.i] drawn from different distributions) is their
mean: non-heaped types have a mean of zero and heaped types have a mean
of 0.5.
The main features of these data are depicted in Panels A and B of
Figure 1, which are based on a single simulation. Panel A, in which we
plot the distribution of the data using one-unit bins, shows the data
heaps at multiples of 10. Panel B, in which we plot mean outcomes within
the same one-unit bins with separate symbols for bins that correspond to
multiples of 10, makes it clear that the means are systematically higher
at data heaps. (5)
In Panel C of Figure 1, we plot the estimated treatment effect that
is expected from this data-generating process (derived from 1,000
simulations) using the standard RD approach (Equation 1) and bandwidths
ranging from 5 to 100. There are two main features of this figure.
First, despite the fact that there is no treatment effect, the estimated
treatment effect is always positive in expectation. Second, the bias
increases as we shrink the bandwidth. (6)
In Panel D, we plot rejection rates at the 5% level based on three
different approaches to obtaining standard error estimates--assuming
homoskedasticity (i.i.d.), allowing for heteroskedasticity, and
clustering on the running variable, as recommended by Lee and Card
(2008) to address potential model misspecification. Notably, the
rejection rates based on standard error estimates that assume
homoskedasticity and those allowing for heteroskedasticity are nearly
identical to one another, while the rejection rates based on standard
error estimates clustered on the running variable tend to be far lower.
This result is consistent with the clustering approach
"inflating" standard error estimates when the model has been
misspecified. However, the fact that this approach produces rejection
rates at the 5% level that are routinely below 5% indicates that the
standard error estimates are too large. (7)
Inference issues aside, the fact that nonrandom heaping can bias
estimated treatment effects raises several important questions that we
will address in turn. How can we identify this type of problem? Is the
problem specific to circumstances in which a non-random data heap falls
immediately to one side of a treatment threshold? What if the heaped
data have a different slope or a non-zero treatment effect? Finally, how
can we address the problem once it has been diagnosed?
B. Diagnosing the Problem
Standard RD-Validation Checks. It is well established that
practitioners should check that observable characteristics and the
distribution of the running variable are smooth around the threshold.
When neither is smooth through the threshold, it is usually taken as
evidence that there is manipulation of the running variable (McCrary
2008). Specifically, if there is excess mass on one side of the
threshold, it suggests that individuals may be engaging in strategic
behavior in order to gain favorable treatment. In addition, if certain
"types" are disproportionately observed on one side of the
threshold, it suggests that there may be systematic differences in how
successfully different types of individuals can manipulate the running
variable in order to gain access to treatment. Any evidence of this kind
is cause for concern because composition changes across the treatment
threshold could threaten identification. Given that the problem shown in
the previous section is one of composition bias arising from points in
the distribution with excess mass, one might anticipate that existing
tools would be well suited to raising a red flag when this issue exists.
As it turns out, this may not be the case.
In Panel A of Figure 2, we show the expected outcome of tests for a
discontinuity in the distribution across the treatment threshold
following McCrary (2008). The set of discontinuity estimates, based on
bandwidths ranging from 5 to 100, shows what one would probably expect.
The estimated discontinuity is greatest when the bandwidth is small as a
small bandwidth gives greater proportional weight to the data heap that
falls immediately to the right of the cutoff. Panel B, which shows
rejection rates, indicates that this approach reliably rejects zero at
all of the considered bandwidths. That said, it is important to note
that the usefulness of this diagnostic depends on how the heaps are
distributed relative to the threshold and what fraction of the data are
heaped, in addition to the set of bandwidths one can reasonably
consider. (8) In particular, this test yields lower rejection rates when
there is not a heap immediately to one side of the threshold and when a
smaller share of the data are heaped.
In Panels C and D of Figure 2, we investigate the extent to which
we might be able to diagnose the problem by testing whether observable
characteristics are smooth through the threshold. In particular, we
consider discontinuities in a proxy variable that is equal to one for
heaped types and zero for non-heaped types plus a standard normal error
term. (9) As in the simulations testing for discontinuities in the
distribution, these results show that the estimated discontinuities in
this proxy variable are largest for small bandwidths. Moreover, with
standard error estimates that assume i.i.d. errors or allow for
heteroskedasticity the rejection rates tend to be high. As such, like
the test for a discontinuity in the distribution, a test for covariate
balance certainly could fail as a result of non-random heaping. However,
the usefulness of this diagnostic in identifying problems that result
from non-random heaping will again depend on how the heaps are
distributed relative to the threshold, what fraction of the data are
heaped, and the set of bandwidths one can reasonably consider in
addition to the strength of the proxy variable available to the
researcher.
Supplementary Approaches to Identifying Nonrandom Heaping. Although
the conventional specification checks are well suited to diagnosing
strategic manipulation of the running variable, the results above
demonstrate that additional diagnostics may be required to identify
nonrandom heaping. In this section, we introduce two such diagnostics.
First, while researchers are inclined to produce mean plots as
standard practice, the level of aggregation is typically such that
non-random heaping can be hidden. With aggregation, researchers may
mistake heap points for noise rather than systematic outliers. As a
simple remedy, however, one can show disaggregated mean plots that
clearly distinguish heaped data from non-heaped data, as in Panel B of
Figure 1. (10) Of course, a necessary pre-requisite to this approach is
knowledge of where the heaps fall, which can be revealed through a
disaggregated histogram, as in Panel A of Figure 1.
Although this approach is useful for visual inspection of the data,
a second, more-rigorous approach is warranted when systematic
differences between heaped data and non-heaped data are less obvious. To
test whether some covariate, X, systematically deviates from its
surrounding non-heaped data at a given data heap Z, one can estimate
(8) [X.sub.i] = [[gamma].sub.0] + [[gamma].sub.1] ([R.sub.i] = Z) +
[[gamma].sub.2] ([R.sub.1] - Z) + [u.sub.i],
using the data at heap Z itself in addition to the non-heaped data
within some bandwidth around Z. Essentially, this regression equation
estimates the extent to which characteristics "jump" off of
the regression line predicted by the non-heaped data. If [[??].sub.1] is
significant, then one has reason to conclude that the composition
changes abruptly at heap Z.
An approach along these lines can also be used as an initial step
to considering whether there is heaping, be it random or non-random. In
particular, a disaggregated histogram is likely to be more useful than
an aggregated histogram. Moreover, one could estimate an equation like
(8) applied to the distribution of the data to test the degree to which
any potential heap points are statistically significant. That said, we
would caution against ignoring heaps that are economically but not
statistically significant.
C. What if Heaping Occurs Away from the Threshold?
In this section, we examine the extent to which non-random heaping
leads to bias when a heap does not fall immediately to one side of the
threshold. In Panel A of Figure 3, we show the estimated treatment
effect as we change the location of the heap relative to the cutoff,
while rejection rates are shown in Panel B. In order to focus on the
influence of a single heap, we adopt a bandwidth of 5 throughout this
exercise.
As before, when the data heap is adjacent to the cutoff on the
right, the estimated treatment effect is biased upwards. Conversely,
when the data heap is adjacent to the cutoff on the left, the estimated
treatment effect is biased downwards. While this is rather obvious, what
happens as we move the heap away from the cutoff is perhaps surprising.
Specifically, the bias does not converge to zero as we move the heap
away from the cutoff. Instead, the bias goes to zero and then changes
sign. This results from the influence of the heaped types on the slope
terms. For example, consider the case in which the data heap (with mean
0.5 whereas all other data are mean zero) is placed 5 units to the right
of the treatment threshold and the bandwidth is 5. In such a case, the
best-fitting regression line through the data on the right (treatment)
side of the cutoff will have a positive slope and a negative intercept.
In contrast, the best-fitting regression line through the data on the
left (control) side of the cutoff will have a slope and intercept of
zero. Thus, we arrive at an estimated effect that is negative. The
reverse holds when we consider the case in which the data heap is 5
units to the left of the treatment threshold, which yields an estimated
effect that is positive. Recall that, in all cases, the true treatment
effect is zero. Our analysis highlights that "donut-RD
approaches" that estimate the treatment effect after dropping
observations in the immediate vicinity of the treatment threshold, as in
Barreca et al. (2011), should not be thought of as a general approach to
addressing non-random heaping because data heaps away from the threshold
may also introduce bias; instead dropping observations in the immediate
vicinity of the treatment threshold should be thought of as a useful
robustness check that has the potential to highlight misspecification in
any RD design. (11)
D. Alternative DGPs
In this section, we demonstrate other issues related to non-random
heaping by exploring alternative DGPs that introduce differences in
slopes and treatment effects across the two types. We summarize each DGP
in Figure 4 and present the estimated effects using different approaches
to estimation. Rejection rates based on i.i.d., heteroskedastic robust,
and clustered standard errors can be found in Figures A2, A3, and A4 of
the Appendix, respectively.
DGP-2 is the same as the baseline DGP-1 except that the non-heaped
types and heaped types have different slopes instead of different means.
In particular, in DGP-2 the mean is zero for both groups, the slope is
zero for non-heaped types, and the slope is 0.01 for heaped types. These
parameters are summarized in the second column of Figure 4, in Panel A,
while the means are shown in Panel B. In Panel C, we show the estimated
treatment effects for varying bandwidths. Again, although there is no
treatment effect for non-heaped or heaped types, the estimates tend to
suggest that the average treatment effect is not zero. The estimates
display a sawtooth pattern, dropping to less than zero each time an
additional pair of data heaps is added to the analysis sample and then
climbing above zero again as additional non-heaped data are added. (12)
As in DGP-1, the bias becomes less severe at larger bandwidths, although
this may not be obvious to the eye.
While the results are not shown in Figure 4, we have also
considered a DGP that combines the salient elements of DGP-1 and DGP-2
by having the parameters for the non-heaped types remain the same (set
at zero), while having the heaped types have a higher mean (0.5) and a
higher slope (0.01). As one would expect, the estimated local average
treatment effects exhibit a positive bias that grows exponentially as
the bandwidth is reduced (as in DGP-1) and a sawtooth pattern (as in
DGP-2).
We now turn to two DGPs in which there is a treatment effect for
heaped types while maintaining the same parameters for the non-heaped
types. Specifically, for heaped types in DGP-3, we set the control mean
equal to zero and the treatment effect to 0.5, which implies an
unconditional average treatment effect of 0.1. In DGP-4, we also
introduce a slope (0.01) for the heaped types. The set of estimates in
Panel C shows that--as with the DGPs in which there the average
treatment effect is zero--the standard approach to estimation yields
positively biased estimates of the average treatment effect for each of
these DGPs.
E. Addressing Non-Random Heaping
In this section, we explore several approaches to addressing the
bias induced by non-random heaping. To begin, we consider an estimation
strategy in which we simply drop data at heap points from the analysis.
Estimates based on this approach are shown in Panel D of Figure 4. For
all DGPs and all bandwidths, this approach leads to an unbiased estimate
of the treatment effect for non-heaped types, which is zero.
While the approach described above would seem to be quite useful in
most scenarios, it is important to consider whether the data can be used
more fully, either to improve precision or because we are interested in
estimates that capture the treatment effects for those observed at
heaps. In Panel E, we show that RD estimates based solely on data at
heap points provide unbiased estimates of the average treatment effect
across the two types, weighted by the share of non-heaped data that are
observed at data heaps. (13) Furthermore, Panel F shows that we can
recover an unbiased estimate of the unconditional average treatment
effect by taking a population-weighted average of the estimates based on
these two approaches to restricting the sample, that is, by combining
the estimates from Panels D and E. (14)
In Panels G and H, we show that the same results cannot be achieved
by pooled approaches that control for data heaps. In particular, in
Panel G we show estimates that are based on the standard approach while
also including an indicator variable that equals 1 for observations at
data heaps and zero otherwise. These estimates clearly show that adding
this control variable is no panacea. While it does fully remove the bias
from estimates in DGP-1, in which the only difference between the
non-heaped and heaped data is their mean, it does not fully remove the
bias from estimates based on other DGPs. Panel H takes a more flexible
approach to estimation, allowing separate intercepts and trends for
observations at data heaps. This approach does better than simply
allowing a different intercept. It removes the bias when the non-heaped
and heaped data have different means or slopes while having the same
treatment effect (DGP-1 and DGP-2). However, when non-heaped types and
heaped types have different treatment effects (DGP-3 and DGP-4), this
does not recover unbiased estimates of the average treatment effect.
The main takeaway from these results is that it is possible to
obtain unbiased estimates by separately investigating non-heaped and
heaped data. At the same time, it is important to recognize that it may
not always be feasible or convincing to investigate data heaps in
isolation. Doing so requires that there be "enough" heaps
within a reasonable bandwidth of the treatment threshold. (13) It also
limits how close one can get to the treatment threshold to obtain
estimates, which increases the risk that the estimates will be biased by
a misspecified functional form. Given the advantages offered by the
approach that focuses on non-heaped data--unbiased estimates and the
ability to shrink the bandwidth considerably--an approach that restricts
the sample in this fashion would seem to be an integral part of any
RD-based investigation in the presence of heaping. The usefulness of
investigating the effects for observations at data heaps is likely to
depend a great deal on the specific context.
IV. NON-SIMULATED EXAMPLES
In this section, we present three non-simulated examples where
non-random heaping has the potential to bias RD-based estimates. First,
we examine birth weights, which have previously been used as a running
variable to identify the effects of hospital care on infant health. We
show that U.S. birth weight data exhibit nonrandom heaping that will
bias RD estimates, that our proposed diagnostics detect this problem
whereas the usual diagnostics do not, and that the approaches that were
effective at reducing the bias in the simulation are also effective in
this context. Second, we turn our attention to dates of birth, which
have previously been used as a running variable to identify the effects
of maternal education on fertility and infant health. We show that these
data also exhibit non-random heaping that could lead to bias. Third,
using the PSID, we show that non-random heaping is also present in work
hours and in earnings.
A. Birth Weight as a Running Variable
Background. Because some hospitals use birth weight cutoffs as part
of their criteria for determining medical care, a natural way of
measuring the returns to such care is to use a RD design with birth
weight as the running variable. Almond et al. (2010), hereafter ADKW,
take this approach in order to consider the effect of
very-low-birth-weight classification, that is, having a measured birth
weight strictly less than 1,500 g, on infant mortality. (16) While
Barreca et al. (2011) and Almond et al. (2011) explore the sensitivity
of the estimated effects to the treatment of observations around the
1,500-g threshold, here we use this empirical setting to illustrate the
broader econometric issues associated with heaping. (17)
To begin, consider that birth weights can be measured using a
hanging scale, a balance scale, or a digital scale, each of them rated
in terms of their resolution. Modern digital scales marketed as
"neonatal scales" tend to have resolutions of 1, 2, or 5 g.
Products marketed as "digital baby scales" tend to have
resolutions of 5, 10, or 20 g. Mechanical baby scales tend to have
resolutions between 10 and 200 g. Birth weights are also frequently
measured in ounces, with ounce scales varying in resolution from 0.1 to
4 ounces. Because not all hospitals have high-performance neonatal
scales, especially going back in time, a certain amount of heaping at
round numbers is to be expected.
In the discussion of the simulation exercise above, we recommended
that researchers produce disaggregated histograms for the running
variable. We do so for birth weights in Panel A of Figure 5. (18) As
also noted in ADKW, this figure clearly reveals heaping at 100-g and
ounce multiples, with the latter being most dramatic. Although we focus
on these heaps throughout the remainder of this section to elucidate
conceptual issues involving non-random heaping, a more-complete analysis
of the use of birth weights as a running variable would need to consider
heaps at even smaller intervals (e.g., 50-g and half-ounces). In any
case, to the extent to which any of the observed heaping can be
predicted by attributes related to mortality, our simulations imply that
standard RD estimates are likely to be biased.
In considering the potential for heaping to be systematic in a way
that is relevant to the research question, we first note that scale
prices are strongly related to scale resolutions. Today, the
least-expensive scales cost less than $100, whereas the most expensive
cost approximately $2,000. For this reason, it is reasonable to expect
more precise birth weight measurements at hospitals with greater
resources, or at hospitals that tend to serve more-affluent patients.
(19) That is, one might anticipate that the heaping is systematic in a
non-trivial way.
Diagnostics. ADKW noted that there was significant heaping at
round-gram numbers and at gram equivalents of ounce multiples. However,
they did not test whether the heaping was random. They did, of course,
perform the usual specification checks to test for non-random sorting
across the treatment threshold. These tests do not reveal statistically
significant discontinuities in characteristics across the treatment
threshold. Moreover, despite there being two obvious heaps immediately
to the right of the treatment threshold at 1,500 and 1,503 g (i.e., 53
ounces), they find that the estimated discontinuity in the distribution
is not statistically significant.
In addition, ADKW make the rhetorical argument that there are not
irregular heaps around the 1,500-g threshold of interest as the heaps
are similar around 1,400 and 1,600 g. With respect to the usual concerns
about non-random sorting, this argument is compelling. In particular,
the usual concern is that agents might engage in strategic behavior so
that they are on the side of the threshold that gives them access to
favorable treatment. While this is a potential issue for the 1,500-g
threshold, it is not an issue around 1,400 and 1,600 g. As we also see
heaping at the 1,400- and 1,600-g thresholds, it makes sense to conclude
that the heaping observed at the 1,500-g threshold is
"normal." The problem with this line of reasoning, however, is
that all of the data heaps may be systematic outliers in their
composition.
Following our recommended procedure, in the second graph in Panel A
of Figure 5, we plot the fraction of children that are White against
recorded birth weights, visually differentiating heaps at 100-g and
ounce intervals. This is our first strong evidence that the heaping
apparent in Panel A is non-random--children at the 100g heaps are
disproportionately likely to be non-White. This pattern, which suggests
that those at 100-g heaps are relatively disadvantaged is echoed in
Figure A6 in the Appendix, which presents similar results for
mother's education and Apgar scores. In contrast, children at ounce
heaps are disproportionately likely to be White, to have mothers with at
least a high-school education, and to have high Apgar scores. However,
compared with those at 100-g heaps, it is far less clear that those at
ounce heaps are outliers in their underlying characteristics,
highlighting the usefulness of a more-formal approach. Our more-formal
approach to exploring the extent to which the composition of children
changes abruptly at reporting heaps is based on Equation (8), where
[X.sub.i] is a characteristic of individual i with birth weight
[Z.sub.i] and we separately consider Z = {1,000, 1,100, ..., 3,000} and
gram equivalents of ounce multiples. As discussed in the simulation
exercise, this diagnostic is not intended to detect a mean shift across
Z but, rather, the extent to which characteristics at heap Z differ from
what would be expected based on surrounding (non-heaped) observations.
The results from this regression analysis confirm that child
characteristics change abruptly at data heaps. Focusing on the 100-g
heaps, in the first graph in Panel B of Figure 5 we show estimated
percent changes, [[??].sub.1]/[[??].sub.0], for the probability that a
mother is White. (20) For nearly every estimate, bootstrapped standard
error estimates are small enough to reject that the characteristics of
children at Z are on the trend line. Similarly, in the same panel, we
demonstrate that those at ounce heaps also tend to systematically
deviate from the trend based on surrounding observations, except the
more-affluent types are disproportionately more likely to have birth
weights recorded in ounces. (21) In addition, it is clear that the data
at ounce heaps have a different slope from the non-heaped data, which
our simulation exercises revealed to be problematic. In the end, we note
that where the standard validation exercises do not detect these
important sources of potential bias, this simple procedure proves
effective.
Non-Random Heaping, Bias, and Corrections. Given these
relationships between characteristics that predict both infant mortality
and heaping in the running variable, our simulation exercise suggests
standard RD estimates will be biased. To illustrate this issue, we
replicate ADKW's analysis of the 1,500-g cutoff while also
considering placebo cutoffs of 1,000, 1,100, ..., 3,000 g. (22) In
particular, we use the regression described in Equation (1) and an 85-g
bandwidth to consider the extent to which there are mean shifts in
mortality rates across the considered thresholds. We plot percent
changes, 100 multiplied by the estimated treatment effect divided by the
intercept, for greater comparability across the differing cutoffs. (23)
In Panel A of Figure 6, we present the estimated percent impacts on
1 -year mortality, which suggest that near any of the considered cutoffs
c, children with birth weights less than c routinely have better
outcomes than those with birth weights at or above c. Given that these
results are largely driven by a systematically different composition of
children at the 100-g heaps that coincide with the cutoffs, the
estimated effects are much larger in magnitude when one uses a more
narrow bandwidth. For example, with a bandwidth of 30 g, 42 of 42 point
estimates fall below zero. (24)
Before turning to the approaches to estimation suggested by our
simulation exercises, we first consider the extent to which the bias
evident in Panel A may be addressed by the inclusion of an extensive
battery of control variables. In particular, Panel B of Figure 6 shows
the estimated effects that control with fixed effects for state, year,
birth order, the number of prenatal care visits, gestational length in
weeks, mothers' and fathers' ages in 5-year bins (with <15
and >40 as additional categories), and controlling with indicator
variables for multiple births, male, Black, other race, Flispanic, the
mother having less than a high-school education, the mother having a
high-school education, the mother having a college education or more,
and whether the mother lives in her state of birth. Although the
estimates shrink toward zero, they are qualitatively similar to those
reported in Panel A--they imply that having a birth weight less than the
considered cutoff reduces infant mortality in almost all cases. We
interpret this set of estimates as evidence that this approach has not
effectively dealt with the composition bias.
As illustrated in the simulation exercise, an effective approach to
dealing with non-random heaping is to estimate the effect after dropping
observations at data heaps. While a drawback of this method is that it
cannot tell us about the treatment effect for the types who tend to be
observed at data heaps, it is consistent with the usual motivation for
RD designs. Specifically, researchers will focus on what might be
considered a relatively narrow sample in order to be more confident that
they can identify unbiased estimates.
In Panel C of Figure 6, we show the estimated effects on infant
mortality based on this approach, omitting those at 100-g and ounce
heaps from the analysis. While the earlier estimates (in Panels A and B)
were negative for most of the placebo cutoffs, these estimates resemble
the white-noise process we would anticipate in the absence of treatment
effects. Thus, these results indicate that the sample restrictions we
employ reduce the bias produced by the non-random heaping described
above. These results also suggest that the estimated effect of
very-low-birth-weight classification is zero for children born at
hospitals where birth weights are not recorded in hundreds of grams or
in ounces.
Our simulation exercise also showed that an approach that focuses
solely on heaped types can yield unbiased estimates where the data allow
for such estimates. In this setting, the ounce heaps can be considered
in this manner. (25) We show these results in Panel D of Figure 6. The
estimates do indicate a significant effect of very-low-birth-weight
classification for children born at hospitals where birth weights are
recorded in ounces, which is consistent with Almond et al.'s (2011)
evidence that very-low-birth-weight classification is particularly
relevant for those born at low-quality hospitals (where birth weights
are more likely to be recorded in ounces). That said, the magnitude and
statistical significance of the estimated effects at the placebo cutoffs
suggests that the estimates should be interpreted with caution. (26)
B. Dates of Birth and Ages as Running Variables
A common approach to estimating the effects of education on
outcomes is to use variation driven by small differences in birth timing
that straddle school-entry-age cutoffs. For example, "5 years old
on December 1" is a common school-entry requirement. As such, the
causal effect of education on outcomes can be measured by comparing the
outcomes of individuals born just before December 2 to the outcomes of
those born shortly thereafter, who begin school later and thereby tend
to obtain fewer years of education.
Dobkin and Ferreira (2010) use this approach to investigate the
effects of education on job market outcomes, whereas McCrary and Royer
(2011) use this approach to identify the causal effect of maternal
education on fertility and infant health using restricted-use birth
records from California and Texas. In the first graph of Figure 7, Panel
A, we use the same California birth records used by McCrary and Royer
(2011) and show the distribution of mothers' reported birth dates
across days of the month. (27) Although less striking than in the birth
weight example, this figure shows that there are data heaps at the
beginning of each month and at multiples of 5. The second graph in Panel
A shows one of many indications that those at data heaps are
outliers--that the mothers at these data heaps are disproportionately
less likely to have used tobacco during their pregnancies. This
phenomenon is not specific to tobacco use, however. Similar patterns are
equally evident in mother's race, father's race, mother's
education, father's education, the fraction having father's
information missing, or the fraction having pregnancy complications,
along with a wide array of child outcomes. (28)
It turns out that this non-random heaping is unlikely to be a
serious issue for the main results presented by McCrary and Royer (2011)
because their preferred bandwidth of 50 leaves their estimates
relatively insensitive to the high-frequency-composition shifts
described above. (29) At the same time, it is important to keep in mind
that were more data available conventional practice would have them
choose a smaller bandwidth. Our simulation exercise demonstrates that
this practice of shrinking the bandwidth with more data would make the
bias associated with non-random heaping more severe.
This issue of heaping in dates of birth is also present in Shigeoka
(2014) who considers a discontinuity in patient cost sharing at age 70
in Japan. He circumvents any issues associated with the data heaps (at
the day level) by collapsing ages to the monthly level. This sort of
approach can only be used when the heaping function is
known so that the data that are not at heap points can be imputed
to the appropriate heap point (left or right) and when doing so would
not change the implied treatment status. In Shigeoka's context,
these conditions are met because the heaping is in the day of birth and
treatment coincides with the beginning of the month after age 70.
A similar issue also appears in Edmonds, Mammen, and Miller (2005)
who consider a discontinuity in women's pension receipt at age 60
in South Africa. They note that their data exhibits heaping at ages in
round decades and highlight that this heaping is non-random as
"women at age 60 generally look different than would be predicted
by the trend prior to age 60 and the trend after 60." In line with
the solutions we describe above, they exclude women at age 60 from
estimation and explain that this alters the population to which the
results are applicable. That said, the results of our simulation would
support excluding women with ages at any multiple of 10 because
non-random heaps that are far from the threshold have the potential to
introduce bias.
C. Income as a Running Variable
Given how many policies are income-based, there are several
examples where treatment effects might be identified using an RD design
with income as the running variable. For example, one might consider
this strategy to identify the effects of various tax incentives,
financial aid offers, the Special Supplemental Nutrition Program for
Women, Infants, and Children (WIC), or the Children's Health
Insurance Program (CHIP), which subsidizes health insurance for families
with incomes that are marginally too high to qualify for Medicaid.
In light of our results above, Saez's (2010) analysis of the
distribution of income tax data highlights a potential problem for any
such study. In particular, the fact that self-employed taxpayers bunch
at tax kink points but others do not indicates non-random heaping.
Taking a closer look at income data based on the PSID shows even more
systematic heaping. In particular, Panel B of Figure 7 shows that there
is significant heaping at $1,000 multiples and that individuals at these
data heaps are substantially less likely to be White than those with
similar incomes who are not at these data heaps. (30,31)
V. DISCUSSION AND CONCLUSION
In this paper, we have demonstrated that the RD design's
smoothness assumption is inappropriate when there is non-random heaping.
In particular, we have shown that RD-estimated effects are afflicted by
composition bias when attributes related to the outcomes of interest
predict heaping in the running variable. Furthermore, the estimates will
be biased regardless of whether the heaps are close to the treatment
threshold or far away (but within the bandwidth).
While composition bias is not a new concern for RD designs, the
type of composition bias that researchers tend to test for is of a very
special type. In particular, the convention is to test for mean shifts
in characteristics taking place at the treatment threshold. This
diagnostic is often motivated as a test for whether or not certain types
are given special treatment or better able to manipulate the system in
order to obtain favorable treatment. In this paper, we suggest that
researchers also need to be concerned with abrupt compositional changes
that may occur at heap points.
We propose two supplementary approaches to establishing the
validity of RD designs when the distribution of the running variable has
heaps. While the importance of showing disaggregated mean plots is well
established as a way to visually confirm that estimates are not driven
by misspecification (Cook and Campbell 1979), our examples demonstrate
that researchers should highlight data at reporting heaps in such plots
in order to visually inspect whether there is nonrandom heaping. As a
more-formal diagnostic to be used when the problem is not obvious, we
suggest that researchers estimate the extent to which characteristics at
heap points "jump" off of the trend predicted by non-heaped
data.
We consider several different approaches to addressing the bias
that non-random heaping introduces into standard RD estimates.
Approaches that control flexibly for data heaps reduce but do not remove
the bias. In contrast, approaches that stratify the data do provide
unbiased estimates. In particular, an analysis that simply drops the
data at reporting heaps yields an unbiased estimate of the treatment
effect for non-heaped types. Moreover, if there are a sufficient number
of heaps within a reasonable bandwidth of the threshold, a researcher
can separately analyze these data to obtain an unbiased estimate that
captures a weighted average of the treatment effects for heaped and
non-heaped types (as both are present at data heaps). Where this is
feasible, the two unbiased estimates can be combined to provide an
estimate of the unconditional average treatment effect.
doi: 10.1111/ecin.12225
ABBREVIATIONS
CHIP: Children's Health Insurance Program
DGP: Data-Generating Process
PSID: Panel Study of Income Dynamics
RD: Regression Discontinuity
WIC: Women, Infants, and Children
APPENDIX
Understanding the Sawtooth Pattern in Estimated Treatment Effects
To understand the sawtooth pattern first exhibited in DGP2, Figure
A1 plots the regression lines using selected bandwidths. The
short-dashed lines are based on a bandwidth of 10, where the data
include three heap points, R= (-10,0,10). These lines show that the
non-random heaping captured in DGP-2 leads to an estimate that is
negatively biased. In particular, the heap at R = -10 has two effects on
the regression line on the left side of the threshold. First, this heap
causes the regression line to shift down because it pulls down the
center of mass. (32) Second, it induces a positive slope in order to
bring the regression line closer to the heaped data at the edge of the
bandwidth. As it turns out, the slope is large enough that the
regression line crosses zero from below, which results in a positive
expected value approaching the treatment threshold from the left. The
heap at R = 10 has similar effects on the regression line on the right
side of the threshold--shifting the regression line up, inducing a
positive slope such that the expected value is negative approaching the
treatment threshold from the right. As such, approaching the threshold
from each side, we arrive at a negative difference in expected value.
The dash-and-dot line in Figure 8 uses a bandwidth of 18 to
demonstrate how the same DGP can arrive at positive estimates. Again, on
both the left and right sides of the treatment threshold, the sum of
squared errors is minimized by a positively sloped regression line.
However, with more non-heaped data, including a sizable share to the
left of the data heap at R = -10 and to the right of the data heap at R
= 10, the magnitude of the slope is much smaller. As a result, neither
regression line, on the left side or the right side of the threshold,
crosses zero. Thus, we have a negative expected value approaching the
treatment threshold from the left and a positive expected value
approaching the treatment threshold from the right, that is, a positive
estimate of the treatment effect.
Last, the solid line in Figure A1 plots the regression lines using
a bandwidth of 20. Here, it is important to keep in mind that the
increase in the bandwidth has introduced data heaps at R = -20 and R =
20 to the analysis. Not surprisingly, the bandwidth of 20 shares a lot
in common with the bandwidth of 10. In particular, the heaps at the
boundary of the bandwidth influence the slope parameters such that the
regression lines cross zero. As such, we again find a negative estimate
of the treatment effect when the true effect is zero.
As shown in the second column of Panel B in Figure 4, these
phenomena occur in a systematic fashion as we change the bandwidth. Each
time a new set of heaps is introduced, the slope estimate becomes
sharply positive, which leads the regression lines on each side of the
cutoff to pass through zero, leading to negative estimates of the
treatment effect. As we increase the bandwidth beyond a set of heaps,
however, the slope terms shrink in magnitude, the regression lines no
longer pass through zero, and we arrive at positive estimates of the
treatment effect. The process repeats again when the increase in
bandwidth introduces a new set of heaps.
REFERENCES
Aiken, L. S., S. G. West, D. E. Schwalm, J. Carroll, and S. Hsuing.
"Comparison of a Randomized and Two Quasi-Experimental Designs in a
Single Outcome Evaluation: Efficacy of a University-Level Remedial
Writing Program." Evaluation Review, 22(4), 1998, 207-44.
Almond, D., J. J. Doyle Jr., A. E. Kowalski, and H. Williams.
"Estimating Marginal Returns to Medical Care: Evidence from At-risk
Newborns." Quarterly Journal of Economics, 125(2), 2010, 591-634.
--. "The Role of Hospital Heterogeneity in Measuring Marginal
Returns to Medical Care: A Reply to Barreca, Guldi, Lindo, and
Waddell." Quarterly Journal of Economics, 126(4), 2011, 591-634.
Barreca. A. I., M. Guldi, J. M. Lindo, and G. R. Waddell.
"Saving Babies? Revisiting the Effect of Very Low Birth Weight
Classification." Quarterly Journal of Economics, 126(4), 2011,
2117-23.
Berk, R., G. Barnes, L. Ahlman, and E. Kurtz. "When Second
Best Is Good Enough: A Comparison Between a True Experiment and a
Regression Discontinuity Quasi-Experiment." Journal of Experimental
Criminology, 6, 2010, 191-208.
Black, D., J. Galdo, and J. Smith. "Evaluating the Regression
Discontinuity Design Using Experimental Data." Mimeo, University of
Chicago, 2005.
Buddelmeyer, H., and E. Skoufias. "An Evaluation of the
Performance of Regression Discontinuity Design on PROGRESA." World
Bank Policy Research Working Paper No. 3386, 2004.
Cho, J. S., and H. White. "Testing for Regime Switching."
Econometrica, 75(6), 2007, 1671-720.
Cook, T. D. "'Waiting for Life to Arrive': A History
of the Regression-Discontinuity Design in Psychology, Statistics and
Economics." Journal of Econometrics, 142(2), 2008, 636-54.
Cook, T. D., and D. T. Campbell. Quasi-Experimentation: Design and
Analysis Issues for Field Settings. Chicago: Rand McNally, 1979.
Cook, T. D., and V. C. Wong. "Empirical Tests of the Validity
of the Regression Discontinuity Design." Annals of Economics and
Statistics, 91/92, 2008, 127-50.
Dickert-Conlin, S., and T. Elder. "Suburban Legend: School
Cutoff Dates and the Timing of Births." Economics of Education
Review, 29(5), 2010, 826-41.
Dobkin, C., and F. Ferreira. "Do School Entry Laws Affect
Educational Attainment and Labor Market Outcomes?" Economics of
Education Review, 29(1), 2010, 40-54.
Dong, Y. "Regression Discontinuity Applications with Rounding
Errors in the Running Variable." Journal of Applied Econometrics,
30(3), 2015, 422-46.
Edmonds, E., K. Mammen, and D. L. Miller. "Rearranging the
Family? Income Support and Elderly Living Arrangements in a Low-Income
Country." Journal of Human Resources, 40(1), 2005, 186-207.
Hahn, L, P. Todd, and W. Van der Klaauw. "Identification and
Estimation of Treatment Effects with a Regression-Discontinuity
Design." Econometrica, 69(1), 2001, 201-9.
Imbens, G. W., and T. Lemieux. "Regression Discontinuity
Designs: A Guide to Practice." Journal of Econometrics, 142(2),
2008, 615-35.
LaLonde, R. "Evaluating the Econometric Evaluations of
Training with Experimental Data." The American Economic Review,
76(4), 1986, 604-20.
Lee, D. S. "Randomized Experiments from Non-random Selection
in U.S. House Elections." Journal of Econometrics, 142(2), 2008,
675-97.
Lee, D. S., and D. Card. "Regression Discontinuity Inference
with Specification Error." Journal of Econometrics, 127(2), 2008,
655-74.
Lee, D. S., and T. Lemieux. "Regression Discontinuity Designs
in Economics." Journal of Economic Literature, 48(2), 2010,
281-355.
McCrary, J. "Manipulation of the Running Variable in the
Regression Discontinuity Design: A Density Test." Journal of
Econometrics, 142(2), 2008, 698-714.
McCrary, J., and H. Royer. "The Effect of Female Education on
Fertility and Infant Health: Evidence from School Entry Policies Using
Exact Date of Birth." American Economic Review, 101(1), 2011,
158-95.
Saez, E. "Do Taxpayers Bunch at Kink Points?" American
Economic Journal: Economic Policy, 2(3), 2010, 180-212.
Shadish, W., R. Galindo, V. Wong, P. Steiner, and T. Cook. "A
Randomized Experiment Comparing Random to Cutoff-Based Assignment."
Psychological Methods, 16(2), 2011, 179-219.
Shigeoka, H. "The Effect of Patient Cost Sharing on
Utilization, Health, and Risk Protection." American Economic
Review, 104(7), 2014, 2152-84.
Van der Klaauw, W. "Regression-Discontinuity Analysis: A
Survey of Recent Developments in Economics." Labour: Review of
Labour Economics and Industrial Relations, 22(2), 2008, 219-45.
(1.) See Aiken et al. (1998), Buddelmeyer and Skoufias (2004),
Black, Galdo, and Smith (2005), Cook and Wong
(2008), Berk et al. (2010), and Shadish et al. (2011) who describe
within-study comparisons similar to LaLonde (1986).
(2.) This framework corresponds to a two-component mixture model
that could naturally be expanded to allow for more types. A huge number
of papers across statistics and economics have wrestled with how to
identify such models. See Cho and White (2007) for a recent treatment of
the general problem.
(3.) As a simple case, consider a scenario in which there is no
treatment effect and no slope for [K.sup.*] = 0. 1. As such the true
model simplifies to:
(6) [Y.sub.i] = [[alpha].sub.0] + [[alpha].sub.1][K.sup.*.sub.i] +
[e.sub.i].
While it may not be immediately obvious that estimating Equation
(5) would yield a biased estimate, note that Equation (3) implies
[K.sup.*.sub.i] = ([R.sub.i] - [R.sup.*])[([GAMMA]([R.sup.*.sub.i]) -
[R.sup.*]).sup.-1] and, thus, the true model in this case could be
rewritten
(7) [Y.sub.i] = [[alpha].sub.0] +
[[alpha].sub.1][R.sub.i][([GAMMA]([R.sup.*.sub.i]) -
[R.sup.*.sub.i]).sup.-1] - [[alpha].sub.1]
[R.sup.*.sub.i][([GAMMA]([R.sup.*.sub.i]) - [R.sup.*.sub.i]).sup.-1] +
[e.sub.i].
As such, the estimates based on the usual RD model (Equation 5) may
be biased because [u.sub.i] =
[[alpha].sub.1]([R.sub.i])[([GAMMA]([R.sup.*.sub.i]) -
[R.sup.*]).sup.-1] - [[alpha].sub.1]([R.sup.*][([GAMMA]([R.sup.*.sub.i])
- [R.sup.*]).sup.-1] + [e.sub.i].
(4.) Fundamental to the challenge to identification we consider is
that only some (non-random) observations are heaped. See Dong (2015) for
a consideration of random rounding in the running variable.
(5.) To make this example concrete, one can think of estimating the
effect of free school lunches--typically offered to children in
households with income below some set percentage of the poverty line--on
the number of absences per week. The running variable could then be
thought of as the difference between the poverty line and family income,
with treatment provided when the poverty line (weakly) exceeds reported
income ([R.sub.i] [greater than or equal to] 0). In this example, there
may be heterogeneity in how individuals report their incomes--some
individuals may report in dollars (non-heaped types), whereas others may
report their incomes in tens of thousands of dollars (heaped types).
Furthermore, supposing that non-heaped types are expected to be absent
zero days per week regardless of whether they are given free lunch and
heaped types are expected to be absent 0.5 days per week regardless of
whether they are given free lunch, then we would expect to see a mean
plot similar to that of Panel B of Figure 1. That is, we have a setting
in which treatment (free school lunch) has no impact on the outcome
(absences). However, as we show below, the non-random nature of the
heaping will cause the standard RD estimated effects to go awry.
Motivating this thought experiment, in Section IV.C we demonstrate that
there is systematic heterogeneity in how individuals report income
levels, with White individuals being less likely to report incomes in
thousands of dollars.
(6.) This evidence highlights the usefulness of comparing estimates
at various bandwidth levels, as proposed by van der Klaauw (2008).
(7.) This issue does not appear to be specific to heaping-induced
model misspecification. As one simple but illustrative example, we have
investigated a DGP in which [y.sub.i] = [r.sup.2.sub.i] + [e.sub.i] with
[e.sub.i] drawn from a standard normal distribution and [r.sub.i] drawn
from a discrete uniform distribution on {-20, -19, ..., 20) with r = 0
omitted for symmetry. A linear (and thus misspecified) RD model produces
discontinuity estimates centered on zero, which implies we should reject
the null hypothesis of no discontinuity at the 5% level, 5% of the time
if we are using the correct standard-error estimates. This is the case
when inference is based on heteroskedasticity-consistent standard-error
estimates. However, inference based on clustered standard-error
estimates leads to rejection rates of zero, owing to standard-error
estimates that are 5-6 times too large in this instance. Results are
qualitatively similar with alternative nonlinear DGPs involving higher
ordered polynomials and/or trigonometric functions.
(8.) We should also emphasize that this test, as described by
McCrary (2008), is not meant to identify data heaps, but to identify
circumstances in which there is manipulation of the running variable.
This type of behavior in which individuals close to the threshold exert
effort to move to the "preferred" side of the cutoff will
produce a distribution that is qualitatively different from simply
having a data heap on one side of the threshold. In particular, this
behavior will produce more of a shift in the distribution at the
treatment threshold, whereas heaping produces blips in the distribution
that may or may not coincide with the treatment threshold.
(9.) Note our earlier reference to [K.sup.*.sub.i], as an
unobservable identifier that i is heaped. As we here consider a proxy
variable indicating heaped types, then, we surmise that such a proxy may
be arrived at in practice by institutional knowledge, or common practice
(as might be the case in rounding leading to heaping, for example). We
also envision some experimenting with methods for the identification of
heaps, with more systematic heaping patterns across the distribution of
the running variable better facilitating their discovery. This was the
case in our initial consideration of heaping in birth weight (Barreca et
al. 2011), for example.
(10.) We recommend this type of plot as a complement to
more-aggregated mean plots rather than a substitute. More-aggregated
mean plots may be more useful when trying to discern what functional
form should be used in estimation and whether or not there is a
treatment effect.
(11.) The relationship highlighted in Panel A of Figure 3--that the
sign of the bias depends on the location of the heap relative to the
cutoff--also reveals a potential special case in which the heaping is
such that equal and opposing biases of the estimates of the conditional
expectation function on each side of the threshold results in an
unbiased (though imprecise) estimate of the true treatment effect. While
such a data-generating process can be imagined, it is rather particular
and we thus imagine that there is room to consider the implications of
heaping even in such an environment.
(12.) For a detailed explanation of this sawtooth pattern, see the
Appendix.
(13.) In particular, the average treatment effect across
observations at heap points is contributed to by non-heaped and heaped
types. Non-heaped types account for 80% of the full sample, but only 10%
of these will fall at heap points, so represent 0.1 x 80/(0.1 x 80 + 20)
of the sample at heaps. Heaped types--20% of the full sample--account
for 20/(20 + 8) of the sample at heaps. As such, the average treatment
effect at heaps is (8/28) x 0 + (20/28) x 0.5 = 0.357.
(14.) Combining the heaped and non-heaped analyses yields an
average treatment effect of 72/100 x 0 + 28/ 100 x 0.357 = 0.1.
(15.) For example, when there are data heaps at multiples of 10, as
in the DGPs we consider, 20 is the smallest bandwidth one could use to
estimate Equation (1), as it requires at least two observations on each
side of the threshold.
(16.) While our simulation exercise follows the convention that
treatment falls on those to the right of the threshold, note that
treatment falls on the left in this setting.
(17.) In so doing, we also shed light on why estimated effects of
very-low-birth-weight classification are sensitive to the treatment of
observations bunched around the 1,500-g threshold.
(18.) We use identical data as ADKW throughout this section, Vital
Statistics Linked Birth and Infant Death Data from 1983-1991 and
1995-2002; linked files are not available for 1992-1994. These data
combine information available on an infant's birth certificate with
information on the death certificate for individuals less than 1 year
old at the time of death. As such, the data provide information on the
infant, the infant's health at birth, the infant's death
(where applicable), the family background of the infant, the geographic
location of birth, and maternal health and behavior during pregnancy. We
do not, however, have access to the treatment data that ADKW use to
estimate a first stage, which in turn allows them to construct
two-sample IV estimates of the effect of treatment on mortality. For
information on our sample construction, see Almond et al. (2010).
(19.) With general improvement in technology, one would anticipate
that measurement would appear more precise in the aggregate over time.
We show that this is indeed the case in Figure A5 in the Appendix, which
also foreshadows the systematic relationship between heaping and
measures of socioeconomic status. Note that a major reason that the
figure does not show smooth trends is because data are not consistently
available for all states.
(20.) Results focusing on other child characteristics are shown in
Figure A6.
(21.) Although the estimates for each ounce heap are rarely
statistically significant, it is obvious that the set of estimates is
jointly significant and that the individual estimates would usually be
significant with a bandwidth larger than 85 g.
(22.) We note that not all of these are "true placebo
cutoffs" as 1,000 g corresponds to the extremely-low-birth-weight
cut off and 2,500 g corresponds to the low-birth-weight cutoff.
(23.) Confidence intervals based on a bootstrap with 500
replications in which observations are drawn at random. Specifically,
the confidence intervals shown reflect the 2.5th and 97.5th percentiles
of the 500 estimates produced from this procedure.
(24.) Results are similar if one uses triangular kernel weights
that also place greater emphasis on observations at 100-g heaps. ADKW
mention having considered the effects at these same placebo cutoffs,
motivating the analysis as follows:
[A]t points in the distribution where we do not anticipate
treatment differences, economically and statistically
significant jumps of magnitudes similar to our
VLBW treatment effects could suggest that the discontinuity
we observe at 1,500 grams may be due to natural
variation in treatment and mortality in our data.
They do not present these results but instead report:
In summary, we find striking discontinuities in treatment
and mortality at the VLBW threshold, but less
convincing differences at other points of the distribution.
These results support the validity of our main
findings.
We disagree with this interpretation of the results.
(25.) In principle, the observations at 100-g heaps could also be
analyzed in this manner; however, the analysis would require extremely
large bandwidths.
(26.) One a priori reason for caution is that the 85-g bandwidth
effectively means that the estimates will be identified using
observations at six data heaps at a maximum. Estimates using a larger
bandwidth of 150 g are shown in Figure A7 in the appendix. It is also
possible that the estimates could be confounded by systematic deviations
at ounce multiples that correspond to pounds and fractions thereof.
(27.) The California Vital Statistics Data span 1989 through 2004.
These data, obtained from the California
Department of Pubic Health, contain information on the universe of
births that occurred in California during this time frame. Mother's
date of birth is not available in the public use version of the National
Vital Statistics Natality Data. We use the same sample restrictions as
McCrary and Royer (2011), limiting the sample to mothers who: were born
in California between 1969 and 1987, were 23 years of age or younger at
the time of birth, gave birth to her first child between 1989 and 2002,
and whose education level and date of birth are reported in the data.
(28.) For related reasons, the empirical findings in Dickert-Conlin
and Elder (2010) should also be considered in future papers that use day
of birth as their running variable. In particular, they show that there
are relatively few children born on weekends relative to weekdays
because hospitals usually do not schedule induced labor and cesarean
sections on weekends. As such, children born without medical
intervention who tend to be of relatively low socioeconomic status are
disproportionately observed on weekends.
(29.) With that said, this phenomenon may explain why their
estimates vary a great deal when their bandwidth is less than 20 but are
relatively stable at higher bandwidths. See McCrary and Royer (2011),
Web Appendix figure 3.
(30.) These results are based on reported incomes among PSID heads
of household, 1968-2007. For visual clarity, the graphs focus on
individuals with positive incomes less than $40,000, which is
approximately equal to the 75th percentile. In addition, the histogram
uses $100 bins and the mean plot uses $100 bins for the data that are
not found at $1,000 multiples.
(31.) Interestingly, the PSID also reveals systematic heaping in
annual hours of work, which could also be used as a running variable in
an RD design. For example, many employers provide only health insurance
and other benefits to employees who work some predetermined number of
hours. In these data, heaping is evident at 40-hour multiples and those
at these heaps have less education, on average, than those who work a
similar number of hours who are not data heaps.
(32.) Recall that a regression line always runs through ([bar.x],
[bar.y]).
ALAN I. BARRECA, JASON M. LINDO and GLEN R. WADDELL *
* The authors thank the editor, Lars Lefgren, and two anonymous
referees for their comments and suggestions, along with Josh Angrist,
Bob Breunig, Patrick Button, David Card, Janet Currie, Yingying Dong,
Todd Elder, Bill Evans, David Figlio, Melanie Guldi, Hilary Hoynes,
Wilbert van der Klaauw, Thomas Lemieux, Justin McCrary, Doug Miller,
Marianne Page, Heather Royer, Larry Singell, Ann Huff Stevens, Ke-Li Xu,
Jim Ziliak, seminar participants at the University of Kentucky, and
conference participants at the 2011 Public Policy and Economics of the
Family Conference at Mount Holyoke College, the 2011 SOLE Meetings, the
2011 NBER's Children's Program Meetings, and the 2011 Labour
Econometrics Workshop at the University of Sydney. Barreca: Associate
Professor, Department of Economics,
Tulane University, New Orleans, LA 70115; NBER and IZA. Phone
504-865-5321, Fax 504-865-5869, E-mail
[email protected]
Lindo: Associate Professor, Department of Economics, Texas A&M
University, College Station, TX 77845; NBER and IZA. Phone 979-845-1363,
Fax 979-847-8757, E-mail
[email protected]
Waddell: Professor, Department of Economics, University of Oregon,
Eugene, OR 97403-1285; IZA. Phone 541-346-1259, Fax 541-346-1243, E-mail
[email protected]
COPYRIGHT 2016 Western Economic Association International
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2016 Gale, Cengage Learning. All rights reserved.