文章基本信息

标题：Heaping-induced bias in regression-discontinuity designs.
作者：Barreca, Alan I. ; Lindo, Jason M. ; Waddell, Glen R. 等
期刊名称：Economic Inquiry
印刷版ISSN：0095-2583
出版年度：2016
期号：January
出版社：Western Economic Association International
摘要：I. INTRODUCTION

Empirical researchers have witnessed a resurgence in the use of regression-discontinuity (RD) designs since the late 1990s. This approach to estimating causal effects is often characterized as superior to all other non-experimental identification strategies (Cook 2008; Lee and Lemieux 2010) as RD designs usually entail perfect knowledge of the selection process and require comparatively weak assumptions (Hahn, Todd, and van der Klaauw 2001; Imbens and Lemieux 2008; Lee 2008). This view is supported by several studies that have shown that RD designs and experimental studies produce similar estimates. (1) RD designs also offer appealing intuition--so long as characteristics related to outcomes are smooth around the treatment threshold, we can reasonably attribute differences in outcomes across the threshold to the treatment. In this paper, we discuss the appropriateness of this "smoothness assumption" in the presence of heaping.

For a wide variety of reasons, heaping is common in many types of data. For example, we often observe heaping when data are self-reported (e.g., income, age, and height), when tools with limited precision are used for measurement (e.g., birth weight and pollution), and when continuous data are rounded or otherwise discretized (e.g., letter grades and grade point averages). Heaping also occurs as a matter of practice, such as with work hours (e.g., 8 hours per day, 40 hours per week) and retirement ages (e.g., 62 and 65). In this paper, we show how ignoring heaping can have serious consequences. In particular, in RD designs, estimates are likely to be biased if attributes related to the outcomes of interest predict heaping in the running variable.

Heaping-induced bias in regression-discontinuity designs.

Barreca, Alan I. ; Lindo, Jason M. ; Waddell, Glen R. 等

Heaping-induced bias in regression-discontinuity designs.

I. INTRODUCTION

Empirical researchers have witnessed a resurgence in the use of regression-discontinuity (RD) designs since the late 1990s. This approach to estimating causal effects is often characterized as superior to all other non-experimental identification strategies (Cook 2008; Lee and Lemieux 2010) as RD designs usually entail perfect knowledge of the selection process and require comparatively weak assumptions (Hahn, Todd, and van der Klaauw 2001; Imbens and Lemieux 2008; Lee 2008). This view is supported by several studies that have shown that RD designs and experimental studies produce similar estimates. (1) RD designs also offer appealing intuition--so long as characteristics related to outcomes are smooth around the treatment threshold, we can reasonably attribute differences in outcomes across the threshold to the treatment. In this paper, we discuss the appropriateness of this "smoothness assumption" in the presence of heaping.

For a wide variety of reasons, heaping is common in many types of data. For example, we often observe heaping when data are self-reported (e.g., income, age, and height), when tools with limited precision are used for measurement (e.g., birth weight and pollution), and when continuous data are rounded or otherwise discretized (e.g., letter grades and grade point averages). Heaping also occurs as a matter of practice, such as with work hours (e.g., 8 hours per day, 40 hours per week) and retirement ages (e.g., 62 and 65). In this paper, we show how ignoring heaping can have serious consequences. In particular, in RD designs, estimates are likely to be biased if attributes related to the outcomes of interest predict heaping in the running variable.

While our earlier work (Barreca et al. 2011) identified one case in which heaping led to biased estimates in a RD design, it left several important gaps in the literature. Most crucially, it did not discuss the different ways in which non-random heaping can lead to bias, approaches to diagnosing non-random heaping, or how to correct for non-random heaping. In this paper, we address the following four questions. First, how well does the usual battery of diagnostic tests perform in identifying non-random heaping? (It depends.) Second, are there supplementary diagnostic tests that might be better suited to identifying this type of problem? (We offer two.) Third, do we need to worry about data heaps that are far away from the treatment threshold but within the bandwidth? (Yes, we do.) Fourth, once the problem has been diagnosed, what should a practitioner do? (Although our earlier work may have left the impression that a researcher should drop observations at data heaps from the analysis, this is not necessarily the best solution--we consider alternative approaches that may be more appropriate depending on the circumstances.)

We illustrate the general issue with a series of simulation exercises that consider estimating the most common of sharp-RD models,

(1) [Y.sub.i] = [alpha] + [beta]1 ([R.sub.i] > c) + [theta][R.sub.i] + [psi][R.sub.i]1 ([R.sub.i] > c) + [[epsilon].sub.i],

where [R.sub.i] is the running variable, observations with [R.sub.i] [greater than or equal to] > c are treated, and [[epsilon].sub.i] is a random error term. As usual, this model measures the local average treatment effect by considering the difference in the estimated conditional expectations of [Y.sub.i] on each side of the treatment threshold,

(2) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

Our primary simulation exercise supposes that the treatment cutoff occurs at zero, that there is no treatment effect, and that the running variable is unrelated to the outcome. We introduce non-random heaping by having the expected value of [Y.sub.i] vary across two data types, heaped and non-heaped, where heaped types have [R.sub.i] randomly drawn from {-100, -90, ..., 90, 100} and non-heaped types have [R.sub.i] randomly drawn from {-100, -99, ..., 99, 100}. As such, we have a simple data-generating process (DGP) in which an attribute that predicts heaping in the running variable (type) is also related to outcomes. In this stripped-down example, we show that estimating Equation (1) will arrive at biased estimates and that the usual diagnostics may fail to identify this type of problem. Furthermore, we show that non-random heaping introduces bias even if a data heap does not fall near the treatment threshold. On a brighter note, we offer alternative approaches to considering the underlying data and to accommodating non-random heaping.

To explore how non-random heaping can impair estimation in settings beyond our simple DGP, we examine several alternative DGPs which allow the heaped and non-heaped types to have different means, different slopes, and different treatment effects. We consider several approaches to addressing the bias in each DGP and come to the following conclusions:

1. Omitting observations at data heaps from the analysis leads to unbiased estimates of the treatment effect for non-heaped types.

2. Keeping only those observations at data heaps leads to unbiased estimates of the average treatment effect across the two types weighted by the share of non-heaped data that are observed at data heaps. However, this approach: (a) limits the extent to which one can shrink the bandwidth; (b) cannot be implemented when there are relatively few heaps within reasonable bandwidths; and (c) may lead to problems of inference associated with having too few clusters. When it can be reasonably implemented, the resulting estimate can be combined with the estimate for non-heaped types to produce an unbiased estimate of the average treatment effect for the population.

3. Approaches to estimating the unconditional average treatment effect that pool the data and control flexibly for data observed at heap points reduce but do not eliminate bias.

We consider the lessons learned from our simulation exercise in the context of three non-simulated environments. First, we consider the use of birth weights as a running variable to highlight the efficacy of our proposed tests for non-random heaping. We also use this example to demonstrate the merits of alternative approaches to overcoming the bias that non-random heaping introduces into RD designs. Second, we show that mother's reported day of birth, previously used to estimate the effect of maternal education on fertility and infant health, also exhibits non-random heaping. Third, in order to further demonstrate the pervasiveness of heaping in commonly used data, we document non-random heaping in both income and hours worked in the Panel Study of Income Dynamics (PSID). While clearly not an exhaustive consideration of the existing RD literature or of data where heaping is evident, this analysis suggests that empirical researchers should, as a matter of practice, consider heaping as a potential threat to internal validity in most any exercise.

II. A SIMPLE FRAMEWORK FOR HEAPING-INDUCED BIAS

Consider a situation in which neither the program administrator (who is responsible for treatment) nor the researcher can observe the true value of the running variable. [R.sup.*.sub.i]. Instead, they observe:

(3) [R.sub.i] = [R.sup.*.sub.i] (1 - [K.sup.*.sub.i]) + [GAMMA]([R.sup.*.sub.i])[K.sup.*.sub.i],

where [K.sup.*.sub.i] is an unobservable variable drawn from a Bernoulli distribution with mean [lambda] that indicates whether the data are heaped. The heaping function [GAMMA] can take the form of a rounding function (e.g., nearest integer function, floor function, and ceiling function) but may also reflect other imputation rules. For example, missing birthdays may be recorded as happening on the 15th of the month, in which case [GAMMA] would be a constant function. As another example, topcoding would imply that [GAMMA] = [R.sup.*.sub.i] for [R.sup.*.sub.i] < T and equal to T for [R.sup.*.sub.i] [greater than or equal to] T.

Furthermore, suppose that such heaping is related to outcomes in a linear RD setting such that

(4) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

which allows for heterogeneity in outcomes ([Y.sub.i]) across data types [K.sup.*]. Given this knowledge, a researcher would naturally estimate Equation (4) directly and recover estimates of both [[beta].sub.0] and [[beta].sub.1]. In practice, however, a researcher who is not aware of this heterogeneity is likely to instead estimate

(5) [Y.sub.i] = [alpha] + [beta]1([R.sub.i] > c) + [theta][R.sub.i] + [psi][R.sub.i]1 ([R.sub.i] > c) + [u.sub.i],

which is like Equation (1), but for the error term, [u.sub.i] = [K.sup.*.sub.i]([[alpha].sub.1] + [[beta].sub.1]1 ([R.sub.i] > c) + [[theta].sub.1][R.sub.i] + [[psi].sub.1] [R.sub.i]1([R.sub.i] > c)) + [e.sub.i] where Equation (4) implies [K.sup.*.sub.i] = ([R.sub.i] - [R.sup.*.sub.i])[([GAMMA]([R.sup.*.sub.i]) - [R.sup.*.sub.i]).sup.-1]. (2) As such, the degree to which there is bias depends on the degree to which there is heterogeneity across data types in addition to how the heaping function operates on the unobserved running variable [R.sup.*]. (3,4)

III. SIMULATION EXERCISE

A. Baseline DGP and Results

All of our simulation exercises are based on samples of 10,000 observations, 80% of which have [R.sub.i] randomly drawn from a discrete uniform distribution on the integers {-100, -99, ..., 99, 100) (i.e., non-heaped types, or those with [K.sup.*] = 0), while the remainder have [R.sub.i] drawn from a discrete uniform distribution on the integers {-100, -90, ..., 90, 100} (i.e., heaped types, or those with [K.sup.*] - 1). Our baseline data-generating process (DGP-1) considers the case in which the treatment cutoff occurs at zero (c = 0), there is no treatment effect ([[beta].sub.0] = [[beta].sub.1] = 0), the running variable is not related to the outcome ([[theta].sub.0] = [[theta].sub.1] = [[psi].sub.0] = [[psi].sub.1] = 0), and [[epsilon].sub.i] is drawn from N(0,1). The only difference between heaped types and non-heaped types (besides having [R.sub.i] drawn from different distributions) is their mean: non-heaped types have a mean of zero and heaped types have a mean of 0.5.

The main features of these data are depicted in Panels A and B of Figure 1, which are based on a single simulation. Panel A, in which we plot the distribution of the data using one-unit bins, shows the data heaps at multiples of 10. Panel B, in which we plot mean outcomes within the same one-unit bins with separate symbols for bins that correspond to multiples of 10, makes it clear that the means are systematically higher at data heaps. (5)

In Panel C of Figure 1, we plot the estimated treatment effect that is expected from this data-generating process (derived from 1,000 simulations) using the standard RD approach (Equation 1) and bandwidths ranging from 5 to 100. There are two main features of this figure. First, despite the fact that there is no treatment effect, the estimated treatment effect is always positive in expectation. Second, the bias increases as we shrink the bandwidth. (6)

In Panel D, we plot rejection rates at the 5% level based on three different approaches to obtaining standard error estimates--assuming homoskedasticity (i.i.d.), allowing for heteroskedasticity, and clustering on the running variable, as recommended by Lee and Card (2008) to address potential model misspecification. Notably, the rejection rates based on standard error estimates that assume homoskedasticity and those allowing for heteroskedasticity are nearly identical to one another, while the rejection rates based on standard error estimates clustered on the running variable tend to be far lower. This result is consistent with the clustering approach "inflating" standard error estimates when the model has been misspecified. However, the fact that this approach produces rejection rates at the 5% level that are routinely below 5% indicates that the standard error estimates are too large. (7)

Inference issues aside, the fact that nonrandom heaping can bias estimated treatment effects raises several important questions that we will address in turn. How can we identify this type of problem? Is the problem specific to circumstances in which a non-random data heap falls immediately to one side of a treatment threshold? What if the heaped data have a different slope or a non-zero treatment effect? Finally, how can we address the problem once it has been diagnosed?

B. Diagnosing the Problem

Standard RD-Validation Checks. It is well established that practitioners should check that observable characteristics and the distribution of the running variable are smooth around the threshold. When neither is smooth through the threshold, it is usually taken as evidence that there is manipulation of the running variable (McCrary 2008). Specifically, if there is excess mass on one side of the threshold, it suggests that individuals may be engaging in strategic behavior in order to gain favorable treatment. In addition, if certain "types" are disproportionately observed on one side of the threshold, it suggests that there may be systematic differences in how successfully different types of individuals can manipulate the running variable in order to gain access to treatment. Any evidence of this kind is cause for concern because composition changes across the treatment threshold could threaten identification. Given that the problem shown in the previous section is one of composition bias arising from points in the distribution with excess mass, one might anticipate that existing tools would be well suited to raising a red flag when this issue exists. As it turns out, this may not be the case.

In Panel A of Figure 2, we show the expected outcome of tests for a discontinuity in the distribution across the treatment threshold following McCrary (2008). The set of discontinuity estimates, based on bandwidths ranging from 5 to 100, shows what one would probably expect. The estimated discontinuity is greatest when the bandwidth is small as a small bandwidth gives greater proportional weight to the data heap that falls immediately to the right of the cutoff. Panel B, which shows rejection rates, indicates that this approach reliably rejects zero at all of the considered bandwidths. That said, it is important to note that the usefulness of this diagnostic depends on how the heaps are distributed relative to the threshold and what fraction of the data are heaped, in addition to the set of bandwidths one can reasonably consider. (8) In particular, this test yields lower rejection rates when there is not a heap immediately to one side of the threshold and when a smaller share of the data are heaped.

In Panels C and D of Figure 2, we investigate the extent to which we might be able to diagnose the problem by testing whether observable characteristics are smooth through the threshold. In particular, we consider discontinuities in a proxy variable that is equal to one for heaped types and zero for non-heaped types plus a standard normal error term. (9) As in the simulations testing for discontinuities in the distribution, these results show that the estimated discontinuities in this proxy variable are largest for small bandwidths. Moreover, with standard error estimates that assume i.i.d. errors or allow for heteroskedasticity the rejection rates tend to be high. As such, like the test for a discontinuity in the distribution, a test for covariate balance certainly could fail as a result of non-random heaping. However, the usefulness of this diagnostic in identifying problems that result from non-random heaping will again depend on how the heaps are distributed relative to the threshold, what fraction of the data are heaped, and the set of bandwidths one can reasonably consider in addition to the strength of the proxy variable available to the researcher.

Supplementary Approaches to Identifying Nonrandom Heaping. Although the conventional specification checks are well suited to diagnosing strategic manipulation of the running variable, the results above demonstrate that additional diagnostics may be required to identify nonrandom heaping. In this section, we introduce two such diagnostics.

First, while researchers are inclined to produce mean plots as standard practice, the level of aggregation is typically such that non-random heaping can be hidden. With aggregation, researchers may mistake heap points for noise rather than systematic outliers. As a simple remedy, however, one can show disaggregated mean plots that clearly distinguish heaped data from non-heaped data, as in Panel B of Figure 1. (10) Of course, a necessary pre-requisite to this approach is knowledge of where the heaps fall, which can be revealed through a disaggregated histogram, as in Panel A of Figure 1.

Although this approach is useful for visual inspection of the data, a second, more-rigorous approach is warranted when systematic differences between heaped data and non-heaped data are less obvious. To test whether some covariate, X, systematically deviates from its surrounding non-heaped data at a given data heap Z, one can estimate

(8) [X.sub.i] = [[gamma].sub.0] + [[gamma].sub.1] ([R.sub.i] = Z) + [[gamma].sub.2] ([R.sub.1] - Z) + [u.sub.i],

using the data at heap Z itself in addition to the non-heaped data within some bandwidth around Z. Essentially, this regression equation estimates the extent to which characteristics "jump" off of the regression line predicted by the non-heaped data. If [[??].sub.1] is significant, then one has reason to conclude that the composition changes abruptly at heap Z.

An approach along these lines can also be used as an initial step to considering whether there is heaping, be it random or non-random. In particular, a disaggregated histogram is likely to be more useful than an aggregated histogram. Moreover, one could estimate an equation like (8) applied to the distribution of the data to test the degree to which any potential heap points are statistically significant. That said, we would caution against ignoring heaps that are economically but not statistically significant.

C. What if Heaping Occurs Away from the Threshold?

In this section, we examine the extent to which non-random heaping leads to bias when a heap does not fall immediately to one side of the threshold. In Panel A of Figure 3, we show the estimated treatment effect as we change the location of the heap relative to the cutoff, while rejection rates are shown in Panel B. In order to focus on the influence of a single heap, we adopt a bandwidth of 5 throughout this exercise.

As before, when the data heap is adjacent to the cutoff on the right, the estimated treatment effect is biased upwards. Conversely, when the data heap is adjacent to the cutoff on the left, the estimated treatment effect is biased downwards. While this is rather obvious, what happens as we move the heap away from the cutoff is perhaps surprising. Specifically, the bias does not converge to zero as we move the heap away from the cutoff. Instead, the bias goes to zero and then changes sign. This results from the influence of the heaped types on the slope terms. For example, consider the case in which the data heap (with mean 0.5 whereas all other data are mean zero) is placed 5 units to the right of the treatment threshold and the bandwidth is 5. In such a case, the best-fitting regression line through the data on the right (treatment) side of the cutoff will have a positive slope and a negative intercept. In contrast, the best-fitting regression line through the data on the left (control) side of the cutoff will have a slope and intercept of zero. Thus, we arrive at an estimated effect that is negative. The reverse holds when we consider the case in which the data heap is 5 units to the left of the treatment threshold, which yields an estimated effect that is positive. Recall that, in all cases, the true treatment effect is zero. Our analysis highlights that "donut-RD approaches" that estimate the treatment effect after dropping observations in the immediate vicinity of the treatment threshold, as in Barreca et al. (2011), should not be thought of as a general approach to addressing non-random heaping because data heaps away from the threshold may also introduce bias; instead dropping observations in the immediate vicinity of the treatment threshold should be thought of as a useful robustness check that has the potential to highlight misspecification in any RD design. (11)

D. Alternative DGPs

In this section, we demonstrate other issues related to non-random heaping by exploring alternative DGPs that introduce differences in slopes and treatment effects across the two types. We summarize each DGP in Figure 4 and present the estimated effects using different approaches to estimation. Rejection rates based on i.i.d., heteroskedastic robust, and clustered standard errors can be found in Figures A2, A3, and A4 of the Appendix, respectively.

DGP-2 is the same as the baseline DGP-1 except that the non-heaped types and heaped types have different slopes instead of different means. In particular, in DGP-2 the mean is zero for both groups, the slope is zero for non-heaped types, and the slope is 0.01 for heaped types. These parameters are summarized in the second column of Figure 4, in Panel A, while the means are shown in Panel B. In Panel C, we show the estimated treatment effects for varying bandwidths. Again, although there is no treatment effect for non-heaped or heaped types, the estimates tend to suggest that the average treatment effect is not zero. The estimates display a sawtooth pattern, dropping to less than zero each time an additional pair of data heaps is added to the analysis sample and then climbing above zero again as additional non-heaped data are added. (12) As in DGP-1, the bias becomes less severe at larger bandwidths, although this may not be obvious to the eye.

While the results are not shown in Figure 4, we have also considered a DGP that combines the salient elements of DGP-1 and DGP-2 by having the parameters for the non-heaped types remain the same (set at zero), while having the heaped types have a higher mean (0.5) and a higher slope (0.01). As one would expect, the estimated local average treatment effects exhibit a positive bias that grows exponentially as the bandwidth is reduced (as in DGP-1) and a sawtooth pattern (as in DGP-2).

We now turn to two DGPs in which there is a treatment effect for heaped types while maintaining the same parameters for the non-heaped types. Specifically, for heaped types in DGP-3, we set the control mean equal to zero and the treatment effect to 0.5, which implies an unconditional average treatment effect of 0.1. In DGP-4, we also introduce a slope (0.01) for the heaped types. The set of estimates in Panel C shows that--as with the DGPs in which there the average treatment effect is zero--the standard approach to estimation yields positively biased estimates of the average treatment effect for each of these DGPs.

E. Addressing Non-Random Heaping

In this section, we explore several approaches to addressing the bias induced by non-random heaping. To begin, we consider an estimation strategy in which we simply drop data at heap points from the analysis. Estimates based on this approach are shown in Panel D of Figure 4. For all DGPs and all bandwidths, this approach leads to an unbiased estimate of the treatment effect for non-heaped types, which is zero.

While the approach described above would seem to be quite useful in most scenarios, it is important to consider whether the data can be used more fully, either to improve precision or because we are interested in estimates that capture the treatment effects for those observed at heaps. In Panel E, we show that RD estimates based solely on data at heap points provide unbiased estimates of the average treatment effect across the two types, weighted by the share of non-heaped data that are observed at data heaps. (13) Furthermore, Panel F shows that we can recover an unbiased estimate of the unconditional average treatment effect by taking a population-weighted average of the estimates based on these two approaches to restricting the sample, that is, by combining the estimates from Panels D and E. (14)

In Panels G and H, we show that the same results cannot be achieved by pooled approaches that control for data heaps. In particular, in Panel G we show estimates that are based on the standard approach while also including an indicator variable that equals 1 for observations at data heaps and zero otherwise. These estimates clearly show that adding this control variable is no panacea. While it does fully remove the bias from estimates in DGP-1, in which the only difference between the non-heaped and heaped data is their mean, it does not fully remove the bias from estimates based on other DGPs. Panel H takes a more flexible approach to estimation, allowing separate intercepts and trends for observations at data heaps. This approach does better than simply allowing a different intercept. It removes the bias when the non-heaped and heaped data have different means or slopes while having the same treatment effect (DGP-1 and DGP-2). However, when non-heaped types and heaped types have different treatment effects (DGP-3 and DGP-4), this does not recover unbiased estimates of the average treatment effect.

The main takeaway from these results is that it is possible to obtain unbiased estimates by separately investigating non-heaped and heaped data. At the same time, it is important to recognize that it may not always be feasible or convincing to investigate data heaps in isolation. Doing so requires that there be "enough" heaps within a reasonable bandwidth of the treatment threshold. (13) It also limits how close one can get to the treatment threshold to obtain estimates, which increases the risk that the estimates will be biased by a misspecified functional form. Given the advantages offered by the approach that focuses on non-heaped data--unbiased estimates and the ability to shrink the bandwidth considerably--an approach that restricts the sample in this fashion would seem to be an integral part of any RD-based investigation in the presence of heaping. The usefulness of investigating the effects for observations at data heaps is likely to depend a great deal on the specific context.

IV. NON-SIMULATED EXAMPLES

In this section, we present three non-simulated examples where non-random heaping has the potential to bias RD-based estimates. First, we examine birth weights, which have previously been used as a running variable to identify the effects of hospital care on infant health. We show that U.S. birth weight data exhibit nonrandom heaping that will bias RD estimates, that our proposed diagnostics detect this problem whereas the usual diagnostics do not, and that the approaches that were effective at reducing the bias in the simulation are also effective in this context. Second, we turn our attention to dates of birth, which have previously been used as a running variable to identify the effects of maternal education on fertility and infant health. We show that these data also exhibit non-random heaping that could lead to bias. Third, using the PSID, we show that non-random heaping is also present in work hours and in earnings.

A. Birth Weight as a Running Variable

Background. Because some hospitals use birth weight cutoffs as part of their criteria for determining medical care, a natural way of measuring the returns to such care is to use a RD design with birth weight as the running variable. Almond et al. (2010), hereafter ADKW, take this approach in order to consider the effect of very-low-birth-weight classification, that is, having a measured birth weight strictly less than 1,500 g, on infant mortality. (16) While Barreca et al. (2011) and Almond et al. (2011) explore the sensitivity of the estimated effects to the treatment of observations around the 1,500-g threshold, here we use this empirical setting to illustrate the broader econometric issues associated with heaping. (17)

To begin, consider that birth weights can be measured using a hanging scale, a balance scale, or a digital scale, each of them rated in terms of their resolution. Modern digital scales marketed as "neonatal scales" tend to have resolutions of 1, 2, or 5 g. Products marketed as "digital baby scales" tend to have resolutions of 5, 10, or 20 g. Mechanical baby scales tend to have resolutions between 10 and 200 g. Birth weights are also frequently measured in ounces, with ounce scales varying in resolution from 0.1 to 4 ounces. Because not all hospitals have high-performance neonatal scales, especially going back in time, a certain amount of heaping at round numbers is to be expected.

In the discussion of the simulation exercise above, we recommended that researchers produce disaggregated histograms for the running variable. We do so for birth weights in Panel A of Figure 5. (18) As also noted in ADKW, this figure clearly reveals heaping at 100-g and ounce multiples, with the latter being most dramatic. Although we focus on these heaps throughout the remainder of this section to elucidate conceptual issues involving non-random heaping, a more-complete analysis of the use of birth weights as a running variable would need to consider heaps at even smaller intervals (e.g., 50-g and half-ounces). In any case, to the extent to which any of the observed heaping can be predicted by attributes related to mortality, our simulations imply that standard RD estimates are likely to be biased.

In considering the potential for heaping to be systematic in a way that is relevant to the research question, we first note that scale prices are strongly related to scale resolutions. Today, the least-expensive scales cost less than $100, whereas the most expensive cost approximately $2,000. For this reason, it is reasonable to expect more precise birth weight measurements at hospitals with greater resources, or at hospitals that tend to serve more-affluent patients. (19) That is, one might anticipate that the heaping is systematic in a non-trivial way.

Diagnostics. ADKW noted that there was significant heaping at round-gram numbers and at gram equivalents of ounce multiples. However, they did not test whether the heaping was random. They did, of course, perform the usual specification checks to test for non-random sorting across the treatment threshold. These tests do not reveal statistically significant discontinuities in characteristics across the treatment threshold. Moreover, despite there being two obvious heaps immediately to the right of the treatment threshold at 1,500 and 1,503 g (i.e., 53 ounces), they find that the estimated discontinuity in the distribution is not statistically significant.

In addition, ADKW make the rhetorical argument that there are not irregular heaps around the 1,500-g threshold of interest as the heaps are similar around 1,400 and 1,600 g. With respect to the usual concerns about non-random sorting, this argument is compelling. In particular, the usual concern is that agents might engage in strategic behavior so that they are on the side of the threshold that gives them access to favorable treatment. While this is a potential issue for the 1,500-g threshold, it is not an issue around 1,400 and 1,600 g. As we also see heaping at the 1,400- and 1,600-g thresholds, it makes sense to conclude that the heaping observed at the 1,500-g threshold is "normal." The problem with this line of reasoning, however, is that all of the data heaps may be systematic outliers in their composition.

Following our recommended procedure, in the second graph in Panel A of Figure 5, we plot the fraction of children that are White against recorded birth weights, visually differentiating heaps at 100-g and ounce intervals. This is our first strong evidence that the heaping apparent in Panel A is non-random--children at the 100g heaps are disproportionately likely to be non-White. This pattern, which suggests that those at 100-g heaps are relatively disadvantaged is echoed in Figure A6 in the Appendix, which presents similar results for mother's education and Apgar scores. In contrast, children at ounce heaps are disproportionately likely to be White, to have mothers with at least a high-school education, and to have high Apgar scores. However, compared with those at 100-g heaps, it is far less clear that those at ounce heaps are outliers in their underlying characteristics, highlighting the usefulness of a more-formal approach. Our more-formal approach to exploring the extent to which the composition of children changes abruptly at reporting heaps is based on Equation (8), where [X.sub.i] is a characteristic of individual i with birth weight [Z.sub.i] and we separately consider Z = {1,000, 1,100, ..., 3,000} and gram equivalents of ounce multiples. As discussed in the simulation exercise, this diagnostic is not intended to detect a mean shift across Z but, rather, the extent to which characteristics at heap Z differ from what would be expected based on surrounding (non-heaped) observations.

The results from this regression analysis confirm that child characteristics change abruptly at data heaps. Focusing on the 100-g heaps, in the first graph in Panel B of Figure 5 we show estimated percent changes, [[??].sub.1]/[[??].sub.0], for the probability that a mother is White. (20) For nearly every estimate, bootstrapped standard error estimates are small enough to reject that the characteristics of children at Z are on the trend line. Similarly, in the same panel, we demonstrate that those at ounce heaps also tend to systematically deviate from the trend based on surrounding observations, except the more-affluent types are disproportionately more likely to have birth weights recorded in ounces. (21) In addition, it is clear that the data at ounce heaps have a different slope from the non-heaped data, which our simulation exercises revealed to be problematic. In the end, we note that where the standard validation exercises do not detect these important sources of potential bias, this simple procedure proves effective.

Non-Random Heaping, Bias, and Corrections. Given these relationships between characteristics that predict both infant mortality and heaping in the running variable, our simulation exercise suggests standard RD estimates will be biased. To illustrate this issue, we replicate ADKW's analysis of the 1,500-g cutoff while also considering placebo cutoffs of 1,000, 1,100, ..., 3,000 g. (22) In particular, we use the regression described in Equation (1) and an 85-g bandwidth to consider the extent to which there are mean shifts in mortality rates across the considered thresholds. We plot percent changes, 100 multiplied by the estimated treatment effect divided by the intercept, for greater comparability across the differing cutoffs. (23)

In Panel A of Figure 6, we present the estimated percent impacts on 1 -year mortality, which suggest that near any of the considered cutoffs c, children with birth weights less than c routinely have better outcomes than those with birth weights at or above c. Given that these results are largely driven by a systematically different composition of children at the 100-g heaps that coincide with the cutoffs, the estimated effects are much larger in magnitude when one uses a more narrow bandwidth. For example, with a bandwidth of 30 g, 42 of 42 point estimates fall below zero. (24)

Before turning to the approaches to estimation suggested by our simulation exercises, we first consider the extent to which the bias evident in Panel A may be addressed by the inclusion of an extensive battery of control variables. In particular, Panel B of Figure 6 shows the estimated effects that control with fixed effects for state, year, birth order, the number of prenatal care visits, gestational length in weeks, mothers' and fathers' ages in 5-year bins (with <15 and >40 as additional categories), and controlling with indicator variables for multiple births, male, Black, other race, Flispanic, the mother having less than a high-school education, the mother having a high-school education, the mother having a college education or more, and whether the mother lives in her state of birth. Although the estimates shrink toward zero, they are qualitatively similar to those reported in Panel A--they imply that having a birth weight less than the considered cutoff reduces infant mortality in almost all cases. We interpret this set of estimates as evidence that this approach has not effectively dealt with the composition bias.

As illustrated in the simulation exercise, an effective approach to dealing with non-random heaping is to estimate the effect after dropping observations at data heaps. While a drawback of this method is that it cannot tell us about the treatment effect for the types who tend to be observed at data heaps, it is consistent with the usual motivation for RD designs. Specifically, researchers will focus on what might be considered a relatively narrow sample in order to be more confident that they can identify unbiased estimates.

In Panel C of Figure 6, we show the estimated effects on infant mortality based on this approach, omitting those at 100-g and ounce heaps from the analysis. While the earlier estimates (in Panels A and B) were negative for most of the placebo cutoffs, these estimates resemble the white-noise process we would anticipate in the absence of treatment effects. Thus, these results indicate that the sample restrictions we employ reduce the bias produced by the non-random heaping described above. These results also suggest that the estimated effect of very-low-birth-weight classification is zero for children born at hospitals where birth weights are not recorded in hundreds of grams or in ounces.

Our simulation exercise also showed that an approach that focuses solely on heaped types can yield unbiased estimates where the data allow for such estimates. In this setting, the ounce heaps can be considered in this manner. (25) We show these results in Panel D of Figure 6. The estimates do indicate a significant effect of very-low-birth-weight classification for children born at hospitals where birth weights are recorded in ounces, which is consistent with Almond et al.'s (2011) evidence that very-low-birth-weight classification is particularly relevant for those born at low-quality hospitals (where birth weights are more likely to be recorded in ounces). That said, the magnitude and statistical significance of the estimated effects at the placebo cutoffs suggests that the estimates should be interpreted with caution. (26)

B. Dates of Birth and Ages as Running Variables

A common approach to estimating the effects of education on outcomes is to use variation driven by small differences in birth timing that straddle school-entry-age cutoffs. For example, "5 years old on December 1" is a common school-entry requirement. As such, the causal effect of education on outcomes can be measured by comparing the outcomes of individuals born just before December 2 to the outcomes of those born shortly thereafter, who begin school later and thereby tend to obtain fewer years of education.

Dobkin and Ferreira (2010) use this approach to investigate the effects of education on job market outcomes, whereas McCrary and Royer (2011) use this approach to identify the causal effect of maternal education on fertility and infant health using restricted-use birth records from California and Texas. In the first graph of Figure 7, Panel A, we use the same California birth records used by McCrary and Royer (2011) and show the distribution of mothers' reported birth dates across days of the month. (27) Although less striking than in the birth weight example, this figure shows that there are data heaps at the beginning of each month and at multiples of 5. The second graph in Panel A shows one of many indications that those at data heaps are outliers--that the mothers at these data heaps are disproportionately less likely to have used tobacco during their pregnancies. This phenomenon is not specific to tobacco use, however. Similar patterns are equally evident in mother's race, father's race, mother's education, father's education, the fraction having father's information missing, or the fraction having pregnancy complications, along with a wide array of child outcomes. (28)

It turns out that this non-random heaping is unlikely to be a serious issue for the main results presented by McCrary and Royer (2011) because their preferred bandwidth of 50 leaves their estimates relatively insensitive to the high-frequency-composition shifts described above. (29) At the same time, it is important to keep in mind that were more data available conventional practice would have them choose a smaller bandwidth. Our simulation exercise demonstrates that this practice of shrinking the bandwidth with more data would make the bias associated with non-random heaping more severe.

This issue of heaping in dates of birth is also present in Shigeoka (2014) who considers a discontinuity in patient cost sharing at age 70 in Japan. He circumvents any issues associated with the data heaps (at the day level) by collapsing ages to the monthly level. This sort of approach can only be used when the heaping function is

known so that the data that are not at heap points can be imputed to the appropriate heap point (left or right) and when doing so would not change the implied treatment status. In Shigeoka's context, these conditions are met because the heaping is in the day of birth and treatment coincides with the beginning of the month after age 70.

A similar issue also appears in Edmonds, Mammen, and Miller (2005) who consider a discontinuity in women's pension receipt at age 60 in South Africa. They note that their data exhibits heaping at ages in round decades and highlight that this heaping is non-random as "women at age 60 generally look different than would be predicted by the trend prior to age 60 and the trend after 60." In line with the solutions we describe above, they exclude women at age 60 from estimation and explain that this alters the population to which the results are applicable. That said, the results of our simulation would support excluding women with ages at any multiple of 10 because non-random heaps that are far from the threshold have the potential to introduce bias.

C. Income as a Running Variable

Given how many policies are income-based, there are several examples where treatment effects might be identified using an RD design with income as the running variable. For example, one might consider this strategy to identify the effects of various tax incentives, financial aid offers, the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), or the Children's Health Insurance Program (CHIP), which subsidizes health insurance for families with incomes that are marginally too high to qualify for Medicaid.

In light of our results above, Saez's (2010) analysis of the distribution of income tax data highlights a potential problem for any such study. In particular, the fact that self-employed taxpayers bunch at tax kink points but others do not indicates non-random heaping. Taking a closer look at income data based on the PSID shows even more systematic heaping. In particular, Panel B of Figure 7 shows that there is significant heaping at $1,000 multiples and that individuals at these data heaps are substantially less likely to be White than those with similar incomes who are not at these data heaps. (30,31)

V. DISCUSSION AND CONCLUSION

In this paper, we have demonstrated that the RD design's smoothness assumption is inappropriate when there is non-random heaping. In particular, we have shown that RD-estimated effects are afflicted by composition bias when attributes related to the outcomes of interest predict heaping in the running variable. Furthermore, the estimates will be biased regardless of whether the heaps are close to the treatment threshold or far away (but within the bandwidth).

While composition bias is not a new concern for RD designs, the type of composition bias that researchers tend to test for is of a very special type. In particular, the convention is to test for mean shifts in characteristics taking place at the treatment threshold. This diagnostic is often motivated as a test for whether or not certain types are given special treatment or better able to manipulate the system in order to obtain favorable treatment. In this paper, we suggest that researchers also need to be concerned with abrupt compositional changes that may occur at heap points.

We propose two supplementary approaches to establishing the validity of RD designs when the distribution of the running variable has heaps. While the importance of showing disaggregated mean plots is well established as a way to visually confirm that estimates are not driven by misspecification (Cook and Campbell 1979), our examples demonstrate that researchers should highlight data at reporting heaps in such plots in order to visually inspect whether there is nonrandom heaping. As a more-formal diagnostic to be used when the problem is not obvious, we suggest that researchers estimate the extent to which characteristics at heap points "jump" off of the trend predicted by non-heaped data.

We consider several different approaches to addressing the bias that non-random heaping introduces into standard RD estimates. Approaches that control flexibly for data heaps reduce but do not remove the bias. In contrast, approaches that stratify the data do provide unbiased estimates. In particular, an analysis that simply drops the data at reporting heaps yields an unbiased estimate of the treatment effect for non-heaped types. Moreover, if there are a sufficient number of heaps within a reasonable bandwidth of the threshold, a researcher can separately analyze these data to obtain an unbiased estimate that captures a weighted average of the treatment effects for heaped and non-heaped types (as both are present at data heaps). Where this is feasible, the two unbiased estimates can be combined to provide an estimate of the unconditional average treatment effect.

doi: 10.1111/ecin.12225

ABBREVIATIONS

CHIP: Children's Health Insurance Program

DGP: Data-Generating Process

PSID: Panel Study of Income Dynamics

RD: Regression Discontinuity

WIC: Women, Infants, and Children

APPENDIX

Understanding the Sawtooth Pattern in Estimated Treatment Effects

To understand the sawtooth pattern first exhibited in DGP2, Figure A1 plots the regression lines using selected bandwidths. The short-dashed lines are based on a bandwidth of 10, where the data include three heap points, R= (-10,0,10). These lines show that the non-random heaping captured in DGP-2 leads to an estimate that is negatively biased. In particular, the heap at R = -10 has two effects on the regression line on the left side of the threshold. First, this heap causes the regression line to shift down because it pulls down the center of mass. (32) Second, it induces a positive slope in order to bring the regression line closer to the heaped data at the edge of the bandwidth. As it turns out, the slope is large enough that the regression line crosses zero from below, which results in a positive expected value approaching the treatment threshold from the left. The heap at R = 10 has similar effects on the regression line on the right side of the threshold--shifting the regression line up, inducing a positive slope such that the expected value is negative approaching the treatment threshold from the right. As such, approaching the threshold from each side, we arrive at a negative difference in expected value.

The dash-and-dot line in Figure 8 uses a bandwidth of 18 to demonstrate how the same DGP can arrive at positive estimates. Again, on both the left and right sides of the treatment threshold, the sum of squared errors is minimized by a positively sloped regression line. However, with more non-heaped data, including a sizable share to the left of the data heap at R = -10 and to the right of the data heap at R = 10, the magnitude of the slope is much smaller. As a result, neither regression line, on the left side or the right side of the threshold, crosses zero. Thus, we have a negative expected value approaching the treatment threshold from the left and a positive expected value approaching the treatment threshold from the right, that is, a positive estimate of the treatment effect.

Last, the solid line in Figure A1 plots the regression lines using a bandwidth of 20. Here, it is important to keep in mind that the increase in the bandwidth has introduced data heaps at R = -20 and R = 20 to the analysis. Not surprisingly, the bandwidth of 20 shares a lot in common with the bandwidth of 10. In particular, the heaps at the boundary of the bandwidth influence the slope parameters such that the regression lines cross zero. As such, we again find a negative estimate of the treatment effect when the true effect is zero.

As shown in the second column of Panel B in Figure 4, these phenomena occur in a systematic fashion as we change the bandwidth. Each time a new set of heaps is introduced, the slope estimate becomes sharply positive, which leads the regression lines on each side of the cutoff to pass through zero, leading to negative estimates of the treatment effect. As we increase the bandwidth beyond a set of heaps, however, the slope terms shrink in magnitude, the regression lines no longer pass through zero, and we arrive at positive estimates of the treatment effect. The process repeats again when the increase in bandwidth introduces a new set of heaps.

REFERENCES

Aiken, L. S., S. G. West, D. E. Schwalm, J. Carroll, and S. Hsuing. "Comparison of a Randomized and Two Quasi-Experimental Designs in a Single Outcome Evaluation: Efficacy of a University-Level Remedial Writing Program." Evaluation Review, 22(4), 1998, 207-44.

Almond, D., J. J. Doyle Jr., A. E. Kowalski, and H. Williams. "Estimating Marginal Returns to Medical Care: Evidence from At-risk Newborns." Quarterly Journal of Economics, 125(2), 2010, 591-634.

--. "The Role of Hospital Heterogeneity in Measuring Marginal Returns to Medical Care: A Reply to Barreca, Guldi, Lindo, and Waddell." Quarterly Journal of Economics, 126(4), 2011, 591-634.

Barreca. A. I., M. Guldi, J. M. Lindo, and G. R. Waddell. "Saving Babies? Revisiting the Effect of Very Low Birth Weight Classification." Quarterly Journal of Economics, 126(4), 2011, 2117-23.

Berk, R., G. Barnes, L. Ahlman, and E. Kurtz. "When Second Best Is Good Enough: A Comparison Between a True Experiment and a Regression Discontinuity Quasi-Experiment." Journal of Experimental Criminology, 6, 2010, 191-208.

Black, D., J. Galdo, and J. Smith. "Evaluating the Regression Discontinuity Design Using Experimental Data." Mimeo, University of Chicago, 2005.

Buddelmeyer, H., and E. Skoufias. "An Evaluation of the Performance of Regression Discontinuity Design on PROGRESA." World Bank Policy Research Working Paper No. 3386, 2004.

Cho, J. S., and H. White. "Testing for Regime Switching." Econometrica, 75(6), 2007, 1671-720.

Cook, T. D. "'Waiting for Life to Arrive': A History of the Regression-Discontinuity Design in Psychology, Statistics and Economics." Journal of Econometrics, 142(2), 2008, 636-54.

Cook, T. D., and D. T. Campbell. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Chicago: Rand McNally, 1979.

Cook, T. D., and V. C. Wong. "Empirical Tests of the Validity of the Regression Discontinuity Design." Annals of Economics and Statistics, 91/92, 2008, 127-50.

Dickert-Conlin, S., and T. Elder. "Suburban Legend: School Cutoff Dates and the Timing of Births." Economics of Education Review, 29(5), 2010, 826-41.

Dobkin, C., and F. Ferreira. "Do School Entry Laws Affect Educational Attainment and Labor Market Outcomes?" Economics of Education Review, 29(1), 2010, 40-54.

Dong, Y. "Regression Discontinuity Applications with Rounding Errors in the Running Variable." Journal of Applied Econometrics, 30(3), 2015, 422-46.

Edmonds, E., K. Mammen, and D. L. Miller. "Rearranging the Family? Income Support and Elderly Living Arrangements in a Low-Income Country." Journal of Human Resources, 40(1), 2005, 186-207.

Hahn, L, P. Todd, and W. Van der Klaauw. "Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design." Econometrica, 69(1), 2001, 201-9.

Imbens, G. W., and T. Lemieux. "Regression Discontinuity Designs: A Guide to Practice." Journal of Econometrics, 142(2), 2008, 615-35.

LaLonde, R. "Evaluating the Econometric Evaluations of Training with Experimental Data." The American Economic Review, 76(4), 1986, 604-20.

Lee, D. S. "Randomized Experiments from Non-random Selection in U.S. House Elections." Journal of Econometrics, 142(2), 2008, 675-97.

Lee, D. S., and D. Card. "Regression Discontinuity Inference with Specification Error." Journal of Econometrics, 127(2), 2008, 655-74.

Lee, D. S., and T. Lemieux. "Regression Discontinuity Designs in Economics." Journal of Economic Literature, 48(2), 2010, 281-355.

McCrary, J. "Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test." Journal of Econometrics, 142(2), 2008, 698-714.

McCrary, J., and H. Royer. "The Effect of Female Education on Fertility and Infant Health: Evidence from School Entry Policies Using Exact Date of Birth." American Economic Review, 101(1), 2011, 158-95.

Saez, E. "Do Taxpayers Bunch at Kink Points?" American Economic Journal: Economic Policy, 2(3), 2010, 180-212.

Shadish, W., R. Galindo, V. Wong, P. Steiner, and T. Cook. "A Randomized Experiment Comparing Random to Cutoff-Based Assignment." Psychological Methods, 16(2), 2011, 179-219.

Shigeoka, H. "The Effect of Patient Cost Sharing on Utilization, Health, and Risk Protection." American Economic Review, 104(7), 2014, 2152-84.

Van der Klaauw, W. "Regression-Discontinuity Analysis: A Survey of Recent Developments in Economics." Labour: Review of Labour Economics and Industrial Relations, 22(2), 2008, 219-45.

(1.) See Aiken et al. (1998), Buddelmeyer and Skoufias (2004), Black, Galdo, and Smith (2005), Cook and Wong

(2008), Berk et al. (2010), and Shadish et al. (2011) who describe within-study comparisons similar to LaLonde (1986).

(2.) This framework corresponds to a two-component mixture model that could naturally be expanded to allow for more types. A huge number of papers across statistics and economics have wrestled with how to identify such models. See Cho and White (2007) for a recent treatment of the general problem.

(3.) As a simple case, consider a scenario in which there is no treatment effect and no slope for [K.sup.*] = 0. 1. As such the true model simplifies to:

(6) [Y.sub.i] = [[alpha].sub.0] + [[alpha].sub.1][K.sup.*.sub.i] + [e.sub.i].

While it may not be immediately obvious that estimating Equation (5) would yield a biased estimate, note that Equation (3) implies [K.sup.*.sub.i] = ([R.sub.i] - [R.sup.*])[([GAMMA]([R.sup.*.sub.i]) - [R.sup.*]).sup.-1] and, thus, the true model in this case could be rewritten

(7) [Y.sub.i] = [[alpha].sub.0] + [[alpha].sub.1][R.sub.i][([GAMMA]([R.sup.*.sub.i]) - [R.sup.*.sub.i]).sup.-1] - [[alpha].sub.1] [R.sup.*.sub.i][([GAMMA]([R.sup.*.sub.i]) - [R.sup.*.sub.i]).sup.-1] + [e.sub.i].

As such, the estimates based on the usual RD model (Equation 5) may be biased because [u.sub.i] = [[alpha].sub.1]([R.sub.i])[([GAMMA]([R.sup.*.sub.i]) - [R.sup.*]).sup.-1] - [[alpha].sub.1]([R.sup.*][([GAMMA]([R.sup.*.sub.i]) - [R.sup.*]).sup.-1] + [e.sub.i].

(4.) Fundamental to the challenge to identification we consider is that only some (non-random) observations are heaped. See Dong (2015) for a consideration of random rounding in the running variable.

(5.) To make this example concrete, one can think of estimating the effect of free school lunches--typically offered to children in households with income below some set percentage of the poverty line--on the number of absences per week. The running variable could then be thought of as the difference between the poverty line and family income, with treatment provided when the poverty line (weakly) exceeds reported income ([R.sub.i] [greater than or equal to] 0). In this example, there may be heterogeneity in how individuals report their incomes--some individuals may report in dollars (non-heaped types), whereas others may report their incomes in tens of thousands of dollars (heaped types). Furthermore, supposing that non-heaped types are expected to be absent zero days per week regardless of whether they are given free lunch and heaped types are expected to be absent 0.5 days per week regardless of whether they are given free lunch, then we would expect to see a mean plot similar to that of Panel B of Figure 1. That is, we have a setting in which treatment (free school lunch) has no impact on the outcome (absences). However, as we show below, the non-random nature of the heaping will cause the standard RD estimated effects to go awry. Motivating this thought experiment, in Section IV.C we demonstrate that there is systematic heterogeneity in how individuals report income levels, with White individuals being less likely to report incomes in thousands of dollars.

(6.) This evidence highlights the usefulness of comparing estimates at various bandwidth levels, as proposed by van der Klaauw (2008).

(7.) This issue does not appear to be specific to heaping-induced model misspecification. As one simple but illustrative example, we have investigated a DGP in which [y.sub.i] = [r.sup.2.sub.i] + [e.sub.i] with [e.sub.i] drawn from a standard normal distribution and [r.sub.i] drawn from a discrete uniform distribution on {-20, -19, ..., 20) with r = 0 omitted for symmetry. A linear (and thus misspecified) RD model produces discontinuity estimates centered on zero, which implies we should reject the null hypothesis of no discontinuity at the 5% level, 5% of the time if we are using the correct standard-error estimates. This is the case when inference is based on heteroskedasticity-consistent standard-error estimates. However, inference based on clustered standard-error estimates leads to rejection rates of zero, owing to standard-error estimates that are 5-6 times too large in this instance. Results are qualitatively similar with alternative nonlinear DGPs involving higher ordered polynomials and/or trigonometric functions.

(8.) We should also emphasize that this test, as described by McCrary (2008), is not meant to identify data heaps, but to identify circumstances in which there is manipulation of the running variable. This type of behavior in which individuals close to the threshold exert effort to move to the "preferred" side of the cutoff will produce a distribution that is qualitatively different from simply having a data heap on one side of the threshold. In particular, this behavior will produce more of a shift in the distribution at the treatment threshold, whereas heaping produces blips in the distribution that may or may not coincide with the treatment threshold.

(9.) Note our earlier reference to [K.sup.*.sub.i], as an unobservable identifier that i is heaped. As we here consider a proxy variable indicating heaped types, then, we surmise that such a proxy may be arrived at in practice by institutional knowledge, or common practice (as might be the case in rounding leading to heaping, for example). We also envision some experimenting with methods for the identification of heaps, with more systematic heaping patterns across the distribution of the running variable better facilitating their discovery. This was the case in our initial consideration of heaping in birth weight (Barreca et al. 2011), for example.

(10.) We recommend this type of plot as a complement to more-aggregated mean plots rather than a substitute. More-aggregated mean plots may be more useful when trying to discern what functional form should be used in estimation and whether or not there is a treatment effect.

(11.) The relationship highlighted in Panel A of Figure 3--that the sign of the bias depends on the location of the heap relative to the cutoff--also reveals a potential special case in which the heaping is such that equal and opposing biases of the estimates of the conditional expectation function on each side of the threshold results in an unbiased (though imprecise) estimate of the true treatment effect. While such a data-generating process can be imagined, it is rather particular and we thus imagine that there is room to consider the implications of heaping even in such an environment.

(12.) For a detailed explanation of this sawtooth pattern, see the Appendix.

(13.) In particular, the average treatment effect across observations at heap points is contributed to by non-heaped and heaped types. Non-heaped types account for 80% of the full sample, but only 10% of these will fall at heap points, so represent 0.1 x 80/(0.1 x 80 + 20) of the sample at heaps. Heaped types--20% of the full sample--account for 20/(20 + 8) of the sample at heaps. As such, the average treatment effect at heaps is (8/28) x 0 + (20/28) x 0.5 = 0.357.

(14.) Combining the heaped and non-heaped analyses yields an average treatment effect of 72/100 x 0 + 28/ 100 x 0.357 = 0.1.

(15.) For example, when there are data heaps at multiples of 10, as in the DGPs we consider, 20 is the smallest bandwidth one could use to estimate Equation (1), as it requires at least two observations on each side of the threshold.

(16.) While our simulation exercise follows the convention that treatment falls on those to the right of the threshold, note that treatment falls on the left in this setting.

(17.) In so doing, we also shed light on why estimated effects of very-low-birth-weight classification are sensitive to the treatment of observations bunched around the 1,500-g threshold.

(18.) We use identical data as ADKW throughout this section, Vital Statistics Linked Birth and Infant Death Data from 1983-1991 and 1995-2002; linked files are not available for 1992-1994. These data combine information available on an infant's birth certificate with information on the death certificate for individuals less than 1 year old at the time of death. As such, the data provide information on the infant, the infant's health at birth, the infant's death (where applicable), the family background of the infant, the geographic location of birth, and maternal health and behavior during pregnancy. We do not, however, have access to the treatment data that ADKW use to estimate a first stage, which in turn allows them to construct two-sample IV estimates of the effect of treatment on mortality. For information on our sample construction, see Almond et al. (2010).

(19.) With general improvement in technology, one would anticipate that measurement would appear more precise in the aggregate over time. We show that this is indeed the case in Figure A5 in the Appendix, which also foreshadows the systematic relationship between heaping and measures of socioeconomic status. Note that a major reason that the figure does not show smooth trends is because data are not consistently available for all states.

(20.) Results focusing on other child characteristics are shown in Figure A6.

(21.) Although the estimates for each ounce heap are rarely statistically significant, it is obvious that the set of estimates is jointly significant and that the individual estimates would usually be significant with a bandwidth larger than 85 g.

(22.) We note that not all of these are "true placebo cutoffs" as 1,000 g corresponds to the extremely-low-birth-weight cut off and 2,500 g corresponds to the low-birth-weight cutoff.

(23.) Confidence intervals based on a bootstrap with 500 replications in which observations are drawn at random. Specifically, the confidence intervals shown reflect the 2.5th and 97.5th percentiles of the 500 estimates produced from this procedure.

(24.) Results are similar if one uses triangular kernel weights that also place greater emphasis on observations at 100-g heaps. ADKW mention having considered the effects at these same placebo cutoffs, motivating the analysis as follows:


   [A]t points in the distribution where we do not anticipate
   treatment differences, economically and statistically
   significant jumps of magnitudes similar to our
   VLBW treatment effects could suggest that the discontinuity
   we observe at 1,500 grams may be due to natural
   variation in treatment and mortality in our data.

   They do not present these results but instead report:

   In summary, we find striking discontinuities in treatment
   and mortality at the VLBW threshold, but less
   convincing differences at other points of the distribution.
   These results support the validity of our main
   findings.

   We disagree with this interpretation of the results.

(25.) In principle, the observations at 100-g heaps could also be analyzed in this manner; however, the analysis would require extremely large bandwidths.

(26.) One a priori reason for caution is that the 85-g bandwidth effectively means that the estimates will be identified using observations at six data heaps at a maximum. Estimates using a larger bandwidth of 150 g are shown in Figure A7 in the appendix. It is also possible that the estimates could be confounded by systematic deviations at ounce multiples that correspond to pounds and fractions thereof.

(27.) The California Vital Statistics Data span 1989 through 2004. These data, obtained from the California

Department of Pubic Health, contain information on the universe of births that occurred in California during this time frame. Mother's date of birth is not available in the public use version of the National Vital Statistics Natality Data. We use the same sample restrictions as McCrary and Royer (2011), limiting the sample to mothers who: were born in California between 1969 and 1987, were 23 years of age or younger at the time of birth, gave birth to her first child between 1989 and 2002, and whose education level and date of birth are reported in the data.

(28.) For related reasons, the empirical findings in Dickert-Conlin and Elder (2010) should also be considered in future papers that use day of birth as their running variable. In particular, they show that there are relatively few children born on weekends relative to weekdays because hospitals usually do not schedule induced labor and cesarean sections on weekends. As such, children born without medical intervention who tend to be of relatively low socioeconomic status are disproportionately observed on weekends.

(29.) With that said, this phenomenon may explain why their estimates vary a great deal when their bandwidth is less than 20 but are relatively stable at higher bandwidths. See McCrary and Royer (2011), Web Appendix figure 3.

(30.) These results are based on reported incomes among PSID heads of household, 1968-2007. For visual clarity, the graphs focus on individuals with positive incomes less than $40,000, which is approximately equal to the 75th percentile. In addition, the histogram uses $100 bins and the mean plot uses $100 bins for the data that are not found at $1,000 multiples.

(31.) Interestingly, the PSID also reveals systematic heaping in annual hours of work, which could also be used as a running variable in an RD design. For example, many employers provide only health insurance and other benefits to employees who work some predetermined number of hours. In these data, heaping is evident at 40-hour multiples and those at these heaps have less education, on average, than those who work a similar number of hours who are not data heaps.

(32.) Recall that a regression line always runs through ([bar.x], [bar.y]).

ALAN I. BARRECA, JASON M. LINDO and GLEN R. WADDELL *

* The authors thank the editor, Lars Lefgren, and two anonymous referees for their comments and suggestions, along with Josh Angrist, Bob Breunig, Patrick Button, David Card, Janet Currie, Yingying Dong, Todd Elder, Bill Evans, David Figlio, Melanie Guldi, Hilary Hoynes, Wilbert van der Klaauw, Thomas Lemieux, Justin McCrary, Doug Miller, Marianne Page, Heather Royer, Larry Singell, Ann Huff Stevens, Ke-Li Xu, Jim Ziliak, seminar participants at the University of Kentucky, and conference participants at the 2011 Public Policy and Economics of the Family Conference at Mount Holyoke College, the 2011 SOLE Meetings, the 2011 NBER's Children's Program Meetings, and the 2011 Labour Econometrics Workshop at the University of Sydney. Barreca: Associate Professor, Department of Economics,

Tulane University, New Orleans, LA 70115; NBER and IZA. Phone 504-865-5321, Fax 504-865-5869, E-mail [email protected]

Lindo: Associate Professor, Department of Economics, Texas A&M University, College Station, TX 77845; NBER and IZA. Phone 979-845-1363, Fax 979-847-8757, E-mail [email protected]

Waddell: Professor, Department of Economics, University of Oregon, Eugene, OR 97403-1285; IZA. Phone 541-346-1259, Fax 541-346-1243, E-mail [email protected]

COPYRIGHT 2016 Western Economic Association International
No portion of this article can be reproduced without the express written permission from the copyright holder.