Measuring the mortality impact of breast cancer screening.
Hanley, James A. ; McGregor, Maurice ; Liu, Zhihui 等
Whether or not to implement a screening program for breast cancer
requires weighing the health benefits (cancer deaths averted) against
the harms (overdiagnosis) and the costs. Essential to such a decision is
an accurate estimation of the extent of the health benefits and harms in
question. We avoid the larger debate, to screen or not to screen, and
focus instead on how the benefit is typically calculated in reports. We
show that this method contains conceptual errors and leads to serious
underestimates. Although other reports (1,2) are also based on analyses
that contain these same errors, and use the same trials, we will for
simplicity focus on the recent report of the Canadian Task Force on
screening for breast cancer in average-risk women aged 50-69 years. (3)
Before we address this report, we first briefly consider some important
characteristics of screening for cancers.
Unlike most medical interventions (that produce rapid effects),
cancer screening, by its very nature, generates mortality reductions
that only manifest several years after the onset of screening. (4,5)
Illustrated in Figure 1 are hypothetical examples of the yearly
percentage mortality reductions that might be expected from screening
for cancer every year for a) just three years (as some trials did) or b)
twenty years (as a screening program might do). Screening leads to
earlier treatment of otherwise fatal cancers, but can only save lives
(produce a mortality "deficit" or "reduction") at
the time when the deaths averted as a result of screening would have
(otherwise) occurred. Thus, in the trial, illustrated in scenario a),
the mortality of the screened population, relative to that of the
unscreened, only starts to fall perceptibly by the third year, when the
earliest effect of the first screen is expressed; it continues to fall
for three more years, with the greatest reduction (35%) attained in the
sixth year; mortality then rises again and returns to the level in the
unscreened population after year nine when the last effect of the third
and final screen is expressed. In contrast, in the 20-year screening
program, illustrated in scenario b), the (relative) mortality in the
screened population would again start to decrease by the third year, and
the reductions would reach an asymptote (largest possible magnitude of
benefit) of 46% in the seventh year; mortality would only rise again and
return to that in the unscreened population after year twenty-six, the
year when the last effect of the twentieth screen in the program is
expressed.
Thus, when our objective is to deduce the size of the reduction in
breast cancer mortality that would result from instituting a program of
regular screening, we must identify the "asymptote": the
annual mortality reduction that would be achieved each year after an
adequate period of regular screening. One could not determine this value
if screening were discontinued "prematurely" (i.e., before the
maximum annual mortality reduction of 46% was achieved), and any
estimate from a trial with a limited number of rounds of screening will
be an underestimate of what the program could achieve.
[FIGURE 1 OMITTED]
However, many of the screening trials on which the Canadian and
other reports are based were terminated prematurely (either by ending
screening of the intervention population, or by initiating screening of
the control population). Furthermore, most of these studies do not
report the mortality deficits observed in each year of the trial, but
give their results as a single rate ratio, and thus a single mortality
deficit, calculated from the cumulative numbers of deaths. This metric
includes all deaths from the very onset of screening to the end of the
follow-up, however long or short, or arbitrary, that duration may be.
This overall duration includes the early years in which little or no
reduction in mortality can be expected, and sometimes also the late
years in which the effects of screening are diminishing as a result of
its discontinuation. By relying on this overall measure, task forces
inevitably arrive at results that are smaller than the reduction
achievable by a program (46% in our hypothetical example) by an amount
dependent on the number of years included in the average in which
mortality reduction was zero or less than maximal.
Although these features of screening have long been recognized,
(4-13) they are still frequently overlooked, as they were in the recent
report of the Canadian Task Force on Preventive Health Care. Its
guidelines are primarily based on a meta-analysis of six breast cancer
screening trials, (14-19) which found that the expected mortality
reduction that would result from breast screening was 21%. Our primary
objective in this paper is to display the yearly mortality data in each
trial and deduce the reduction expected from a screening program, using
an approach that respects the features referred to above.
METHODS
Five of the six trials subjected to meta-analysis by the Task Force
are briefly summarized below. It was necessary to exclude the Canadian
Trial (19) (1980b in the Canadian meta-analysis (20)) because the
year-specific mortality data are not available from the reports nor
obtainable from the authors. The remaining five trials differ so greatly
in the screening regimens and other important elements that we do not
find it justifiable to combine the year-specific numbers of deaths.
Instead, we examined the year-by-year pattern of mortality deficits in
each trial separately. Thus, for each trial, we attempted to identify
the "trough" or "nadir" achieved following the onset
of screening.
[FIGURE 2 OMITTED]
Two authors (JH, ZL) independently extracted the year-specific
numbers of breast cancer deaths in the experimental and control arms
from the published articles. From the cumulative numbers of deaths
reported in Table 7 in the HIP trial and Table X in the Malmo trial, we
calculated the yearly numbers of deaths by successive subtractions. The
reports of the other three trials contained plots of cumulative numbers
of deaths over time (Figure 2, Two-County; Figure 2, Stockholm; Figure
1, Gothenburg). For each of these, we used a graph digitizer to extract
the cumulative values, and then converted them into year-specific
numbers of deaths, and checked the totals against the total numbers
reported in the text. Disagreements between extractors were resolved by
further review. In reports that did not provide sufficiently
age-specific data, we used slightly wider or narrower age-at-entry
bands.
There was substantial variation in the screening regimens, and the
year-specific death counts in most trials were in the single digits. To
reduce the statistical noise, and to avoid artifacts in estimating
nadirs, we used three-year moving averages to calculate the
year-specific mortality rate ratios, and their complements, the
year-specific mortality deficits. Given the general lack of sufficiently
sustained screening in these trials, our aim was to use the maximum
annual mortality deficit in each trial to gain some idea of the
sustained mortality reduction that would result if women were regularly
screened (annually or biennially), from age 50 until 69, at the same
participation rates as pertained in the trials.
We investigated, by simulations, whether this amount of smoothing
(each deficit based on three-year moving rates) was sufficient to keep
the probability of overestimating the true nadir at around 50% (i.e.,
whether the estimator of the nadir was median unbiased). We found that
indeed, if one relied on the largest deficit in a series of moving
deficits, one would tend to slightly overestimate the true nadir. But we
also found that the most conservative of three adjacent such moving
deficits was as likely to overestimate the true nadir as it was to
underestimate it. When visually extracting a sensible nadir from Figure
2, we informally looked for an estimate of the percentage deficit that
would be surpassed or equaled by the displayed moving deficit for at
least three successive years. For example, the HIP study has three
consecutive years with deficits of more than 40%, while the Malmo study
has three with deficits of more than 45%.
RESULTS
The five trials in question are included in Figure 2 and are
summarized below.
The HIP trial (14) employed 4 annual rounds of screening, using
mammography and physical examination, with a participation rate of 65%
at the initial round. The breast cancer mortality deficits begin to
manifest in year 3, reaching values of 43%, 47% and 43% for the next
three years, after which the effect of screening (already discontinued)
again diminishes. Thus, screening is associated with a sustained deficit
in annual mortality of over 40%.
Comment: The Task Force meta-analysis (20) used a 22% deficit,
calculated over 14 years, including the first 2 years in which the
effect of screening had not yet commenced, and the years 10-14 in which
its effects had ended. Thus it clearly underestimates what a sustained
program could achieve.
The Malmo trial (15) had the longest duration of screening: 6
rounds over 9 years, with a participation rate of more than 70%. The
task force used the data for women aged 55 years and over. Probably
because of its limited size (virtually all of the yearly numbers of
deaths are in the single digits), breast cancer mortality deficits only
begin to be expressed in year 7, reaching values of 48%, 58%, and 52% in
years 8, 9 and 10, respectively, when the trial was terminated. Thus the
sustained deficit in annual mortality was of the order of 50%.
Comment: The deficit in mortality used by the Task Force is an
average over 18 years. Since in years 12-18 (yearly data not available),
women in the control arm were invited to screening, the 18% deficit
calculated by the task force would be expected to underestimate the
uncontaminated impact of 6 rounds of screening. Indeed, the authors of
this study recognized that "intervention at the noninvasive or
early invasive stage would not influence the death rate until several
years later". They estimated that after a 6-year delay and with the
inclusion of preliminary data from 1987, the deficit in mortality is
42%. (14)
In the Two-County trial, (16) the experimental arm involved 3
rounds of screening over a span of 5 years. Women in the control arm
were invited to screening from about year 8 onwards. The mortality
deficits in the last three years (56%, 62%, and 58%, with an average of
59%) reflect the deficits in mortality resulting from screening in this
study.
Comment: The substantial mortality deficit in this trial presumably
reflects both the high participation rate (89% at the initial
examination) in the experimental arm and the greater stability of the
derived statistics: this trial was the largest of the five in terms of
yearly numbers of deaths. Based on the average mortality over the
lengths of the follow-up in the 1995 and 2002 separate-county (East and
West) reports, the Task Force analysis used deficits of 19% and 47%,
respectively, or 33% if one were to combine them.
The Stockholm trial (17) involved 2 rounds of screening over a span
of 2 years. Women in the control arm were invited to screening after
about year 5, thus limiting the time during which the uncontaminated
effect of screening could be observed. In years 5, 6 and 7, deficits of
45%, 40% and 46%, respectively (average 44%) were observed. Over years
3-9, there is a sustained mortality deficit of approximately 40%.
Comment: In contrast, the Task Force calculated an average deficit
over all 12 years of 32%.
In the Gothenburg trial, (18) the experimental arm involved 4
rounds of screening over a span of 6 years. Women in the control arm
were invited to screening as soon as the cumulative number of breast
cancer deaths in the experimental arm was statistically significantly
lower than that in the control arm (thereby preventing the full
expression of the effect of screening). The 3 rounds of screening appear
to have resulted in mortality deficits of 45% and 29% in the two years
before the trial was effectively terminated by introducing screening to
the control group. Thereafter the time-pattern of the mortality deficits
becomes erratic. A very approximate estimate of the effect of screening
would be the average of the two years in which it was observed, i.e.,
38%.
Comment: Not surprisingly, given the similarity of the intervention
in the two arms from year 5 onwards, there is no evidence of the impact
of screening beyond year 13. The 21% average over all 14 years used by
the Task Force reflects both this attenuation and the inclusion of the
initial years in which no effect could have been seen.
Estimated mortality reduction of a program that screens regularly
for a 20-year age span
From observation of the deficits in mortality associated with
screening in each trial (Figure 2), it is apparent that (except for the
Malmo trial) screening was not maintained sufficiently long to achieve
its full effect. However, some idea of the magnitude of the reduction in
mortality that would have been achieved if screening were continued for
20 years can be estimated from the pattern of deficits. Despite the
variability, expected with such small numbers, the trials consistently
suggest that 20 years of offering screening to women from age 50 to 69
would be followed by 20 years (approximately ages 55-74) in which the
breast cancer mortality reductions would be at least 40%. Moreover,
since the maximal deficits were achieved with participation rates that
were well below 100%, they in turn underestimate the probability of
benefit for women who would participate more fully than the
"average" in the trials.
DISCUSSION
The decision to initiate and/or sustain a program of breast cancer
screening will always require up-to-date and accurate estimates of the
harms and benefits that it will cause. Since the time when the studies
cited above were carried out, screening techniques have become more
sensitive (and less specific) and cancer therapies have become more
effective. However, if they are to be used for the formulation of
policy, they must be correctly interpreted. Without engaging in the
debate on the overall value of screening, we believe that the reduction
in mortality estimated by the Task Force on the basis of these studies
is a considerable underestimation.
What we need to know for such a decision is the yearly reduction in
mortality that will result from screening (say annually or biennially)
of women of a given age at entry (say 50 years) over a prolonged (say 20
years) time, compared with the mortality in women who do not take part
in screening. This we must attempt to derive from data reflecting much
shorter periods of screening (usually terminated before the full effect
can be seen) of women invited to screening, compared to control groups
in which substantial proportions undergo "external" screening.
Furthermore, we need to know the reduction in annual mortality rate
produced by the screening rather than the reduction over the overall
length of the follow-up, a figure that will be unduly low due to
inclusion of mortality data at times when the intervention can only have
zero or reduced effects. Even without correction for rates of external
screening, the deficits shown in Figure 2 indicate that, in contrast
with the 21% calculated by the Canadian Task Force, the estimated
reduction lies closer to 40%. The mortality reduction in women screened,
as distinct from invited, would be greater and would be further
increased when compared to women who are not screened.
To appreciate the numbers involved, one might wish to apply these
different percentage reductions, and the amount of screening that would
be involved, to the current population of Canadian women. At present,
approximately 4 million Canadian women are between the ages of 50 and
69. Each year, more or less uniformly distributed over the age range 50
to 85, there are approximately 5,000 breast cancer deaths. If screening
from age 50 to 69 resulted in a 20% reduction in the breast cancer
mortality rates in the age ranges 55-75, with smaller reductions in
younger and older ages, approximately 650 breast cancer deaths would be
averted each year; if it resulted in a 40% reduction, 1,300 would be.
We did not attempt to calculate what the reductions would be with
other or full participation rates. We merely show that despite
participation rates that are well below those seen in therapeutic
trials, and despite the fact that the regimens used in the trials were
much shorter than those that would be used in a screening program, the
deficits achieved were still considerably larger than the reductions
estimated in the Task Force report.
An implicit but clearly inappropriate assumption in the
meta-analysis underpinning the Task Force report is statistical
exchangeability of deaths in different person years, no matter whether
they occur in year 1, 11 or 24. Unlike the practice in other
"latency" contexts, (21) most data analysts ignore the
non-proportional hazards (5,22) that characterize mortality patterns in
cancer screening trials. We suggest they adopt a time-specific approach
such as that in Figures 1 and 2, and dispense with single (aggregated
over all follow-up time) numbers.
Ideally, i.e., if they were sufficiently numerous, the data in each
separate trial we examined would coherently "speak for
themselves" as to the time windows in which one should and should
not expect mortality deficits. However, in many of the trials, and
despite our attempts to reduce the noise, the numbers of screenings and
the numbers of breast cancer deaths were almost too low to interpret.
The Malmo trial is the only one with a sufficiently sustained screening
regimen to generate a genuine asymptote. And indeed, when the
time-specific data from this trial were reconsidered in detail, (4) and
allowance was made for the expected lag, they suggested that large
mortality reductions (>50%) are possible with sustained screening.
Likewise, the long-term (25-30 year) follow-up of cancer screening
trials with limited screening, and the use of (one-number) reduction
measures based on all deaths in the follow-up window, in subjects whose
last screening examination was carried out decades earlier, (19,23) will
not be informative. In such analyses, the inclusion of the time window
before any deficits would be expected will already dilute the effect;
but the inclusion of the very long post-last-screen time window--when
deficits will long since have disappeared--will dilute it even more,
(4,5,22) and make the resulting number meaningless as a measure of what
a screening program that involves 20 years of screening would
accomplish.
The duration of screening in a trial is typically shorter than that
in a program and the deficits last for fewer years. The Canadian Task
Force failed to distinguish trials from programs, as is evident in their
statement "Screening women aged 50-69 years ... for about 11
years" and in their calculations based on this arbitrary
time-horizon. If numbers needed to screen are to be meaningful, they
should refer to the full length of a program, in which women would
undergo 20 years of screening (10-20 examinations say), starting at age
50, rather than the limited number (typically 3-4) of examinations and
an average of 11 years of follow-up in the trials the Task Force used.
Likewise, mortality deficits should be tallied in a 30-year follow-up
window extending from 50 to 80 years of age.
Finally, it should be noted that the full effect of an earlier
detection program will always be underestimated by the focus on
statistical hypothesis-testing and the practice of announcing results
when the accumulated deficits first become "statistically"
significantly different from zero. When used in the context of
policymaking, the "key question" targeted by the Canadian Task
Force " Does screening ... decrease breast cancer mortality for
women of all ages?" is seriously incomplete. Decision makers need
to know how great the benefits might be.
SUMMARY
To estimate the magnitude of the impact on breast cancer mortality
in a screening program using data from trials, one must recognize the
critical roles of the screening regimen, and the time-window in which
the delayed deficits are seen. These issues were ignored in the recent
Canadian, US, and UK Task Force reports. Reanalysis of data from the
same trials, paying attention to the timing of the deaths in relation to
the timing of the screening, indicates that yearly breast cancer
mortality reductions under a screening program would be at least
40%--double the Task Force's estimate.
Acknowledgements: This work was supported by the Canadian
Institutes of Health Research.
Conflict of Interest: None to declare.
La traduction du resume se trouve a la fin de l'article.
Can j Public Health 2013;104(7):e437-e442.
REFERENCES
(1.) US Preventive Services Task Force. Screening for Breast
Cancer: U.S. Preventive Services Task Force Recommendation Statement.
Ann Intern Med 2009;151:716-26.
(2.) Independent UK Panel on Breast Cancer Screening. The benefits
and harms of breast cancer screening: An independent review. Lancet
2012;380(9855):1778-86.
(3.) Canadian Task Force on Preventive Health Care, Tonelli M,
Connor Gorber S, Joffres M, Dickinson J, Singh H, et al. Recommendations
on screening for breast cancer in average-risk women aged 40-74 years.
CMAJ 2011;183(17):1991-2001.
(4.) Miettinen OS, Henschke CI, Pasmantier MW, Smith JP, Libby DM,
Yankelevitz DF. Mammographic screening: No reliable supporting evidence?
Lancet 2002;359(9304):404-5.
(5.) Miettinen OS, Karp I. Epidemiological Research: An
Introduction. New York, NY: Springer, 2012; 81.
(6.) Morrison AS. Screening in Chronic Disease, First Edition. New
York: Oxford University Press, 1985.
(7.) Caro J. Screening for breast cancer in Quebec: Estimates of
health effects and of costs. Montreal: CETS, 1990;24. Available at:
http://www.aetmis.gouv.qc.ca/ site/en_publications_liste.phtml (Accessed
January 7, 2012).
(8.) Hu P, Zelen M. Planning clinical trials to evaluate early
detection programs. Biometrika 1997;84:817-29.
(9.) Hu P, Zelen M. Planning of randomized early detection trials.
Stat Methods Med Res 2004;13(6):491-506.
(10.) Hanley JA. Analysis of mortality data from cancer screening
studies: Looking in the right window. Epidemiology 2005;16:786-90.
(11.) Baker SG, Kramer BS, Prorok PC. Early reporting for cancer
screening trials. J Med Screen 2008;15:122-29.
(12.) Hanley JA. Mortality reductions produced by sustained
prostate cancer screening have been underestimated. J Med Screen
2010;17(3):147-51.
(13.) Hanley JA. Measuring mortality reductions in cancer screening
trials. Epidemiol Rev 2011;33(1):36-45.
(14.) Shapiro S. Evidence on screening for breast cancer from a
randomized trial. Cancer 1977;39(6 Suppl):2772-82.
(15.) Andersson I, Aspegren K, Janzon L, Landberg T, Lindholm K,
Linell F, et al. Mammographic screening and mortality from breast
cancer: The Malmo mammographic screening trial. BMJ
1988;297(6654):943-48.
(16.) Tabar L, Fagerberg CJ, Gad A, Baldetorp L, Holmberg LH,
Grontoft O, et al. Reduction in mortality from breast cancer after mass
screening with mammography. Randomised trial from the Breast Cancer
Screening Working Group of the Swedish National Board of Health and
Welfare. Lancet 1985;1(8433):829-32.
(17.) Frisell J, Lidbrink E, Hellstrom L, Rutqvist LE. Followup
after 11 years--Update of mortality results in the Stockholm
mammographic screening trial. Breast Cancer Res Treat 1997;45(3):263-70.
(18.) Bjurstam N, Bjorneld L, Warwick J, Sala E, Duffy SW, Nystrom
L, et al. The Gothenburg Breast Screening Trial. Cancer 2003;97:2387-96.
(19.) Miller AB, To T, Baines CJ, Wall C. Canadian National Breast
Screening Study2: 13-year results of a randomized trial in women aged
50-59 years. J Natl Cancer Inst 2000;92(18):1490-99.
(20.) Fitzpatrick-Lewis D, Hodgson N, Ciliska D, Peirson L, Gauld
M, Yun Liu Y. Breast cancer screening. Available at:
http://www.ephpp.ca/pdf/breast_cancer_2011_systematic_review_ENG.pdf
(Accessed July 26, 2012).
(21.) Breslow NE, Day NE. Statistical Methods in Cancer Research.
Volume II--The Design and Analysis of Cohort Studies. Lyons, France:
IARC Scientific Publications No. 82., 1987.
(22.) Liu Z, Hanley JA, Strumpf EC. Projecting the yearly mortality
reductions due to a cancer screening programme. J Med Screen [2013 Sep
18. Epub ahead of print].
(23.) Marcus PM, Bergstralh EJ, Fagerstrom RM, Williams DE, Fontana
R, Taylor WF, Prorok PC. Lung cancer mortality in the Mayo Lung Project:
Impact of extended follow-up. J Natl Cancer Inst 2000;92(16):1308-16.
Received: June 19, 2013
Accepted: September 19, 2013
James A. Hanley, PhD, [1,2] Maurice McGregor, MD, [2] Zhihui Liu,
MSc, [1] Erin C. Strumpf, PhD, [1,3] Nandini Dendukuri, PhD, [1,2]
Author Affiliations
McGill University, Montreal, QC
[1.] Department of Epidemiology, Biostatistics and Occupational
Health
[2.] Department of Medicine
[3.] Department of Economics
Correspondence: James Hanley, Dept. of Epidemiology, Biostatistics
and Occupational Health, McGill University, 1020 Pine Avenue West,
Montreal, QC H3A 1A2, E-mail:
[email protected]