文章基本信息

标题：Measuring the mortality impact of breast cancer screening.
作者：Hanley, James A. ; McGregor, Maurice ; Liu, Zhihui 等
期刊名称：Canadian Journal of Public Health
印刷版ISSN：0008-4263
出版年度：2013
期号：November
语种：English
出版社：Canadian Public Health Association
摘要：Unlike most medical interventions (that produce rapid effects), cancer screening, by its very nature, generates mortality reductions that only manifest several years after the onset of screening. (4,5) Illustrated in Figure 1 are hypothetical examples of the yearly percentage mortality reductions that might be expected from screening for cancer every year for a) just three years (as some trials did) or b) twenty years (as a screening program might do). Screening leads to earlier treatment of otherwise fatal cancers, but can only save lives (produce a mortality "deficit" or "reduction") at the time when the deaths averted as a result of screening would have (otherwise) occurred. Thus, in the trial, illustrated in scenario a), the mortality of the screened population, relative to that of the unscreened, only starts to fall perceptibly by the third year, when the earliest effect of the first screen is expressed; it continues to fall for three more years, with the greatest reduction (35%) attained in the sixth year; mortality then rises again and returns to the level in the unscreened population after year nine when the last effect of the third and final screen is expressed. In contrast, in the 20-year screening program, illustrated in scenario b), the (relative) mortality in the screened population would again start to decrease by the third year, and the reductions would reach an asymptote (largest possible magnitude of benefit) of 46% in the seventh year; mortality would only rise again and return to that in the unscreened population after year twenty-six, the year when the last effect of the twentieth screen in the program is expressed.
关键词：Breast cancer;Cancer;Cancer screening;Mortality;Task forces

Measuring the mortality impact of breast cancer screening.

Hanley, James A. ; McGregor, Maurice ; Liu, Zhihui 等

Whether or not to implement a screening program for breast cancer requires weighing the health benefits (cancer deaths averted) against the harms (overdiagnosis) and the costs. Essential to such a decision is an accurate estimation of the extent of the health benefits and harms in question. We avoid the larger debate, to screen or not to screen, and focus instead on how the benefit is typically calculated in reports. We show that this method contains conceptual errors and leads to serious underestimates. Although other reports (1,2) are also based on analyses that contain these same errors, and use the same trials, we will for simplicity focus on the recent report of the Canadian Task Force on screening for breast cancer in average-risk women aged 50-69 years. (3) Before we address this report, we first briefly consider some important characteristics of screening for cancers.

Unlike most medical interventions (that produce rapid effects), cancer screening, by its very nature, generates mortality reductions that only manifest several years after the onset of screening. (4,5) Illustrated in Figure 1 are hypothetical examples of the yearly percentage mortality reductions that might be expected from screening for cancer every year for a) just three years (as some trials did) or b) twenty years (as a screening program might do). Screening leads to earlier treatment of otherwise fatal cancers, but can only save lives (produce a mortality "deficit" or "reduction") at the time when the deaths averted as a result of screening would have (otherwise) occurred. Thus, in the trial, illustrated in scenario a), the mortality of the screened population, relative to that of the unscreened, only starts to fall perceptibly by the third year, when the earliest effect of the first screen is expressed; it continues to fall for three more years, with the greatest reduction (35%) attained in the sixth year; mortality then rises again and returns to the level in the unscreened population after year nine when the last effect of the third and final screen is expressed. In contrast, in the 20-year screening program, illustrated in scenario b), the (relative) mortality in the screened population would again start to decrease by the third year, and the reductions would reach an asymptote (largest possible magnitude of benefit) of 46% in the seventh year; mortality would only rise again and return to that in the unscreened population after year twenty-six, the year when the last effect of the twentieth screen in the program is expressed.

Thus, when our objective is to deduce the size of the reduction in breast cancer mortality that would result from instituting a program of regular screening, we must identify the "asymptote": the annual mortality reduction that would be achieved each year after an adequate period of regular screening. One could not determine this value if screening were discontinued "prematurely" (i.e., before the maximum annual mortality reduction of 46% was achieved), and any estimate from a trial with a limited number of rounds of screening will be an underestimate of what the program could achieve.

[FIGURE 1 OMITTED]

However, many of the screening trials on which the Canadian and other reports are based were terminated prematurely (either by ending screening of the intervention population, or by initiating screening of the control population). Furthermore, most of these studies do not report the mortality deficits observed in each year of the trial, but give their results as a single rate ratio, and thus a single mortality deficit, calculated from the cumulative numbers of deaths. This metric includes all deaths from the very onset of screening to the end of the follow-up, however long or short, or arbitrary, that duration may be. This overall duration includes the early years in which little or no reduction in mortality can be expected, and sometimes also the late years in which the effects of screening are diminishing as a result of its discontinuation. By relying on this overall measure, task forces inevitably arrive at results that are smaller than the reduction achievable by a program (46% in our hypothetical example) by an amount dependent on the number of years included in the average in which mortality reduction was zero or less than maximal.

Although these features of screening have long been recognized, (4-13) they are still frequently overlooked, as they were in the recent report of the Canadian Task Force on Preventive Health Care. Its guidelines are primarily based on a meta-analysis of six breast cancer screening trials, (14-19) which found that the expected mortality reduction that would result from breast screening was 21%. Our primary objective in this paper is to display the yearly mortality data in each trial and deduce the reduction expected from a screening program, using an approach that respects the features referred to above.

METHODS

Five of the six trials subjected to meta-analysis by the Task Force are briefly summarized below. It was necessary to exclude the Canadian Trial (19) (1980b in the Canadian meta-analysis (20)) because the year-specific mortality data are not available from the reports nor obtainable from the authors. The remaining five trials differ so greatly in the screening regimens and other important elements that we do not find it justifiable to combine the year-specific numbers of deaths. Instead, we examined the year-by-year pattern of mortality deficits in each trial separately. Thus, for each trial, we attempted to identify the "trough" or "nadir" achieved following the onset of screening.

[FIGURE 2 OMITTED]

Two authors (JH, ZL) independently extracted the year-specific numbers of breast cancer deaths in the experimental and control arms from the published articles. From the cumulative numbers of deaths reported in Table 7 in the HIP trial and Table X in the Malmo trial, we calculated the yearly numbers of deaths by successive subtractions. The reports of the other three trials contained plots of cumulative numbers of deaths over time (Figure 2, Two-County; Figure 2, Stockholm; Figure 1, Gothenburg). For each of these, we used a graph digitizer to extract the cumulative values, and then converted them into year-specific numbers of deaths, and checked the totals against the total numbers reported in the text. Disagreements between extractors were resolved by further review. In reports that did not provide sufficiently age-specific data, we used slightly wider or narrower age-at-entry bands.

There was substantial variation in the screening regimens, and the year-specific death counts in most trials were in the single digits. To reduce the statistical noise, and to avoid artifacts in estimating nadirs, we used three-year moving averages to calculate the year-specific mortality rate ratios, and their complements, the year-specific mortality deficits. Given the general lack of sufficiently sustained screening in these trials, our aim was to use the maximum annual mortality deficit in each trial to gain some idea of the sustained mortality reduction that would result if women were regularly screened (annually or biennially), from age 50 until 69, at the same participation rates as pertained in the trials.

We investigated, by simulations, whether this amount of smoothing (each deficit based on three-year moving rates) was sufficient to keep the probability of overestimating the true nadir at around 50% (i.e., whether the estimator of the nadir was median unbiased). We found that indeed, if one relied on the largest deficit in a series of moving deficits, one would tend to slightly overestimate the true nadir. But we also found that the most conservative of three adjacent such moving deficits was as likely to overestimate the true nadir as it was to underestimate it. When visually extracting a sensible nadir from Figure 2, we informally looked for an estimate of the percentage deficit that would be surpassed or equaled by the displayed moving deficit for at least three successive years. For example, the HIP study has three consecutive years with deficits of more than 40%, while the Malmo study has three with deficits of more than 45%.

RESULTS

The five trials in question are included in Figure 2 and are summarized below.

The HIP trial (14) employed 4 annual rounds of screening, using mammography and physical examination, with a participation rate of 65% at the initial round. The breast cancer mortality deficits begin to manifest in year 3, reaching values of 43%, 47% and 43% for the next three years, after which the effect of screening (already discontinued) again diminishes. Thus, screening is associated with a sustained deficit in annual mortality of over 40%.

Comment: The Task Force meta-analysis (20) used a 22% deficit, calculated over 14 years, including the first 2 years in which the effect of screening had not yet commenced, and the years 10-14 in which its effects had ended. Thus it clearly underestimates what a sustained program could achieve.

The Malmo trial (15) had the longest duration of screening: 6 rounds over 9 years, with a participation rate of more than 70%. The task force used the data for women aged 55 years and over. Probably because of its limited size (virtually all of the yearly numbers of deaths are in the single digits), breast cancer mortality deficits only begin to be expressed in year 7, reaching values of 48%, 58%, and 52% in years 8, 9 and 10, respectively, when the trial was terminated. Thus the sustained deficit in annual mortality was of the order of 50%.

Comment: The deficit in mortality used by the Task Force is an average over 18 years. Since in years 12-18 (yearly data not available), women in the control arm were invited to screening, the 18% deficit calculated by the task force would be expected to underestimate the uncontaminated impact of 6 rounds of screening. Indeed, the authors of this study recognized that "intervention at the noninvasive or early invasive stage would not influence the death rate until several years later". They estimated that after a 6-year delay and with the inclusion of preliminary data from 1987, the deficit in mortality is 42%. (14)

In the Two-County trial, (16) the experimental arm involved 3 rounds of screening over a span of 5 years. Women in the control arm were invited to screening from about year 8 onwards. The mortality deficits in the last three years (56%, 62%, and 58%, with an average of 59%) reflect the deficits in mortality resulting from screening in this study.

Comment: The substantial mortality deficit in this trial presumably reflects both the high participation rate (89% at the initial examination) in the experimental arm and the greater stability of the derived statistics: this trial was the largest of the five in terms of yearly numbers of deaths. Based on the average mortality over the lengths of the follow-up in the 1995 and 2002 separate-county (East and West) reports, the Task Force analysis used deficits of 19% and 47%, respectively, or 33% if one were to combine them.

The Stockholm trial (17) involved 2 rounds of screening over a span of 2 years. Women in the control arm were invited to screening after about year 5, thus limiting the time during which the uncontaminated effect of screening could be observed. In years 5, 6 and 7, deficits of 45%, 40% and 46%, respectively (average 44%) were observed. Over years 3-9, there is a sustained mortality deficit of approximately 40%.

Comment: In contrast, the Task Force calculated an average deficit over all 12 years of 32%.

In the Gothenburg trial, (18) the experimental arm involved 4 rounds of screening over a span of 6 years. Women in the control arm were invited to screening as soon as the cumulative number of breast cancer deaths in the experimental arm was statistically significantly lower than that in the control arm (thereby preventing the full expression of the effect of screening). The 3 rounds of screening appear to have resulted in mortality deficits of 45% and 29% in the two years before the trial was effectively terminated by introducing screening to the control group. Thereafter the time-pattern of the mortality deficits becomes erratic. A very approximate estimate of the effect of screening would be the average of the two years in which it was observed, i.e., 38%.

Comment: Not surprisingly, given the similarity of the intervention in the two arms from year 5 onwards, there is no evidence of the impact of screening beyond year 13. The 21% average over all 14 years used by the Task Force reflects both this attenuation and the inclusion of the initial years in which no effect could have been seen.

Estimated mortality reduction of a program that screens regularly for a 20-year age span

From observation of the deficits in mortality associated with screening in each trial (Figure 2), it is apparent that (except for the Malmo trial) screening was not maintained sufficiently long to achieve its full effect. However, some idea of the magnitude of the reduction in mortality that would have been achieved if screening were continued for 20 years can be estimated from the pattern of deficits. Despite the variability, expected with such small numbers, the trials consistently suggest that 20 years of offering screening to women from age 50 to 69 would be followed by 20 years (approximately ages 55-74) in which the breast cancer mortality reductions would be at least 40%. Moreover, since the maximal deficits were achieved with participation rates that were well below 100%, they in turn underestimate the probability of benefit for women who would participate more fully than the "average" in the trials.

DISCUSSION

The decision to initiate and/or sustain a program of breast cancer screening will always require up-to-date and accurate estimates of the harms and benefits that it will cause. Since the time when the studies cited above were carried out, screening techniques have become more sensitive (and less specific) and cancer therapies have become more effective. However, if they are to be used for the formulation of policy, they must be correctly interpreted. Without engaging in the debate on the overall value of screening, we believe that the reduction in mortality estimated by the Task Force on the basis of these studies is a considerable underestimation.

What we need to know for such a decision is the yearly reduction in mortality that will result from screening (say annually or biennially) of women of a given age at entry (say 50 years) over a prolonged (say 20 years) time, compared with the mortality in women who do not take part in screening. This we must attempt to derive from data reflecting much shorter periods of screening (usually terminated before the full effect can be seen) of women invited to screening, compared to control groups in which substantial proportions undergo "external" screening. Furthermore, we need to know the reduction in annual mortality rate produced by the screening rather than the reduction over the overall length of the follow-up, a figure that will be unduly low due to inclusion of mortality data at times when the intervention can only have zero or reduced effects. Even without correction for rates of external screening, the deficits shown in Figure 2 indicate that, in contrast with the 21% calculated by the Canadian Task Force, the estimated reduction lies closer to 40%. The mortality reduction in women screened, as distinct from invited, would be greater and would be further increased when compared to women who are not screened.

To appreciate the numbers involved, one might wish to apply these different percentage reductions, and the amount of screening that would be involved, to the current population of Canadian women. At present, approximately 4 million Canadian women are between the ages of 50 and 69. Each year, more or less uniformly distributed over the age range 50 to 85, there are approximately 5,000 breast cancer deaths. If screening from age 50 to 69 resulted in a 20% reduction in the breast cancer mortality rates in the age ranges 55-75, with smaller reductions in younger and older ages, approximately 650 breast cancer deaths would be averted each year; if it resulted in a 40% reduction, 1,300 would be.

We did not attempt to calculate what the reductions would be with other or full participation rates. We merely show that despite participation rates that are well below those seen in therapeutic trials, and despite the fact that the regimens used in the trials were much shorter than those that would be used in a screening program, the deficits achieved were still considerably larger than the reductions estimated in the Task Force report.

An implicit but clearly inappropriate assumption in the meta-analysis underpinning the Task Force report is statistical exchangeability of deaths in different person years, no matter whether they occur in year 1, 11 or 24. Unlike the practice in other "latency" contexts, (21) most data analysts ignore the non-proportional hazards (5,22) that characterize mortality patterns in cancer screening trials. We suggest they adopt a time-specific approach such as that in Figures 1 and 2, and dispense with single (aggregated over all follow-up time) numbers.

Ideally, i.e., if they were sufficiently numerous, the data in each separate trial we examined would coherently "speak for themselves" as to the time windows in which one should and should not expect mortality deficits. However, in many of the trials, and despite our attempts to reduce the noise, the numbers of screenings and the numbers of breast cancer deaths were almost too low to interpret. The Malmo trial is the only one with a sufficiently sustained screening regimen to generate a genuine asymptote. And indeed, when the time-specific data from this trial were reconsidered in detail, (4) and allowance was made for the expected lag, they suggested that large mortality reductions (>50%) are possible with sustained screening.

Likewise, the long-term (25-30 year) follow-up of cancer screening trials with limited screening, and the use of (one-number) reduction measures based on all deaths in the follow-up window, in subjects whose last screening examination was carried out decades earlier, (19,23) will not be informative. In such analyses, the inclusion of the time window before any deficits would be expected will already dilute the effect; but the inclusion of the very long post-last-screen time window--when deficits will long since have disappeared--will dilute it even more, (4,5,22) and make the resulting number meaningless as a measure of what a screening program that involves 20 years of screening would accomplish.

The duration of screening in a trial is typically shorter than that in a program and the deficits last for fewer years. The Canadian Task Force failed to distinguish trials from programs, as is evident in their statement "Screening women aged 50-69 years ... for about 11 years" and in their calculations based on this arbitrary time-horizon. If numbers needed to screen are to be meaningful, they should refer to the full length of a program, in which women would undergo 20 years of screening (10-20 examinations say), starting at age 50, rather than the limited number (typically 3-4) of examinations and an average of 11 years of follow-up in the trials the Task Force used. Likewise, mortality deficits should be tallied in a 30-year follow-up window extending from 50 to 80 years of age.

Finally, it should be noted that the full effect of an earlier detection program will always be underestimated by the focus on statistical hypothesis-testing and the practice of announcing results when the accumulated deficits first become "statistically" significantly different from zero. When used in the context of policymaking, the "key question" targeted by the Canadian Task Force " Does screening ... decrease breast cancer mortality for women of all ages?" is seriously incomplete. Decision makers need to know how great the benefits might be.

SUMMARY

To estimate the magnitude of the impact on breast cancer mortality in a screening program using data from trials, one must recognize the critical roles of the screening regimen, and the time-window in which the delayed deficits are seen. These issues were ignored in the recent Canadian, US, and UK Task Force reports. Reanalysis of data from the same trials, paying attention to the timing of the deaths in relation to the timing of the screening, indicates that yearly breast cancer mortality reductions under a screening program would be at least 40%--double the Task Force's estimate.

Acknowledgements: This work was supported by the Canadian Institutes of Health Research.

Conflict of Interest: None to declare.

La traduction du resume se trouve a la fin de l'article.

Can j Public Health 2013;104(7):e437-e442.

REFERENCES

(1.) US Preventive Services Task Force. Screening for Breast Cancer: U.S. Preventive Services Task Force Recommendation Statement. Ann Intern Med 2009;151:716-26.

(2.) Independent UK Panel on Breast Cancer Screening. The benefits and harms of breast cancer screening: An independent review. Lancet 2012;380(9855):1778-86.

(3.) Canadian Task Force on Preventive Health Care, Tonelli M, Connor Gorber S, Joffres M, Dickinson J, Singh H, et al. Recommendations on screening for breast cancer in average-risk women aged 40-74 years. CMAJ 2011;183(17):1991-2001.

(4.) Miettinen OS, Henschke CI, Pasmantier MW, Smith JP, Libby DM, Yankelevitz DF. Mammographic screening: No reliable supporting evidence? Lancet 2002;359(9304):404-5.

(5.) Miettinen OS, Karp I. Epidemiological Research: An Introduction. New York, NY: Springer, 2012; 81.

(6.) Morrison AS. Screening in Chronic Disease, First Edition. New York: Oxford University Press, 1985.

(7.) Caro J. Screening for breast cancer in Quebec: Estimates of health effects and of costs. Montreal: CETS, 1990;24. Available at: http://www.aetmis.gouv.qc.ca/ site/en_publications_liste.phtml (Accessed January 7, 2012).

(8.) Hu P, Zelen M. Planning clinical trials to evaluate early detection programs. Biometrika 1997;84:817-29.

(9.) Hu P, Zelen M. Planning of randomized early detection trials. Stat Methods Med Res 2004;13(6):491-506.

(10.) Hanley JA. Analysis of mortality data from cancer screening studies: Looking in the right window. Epidemiology 2005;16:786-90.

(11.) Baker SG, Kramer BS, Prorok PC. Early reporting for cancer screening trials. J Med Screen 2008;15:122-29.

(12.) Hanley JA. Mortality reductions produced by sustained prostate cancer screening have been underestimated. J Med Screen 2010;17(3):147-51.

(13.) Hanley JA. Measuring mortality reductions in cancer screening trials. Epidemiol Rev 2011;33(1):36-45.

(14.) Shapiro S. Evidence on screening for breast cancer from a randomized trial. Cancer 1977;39(6 Suppl):2772-82.

(15.) Andersson I, Aspegren K, Janzon L, Landberg T, Lindholm K, Linell F, et al. Mammographic screening and mortality from breast cancer: The Malmo mammographic screening trial. BMJ 1988;297(6654):943-48.

(16.) Tabar L, Fagerberg CJ, Gad A, Baldetorp L, Holmberg LH, Grontoft O, et al. Reduction in mortality from breast cancer after mass screening with mammography. Randomised trial from the Breast Cancer Screening Working Group of the Swedish National Board of Health and Welfare. Lancet 1985;1(8433):829-32.

(17.) Frisell J, Lidbrink E, Hellstrom L, Rutqvist LE. Followup after 11 years--Update of mortality results in the Stockholm mammographic screening trial. Breast Cancer Res Treat 1997;45(3):263-70.

(18.) Bjurstam N, Bjorneld L, Warwick J, Sala E, Duffy SW, Nystrom L, et al. The Gothenburg Breast Screening Trial. Cancer 2003;97:2387-96.

(19.) Miller AB, To T, Baines CJ, Wall C. Canadian National Breast Screening Study2: 13-year results of a randomized trial in women aged 50-59 years. J Natl Cancer Inst 2000;92(18):1490-99.

(20.) Fitzpatrick-Lewis D, Hodgson N, Ciliska D, Peirson L, Gauld M, Yun Liu Y. Breast cancer screening. Available at: http://www.ephpp.ca/pdf/breast_cancer_2011_systematic_review_ENG.pdf (Accessed July 26, 2012).

(21.) Breslow NE, Day NE. Statistical Methods in Cancer Research. Volume II--The Design and Analysis of Cohort Studies. Lyons, France: IARC Scientific Publications No. 82., 1987.

(22.) Liu Z, Hanley JA, Strumpf EC. Projecting the yearly mortality reductions due to a cancer screening programme. J Med Screen [2013 Sep 18. Epub ahead of print].

(23.) Marcus PM, Bergstralh EJ, Fagerstrom RM, Williams DE, Fontana R, Taylor WF, Prorok PC. Lung cancer mortality in the Mayo Lung Project: Impact of extended follow-up. J Natl Cancer Inst 2000;92(16):1308-16.

Received: June 19, 2013

Accepted: September 19, 2013

James A. Hanley, PhD, [1,2] Maurice McGregor, MD, [2] Zhihui Liu, MSc, [1] Erin C. Strumpf, PhD, [1,3] Nandini Dendukuri, PhD, [1,2]

Author Affiliations

McGill University, Montreal, QC

[1.] Department of Epidemiology, Biostatistics and Occupational Health

[2.] Department of Medicine

[3.] Department of Economics

Correspondence: James Hanley, Dept. of Epidemiology, Biostatistics and Occupational Health, McGill University, 1020 Pine Avenue West, Montreal, QC H3A 1A2, E-mail: [email protected]