期刊名称:Proceedings of the National Academy of Sciences
印刷版ISSN:0027-8424
电子版ISSN:1091-6490
出版年度:2016
卷号:113
期号:47
页码:13283-13288
DOI:10.1073/pnas.1607774113
语种:English
出版社:The National Academy of Sciences of the United States of America
摘要:SignificanceMany scientific applications ranging from ecology to genetics use a small sample to estimate the number of distinct elements, known as "species," in a population. Classical results have shown that n samples can be used to estimate the number of species that would be observed if the sample size were doubled to [IMG]f1.gif" ALT="Formula" BORDER="0">. We obtain a class of simple algorithms that extend the estimate all the way to [IMG]f2.gif" ALT="Formula" BORDER="0"> samples, and we show that this is also the largest possible estimation range. Therefore, statistically speaking, the proverbial bird in the hand is worth log n in the bush. The proposed estimators outperform existing ones on several synthetic and real datasets collected in various disciplines. Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42-58], uses n samples to predict the number U of hitherto unseen species that would be observed if [IMG]f3.gif" ALT="Formula" BORDER="0"> new samples were collected. Of considerable interest is the largest ratio t between the number of new and existing samples for which U can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) Biometrika 43(102):45-63] constructed an intriguing estimator that predicts U for all [IMG]f4.gif" ALT="Formula" BORDER="0">. Subsequently, Efron and Thisted [Efron B, Thisted R (1976) Biometrika 63(3):435-447] proposed a modification that empirically predicts U even for some [IMG]f5.gif" ALT="Formula" BORDER="0">, but without provable guarantees. We derive a class of estimators that provably predict U all of the way up to [IMG]f6.gif" ALT="Formula" BORDER="0">. We also show that this range is the best possible and that the estimators mean-square error is near optimal for any t. Our approach yields a provable guarantee for the Efron-Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.
关键词:species estimation ; extrapolation model ; nonparametric statistics