摘要:Recent advance in technology enables researchers to gather and store enormous data sets with ultra high dimensionality. In bioinformatics, microarray and next generation sequencing technologies can produce data with tens of thousands of predictors of biomarkers. On the other hand, the corresponding sample sizes are often limited. For classification problems, to predict new observations with high accuracy, and to better understand the effect of predictors on classification, it is desirable, and often necessary, to train the classifier with variable selection. In the literature, sparse regularized classification techniques have been popular due to the ability of simultaneous classification and variable selection. Despite its success, such a sparse penalized method may have low computational speed, when the dimension of the problem is ultra high. To overcome this challenge, we propose a new sparse REgression based multicategory Classifier (REC). Our method uses a simplex to represent different categories of the classification problem. A major advantage of REC is that the optimization can be decoupled into smaller independent sparse penalized regression problems, and hence solved by using parallel computing. Consequently, REC enjoys an extraordinarily fast computational speed. Moreover, REC is able to provide class conditional probability estimation. Simulated examples and applications on microarray and next generation sequencing data suggest that REC is very competitive when compared to several existing methods.
关键词:LASSO; parallel computing; probability estimation; simplex; variable selection