期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2015
卷号:72
期号:1
出版社:Journal of Theoretical and Applied
摘要:This paper presents a new ensemble classifier for class imbalance problem with the emphasis on two -class (binary) classification. This novel method is a combination of SMOTE (Synthetic Minority Over-sampling Technique), Rotation Forest, and AdaBoostM1 algorithms. SMOTE was employed for the over-sampling of the minority samples at 100%, 200%, 300%, 400%, and 500% of the initial sample size, with attribute selection being conducted in order to prevent the classification from being over-fitted. The ensemble classifier method was presented to solve the problem of imbalanced biological datasets classification by obtaining a low prediction error and raising the prediction performance. The Rotation Forest algorithm was used to produce an ensemble classifier with a lower prediction error, while the AdaBoostM1 algorithm was used to enhance the performance of the classifier. All the tests were carried out using the java-based WEKA (Waikato Environment for Knowledge Analysis) and Orange canvas data mining systems for training datasets. The performances of three types of classifiers on imbalanced biomedical datasets were assessed. This paper explores the efficiency of this new method in producing an accurate overall classifier and in lowering the error rate in the overall performance of the classifier. Tests were carried out on three actual imbalanced biomedical datasets, which were obtained from the KEEL dataset repository. These imbalanced datasets were divided into ten categories according to their imbalance ratios (IR) which ranged from 1.86 to 41.40. The results indicated that the proposed method, which used a combination of three methods and various evaluation metrics in its assessments, was effective. In practical terms, the use of the SMOTE-RotBoost for the classification of biological datasets results in a low mean absolute error rate as well as high accuracy and precision. The values of the Kappa Coefficient were close to 1, thus indicating that all the rates in every classification were the same even though the false negative rates, which were close to 0, showed the reliability of the measurements. The SMOTE-RotBoost has useful AUC-ROC outputs that characterise the wider area under the curve compared to other classifiers and is a vital method for the assessment of diagnostic tests.
关键词:SMOTE; Rotation Forest; Random Subspace; Bagging; Boosting