首页    期刊浏览 2024年12月03日 星期二
登录注册

文章基本信息

  • 标题:Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance
  • 其他标题:Learning from a Class Imbalanced Public Health Dataset: a Cost-based Comparison of Classifier Performance
  • 本地全文:下载
  • 作者:Rohini R Rao ; Krishnamoorthi Makkithaya
  • 期刊名称:International Journal of Electrical and Computer Engineering
  • 电子版ISSN:2088-8708
  • 出版年度:2017
  • 卷号:7
  • 期号:4
  • 页码:2215-2222
  • DOI:10.11591/ijece.v7i4.pp2215-2222
  • 语种:English
  • 出版社:Institute of Advanced Engineering and Science (IAES)
  • 摘要:Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers’ performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.
  • 其他摘要:Public health care systems routinely collect health-related data from the population. This data can be analyzed using data mining techniques to find novel, interesting patterns, which could help formulate effective public health policies and interventions. The occurrence of chronic illness is rare in the population and the effect of this class imbalance, on the performance of various classifiers was studied. The objective of this work is to identify the best classifiers for class imbalanced health datasets through a cost-based comparison of classifier performance. The popular, open-source data mining tool WEKA, was used to build a variety of core classifiers as well as classifier ensembles, to evaluate the classifiers’ performance. The unequal misclassification costs were represented in a cost matrix, and cost-benefit analysis was also performed. In another experiment, various sampling methods such as under-sampling, over-sampling, and SMOTE was performed to balance the class distribution in the dataset, and the costs were compared. The Bayesian classifiers performed well with a high recall, low number of false negatives and were not affected by the class imbalance. Results confirm that total cost of Bayesian classifiers can be further reduced using cost-sensitive learning methods. Classifiers built using the random under-sampled dataset showed a dramatic drop in costs and high classification accuracy.
  • 关键词:Computer and Informatics;class imbalance; classifier accuracy; cost benefit analysis; data mining; healthcare
国家哲学社会科学文献中心版权所有