首页    期刊浏览 2024年11月29日 星期五
登录注册

文章基本信息

  • 标题:Enriched Format Text Categorization Using A Component Similarity Approach
  • 本地全文:下载
  • 作者:Zhu, Fei ; Yang, Jiong ; Zhou, Yong
  • 期刊名称:Journal of Software
  • 印刷版ISSN:1796-217X
  • 出版年度:2011
  • 卷号:6
  • 期号:9
  • 页码:1713-1720
  • DOI:10.4304/jsw.6.9.1713-1720
  • 语种:English
  • 出版社:Academy Publisher
  • 摘要:Text categorization has been widely studied for years. However, conventional plain text categorization approaches which work good in plain text behave poor when they are simply applied to enriched format texts. An categorization approach that is applicable to enriched format text is proposed. During feature selection, we get feature structure distribution weight by using extended structure model so that structure affections to categorization are fully considered. Text formats are also taken into account in feature weighting. The combined feature weighting approach strengthens important parts and weakens less important ones. The text categorization is fulfilled by document component similarity, which first decomposes document, gathers features by components and other user-defined rules, completes document component tree, and then achieves categorization by it. We implement a CSBC based Naïve Bayes classifier in which the final result is the combination of all classifiers of component tree. Finally we parse OpenOffice.org document, draw components that are most related to classification from OpenOffice.org documents, and then use the classifier to categorize OpenOffice.org documents. The experiment results show that the classifier can automatically classify OpenOffice.org documents and work quite well.
  • 关键词:text classification;enriched format text classification;OpenDocument;OpenOffice.org;Naïve Bayes
国家哲学社会科学文献中心版权所有