首页    期刊浏览 2024年12月04日 星期三
登录注册

文章基本信息

  • 标题:COMMON SENSE BASED TEXT DOCUMENT CLUSTERING ALGORITHM BY COARSE AND FINE GRAINED CLUSTERING TECHNIQUES
  • 本地全文:下载
  • 作者:G. LOSHMA ; DR. NAGARATNA P HEDGE
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2017
  • 卷号:95
  • 期号:10
  • 出版社:Journal of Theoretical and Applied
  • 摘要:Text documents occupy the major source of data and hence it is important to keep the data in an organized fashion. Clustering is one of the ways for data organization, which tends to group similar documents together. In spite of the presence of numerous existing clustering algorithms, still there is an emergent need for accurate clustering algorithms. Additionally, most of the clustering algorithms work by distance based measures, which is the reason for lack of accuracy. In order to overcome these issues, this work presents a double layered text document clustering algorithm. The entire system is categorized into phases such as document pre-processing, representation, clustering and cluster labelling. The document pre-processing phase prepares the document in such a way that it is suitable for the forthcoming processes. The document representation phase is to standardize the structure of the document and this is done by Document Index Graph (DIG) model. The documents are then clustered by cosine similarity and rough set of clusters are formed. The second level of cluster refinement is achieved by ConceptNet, which works on the basis of common sense reasoning. Finally, the clusters are labelled by picking the top ranked key-phrase. This work is tested over BBCSport and 20 NewsGroup dataset and the proposed approach proves better results in terms of F-measure, purity and entropy.
  • 关键词:Document clustering; DIG model; Sense based clustering; Distance based clustering
国家哲学社会科学文献中心版权所有