首页    期刊浏览 2024年12月05日 星期四
登录注册

文章基本信息

  • 标题:A Near-Duplicate Detection Algorithm to Facilitate Document Clustering
  • 本地全文:下载
  • 作者:Lavanya Pamulaparty ; Dr. C.V Guru Rao ; Dr. M. Sreenivasa Rao
  • 期刊名称:International Journal of Data Mining & Knowledge Management Process
  • 印刷版ISSN:2231-007X
  • 电子版ISSN:2230-9608
  • 出版年度:2014
  • 卷号:4
  • 期号:6
  • 页码:39
  • DOI:10.5121/ijdkp.2014.4604
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting NearDuplicates is very difficult in large collection of data like ”internet”. The presence of these web pagesplays an important role in the performance degradation while integrating data from heterogeneoussources. These pages either increase the index storage space or increase the serving costs. Detecting thesepages has many potential applications for example may indicate plagiarism or copyright infringement.This paper concerns detecting, and optionally removing duplicate and near duplicate documents which areused to perform clustering of documents .We demonstrated our approach in web news articles domain. Theexperimental results show that our algorithm outperforms in terms of similarity measures. The nearduplicate and duplicate document identification has resulted reduced memory in repositories.
  • 关键词:Web Content Mining; Information Retrieval; document clustering; duplicate; near-duplicate detection;similarity; web documents
国家哲学社会科学文献中心版权所有