首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:Query Based Duplicate Data Detection on WWW
  • 本地全文:下载
  • 作者:Ranjna Gupta ; Neelam Duhan ; A.K. Sharma
  • 期刊名称:International Journal on Computer Science and Engineering
  • 印刷版ISSN:2229-5631
  • 电子版ISSN:0975-3397
  • 出版年度:2010
  • 卷号:2
  • 期号:4
  • 页码:1395-1400
  • 出版社:Engg Journals Publications
  • 摘要:The problem of finding relevant documents has become much more prominent due to the presence of duplicate data on the WWW. This redundancy in results increases the users� seek time to find the desired information within the search results, while in general most users just want to cull through tens of result pages to find new/different results. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data. Therefore, a mechanism needs to be introduced for detecting duplicate data so that relevant search results can be provided to the user. In this paper, architecture is being proposed that introduces methods that run online as well as offline on the basis of favored and disfavored user queries to detect duplicates and near duplicates.
  • 关键词:WWW; Query log; Cluster; Search Engine; Ranking Algorithm
国家哲学社会科学文献中心版权所有