首页    期刊浏览 2024年12月05日 星期四
登录注册

文章基本信息

  • 标题:Detection and Removal of DUST: Duplicate URLs with Similar Text Using DUSTER
  • 本地全文:下载
  • 作者:Jyoti G. Langhi ; Prof. Shailaja Jadhav
  • 期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
  • 印刷版ISSN:2320-9798
  • 电子版ISSN:2320-9801
  • 出版年度:2017
  • 卷号:5
  • 期号:1
  • 页码:556
  • DOI:10.15680/IJIRCCE.2017.0501112
  • 出版社:S&S Publications
  • 摘要:World Wide Web is a medium commonly used to search information using Web crawlers. Some pagescollected by the web crawlers contains duplicate content. Different URLs with Similar Text are generally known asDUST. The proposed method is to detect and remove duplicate documents without fetching their contents. Here thenormalization rules are used to transform all duplicate URLs into the same canonical form. To improve theperformance of search engines, a new method called DUSTER is used. DUSTER converts all the URLs into multiplesequence of alignments which generates candidate rules and rules validation. Here, DUSTER filters out candidate rulesaccording to their performance in a validation set and finally removes the duplicate URLs. Using this method reductionof large number of duplicate URLs is achieved. Our contribution work is, we intend to improve the scalability andprecision of our method, and to evaluate it using other datasets. For its scalability, we intend to provide acomprehensive comparison among strategies to cope with very large dup-clusters, which includes (a) to betterunderstand the impact of using split dup-clusters instead of the original ones, (b) to propose distributed algorithms forthe task and (c) to use more efficient multiple sequence alignment algorithms. Distributed processing is an effectiveway to improve scalability, reliability and performance of a database system. Distributed database is to be used.
  • 关键词:Crawling; Dup-Cluster; DUSTER; Distributed Database; URL Normalization; Web Technology
国家哲学社会科学文献中心版权所有