首页    期刊浏览 2025年02月28日 星期五
登录注册

文章基本信息

  • 标题:A Graph Based Approach for Eliminating DUST Using Normalization Rules
  • 本地全文:下载
  • 作者:Jayashri Waman ; Pankaj Agarkar
  • 期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
  • 印刷版ISSN:2320-9798
  • 电子版ISSN:2320-9801
  • 出版年度:2016
  • 卷号:4
  • 期号:4
  • 页码:7259
  • DOI:10.15680/IJIRCCE.2016.0404181
  • 出版社:S&S Publications
  • 摘要:Duplicate content means search engines have to waste time in crawling all the different duplicate versions of a page, and you're relying on them to do it in the way you want them to. Duplicate content generally refers to substantive blocks of content within or across domains that either completely m atches with other content or is appreciably similar. Mostly, this is not deceptive in origin. Use of such duplicated data is a waste of resources which results in poor user experiences. We focus on removing links of duplicate contents by address of the website i.e. URL. We will convert URL into grap hical format of sequences and perform the operations. Also URL tokenizer is used to understand the web protocol and top - level domain. This approach will help in a healthy way to remove same content from a set of web pages. So web crawlers can easily accep t this approach and can make better indexing possible. The proposed method achieved larger reductions in the number of duplicate URLs than the existing approach
  • 关键词:URL normalization; De-duping; Consensus sequences; Canonical Form
国家哲学社会科学文献中心版权所有