期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
印刷版ISSN:2320-9798
电子版ISSN:2320-9801
出版年度:2016
卷号:4
期号:4
页码:7259
DOI:10.15680/IJIRCCE.2016.0404181
出版社:S&S Publications
摘要:Duplicate content means search engines have to waste time in crawling all the different duplicate versions of a page, and you're relying on them to do it in the way you want them to. Duplicate content generally refers to substantive blocks of content within or across domains that either completely m atches with other content or is appreciably similar. Mostly, this is not deceptive in origin. Use of such duplicated data is a waste of resources which results in poor user experiences. We focus on removing links of duplicate contents by address of the website i.e. URL. We will convert URL into grap hical format of sequences and perform the operations. Also URL tokenizer is used to understand the web protocol and top - level domain. This approach will help in a healthy way to remove same content from a set of web pages. So web crawlers can easily accep t this approach and can make better indexing possible. The proposed method achieved larger reductions in the number of duplicate URLs than the existing approach
关键词:URL normalization; De-duping; Consensus sequences; Canonical Form