期刊名称:International Journal of Electronics and Computer Science Engineering
电子版ISSN:2277-1956
出版年度:2012
卷号:1
期号:4
页码:2084-2094
出版社:Buldanshahr : IJECSE
摘要:In recent times, the concept of web crawling has received remarkable significance owing to extreme development of the World Wide Web. Very large amounts of web documents are swarming the web making the search engines less appropriate to the users. Among the vast number of web documents are many duplicates and near duplicates i.e. variants derived from the same original web document due to which additional overheads are created for search engines by which their performance and quality is significantly affected. Web crawling research community has extensively recognized the need for detection of duplicate and near duplicate web pages. Providing the users with relevant results for their queries in the first page without duplicates and redundant results is a vital requisite. Also, this problem of duplication should be avoided to save storage as well as to improve search quality. The near duplicate web pages are detected followed by the storage of crawled web pages in to repositories. The detection of near duplicates conserves network bandwidth, brings down storage cost and enhances the quality of search engines. In this paper, we have discussed a feasible method for detection of near-duplicate web documents based on the title of the documents which will help to reduce the overhead of search engines and improve their performance.