首页    期刊浏览 2024年12月03日 星期二
登录注册

文章基本信息

  • 标题:Finding and Classifying Near-Duplicate Pages based on Identical Sentences Detection
  • 本地全文:下载
  • 作者:Tomohide Shibata ; Naun Kang ; Sadao Kurohashi
  • 期刊名称:人工知能学会論文誌
  • 印刷版ISSN:1346-0714
  • 电子版ISSN:1346-8030
  • 出版年度:2010
  • 卷号:25
  • 期号:1
  • 页码:224-232
  • DOI:10.1527/tjsai.25.224
  • 出版社:The Japanese Society for Artificial Intelligence
  • 摘要:The recent explosive increase of Web pages has made it possible for us to obtain a variety of information with a search engine. However, by some estimates, as many as 40% of the pages on the Web are duplicates of the other pages. Therefore, there is a problem that some search results contain duplicate pages. This paper proposes a method for finding similar pages from a huge amount of Web pages: hundred million Japanese Web pages. Similar pages are defined as two pages that share some sentences, and are classified into mirror pages, citation pages and plagiaristic pages, etc. First, in each page, its content region is extracted since sentences in a non-content region do not tend to be utilized for the similar page detection. From the content region in each page, relatively long sentences are extracted. This is because two pages tend to be relevant when they share relatively long sentences. A pair of pages that has the identical sentences is regarded as similar pages. Next, similar pages are classified based on several information such as an overlap ratio, the number of inlinks/outlinks, and the URL similarity. We conducted the similar page detection and classification on the large scale Japanese Web page collection, and can find some mirror pages, citation pages, and plagiaristic pages.
  • 关键词:Web ; near-duplicate page ; contents extraction
国家哲学社会科学文献中心版权所有