首页    期刊浏览 2024年12月03日 星期二
登录注册

文章基本信息

  • 标题:RoadRunner for Heterogeneous Web Pages Using Extended MinHash
  • 本地全文:下载
  • 作者:A Suresh Babu ; P. Premchand ; A. Govardhan
  • 期刊名称:International Journal of Database Management Systems
  • 印刷版ISSN:0975-5985
  • 电子版ISSN:0975-5705
  • 出版年度:2012
  • 卷号:4
  • 期号:1
  • 出版社:Academy & Industry Research Collaboration Center (AIRCC)
  • 摘要:The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structured data ready for post processing. Roadrunner will be used to extract information from template web pages. In this paper, we present novel algorithm for extracting templates from a large number of web documents which are generated from heterogeneous templates. The proposed system focuses on information extraction from heterogeneous web pages. We cluster the web documents based on the common template structures so that the template for each cluster is extracted simultaneously. The resultant clusters will be given as input to the Roadrunner system.
  • 关键词:Information Extraction; Clustering; Minimum Description Length Principle; MinHash
国家哲学社会科学文献中心版权所有