首页    期刊浏览 2024年12月02日 星期一
登录注册

文章基本信息

  • 标题:A New Approach for Web Information Extraction
  • 本地全文:下载
  • 作者:R.Gunasundari ; Dr.S.Karthikeyan
  • 期刊名称:International Journal of Computer Technology and Applications
  • 电子版ISSN:2229-6093
  • 出版年度:2012
  • 卷号:3
  • 期号:1
  • 页码:211-215
  • 出版社:Technopark Publications
  • 摘要:With the exponentially growing amount of information available on the Internet, an effective technique for users to discern the useful information from the unnecessary information is urgently required. Cleaning web pages for web data extraction becomes critical for improving performance of information retrieval and information extraction. So, we investigate to remove various noise patterns in Web pages instead of extracting relevant content from Web pages to get main content information. To solve this problem, we put forward an extracting main content method which firstly removes the usual noise and the candidate nodes without any main content information from web pages, and makes use of the relation of content text length, the length of anchor text and the number of punctuation marks to extract the main content. In this paper, we focus on removing noise and utilization of all kinds of content-characteristics, experiments show that this approach can enhance the universality and accuracy in extracting the body text of web pages.
  • 关键词:information extraction; web page content extraction; removing noise content
国家哲学社会科学文献中心版权所有