首页    期刊浏览 2024年11月30日 星期六
登录注册

文章基本信息

  • 标题:Optimized Content Extraction from web pages using Composite Approaches
  • 本地全文:下载
  • 作者:Sheba Gaikwad ; G. Naveen Sundar
  • 期刊名称:International Journal of Computer Trends and Technology
  • 电子版ISSN:2231-2803
  • 出版年度:2013
  • 卷号:4
  • 期号:3-3
  • 出版社:Seventh Sense Research Group
  • 摘要:The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approaches proposes a hybrid model that operates on Document Object Model (DOM) tree of the corresponding HTML document to extract the content accurately. It combines approaches and techniques like statistical features extraction, formatting characteristic. Content type identification is used along with collective approach to overcome problem of dealing with versatile web pages, and yielding to achieve more accuracy in extracting the contents.
  • 关键词:Data mining; Information Extraction; Content extraction; HTML; Open source intelligence; Information filtering
国家哲学社会科学文献中心版权所有