期刊名称:International Journal of Computer Trends and Technology
电子版ISSN:2231-2803
出版年度:2013
卷号:4
期号:3-3
出版社:Seventh Sense Research Group
摘要:The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approaches proposes a hybrid model that operates on Document Object Model (DOM) tree of the corresponding HTML document to extract the content accurately. It combines approaches and techniques like statistical features extraction, formatting characteristic. Content type identification is used along with collective approach to overcome problem of dealing with versatile web pages, and yielding to achieve more accuracy in extracting the contents.
关键词:Data mining; Information Extraction; Content extraction; HTML; Open source intelligence; Information filtering