期刊名称:International Journal of Computer Science & Technology
印刷版ISSN:2229-4333
电子版ISSN:0976-8491
出版年度:2012
卷号:3
期号:4
页码:372-375
语种:English
出版社:Ayushmaan Technologies
摘要:Content extraction is the process of identifying the main content or removing the additional contents. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Many content extraction strategies are based on DOM tree representation, feature extraction or tag ratios of HTML web page and estimating useful content from it. This paper describes a comparative study on various content extraction algorithms.
关键词:Data Mining;Information Extraction;Content Extraction;HTML; Open Source Intelligence;Information Filtering