期刊名称:International Journal of Innovative Research in Computer and Communication Engineering
印刷版ISSN:2320-9798
电子版ISSN:2320-9801
出版年度:2016
卷号:4
期号:4
页码:8085
DOI:10.15680/IJIRCCE.2016.0404319
出版社:S&S Publications
摘要:Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. This project aims to extract less structured web content, like news articles, that appear only once in noisy WebPages. Our approach classifies text blocks by initially removing noise, then segmenting visual and text units by extracting features and PCA - based feature transformation for classification
关键词:WebPages; Visual Unit; Text Unit; Extracting Features; PCA