期刊名称:Journal of Theoretical and Applied Information Technology
印刷版ISSN:1992-8645
电子版ISSN:1817-3195
出版年度:2012
卷号:45
期号:2
页码:710-715
出版社:Journal of Theoretical and Applied
摘要:Domain-oriented web page extraction is a new and practical direction in the field of information extraction. The paper focuses on the representation of domain-oriented web page topic features, and hierarchical vector space (HVS) model is put forward. Considering the hierarchical characteristics of the web page itself, topic features of the web page are expressed more effectively by HVS model from the facets of the page structure and the content. Then the topic-related page identification problem is discussed by the similarity calculation. Experimental results show good accuracy and applicability for our system to domain-oriented web extraction.
关键词:Domain-Oriented; Hierarchical Vector Space Model; Information Extraction