首页    期刊浏览 2024年12月11日 星期三
登录注册

文章基本信息

  • 标题:Template Extraction from Heterogeneous Web Pages Using Text Clustering
  • 本地全文:下载
  • 作者:T.L.N.Divya ; G.Loshma ; Dr. Nagaratna P Hegde
  • 期刊名称:International Journal of Computer Trends and Technology
  • 电子版ISSN:2231-2803
  • 出版年度:2012
  • 卷号:3
  • 期号:3-2
  • 出版社:Seventh Sense Research Group
  • 摘要:Now a days most of the information is stored in text databases. This information consists of large collection of documents from Heterogeneous web pages. Now we extract template from these heterogeneous templates, and to extract template we use different algorithms to find similarity of underlying template structures in the documents and we cluster the web documents based on the similarity of underlying template structure in the documents so that template is extracted with various clusters. We use different algorithms to find similarity between the web pages. Previously the algorithms used are RTDM, TextHash and TextMax. But the time and space occupied by this algorithms is more. In this paper we are using WaveKMeans algorithm to find similarity between the web pages. This algorithm provides better performance compared to previous algorithms in terms of space and time. The space and time consumed by this algorithm is less compared to RTDM, TextHash and TextMax. Our Experimental results with real life data sets confirm effectiveness and robustness of our algorithm.
  • 关键词:Template Extraction; RTDM; Text-Hash; Text-Max; WaveK-means; Clustering
国家哲学社会科学文献中心版权所有