首页    期刊浏览 2025年02月28日 星期五
登录注册

文章基本信息

  • 标题:EXTRACTION OF HTML DOCUMENTS FROM HETEROGENEOUS WEBPAGES USING CLUSTER TECHNIQUES
  • 本地全文:下载
  • 作者:Sruthi Kamban K.S ; M.Sindhuja
  • 期刊名称:International Journal of Engineering and Computer Science
  • 印刷版ISSN:2319-7242
  • 出版年度:2013
  • 卷号:2
  • 期号:4
  • 页码:1106-1110
  • 出版社:IJECS
  • 摘要:The World Wide Web is a vast and rapidly growing source of information. Most of this information is in the form of unstructured text which makes the information hard to query. To make the queries easy and to provide the result accurately, template extraction technique is used .In the existing system the techniques which are used to extract the data is not efficient and causes the factors such as delay, accuracy, and duplicate data. The proposed system is presented with Hyper Graph technique for extracting the templates from a large number of web documents which are generated from heterogeneous templates for making the web search more efficient in cost wise, performance and time wise. In addition the proposed approach make use of a clustering technique to retrieve the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously providing goodness measure with its fast approximation for clustering.
  • 关键词:Document Object Model; Min Hash; Minimum description length; Jaccard coefficient; Template Extraction
国家哲学社会科学文献中心版权所有