文章基本信息

标题：EXTRACTION OF HTML DOCUMENTS FROM HETEROGENEOUS WEBPAGES USING CLUSTER TECHNIQUES
本地全文：下载
作者：Sruthi Kamban K.S ; M.Sindhuja
期刊名称：International Journal of Engineering and Computer Science
印刷版ISSN：2319-7242
出版年度：2013
卷号：2
期号：4
页码：1106-1110
出版社：IJECS
摘要：The World Wide Web is a vast and rapidly growing source of information. Most of this information is in the form of unstructured text which makes the information hard to query. To make the queries easy and to provide the result accurately, template extraction technique is used .In the existing system the techniques which are used to extract the data is not efficient and causes the factors such as delay, accuracy, and duplicate data. The proposed system is presented with Hyper Graph technique for extracting the templates from a large number of web documents which are generated from heterogeneous templates for making the web search more efficient in cost wise, performance and time wise. In addition the proposed approach make use of a clustering technique to retrieve the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously providing goodness measure with its fast approximation for clustering.
关键词：Document Object Model; Min Hash; Minimum description length; Jaccard coefficient; Template Extraction