文章基本信息

标题：A Framework for Hierarchical Clustering Based Indexing in Search Engines
本地全文：下载
作者：Parul Gupta ; A.K. Sharma
期刊名称：BVICAM's International Journal of Information Technology
印刷版ISSN：0973-5658
出版年度：2011
卷号：3
期号：2
出版社：Bharati Vidyapeeth's Institute of Computer Applications and Management
摘要：Granting efficient and fast accesses to the index is a key issue for performances of Web Search Engines. In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes that consist of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. Since the document identifiers are stored in sorted order, they can be stored as the difference between the successive documents so as to reduce the size of the index. This paper describes a clustering algorithm that aims at partitioning the set of documents into ordered clusters so that the documents within the same cluster are similar and are being assigned the closer document identifiers. Thus the average value of the differences between the successive documents will be minimized and hence storage space would be saved. The paper further presents the extension of this clustering algorithm to be applied for the hierarchical clustering in which similar clusters are clubbed to form a mega cluster and similar mega clusters are then combined to form super cluster. Thus the paper describes the different levels of clustering which optimizes the search process by directing the search to a specific path from higher levels of clustering to the lower levels i.e. from super clusters to mega clusters, then to clusters and finally to the individual documents so that the user gets the best possible matching results in minimum possible time.?
关键词：Inverted files; Index compression; Document Identifiers Assignment; Hierarchical Clustering