期刊名称:International Journal of Computer Technology and Applications
电子版ISSN:2229-6093
出版年度:2012
卷号:3
期号:6
页码:1971-1978
出版社:Technopark Publications
摘要:Accurate clustering of text is a challenging problem among the information retrieval society. In some cases experts possesses prior knowledge about the data that can enhance the clustering performance. In this paper a two layer semi-supervised clustering method is proposed to improve the text clustering accuracy. The novel approach uses Space Level Constraints Clustering (SLCC) method as a first layer to categorize the data which novel the prior knowledge for the second layer. K-means clustering is an efficient method but the bottleneck of this algorithm is its sensitivity to the number of clusters and initial centers. K-means is employed as the second layer in the proposed structure and its drawbacks is solved by incorporating prior knowledge found by SLCC (in the first layer) such as number of partitions and their centers. Here Reuters-21578 dataset along with some standard sets from UCI repository are selected as a rich benchmark to evaluate our method. Therefore, accuracy of the clustering methods can be precisely determined. The combinatorial scheme is applied on a high dimensional reuters-21578 data and the clustering results lead to a higher accuracy compare to utilize just SLCC or K-means on the data set and also got high improvement on the other datasets.
关键词:space level constraints clustering; K-means; text retrieval; semi-supervised clustering