摘要:Clustering is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a clustering algorithm called HCKM, that is more robust to outliers and identifies clusters having spherical or non-spherical shapes and wide variances in size. HCKM achieves this by representing each cluster by a number of points that are the means of all smaller sub-clusters forming it. Having more than one representative point per cluster allows HCKM to adjust well to the geometry of non-spherical shapes. Our experimental results confirm that the quality of clusters produced by HCKM is better than those found by existing algorithms; that is because the first phase -that creates sample- is an enhanced procedure for the k-means algorithm, this enable us to remove the outliers . Furthermore, results demonstrate that sampling enable HCKM not only to outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.
关键词:Hierarchical Clustering, Cluster analysis, Data analysis