摘要:K-means clustering is an important and popular technique in data mining. Unfortunately, for any given dataset (not knowledge-base), it is very difficult for a user to estimate the proper number of clusters in advance, and it also has the tendency of trapping in local optimum when the initial seeds are randomly chosen. The genetic algorithms (GAs) are usually used to determine the number of clusters automatically and to capture an optimal solution as the initial seeds of K-means clustering or K-means clustering results. However, they typically choose the genes of chromosomes randomly, which results in poor clustering results, whereas a generally selected initial population can improve the final clustering results. Hence, some GA-based techniques carefully select a high-quality initial population with a high complexity. This paper proposed an adaptive GA (AGA) with an improved initial population for K-means clustering (SeedClust). In SeedClust, which is an improved density estimation method and the improved K-means++ are presented to capture higher quality initial seeds and generate the initial population with low complexity, and the adaptive crossover and mutation probability is designed and is then used for premature convergence and to maintain the population diversity, respectively, which can automatically determine the proper number of clusters and capture an improved initial solution. Finally, the best chromosomes (centers) are obtained and are then fed into the K-means as initial seeds to generate even higher quality clustering results by allowing the initial seeds to readjust as needed. Experimental results based on low-dimensional taxi GPS (Global Position System) data sets demonstrate that SeedClust has a higher performance and effectiveness.
关键词:automatic K-means clustering; adaptive genetic algorithm; improved K-means++; density estimation; taxi GPS data automatic K-means clustering ; adaptive genetic algorithm ; improved K-means++ ; density estimation ; taxi GPS data