文章基本信息

标题：An Efficient Hierarchical Clustering Algorithm for Protein Sequences.
本地全文：下载
作者：P. A. Vijaya ; M. Narasimha Murty ; D. K. Subramanian 等
期刊名称：International Journal of Computer Science & Applications
印刷版ISSN：0972-9038
出版年度：2004
卷号：II
期号：II
出版社：Technomathematics Research Foundation
摘要：Clustering is the division of data into groups of similar objects. The main objective of this unsuper-vised learning technique is to find a natural grouping or meaningful partition by using a distance or similarity function. Clustering is mainly used for dimensionality reduction, prototype selection/abstractions for pattern classification, data reorganization and indexing and for detecting outliers and noisy patterns. Clustering techniques are applied in pattern classification schemes, bioinformatics, data mining, web mining, biometrics, 'document processing, remote sensed data analysis, biomedical data analysis, etc., in which the data size is very large. In this paper, an efficient incremental clustering algorithm -'Leaders-Subleaders' - an extension of leader algorithm, suitable for protein sequences of bioinformatics is proposed for effective clustering and prototype selection for pattern classification. It is another simple and efficient technique to generate a hierarchical structure for finding the subgroups/subclusters within each cluster which may be used to find the family and subfamily relationships of protein sequences. The experimental results (classification accuracy using the prototypes obtained and the computation time) of the proposed algorithm are compared with that of leader based and nearest neighbour classifier (NNC) methods. It is found to be computationally efficient when compared to NNC. Classification accuracy obtained using the representatives generated by the Leaders-Subleaders method is found to be better than that of using leaders as representatives and it approaches to that of NNC if sequential search is used on the sequences from the selected subcluster. Even if more number of prototypes are generated, classification time is less as only a part of the hierarchical structure is searched in Leaders-Subleaders method.