期刊名称:Computational and Structural Biotechnology Journal
印刷版ISSN:2001-0370
出版年度:2021
卷号:19
页码:3628-3639
DOI:10.1016/j.csbj.2021.06.021
出版社:Computational and Structural Biotechnology Journal
摘要:Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple ( e.g. percent identity) or more intricate concepts ( e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints ( i.e. , direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two ( t ) possible items ( e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons .
关键词:Multiple comparisons ; DNA sequences ; Protein sequences ; Diversity analysis ; Similarity indices ; Consistency ; ANOVA ; Human protein kinases ; Human SH2 domains ; Cytochrome P450