文章基本信息

标题：Extended many-item similarity indices for sets of nucleotide and protein sequences
本地全文：下载
作者：Dávid Bajusz ; Ramón Alain Miranda-Quintana ; Anita Rácz 等
期刊名称：Computational and Structural Biotechnology Journal
印刷版ISSN：2001-0370
出版年度：2021
卷号：19
页码：3628-3639
DOI：10.1016/j.csbj.2021.06.021
出版社：Computational and Structural Biotechnology Journal
摘要：Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple ( e.g. percent identity) or more intricate concepts ( e.g. substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints ( i.e. , direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two ( t ) possible items ( e.g. DNA/RNA sequences with t = 4, or protein sequences with t = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons .
关键词：Multiple comparisons ; DNA sequences ; Protein sequences ; Diversity analysis ; Similarity indices ; Consistency ; ANOVA ; Human protein kinases ; Human SH2 domains ; Cytochrome P450