首页    期刊浏览 2025年02月28日 星期五
登录注册

文章基本信息

  • 标题:Huffbit Compress €“ Algorithm to Compress Dna Sequences Using Extended Binary Trees
  • 本地全文:下载
  • 作者:P.raja Rajeswari ; Dr. Allam Apparao ; Dr. R.kiran Kumar
  • 期刊名称:Journal of Theoretical and Applied Information Technology
  • 印刷版ISSN:1992-8645
  • 电子版ISSN:1817-3195
  • 出版年度:2010
  • 卷号:13
  • 期号:02
  • 出版社:Journal of Theoretical and Applied
  • 摘要:

    Compressing DNA sequences is an important task as every day thousands of gigabytes of sequences of nucleotides and amino acids gets stored in Genbank. Storing large Genomes in a personal computer in the compressed form is an efficient means of using DNA sequences for biological functions.
    We present a compression algorithm, “HuffBit Compress” for DNA sequences based on a novel algorithm of assigning binary bit codes(0 and 1) for each base(A,C,G,T) to compress both repetitive and non repetitive DNA sequence. Our proposed algorithm achieves the compression ratio using the concept of Extended Binary Tree. In this algorithm Extended binary tree concept is applied to derieve a special class of variable-length codes that satisfy prefix property. This class of codes is called Huffman Codes. Huffman codes are based on relative frequency with which DNA symbols (A,C,G,T) appear in the sequence.
    The specific bit sequence assigned to an individual base is determined by tracing out the path from the root of the tree to the leaf that represents that Base. By convention, bit '0' represents following the left child when tracing out the path and bit '1' represents following the right child when tracing out the path. The extended binary tree is constructed such that the paths from the root to the most frequently used bases are short while the paths to less frequently used bases are long. This results in short codes for frequently used bases and long codes for less frequently used bases.
    If the sequences are encoded using the old trivial method of assigning 2 bits per base.(A=00,C=01,G=10,T=11) The encoded text is of higher length.(1000 bases needs 2000 bits of space). This proposed algorithm “HuffBit Compress” compresses DNA sequences more efficiently.(1000 bases may need only 1006 bits of space in Best case.). Analysis of Best case and worst case is defined for the first time in this Algorithm. Compression Ratio of 1.6 bits/base can be achieved using this proposed algorithm. This algorithm makes the near choice (minimum value) at every step that appears to lead to an overall optimal base encoding.

  • 关键词:Huffman Codes; Compression Ratio; Internal Node; External Node; Extended Binary Tree; Encode; Decode
国家哲学社会科学文献中心版权所有