The Deoxyribonucleic acid(DNA) constitutes the physical medium in which all properties of living organisms are encoded. Molecular sequence databases(e.g.,EMBL,Genbank, DDJB, Entrez, SwissProt, etc) currently collect hundreds of thousands of sequences of nucleotides and amino acids reaching to thousands of gigabytes and are under continuous expansion. Need for Compression arises because approximately 44,575,745,176 bases in 40,604,319 sequence records are there in the GenBank database (http://www.ncbi.nlm.nih.gov/Genbank/).Efficient compression may also reveal some biological functions and helps in phylogenic tree reconstruction etc.
We present a compression algorithm, “GenBit Compress” for DNA sequences based on a novel algorithm of assigning binary bits for smaller segments of DNA bases to compress both repetitive and non repetitive DNA sequence. our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. As long as 8 lakh characters can be given as input. Significantly better compression results show that GenBit Compress algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences(Genome),our new GenBit Compress algorithm significantly improves the running time of all previous DNA compression programs. We have also identified that it is a good idea to express the performance of an algorithm as a function of the input size. For the first time we have defined the Worst case, Average case and Best case for DNA compression using our proposed Algorithm. Assigning binary bits for fragments of DNA sequence is also a unique concept introduced in this algorithm for the first time in DNA compression.