Abstract
Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by these technologies has been increasing exponentially since the last decade. Storage, archival and dissemination of such huge data sets require efficient solutions, both from the hardware as well as software perspective. The present paper describes BIND – an algorithm specialized for compressing nucleotide sequence data. By adopting a unique ‘block-length’ encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma). Moreover, in contrast to implementations of existing specialized genomic compression approaches, the implementation of BIND is enabled to handle non-ATGC and lowercase characters. This makes BIND a loss-less compression approach that is suitable for practical use. More importantly, validation results of BIND (with real-world data sets) indicate reasonable speeds of compression and decompression that can be achieved with minimal processor/memory usage. BIND is available for download at http://metagenomics.atc.tcs.com/compression/BIND. No license is required for academic or non-profit use.
Similar content being viewed by others
References
Behzadi B and Le Fessant F 2005 DNA compression challenge revisited; in Combinatorial pattern matching: Proceedings of CPM-2005, LNCS, Jeju Island, Korea (Springer-Verlag)
Cao MD, Dix TI, Allison L and Mears C 2007 A simple statistical algorithm for biological sequence compression; in Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah
Chen X, Li M, Ma B and Tromp J 2002 DNACompress: fast and effective DNA sequence compression. Bioinformatics 18 1696–1698
Cochrane G, Karsch-Mizrachi I and Nakamura Y 2011 International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 39 D15-8
Grumbach S and Tahi F 1994 A new challenge for compression algorithms: genetic sequences. Inform. Process. Management 30 875–886
Korodi G and Tabus I 2007 Normalized maximum likelihood model of order-1 for the compression of DNA sequences; in Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah
Manzini G and Rastero M 2004 A simple and fast DNA compressor. Software—Practice and Experience 34 1397–1411
Matsumoto T, Sadakane K and Imai H 2000 Biological sequence compression algorithms; in Genome informatics: Proceedings of the 11th Workshop, Tokyo, Japan, (eds) AK Dunker, A Konagaya, S Miyano and T Takagi. pp 43–52
Metzker ML 2010 Sequencing technologies - the next generation. Nat. Rev. Genet. 1 31–46
Pinho AJ, Neves AJR and Ferreira PJSG 2008 Inverted-repeats-aware finite-context models for DNA coding; in Proceedings of the 16th European Signal Processing Conference, EUSIPCO-2008, Lausanne, Switzerland
Pinho AJ, Ferreira PJSG, Neves AJR and Bastos CAC 2011 On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6 e21588
Rivals E, Delahaye JP, Dauchet M and Delgrange O 1996 A guaranteed compression scheme for repetitive DNA sequences; in Proceedings of the Data Compression Conference, DCC-96, Snowbird, Utah, pp 453
Zhang J, Chiodini R, Badr A and Zhang G 2011 The impact of next-generation sequencing on genomics. J. Genet. Genomics 38 95–109
Author information
Authors and Affiliations
Corresponding author
Additional information
Corresponding editor: Reiner A Veitia
[Bose T, Mohammed MH, Dutta A and Mande SS 2012 BIND – An algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37 1-5] DOI 10.1007/s12038-012-9230-6
Supplementary materials pertaining to this article are available on the Journal of Biosciences Website at http://www.ias.ac.in/jbiosci/sep2012/supp/bose.pdf
Electronic supplementary material
Below is the link to the electronic supplementary material.
Esm 1
(PDF 32.6 kb)
Rights and permissions
About this article
Cite this article
Bose, T., Mohammed, M.H., Dutta, A. et al. BIND – An algorithm for loss-less compression of nucleotide sequence data. J Biosci 37, 785–789 (2012). https://doi.org/10.1007/s12038-012-9230-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12038-012-9230-6