Skip to main content
Log in

BIND – An algorithm for loss-less compression of nucleotide sequence data

  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by these technologies has been increasing exponentially since the last decade. Storage, archival and dissemination of such huge data sets require efficient solutions, both from the hardware as well as software perspective. The present paper describes BIND – an algorithm specialized for compressing nucleotide sequence data. By adopting a unique ‘block-length’ encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma). Moreover, in contrast to implementations of existing specialized genomic compression approaches, the implementation of BIND is enabled to handle non-ATGC and lowercase characters. This makes BIND a loss-less compression approach that is suitable for practical use. More importantly, validation results of BIND (with real-world data sets) indicate reasonable speeds of compression and decompression that can be achieved with minimal processor/memory usage. BIND is available for download at http://metagenomics.atc.tcs.com/compression/BIND. No license is required for academic or non-profit use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

References

  • Behzadi B and Le Fessant F 2005 DNA compression challenge revisited; in Combinatorial pattern matching: Proceedings of CPM-2005, LNCS, Jeju Island, Korea (Springer-Verlag)

    Google Scholar 

  • Cao MD, Dix TI, Allison L and Mears C 2007 A simple statistical algorithm for biological sequence compression; in Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah

    Google Scholar 

  • Chen X, Li M, Ma B and Tromp J 2002 DNACompress: fast and effective DNA sequence compression. Bioinformatics 18 1696–1698

    Article  PubMed  CAS  Google Scholar 

  • Cochrane G, Karsch-Mizrachi I and Nakamura Y 2011 International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 39 D15-8

  • Grumbach S and Tahi F 1994 A new challenge for compression algorithms: genetic sequences. Inform. Process. Management 30 875–886

    Google Scholar 

  • Korodi G and Tabus I 2007 Normalized maximum likelihood model of order-1 for the compression of DNA sequences; in Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah

    Google Scholar 

  • Manzini G and Rastero M 2004 A simple and fast DNA compressor. Software—Practice and Experience 34 1397–1411

    Google Scholar 

  • Matsumoto T, Sadakane K and Imai H 2000 Biological sequence compression algorithms; in Genome informatics: Proceedings of the 11th Workshop, Tokyo, Japan, (eds) AK Dunker, A Konagaya, S Miyano and T Takagi. pp 43–52

  • Metzker ML 2010 Sequencing technologies - the next generation. Nat. Rev. Genet. 1 31–46

    Article  Google Scholar 

  • Pinho AJ, Neves AJR and Ferreira PJSG 2008 Inverted-repeats-aware finite-context models for DNA coding; in Proceedings of the 16th European Signal Processing Conference, EUSIPCO-2008, Lausanne, Switzerland

    Google Scholar 

  • Pinho AJ, Ferreira PJSG, Neves AJR and Bastos CAC 2011 On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6 e21588

  • Rivals E, Delahaye JP, Dauchet M and Delgrange O 1996 A guaranteed compression scheme for repetitive DNA sequences; in Proceedings of the Data Compression Conference, DCC-96, Snowbird, Utah, pp 453

    Google Scholar 

  • Zhang J, Chiodini R, Badr A and Zhang G 2011 The impact of next-generation sequencing on genomics. J. Genet. Genomics 38 95–109

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sharmila S Mande.

Additional information

Corresponding editor: Reiner A Veitia

[Bose T, Mohammed MH, Dutta A and Mande SS 2012 BIND – An algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37 1-5] DOI 10.1007/s12038-012-9230-6

Supplementary materials pertaining to this article are available on the Journal of Biosciences Website at http://www.ias.ac.in/jbiosci/sep2012/supp/bose.pdf

Electronic supplementary material

Below is the link to the electronic supplementary material.

Esm 1

(PDF 32.6 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bose, T., Mohammed, M.H., Dutta, A. et al. BIND – An algorithm for loss-less compression of nucleotide sequence data. J Biosci 37, 785–789 (2012). https://doi.org/10.1007/s12038-012-9230-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12038-012-9230-6

Keywords

Navigation