BIND – An algorithm for loss-less compression of nucleotide sequence data

Bose, Tungadri; Mohammed, Monzoorul Haque; Dutta, Anirban; Mande, Sharmila S

doi:10.1007/s12038-012-9230-6

BIND – An algorithm for loss-less compression of nucleotide sequence data

Published: 26 August 2012

Volume 37, pages 785–789, (2012)
Cite this article

Journal of Biosciences Aims and scope Submit manuscript

Tungadri Bose¹,
Monzoorul Haque Mohammed¹,
Anirban Dutta¹ &
…
Sharmila S Mande¹

239 Accesses
22 Citations
6 Altmetric
Explore all metrics

Abstract

Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by these technologies has been increasing exponentially since the last decade. Storage, archival and dissemination of such huge data sets require efficient solutions, both from the hardware as well as software perspective. The present paper describes BIND – an algorithm specialized for compressing nucleotide sequence data. By adopting a unique ‘block-length’ encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma). Moreover, in contrast to implementations of existing specialized genomic compression approaches, the implementation of BIND is enabled to handle non-ATGC and lowercase characters. This makes BIND a loss-less compression approach that is suitable for practical use. More importantly, validation results of BIND (with real-world data sets) indicate reasonable speeds of compression and decompression that can be achieved with minimal processor/memory usage. BIND is available for download at http://metagenomics.atc.tcs.com/compression/BIND. No license is required for academic or non-profit use.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Illumina Sequencing Protocol and the NovaSeq 6000 System

A survey of best practices for RNA-seq data analysis

Article Open access 26 January 2016

Next-Generation Sequencing: Advantages, Disadvantages, and Future

References

Behzadi B and Le Fessant F 2005 DNA compression challenge revisited; in Combinatorial pattern matching: Proceedings of CPM-2005, LNCS, Jeju Island, Korea (Springer-Verlag)
Google Scholar
Cao MD, Dix TI, Allison L and Mears C 2007 A simple statistical algorithm for biological sequence compression; in Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah
Google Scholar
Chen X, Li M, Ma B and Tromp J 2002 DNACompress: fast and effective DNA sequence compression. Bioinformatics 18 1696–1698
Article PubMed CAS Google Scholar
Cochrane G, Karsch-Mizrachi I and Nakamura Y 2011 International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 39 D15-8
Grumbach S and Tahi F 1994 A new challenge for compression algorithms: genetic sequences. Inform. Process. Management 30 875–886
Google Scholar
Korodi G and Tabus I 2007 Normalized maximum likelihood model of order-1 for the compression of DNA sequences; in Proceedings of the Data Compression Conference, DCC-2007, Snowbird, Utah
Google Scholar
Manzini G and Rastero M 2004 A simple and fast DNA compressor. Software—Practice and Experience 34 1397–1411
Google Scholar
Matsumoto T, Sadakane K and Imai H 2000 Biological sequence compression algorithms; in Genome informatics: Proceedings of the 11th Workshop, Tokyo, Japan, (eds) AK Dunker, A Konagaya, S Miyano and T Takagi. pp 43–52
Metzker ML 2010 Sequencing technologies - the next generation. Nat. Rev. Genet. 1 31–46
Article Google Scholar
Pinho AJ, Neves AJR and Ferreira PJSG 2008 Inverted-repeats-aware finite-context models for DNA coding; in Proceedings of the 16th European Signal Processing Conference, EUSIPCO-2008, Lausanne, Switzerland
Google Scholar
Pinho AJ, Ferreira PJSG, Neves AJR and Bastos CAC 2011 On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6 e21588
Rivals E, Delahaye JP, Dauchet M and Delgrange O 1996 A guaranteed compression scheme for repetitive DNA sequences; in Proceedings of the Data Compression Conference, DCC-96, Snowbird, Utah, pp 453
Google Scholar
Zhang J, Chiodini R, Badr A and Zhang G 2011 The impact of next-generation sequencing on genomics. J. Genet. Genomics 38 95–109
Google Scholar

Download references

Author information

Authors and Affiliations

Bio-Sciences R&D Division, TCS Innovation Labs, 54B Hadapsar Industrial Estate, Tata Consultancy Services Limited, Hadapsar, Pune, 411 013, India
Tungadri Bose, Monzoorul Haque Mohammed, Anirban Dutta & Sharmila S Mande

Authors

Tungadri Bose
View author publications
You can also search for this author in PubMed Google Scholar
Monzoorul Haque Mohammed
View author publications
You can also search for this author in PubMed Google Scholar
Anirban Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Sharmila S Mande
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sharmila S Mande.

Additional information

Corresponding editor: Reiner A Veitia

[Bose T, Mohammed MH, Dutta A and Mande SS 2012 BIND – An algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37 1-5] DOI 10.1007/s12038-012-9230-6

Supplementary materials pertaining to this article are available on the Journal of Biosciences Website at http://www.ias.ac.in/jbiosci/sep2012/supp/bose.pdf

Electronic supplementary material

Below is the link to the electronic supplementary material.

Esm 1

(PDF 32.6 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bose, T., Mohammed, M.H., Dutta, A. et al. BIND – An algorithm for loss-less compression of nucleotide sequence data. J Biosci 37, 785–789 (2012). https://doi.org/10.1007/s12038-012-9230-6

Download citation

Received: 23 January 2012
Accepted: 14 May 2012
Published: 26 August 2012
Issue Date: September 2012
DOI: https://doi.org/10.1007/s12038-012-9230-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BIND – An algorithm for loss-less compression of nucleotide sequence data

Abstract

Access this article

Similar content being viewed by others

The Illumina Sequencing Protocol and the NovaSeq 6000 System

A survey of best practices for RNA-seq data analysis

Next-Generation Sequencing: Advantages, Disadvantages, and Future

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Esm 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

BIND – An algorithm for loss-less compression of nucleotide sequence data

Abstract

Access this article

Similar content being viewed by others

The Illumina Sequencing Protocol and the NovaSeq 6000 System

A survey of best practices for RNA-seq data analysis

Next-Generation Sequencing: Advantages, Disadvantages, and Future

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Esm 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation