Skip to main content

Assembly Algorithms for Deep Sequencing Data: Basics and Pitfalls

  • Protocol
  • First Online:
Deep Sequencing Data Analysis

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1038))

Abstract

Our ability to sequence the genomic data at our disposal is limited. At each experiment we can sequence reliably only a short fraction of even the smallest genome. We are then faced with the challenge of assembly—combining the short patches we have into a correct reconstruction of as large as possible a fragment of the original sample. The problem has been thoroughly researched and many commercial and academic tools exist to carry it out. However due to basic features of the problem the results of even our best efforts will be sometimes disappointing for the researcher. In this chapter we will try to explain why the assembly problem is so hard, what future directions may alleviate it in the near future, and what can be realistically expected from a current assembly experiment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467

    Article  PubMed  CAS  Google Scholar 

  2. Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321(6071):674–679. doi:10.1038/321674a0

    Article  PubMed  CAS  Google Scholar 

  3. Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36(16):e105. doi:10.1093/nar/gkn425

    Article  PubMed  Google Scholar 

  4. Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM (2007) Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 8(7):R143. doi:10.1186/gb-2007-8-7-r143

    Article  PubMed  Google Scholar 

  5. Niedringhaus TP, Milanova D, Kerby MB, Snyder MP, Barron AE (2011) Landscape of next-generation sequencing technologies. Anal Chem 83(12):4327–4341. doi:10.1021/ac2010857

    Article  PubMed  CAS  Google Scholar 

  6. Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, Alm EJ, Chisholm SW (2010) Unlocking short read sequencing for metagenomics. PLoS One 5(7):e11840. doi:10.1371/journal.pone.0011840

    Article  PubMed  Google Scholar 

  7. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978):37–43. doi:10.1038/nature02340

    Article  PubMed  CAS  Google Scholar 

  8. Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13(1):36–46. doi:10.1038/nrg3117

    CAS  Google Scholar 

  9. Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F (2008) Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9(1):431. doi:10.1186/1471-2105-9-431

    Article  PubMed  Google Scholar 

  10. Meacham F, Boffelli D, Dhahbi J, Martin DI, Singer M, Pachter L (2011) Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12(1):451. doi:10.1186/1471-2105-12-451

    Article  PubMed  Google Scholar 

  11. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S (2011) Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39(13):e90. doi:10.1093/nar/gkr344

    Article  PubMed  CAS  Google Scholar 

  12. Taub MA, Corrada Bravo H, Irizarry RA (2010) Overcoming bias and systematic errors in next generation sequencing data. Genome Med 2(12):1–5. doi:10.1186/gm208

    Article  Google Scholar 

  13. Siegel AF, van den Engh G, Hood L, Trask B, Roach JC (2000) Modeling the feasibility of whole genome shotgun sequencing using a pairwise end strategy. Genomics 68(3):237–246. doi:10.1006/geno.2000.6303

    Article  PubMed  CAS  Google Scholar 

  14. Gallant J, Maier D, Astorer J (1980) On finding minimal length superstrings. J Comput Syst Sci 20(1):50–58. doi:10.1016/0022-0000(80)90004-5

    Article  Google Scholar 

  15. Gallant JK (1983) The complexity of the overlap method for sequencing biopolymers. J Theor Biol 101(1):1–17. doi:10.1016/0022-5193(83)90270-9

    Article  PubMed  CAS  Google Scholar 

  16. Myers EW (1995) Toward simplifying and accurately formulating fragment assembly. J Comput Biol 2(2):275–290. doi:10.1089/cmb.1995.2.275

    Article  PubMed  CAS  Google Scholar 

  17. Pop M (2009) Genome assembly reborn: recent computational challenges. Brief Bioinform 10(4):354–366. doi:10.1093/bib/bbp026

    Article  PubMed  CAS  Google Scholar 

  18. Myers EW (2005) The fragment assembly string graph. Bioinformatics 21(Suppl 2):ii79–ii85. doi:10.1093/bioinformatics/bti1114

    Article  PubMed  CAS  Google Scholar 

  19. Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12):i367–i373. doi:10.1093/bioinformatics/btq217

    Article  PubMed  CAS  Google Scholar 

  20. Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22(3):549–556. doi:10.1101/gr.126953.111

    Article  PubMed  CAS  Google Scholar 

  21. Gurevich Y, Shelah S (1987) Expected computation time for Hamiltonian path problem. SIAM J Comput 16(3):486–502. doi:10.1137/0216034

    Article  Google Scholar 

  22. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA et al (2000) A whole-genome assembly of Drosophila. Science 287(5461):2196–2204

    Article  PubMed  CAS  Google Scholar 

  23. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO et al (2001) The sequence of the human genome. Science 291(5507):1304–1351. doi:10.1126/science.1058040

    Article  PubMed  CAS  Google Scholar 

  24. Derelle E, Ferraz C, Rombauts S, Rouzé P, Worden AZ, Robbens S, Partensky F et al (2006) Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. Proc Natl Acad Sci USA 103(31):11647–11652. doi:10.1073/pnas.0604795103

    Article  PubMed  CAS  Google Scholar 

  25. Pevzner PA (1989) 1-Tuple DNA sequencing: computer analysis. J Biomol Struct Dyn 7(1):63–73. doi:10.1080/07391102.1989.10507752

    PubMed  CAS  Google Scholar 

  26. Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2(2):291–306. doi:10.1089/cmb.1995.2.291

    Article  PubMed  CAS  Google Scholar 

  27. Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327. doi:10.1016/j.ygeno.2010.03.001

    Article  PubMed  CAS  Google Scholar 

  28. Parra G, Bradnam K, Ning Z, Keane T, Korf I (2009) Assessing the gene space in draft genomes. Nucleic Acids Res 37(1):289–297. doi:10.1093/nar/gkn916

    Article  PubMed  CAS  Google Scholar 

  29. Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 21(12):2224–2241. doi:10.1101/gr.126599.111

    Article  PubMed  CAS  Google Scholar 

  30. Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ et al (2012) GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 22(3):557–567. doi:10.1101/gr.131383.111

    Article  PubMed  CAS  Google Scholar 

  31. Wu Q, Wang Y, Cao M, Pantaleo V, Burgyan J, Li W-X, Ding S-W (2012) Homology-independent discovery of replicating pathogenic circular RNAs by deep sequencing and a new computational algorithm. Proc Natl Acad Sci 109(10):3938–3943. doi:10.1073/pnas.1117815109

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this protocol

Cite this protocol

Kol, N., Shomron, N. (2013). Assembly Algorithms for Deep Sequencing Data: Basics and Pitfalls. In: Shomron, N. (eds) Deep Sequencing Data Analysis. Methods in Molecular Biology, vol 1038. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-514-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-62703-514-9_5

  • Published:

  • Publisher Name: Humana Press, Totowa, NJ

  • Print ISBN: 978-1-62703-513-2

  • Online ISBN: 978-1-62703-514-9

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics