Abstract
Our ability to sequence the genomic data at our disposal is limited. At each experiment we can sequence reliably only a short fraction of even the smallest genome. We are then faced with the challenge of assembly—combining the short patches we have into a correct reconstruction of as large as possible a fragment of the original sample. The problem has been thoroughly researched and many commercial and academic tools exist to carry it out. However due to basic features of the problem the results of even our best efforts will be sometimes disappointing for the researcher. In this chapter we will try to explain why the assembly problem is so hard, what future directions may alleviate it in the near future, and what can be realistically expected from a current assembly experiment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467
Smith LM, Sanders JZ, Kaiser RJ, Hughes P, Dodd C, Connell CR, Heiner C, Kent SB, Hood LE (1986) Fluorescence detection in automated DNA sequence analysis. Nature 321(6071):674–679. doi:10.1038/321674a0
Dohm JC, Lottaz C, Borodina T, Himmelbauer H (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36(16):e105. doi:10.1093/nar/gkn425
Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM (2007) Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 8(7):R143. doi:10.1186/gb-2007-8-7-r143
Niedringhaus TP, Milanova D, Kerby MB, Snyder MP, Barron AE (2011) Landscape of next-generation sequencing technologies. Anal Chem 83(12):4327–4341. doi:10.1021/ac2010857
Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, Alm EJ, Chisholm SW (2010) Unlocking short read sequencing for metagenomics. PLoS One 5(7):e11840. doi:10.1371/journal.pone.0011840
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428(6978):37–43. doi:10.1038/nature02340
Treangen TJ, Salzberg SL (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 13(1):36–46. doi:10.1038/nrg3117
Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F (2008) Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9(1):431. doi:10.1186/1471-2105-9-431
Meacham F, Boffelli D, Dhahbi J, Martin DI, Singer M, Pachter L (2011) Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12(1):451. doi:10.1186/1471-2105-12-451
Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, Ishikawa S (2011) Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39(13):e90. doi:10.1093/nar/gkr344
Taub MA, Corrada Bravo H, Irizarry RA (2010) Overcoming bias and systematic errors in next generation sequencing data. Genome Med 2(12):1–5. doi:10.1186/gm208
Siegel AF, van den Engh G, Hood L, Trask B, Roach JC (2000) Modeling the feasibility of whole genome shotgun sequencing using a pairwise end strategy. Genomics 68(3):237–246. doi:10.1006/geno.2000.6303
Gallant J, Maier D, Astorer J (1980) On finding minimal length superstrings. J Comput Syst Sci 20(1):50–58. doi:10.1016/0022-0000(80)90004-5
Gallant JK (1983) The complexity of the overlap method for sequencing biopolymers. J Theor Biol 101(1):1–17. doi:10.1016/0022-5193(83)90270-9
Myers EW (1995) Toward simplifying and accurately formulating fragment assembly. J Comput Biol 2(2):275–290. doi:10.1089/cmb.1995.2.275
Pop M (2009) Genome assembly reborn: recent computational challenges. Brief Bioinform 10(4):354–366. doi:10.1093/bib/bbp026
Myers EW (2005) The fragment assembly string graph. Bioinformatics 21(Suppl 2):ii79–ii85. doi:10.1093/bioinformatics/bti1114
Simpson JT, Durbin R (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12):i367–i373. doi:10.1093/bioinformatics/btq217
Simpson JT, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22(3):549–556. doi:10.1101/gr.126953.111
Gurevich Y, Shelah S (1987) Expected computation time for Hamiltonian path problem. SIAM J Comput 16(3):486–502. doi:10.1137/0216034
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA et al (2000) A whole-genome assembly of Drosophila. Science 287(5461):2196–2204
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO et al (2001) The sequence of the human genome. Science 291(5507):1304–1351. doi:10.1126/science.1058040
Derelle E, Ferraz C, Rombauts S, Rouzé P, Worden AZ, Robbens S, Partensky F et al (2006) Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. Proc Natl Acad Sci USA 103(31):11647–11652. doi:10.1073/pnas.0604795103
Pevzner PA (1989) 1-Tuple DNA sequencing: computer analysis. J Biomol Struct Dyn 7(1):63–73. doi:10.1080/07391102.1989.10507752
Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2(2):291–306. doi:10.1089/cmb.1995.2.291
Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95(6):315–327. doi:10.1016/j.ygeno.2010.03.001
Parra G, Bradnam K, Ning Z, Keane T, Korf I (2009) Assessing the gene space in draft genomes. Nucleic Acids Res 37(1):289–297. doi:10.1093/nar/gkn916
Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO (2011) Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res 21(12):2224–2241. doi:10.1101/gr.126599.111
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ et al (2012) GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res 22(3):557–567. doi:10.1101/gr.131383.111
Wu Q, Wang Y, Cao M, Pantaleo V, Burgyan J, Li W-X, Ding S-W (2012) Homology-independent discovery of replicating pathogenic circular RNAs by deep sequencing and a new computational algorithm. Proc Natl Acad Sci 109(10):3938–3943. doi:10.1073/pnas.1117815109
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this protocol
Cite this protocol
Kol, N., Shomron, N. (2013). Assembly Algorithms for Deep Sequencing Data: Basics and Pitfalls. In: Shomron, N. (eds) Deep Sequencing Data Analysis. Methods in Molecular Biology, vol 1038. Humana Press, Totowa, NJ. https://doi.org/10.1007/978-1-62703-514-9_5
Download citation
DOI: https://doi.org/10.1007/978-1-62703-514-9_5
Published:
Publisher Name: Humana Press, Totowa, NJ
Print ISBN: 978-1-62703-513-2
Online ISBN: 978-1-62703-514-9
eBook Packages: Springer Protocols