Skip to main content

Annotation Pipelines for Next-Generation Sequencing Projects

  • Chapter
  • First Online:
Comparative Gene Finding

Part of the book series: Computational Biology ((COBO,volume 20))

  • 1504 Accesses

Abstract

Next-generation sequencing technologies has caused an explosion in the availability of genomic sequence data. This creates both opportunities and challenges, not the least within the bioinformatics field. The opportunities include the possibility to sequence and analyze a wide variety of organisms, spanning distant parts of the tree of life. The challenges include dealing with the shorter sequence lengths, the reduced data quality, and training and quality control issues when dealing with completely novel sequences. In this chapter we present the various issues and aspects involved in building a genome annotation pipeline, particularly aiming at next-generation sequencing data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amantides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al.: The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000)

    Article  Google Scholar 

  2. Allen, J.E., Salzberg, S.L.: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21, 3596–3603 (2005)

    Article  Google Scholar 

  3. Allen, J.E., Majoros, W.H., Pertea, M., Salzberg, S.L.: JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biol. 7, S9 (2007)

    Article  Google Scholar 

  4. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)

    Article  Google Scholar 

  5. Avery, O.T., MacLeod, C.M., McCarty, M.: Studies of the chemical nature of the substance inducing transformation of pneumococcal types. Induction of transformation by a desoxyribonucleic acid fraction isolated from pneumococcus type III. J. Exp. Med. 79, 137–158 (1944)

    Article  Google Scholar 

  6. Baertsch, R., Diekhans, M., Kent, W.J., Haussler, D., Brosius, J.: Retrocopy contributions to the evolution of the human genome. BMC Genomics 9, 466 (2008)

    Article  Google Scholar 

  7. Bartlett, J.M., Stirling, D.: A short history of the polymerase chain reaction. Methods Mol. Biol. 226, 3–6 (2003)

    Google Scholar 

  8. Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, K., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P., Lander, E.S.: ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002)

    Article  Google Scholar 

  9. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: Genbank Nucleic Acids Res. 37, D26–D31 (2009)

    Article  Google Scholar 

  10. Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R., et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008)

    Article  Google Scholar 

  11. Bergman, C.M., Quesneville, H.: Discovering and detecting transposable elements in genome sequences. Brief. Bioinform. 8, 382–392 (2007)

    Article  Google Scholar 

  12. Bianconi, E., Piovesan, A., Beraudi, A., Casadei, R., Frabetti, F., Vitale, L., Pelleri, M.C., Tassani, S., Piva, F., Perez-Amodio, S., Strippoli, P., Canaider, S.: An estimation of the number of cells in the human body. Ann. Hum. Biol. 40, 463–471 (2013)

    Article  Google Scholar 

  13. Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., Gregor, J., Davis, N.W., Kirkpatrick, H.A., Goeden, M.A., Rose, D.J., Mau, B., Shao, Y.: The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1474 (1997)

    Article  Google Scholar 

  14. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003)

    Article  Google Scholar 

  15. Bradnam, K.R., Fass, J.N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J.A., Chapuis, G., Chikhi, R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013)

    Article  Google Scholar 

  16. Brady, A., Salzberg, S.L.: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6, 673–676 (2009)

    Article  Google Scholar 

  17. Breitbart, M., Salamon, P., Andresen, B., Mahaffy, J.M., Segall, A.M., Mead, D., Azam, F., Rohwer, F.: Genomic analysis of uncultured marine viral communities. Proc. Natl. Acad. Sci. USA 99, 14250–14255 (2002)

    Article  Google Scholar 

  18. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., et al.: Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18, 630–634 (2000)

    Article  Google Scholar 

  19. Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997)

    Article  Google Scholar 

  20. Campbell, M.S., Law, M., Holt, C., Stein, J.C., Moghe, G.D., Hufnagel, D.E., Lei, J., Achawanantakun, R., Jiao, D., Lawrence, C.J., et al.: MAKER-p: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 164, 513–524 (2014)

    Article  Google Scholar 

  21. Cantarel, B.L., Korf, I., Robb, S.M.C., Parra, G., Ross, E., Moore, B., Holt, C., Sanches Alvarado, A., Yandell, M.: MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008)

    Article  Google Scholar 

  22. Caspi, A., Pachter, L.: Identification of transposable elements using multiple alignments of related genomes. Genome Res. 16, 260–270 (2006)

    Article  Google Scholar 

  23. Chain, P.S.G., Grafham, D.V., Fulton, R.S., FitzGerald, M.G., Hostetler, J., Muzny, D., Ali, J., Birren, B., Bruce, D.C., Buhay, C., et al.: Genome project standards in a new era of sequencing. Science 326, 236–237 (2009)

    Article  Google Scholar 

  24. Chen, K., Pachter, L.: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol. 1, e24 (2005)

    Article  Google Scholar 

  25. Clarke, J., Wu, H.-C., Jayasinghe, L., Patel, A., Reid, S., Bayley, H.: Continuouos base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol. 4, 265–270 (2009)

    Article  Google Scholar 

  26. Collins, F.S., Green, E.D., Guttmacher, A.E., Guyer, M.S.: A vision for the future of genomics research. Nature 422, 835–847 (2003)

    Article  Google Scholar 

  27. Cunningham, F., Amode, M.R., Barrell, D., Beal, K., Billis, K., Brent, S., Carvalho-Silva, D., Clapham, P., Coates, G., Fitzgerald, S., et al.: Ensembl 2015. Nucleic Acids Res. 43, D662–D669 (2015)

    Article  Google Scholar 

  28. Dahm, R.: Discovering DNA: Friedrich Miescher and the early years of nucleic acid research. Hum. Genet. 122, 565–581 (2008)

    Article  Google Scholar 

  29. Dayhoff, M.O.: Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington (1969)

    Google Scholar 

  30. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, vol. 5, pp. 345–352. Washington, Natl. Biomed. Res. Found (1978)

    Google Scholar 

  31. de Brujin, N.G.: A combinatorial problem. Koninklije Nederlandse Akademie v. Wetenschappen 49, 758–764 (1946)

    Google Scholar 

  32. de Filippo, C., Ramazzotti, M., Fontana, P., Cavalieri, D.: Bioinformatic approaches for functional annotation and pathway inference in metagenomics data. Brief. Bioinform. 13, 696–710 (2012)

    Article  Google Scholar 

  33. de la Bastide, M., McCombie, W.R.: Assembling genomic DNA sequences with PHRAP. Curr. Protoc. Bioinform. Chapter 11, Unit 11.4 (2007)

    Google Scholar 

  34. Donlin, M.J.: Using the generic genome browser (GBrowse). In: Current Protocols in Bioinformatics, Chapter 9, Unit 9.9 (2009)

    Google Scholar 

  35. Earl, D., Bradnam, K., John, J.S., Darling, A., Lin, D., Fass, J., Yu, H.O.K., Buffalo, V., Zerbino, D.R., Diekhans, M., et al.: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2010)

    Article  Google Scholar 

  36. Eid, J., Fehr, A., Grey, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., et al.: Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009)

    Article  Google Scholar 

  37. Eilbeck, K., Lewis, S.E., Mungall, C.J., Yandell, M., Stein, L., Durbin, R., Ashburner, M.: The sequence ontology: a tool for the unification of genome annotations. Genome Biol. 6, R44 (2005)

    Article  Google Scholar 

  38. Eilbeck, K., Moore, B., Holt, C., Yandell, M.: Quantitative measures for the management and comparison of annotated genomes. BMC Bioinform. 10, 67 (2009)

    Article  Google Scholar 

  39. El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequencing assembly: four stages of data processing and computational challenges. PLoS One 9, e1003345 (2013)

    Google Scholar 

  40. Elsik, C.G., Mackey, A.J., Reese, J.T., Milshina, N.V., Roos, D.S., Weinstock, G.M.: Creating a honey bee consensus gene set. Genome Biol. 8, R13 (2007)

    Article  Google Scholar 

  41. Engels, R.: Argo Genome Browser. http://www.broadinstitute.organnotationargo

  42. Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L.L.: The Pfam protein families database. Nucleic Acids Res. 36, D281–D288 (2007)

    Article  Google Scholar 

  43. Fleischmann, R., Adams, M., White, O., Clayton, R., Kirkness, E., Kerlavage, A., Bult, C., Tomb, J., Dougherty, B., Merrick, J.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496–512 (1995)

    Article  Google Scholar 

  44. Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nat. Methods 6, S6–S12 (2009)

    Article  Google Scholar 

  45. Generic Feature Format (GFF). http://www.sequenceontology.orggff3.shtml

  46. Gilbert, W., Maxam, A.: The nucleotide of the lac operator. Proc. Natl. Acad. Sci. USA 70, 3581–3584 (1973)

    Article  Google Scholar 

  47. Gill, S.R., Pop, M., DeBoy, R.T., Eckburg, P.B., Turnbaugh, P.J., Samuel, B.S., Gordon, J.I., Relman, D.A., Fraser-Liggett, C.M., Nelson, K.E.: Metagenomic analysis of the human distal gut microbiome. Science 312, 1355–1359 (2006)

    Article  Google Scholar 

  48. Gish, W., States, D.J.: Identification of protein coding regions by database similarity search. Nat. Genet. 3, 266–272 (1993)

    Article  Google Scholar 

  49. Glass, E.M., Wilkening, J., Wilke, A., Antonopoulos, D., Meyer, F.: Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. Cold Spring Harbor protocols 2010, doi:10.1101/pdb.prot5368 (2010)

  50. Gnerre, S., Maccallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker, B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., Berlin, A.M., Aird, D., Costello, M., Daza, R., Williams, L., Nicol, R., Gnirke, A., Nusbaum, C., Lander, E.S., Jaffe, D.B.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011)

    Article  Google Scholar 

  51. Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., Louis, E.J., Mewes, H.W., Murakami, Y., Philippsen, P., Tettelin, H., Oliver, S.G.: Life with 6000 genes. Science 274(546), 563–567 (1996)

    Google Scholar 

  52. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q., et al.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 15, 644–652 (2011)

    Article  Google Scholar 

  53. Guttman, M., Garber, M., Levin, J.Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M.J., Gnirke, A., Nusbaum, C., Rinn, J.L., Lander, E.S., Regev, A.: Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010)

    Article  Google Scholar 

  54. Haas, B.J., Zody, M.C.: Advancing RNA-Seq analysis. Nat. Biotechnol. 28, 421–423 (2010)

    Article  Google Scholar 

  55. Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Hannick Jr, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D., et al.: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003)

    Article  Google Scholar 

  56. Haas, B.J., Salzberg, S.L., Zhu, W., Pertea, M., Allen, J.E., Orvis, J., White, O., Buell, C.R., Wortman, J.R.: Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008)

    Article  Google Scholar 

  57. Handelsman, J., Rondon, M.R., Brady, S.F., Clardy, J., Goodman, R.M.: Molecular biology access to the chemistry of unknown soil microbes: a new Frontier for natural products. Chem. Biol. 5, R245–R249 (1998)

    Article  Google Scholar 

  58. Hartl, D.L.: Fly meets shotgun: shotgun wins. Nat. Genet. 24, 327–328 (2000)

    Article  Google Scholar 

  59. Havlak, P., Chen, R., Durbin, K.J., Egan, A., Ren, Y., Song, X.Z., Weinstock, G.M., Gibbs, R.A.: The atlas genome assembly system. Genome Res. 14, 721–732 (2004)

    Article  Google Scholar 

  60. Hesper, B., Hogeweg, P.: Bioinformatica: een werkconcept. Kameleon 1, 28–29 (1970)

    Google Scholar 

  61. Hess, M., Sczyrba, A., Egan, R., Kim, T.-W., Chokhawala, H., Schroth, G., Luo, S., Clark, D.S., Chen, F., Zhang, T., et al.: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331, 463–467 (2011)

    Article  Google Scholar 

  62. Hoff, K.: The effect of sequencing errors on metagenomic gene prediction. BMC Genomics 10, 520 (2009)

    Article  Google Scholar 

  63. Hoff, K.J., Lingner, T., Meinicke, P., Tech, M.: Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 37, W101–105 (2009)

    Article  Google Scholar 

  64. Holley, R.W., Apgar, J., Everett, G.A., Madison, J.T., Marquisee, M., Merrill, S.H., Penswick, J.R., Zamir, A.: Structure of a ribonucleic acid. Science 147, 1462–1465 (1965)

    Article  Google Scholar 

  65. Holt, C., Yandell, M.: MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 12, 491 (2011)

    Article  Google Scholar 

  66. Huang, X., Madan, A.: CAP3: a DNA sequence assembly program. Genome Res. 9, 868–877 (1999)

    Article  Google Scholar 

  67. Huang, X., Wang, J., Aluru, S., Yang, S.P., Hillier, L.: PCAP: a whole-genome assembly program. Genome Res. 13, 2164–2170 (2003)

    Article  Google Scholar 

  68. Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., Lucas, W.J., Wang, X., Xie, B., Ni, P., et al.: The genome of the cucumber. Cucumis sativus L. Nat. Genet. 41, 1275–1281 (2009)

    Google Scholar 

  69. Huson, D.H., Mitra, S., Ruscheweyh, H.J., Weber, N., Schuster, S.C.: Integrative analysis of environmental sequences using MEGAN4. Genome Res. 21, 1552–1560 (2011)

    Article  Google Scholar 

  70. International Human Genome Sequencing Consortium: Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004)

    Google Scholar 

  71. Ju, J., Kim, D.H., Bi, L., Meng, Q., Bai, X., Li, Z., Li, X., Marma, M.S., Shi, S., Wu, J., Edwards, J.R., Romu, A., Turro, N.J.: Four-color DNA sequencing by synthesis using cleavable flourescent nucleotide reversible terminators. Proc. Natl. Acad. Sci. USA 103, 19635–19640 (2006)

    Article  Google Scholar 

  72. Kapustin, Y., Souvorov, A., Tatusova, T., Lipman, D.: Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008)

    Article  Google Scholar 

  73. Kelly, T.J., Smith, H.O.: A restriction enzyme from Hemophilus influenzae II. J. Mol. Biol. 51, 393–409 (1970)

    Article  Google Scholar 

  74. Kent, W.J.: BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002)

    Article  MathSciNet  Google Scholar 

  75. Kim, M., Lee, K.H., Yoon, S.W., Kim, B.S., Chun, J., Yi, H.: Analytical tools and databases for metagenomics in the next-generation sequencing era. Genomics Inform. 11, 102–113 (2013)

    Article  Google Scholar 

  76. Korf, I., Yandell, M., Bedell, J.: BLAST: An Essential Guide to the Basic Local Alignment Search Tool. O’Reilly & Asscociates, Sebastopol (2003)

    Google Scholar 

  77. Korf, I.: Gene finding in novel genomes. BMC Bioinform. 5, 59 (2004)

    Article  Google Scholar 

  78. Lander, E.S., Waterman, M.S.: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988)

    Article  Google Scholar 

  79. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al.: Initial sequencing and analysis of the human genome. Nature 409, 745–964 (2001)

    Article  Google Scholar 

  80. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)

    Article  Google Scholar 

  81. Lerat, E.: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Hered. (Edinb) 104, 520–533 (2010)

    Article  Google Scholar 

  82. Leung, H.C., Yiu, S.M., Yang, B., Peng, Y., Wang, Y., Liu, Z., Chen, J., Qin, J., Li, R., Chin, F.Y.: A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics 27, 1489–1495 (2011)

    Article  Google Scholar 

  83. Lewis, S.E., Searle, S.M., Harris, N., Gibson, M., Lyer, V., Richter, J., Wiel, C., Bayraktaroglir, L., Birney, E., Crosby, M.A.: Apollo: a sequence annotation editor. Genome Biol. 3, research0082 (2002)

    Google Scholar 

  84. Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., et al.: The sequence and De Novo assembly of the giant panda genome. Nature 463, 311–317 (2010)

    Article  Google Scholar 

  85. Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G., Kristiansen, K., Li, S., Yang, H., Wang, J., Wang, J.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010)

    Article  Google Scholar 

  86. Li, Z., Zhang, Z., Yan, P., Huang, S., Fei, Z., Lin, K.: RNA-Seq improves annotation of protein-coding genes in the cucumber genome. BMC Genomics 12, 540 (2011)

    Article  Google Scholar 

  87. Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X., Liu, B., Yang, B., Fan, W.: Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-brujin-graph. Brief. Funct. Genomics 11, 25–37 (2012)

    Article  Google Scholar 

  88. Lindblad-Toh, K., Wade, C.M., Mikkelsen, T.S., Karlsson, E.K., Jaffe, D.B., Kamal, M., Clamp, M., Chang, J.L., Kulbokas III, E.J., Zody, M.C.: Genome sequence, comparative, analysis and haplotype structure of the domestic dog. Nature 438, 803–819 (2005)

    Article  Google Scholar 

  89. Liu, B., Gibbons, T., Ghodsi, M., Treangen, T., Pop, M.: Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12 (Suppl 2), S4 (2011)

    Google Scholar 

  90. Liu, Q., Mackey, A.J., Roos, D.S., Pereira, F.C.N.: Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction. Bioinformatics 24, 597–605 (2008)

    Article  MATH  Google Scholar 

  91. Loftus, B.J., Fung, E., Roncaglia, P., Rowley, D., Amedeo, P., Bruno, D., Vamathevan, J., Miranda, M., Anderson, I.J., Fraser, J.A., et al.: The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science 307, 1321–1324 (2005)

    Article  Google Scholar 

  92. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y.O., Borodovsky, M.: Gene identification in novel eukaryotic genomes by self-traning algorithm. Nucleic Acids Res. 33, 6494–6506 (2005)

    Article  Google Scholar 

  93. Lorenz, P., Eck, J.: Metagenomics and industrial applications. Nat. Rev. Microbiol. 3, 510–516 (2005)

    Article  Google Scholar 

  94. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.-J., Chen, Z., et al.: Genome Sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–380 (2005)

    Google Scholar 

  95. Maxam, A.M., Gilbert, W.: A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74, 560–564 (1977)

    Article  Google Scholar 

  96. McCallum, D., Smith, M.: Computer processing of DNA sequence data. J. Mol. Biol. 116, 29–30 (1977)

    Article  Google Scholar 

  97. McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P., Rigoutsos, I.: Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2007)

    Article  Google Scholar 

  98. Miller, J.R., Delcher, A.L., Koren, S., Venter, E., Walenz, B.P., Brownley, A., Johnson, J., Li, K., Mobarry, C., Sutton, G.: Aggressive assembly of pyrosequencing reads with mates. Bioinformatics 24, 2818–2824 (2008)

    Article  Google Scholar 

  99. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010)

    Article  Google Scholar 

  100. Monzoorul Haque, M., Ghosh, T.S., Komanduri, D., Mande, S.S.: SOrt-ITEMS: sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics 25, 1722–1730 (2009)

    Article  Google Scholar 

  101. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5, 621–628 (2008)

    Article  Google Scholar 

  102. Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002)

    Google Scholar 

  103. Mulder, N., Apweiler, R.: InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol. Biol. 396, 59–70 (2007)

    Article  Google Scholar 

  104. Mullikin, J.C., Ning, Z.: The Phusion assembler. Genome Res. 13, 81–90 (2003)

    Article  Google Scholar 

  105. Myers, E.W.: The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005)

    Google Scholar 

  106. Myers, E.W., Sutton, C.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al.: A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000)

    Article  Google Scholar 

  107. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., Snyder, M.: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, 1344–1349 (2008)

    Article  Google Scholar 

  108. Namiki, T., Hachiya, T., Tanaka, H., Sakakibara, Y.: MetaVelvet: an extension of Velvet assembler to De Novo metagenome assembly from short sequence reads. Nucleic Acids Res. 40, e155 (2012)

    Article  Google Scholar 

  109. Nene, V., Wortman, J.R., Lawson, D., Haas, B., Kodira, C., Tu, Z.J., Loftus, B., Xi, Z., Megy, K., Grabherr, M., et al.: Genome sequence of Aedes aegypti, a major arbovirus vector. Science 316, 1718–1723 (2007)

    Article  Google Scholar 

  110. Noguchi, H., Taniguchi, T., Itoh, T.: MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res. 15, 387–396 (2008)

    Article  Google Scholar 

  111. Nygaard, S., Zhang, G., Schiott, M., Li, C., Wurm, Y., Hu, H., Zhou, J., Ji, L., Qiu, F., Rasmussen, M., et al.: The genome of the leaf-cutting ant Acromyrmex echinatior suggests key adaptations to advanced social life and fungus farming. Genome Res. 21, 1339–1348 (2011)

    Article  Google Scholar 

  112. Overbeek, R., Begley, T., Butler, R.M., Choudhuri, J.V., Chuang, H.Y., Cohoon, M., de Crecy-Lagard, V., Diaz, N., Disz, T., Edwards, R., et al.: The subsystems approach to genome annoation and its use in the project project to annotate 1000 genomes. Nucleic Acids Res. 33, 5691–5702 (2005)

    Article  Google Scholar 

  113. Pagani, I., Liolios, K., Jansson, J., Chen, I.A., Smirnova, T., Nosrat, B., Markowitz, V.M., Kyrpides, N.C.: The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 40, D571–D579 (2011)

    Google Scholar 

  114. Park, P.J.: ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009)

    Article  Google Scholar 

  115. Parra, G., Bradnam, K., Korf, I.: CEGMA: A pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007)

    Article  Google Scholar 

  116. Parra, G., Bradnam, K., Korf, I.: Assessing the gene space in draft genomes. Nucleic Acids Res. 37, 289–297 (2009)

    Article  Google Scholar 

  117. Paszkiewicz, K., Studholme, D.J.: De Novo assembly of short sequence reads. Brief. Bioinform. 11, 457–472 (2010)

    Article  Google Scholar 

  118. Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: Meta-IDBA: a De Novo assembler for metagenomic data. Bioinformatics 27, i94–101 (2011)

    Article  Google Scholar 

  119. Petrosino, J.F., Highlander, S., Luna, R.A., Gibbs, R.A., Versalovic, J.: Metagenomic pyrosequencing and microbial identification. Clin. Chem. 55, 856–866 (2009)

    Article  Google Scholar 

  120. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 98, 9748–9753 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  121. Pevzner, P.A., Tang, H., Tesler, G.: De Novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004)

    Article  Google Scholar 

  122. Pop, M., Phillippy, A., Delcher, A.L., Salzberg, S.L.: Comparative genome assembly. Brief. Bioinform. 5, 237–248 (2004)

    Article  Google Scholar 

  123. Pushkarev, D., Neff, N.F., Quake, S.R.: Single-molecule sequencing of an individual human genome. Nat. Biotechnol. 27, 847–850 (2009)

    Article  Google Scholar 

  124. Rat Genome Sequencing Project Consortium: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493–521 (2004)

    Google Scholar 

  125. Rhesus Macaque Genome Sequencing and Analysis Consortium: Evolutionary and biomedical insights from the rhesus macaque genome. Science 316, 222–234 (2007)

    Google Scholar 

  126. Rho, M., Tang, H., Ye, Y.: FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 38, e191 (2010)

    Article  Google Scholar 

  127. Rondon, M.R., August, P.R., Betterman, A.D., Brady, S.F., Grossman, T.H., Liles, M.R., Loiacono, K.A., Lynch, B.A., MacNeil, I.A., Minor, C., Tiong, C.L., Gilman, M., Osburne, M.S., Clardy, J., Handelsman, J., Goodman, R.M.: Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms. Appl. Environ. Microbiol. 66, 2541–2547 (2000)

    Article  Google Scholar 

  128. Rosen, G.L., Reichenberger, E.R., Rosenfeld, A.M.: NBC: the naive Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27, 127–129 (2011)

    Article  Google Scholar 

  129. Rothberg, J.M., Hinz, W., Rearick, T.M., Schultz, J., Mileski, W., Davey, M., Leamon, J.H., Johnson, K., Milgrew, M.J., Edwards, M., et al.: An integrated semiconductor device enabling non-optical genome sequencing. Nature 475, 348–352 (2011)

    Article  Google Scholar 

  130. Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.A., Barrell, B.: Artemis: sequence visualization and annotation. Bioinformatics 16, 944–945 (2000)

    Article  Google Scholar 

  131. Salamov, A.A., Solovyev, V.V.: Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000)

    Article  Google Scholar 

  132. Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marcais, G., Pop, M., Yorke, J.A.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012)

    Article  Google Scholar 

  133. Sanger, F., Air, G.M., Barrell, B.G., Brown, N.L., Coulson, A.R., Fiddes, C.A., Hutchison, C.A., Slocombe, P.M., Smith, M.: Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265, 687–695 (1977)

    Google Scholar 

  134. Sanger, F., Coulson, A.R.: A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol. 94, 441–448 (1975)

    Article  Google Scholar 

  135. Sanger, F., Niclen, S., Coulson, A.R.: DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74, 5463–5467 (1977)

    Article  Google Scholar 

  136. Sato, T., Terabe, M., Watanabe, H., Gojobori, T., Hori-Takemoto, C., Miura, K.: Codon and base biases after the initiation codon of the open reading frames in the Escherichia coli genome and their influence on the translation efficiency. J. Biochem. 129, 851–860 (2001)

    Article  Google Scholar 

  137. Sayers, E.W., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., et al.: Database resources of the national center for biotechnology information. Nucleic Acids Res. 37, D5–D15 (2009)

    Article  Google Scholar 

  138. Schadt, E.E., Turner, S., Kasarskis, A.: A window into third-generation sequencing. Hum. Mol. Genet. 19, R227–R240 (2010)

    Article  Google Scholar 

  139. Schloss, J.A.: How to get genomes at one ten-thousandth the cost. Nat. Biotechnol. 26, 1113–1115 (2008)

    Article  Google Scholar 

  140. Shendure, J., Porreca, G.J., Reppas, N.B., Lin, X., McCutcheon, J.P., Rosenbaum, A.M., Wang, M.D., Zhang, K., Mitra, R.D., Church, G.M.: Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309, 1728–1732 (2005)

    Article  Google Scholar 

  141. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009)

    Article  Google Scholar 

  142. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012)

    Article  Google Scholar 

  143. Skinner, M.E., Uzilov, A.V., Stein, L.D., Mungall, C.J., Holmes, I.H.: JBROWSE: a next-generation genome browser. Genome Res. 19, 1630–1638 (2009)

    Article  Google Scholar 

  144. Slater, G.S., Birney, E.: Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 6, 31 (2005)

    Article  Google Scholar 

  145. Smit, A.F.A., Hubley, R., Green, P.: RepeatMasker at http://www.repeatmasker.org

  146. Smith, H.O., Wilcox, K.W.: A restriction enzyme from Hemophilus influeanzae. I. Purification and general properties. J. Mol. Biol. 51, 379–391 (1970)

    Article  Google Scholar 

  147. Smith, L.M., Sanders, J.Z., Kaiser, R.J., Hughes, P., Dodd, C., Connell, C.R., Heiner, C., Kent, S.B., Hood, L.E.: Flourescence detection in automated DNA sequence analysis. Nature 321, 674–679 (1986)

    Article  Google Scholar 

  148. Smith, C.D., Edgar, R.C., Yandell, M.D., Smith, D.R., Celniker, S.E., Myers, E.W., Karpen, G.H.: Improved repeat identification and masking in Dipterans. Gene 389, 1–9 (2007)

    Article  Google Scholar 

  149. Smith, C.C., Zimin, A., Holt, C., Abouheif, E., Benton, R., Cash, E., Croset, V., Currie, C.R., Elhaik, E., Elsik, C.G., et al.: Draft genome of the globally widespread and invasive Argentine ant (Linepithema humile). Proc. Natl. Acad. Sci. USA 108, 5673–5678 (2011)

    Article  Google Scholar 

  150. Staden, R.: Sequence data handling by computer. Nucleic Acids Res. 4, 4037–4051 (1977)

    Article  Google Scholar 

  151. Staden, R., Beal, K.F., Bonfield, J.K.: The Staden package, 1998. Methods Mol. Biol. 132, 115–130 (2000)

    Google Scholar 

  152. Stanke, M., Waack, S.: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003)

    Article  Google Scholar 

  153. Stanke, M., Steinkamp, R., Waack, S., Morgenstern, B.: AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004)

    Article  Google Scholar 

  154. Suen, G., Teiling, C., Li, L., Holt, C., Abouheif, E., Bornberg-Bauer, E., Bouffard, P., Caldera, E.J., Cash, E., Cavanaugh, A., et al.: The genome sequence of the leaf-cutter ant Atta cephalotes reveals insights into its obligate symbiotic lifestile. PLoS Genet. 7, e1002007 (2011)

    Article  Google Scholar 

  155. The Bovine Genome Sequencing and Analysis Consortium: The genome sequence of taurine cattle: a window to ruminant biology and evolution. Science 324, 522–528 (2009)

    Google Scholar 

  156. The Generic Model Organism Database. http://www.gmod.org

  157. The Reference Genome Group of the Gene Ontology: Consortium: The gene ontology’s reference genome project: a unified framework for functional annotation across species. PLoS Comput. Biol. 5, e1000431 (2009)

    Google Scholar 

  158. The Rice Genome Project: A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002)

    Google Scholar 

  159. The UniProt Consortium: The universal protein resource (UniProt) 2009. Nucleic Acids Res. 37, D169–D174 (2009)

    Google Scholar 

  160. The University of Santa Cruz Genome Browser: http://genome.ucsc.edu

  161. The C. elegans Sequencing Consortium: Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998)

    Google Scholar 

  162. Trapnell, C., Pachter, L., Salzberg, S.L.: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009)

    Article  Google Scholar 

  163. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., Pachter, L.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010)

    Article  Google Scholar 

  164. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2011)

    Google Scholar 

  165. Tyson, G.W., Chapman, J., Hugenholtz, P., Allen, E.E., Ram, R.J., Richardson, P.M., Solovyev, V.V., Rubin, E.M., Rokhsar, D.S., Banfield, J.F.: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004)

    Article  Google Scholar 

  166. Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng, K., Malek, J.A., Costa, G., McKernan, K., Sidow, A., Fire, A., Johnson, S.M.: A high-resolution, nucleosom position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res. 18, 1051–1063 (2008)

    Article  Google Scholar 

  167. van Dijk, E.L., Auger, H., Jaszczyszyn, Y., Thermes, C.: Ten years of next-generation sequencing technology. Trends Genet. 30, 418–426 (2014)

    Article  Google Scholar 

  168. Venter, C.J., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al.: The sequence of the human genome. Science 291, 1304–1351 (2001)

    Article  Google Scholar 

  169. Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., et al.: Environmental genome sequencing of the Sargasso Sea. Science 304, 66–74 (2004)

    Article  Google Scholar 

  170. Wang, J., Wong, G.K., Ni, P., Han, Y., Huang, X., Zhang, J., Ye, C., Zhang, Y., Hu, J., Zhang, K., et al.: RePS: a sequence assembler that masks exact repeats identified from the shotgun data. Genome Res. 12, 821–831 (2002)

    Article  Google Scholar 

  171. Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009)

    Article  Google Scholar 

  172. Warren, R.L., Sutton, G.G., Jones, S.J., Holt, R.A.: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 500–501 (2007)

    Article  Google Scholar 

  173. Watson, J.D., Crick, F.H.C.: Molecular structure of nucleic acids. Nature 171, 737–738 (1953)

    Article  Google Scholar 

  174. Whiteford, N., Haslam, N., Weber, G., Prügel-Bennett, A., Essex, J.W., Roach, P.L., Bradley, M., Neylon, C.: An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 33, e171 (2005)

    Article  Google Scholar 

  175. Wold, B., Myers, R.M.: Sequence census methods for functional genomics. Nat. Methods 5, 19–21 (2008)

    Article  Google Scholar 

  176. Worley, K.C., Gibbs, R.A.: Genetics: decoding a national treasure. Nature 463, 303–304 (2010)

    Article  Google Scholar 

  177. Wu, R., Kaiser, A.D.: Structure and base sequence in the cohesive ends of bacteriophage lambda DNA. J. Mol. Biol. 35, 523–537 (1968)

    Article  Google Scholar 

  178. Wu, R., Taylor, E.: Nucleotide sequence analysis of DNA. II. Complete nucleotide sequence of the cohesive ends of bacteriophage lambda DNA. J. Mol. Biol. 57, 491–511 (1971)

    Article  Google Scholar 

  179. Wu, T.D., Nacu, S.: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010)

    Article  Google Scholar 

  180. Yandell, M., Ence, D.: A beginner’s guide to eukaryotic genome annotation. Nat. Rev. Genet. 13, 329–342 (2012)

    Article  Google Scholar 

  181. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Brujin graphs. Genome Res. 18, 821–829 (2008)

    Article  Google Scholar 

  182. Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J., Shen, B.: A practical comparison of De Novo genome assembly software tools for next-generation sequencing technologies. PLoS One 6, e17915 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Axelson-Fisk .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Axelson-Fisk, M. (2015). Annotation Pipelines for Next-Generation Sequencing Projects. In: Comparative Gene Finding. Computational Biology, vol 20. Springer, London. https://doi.org/10.1007/978-1-4471-6693-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-6693-1_8

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-6692-4

  • Online ISBN: 978-1-4471-6693-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics