Skip to main content

Genome Mapping Statistics and Bioinformatics

  • Protocol
Topics in Biostatistics

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 404))

Abstract

The unprecedented availability of genome sequences, coupled with user-friendly, web-enabled search and analysis tools allows practitioners to locate interesting genome features or sequence tracts with relative ease. Although many public model organism- and genome-mapping resources offer pre-mapped genome browsing, biologists also still need to perform de novo mapping analyses. Correct interpretation of the results in genome annotation databases or the results of one’s individual analyses requires at least a conceptual understanding of the statistics and mechanics of genome searches, the expected results from statistical considerations, as well as the algorithms used by different search tools. This chapter introduces the basic statistical results that underlie mapping of nucleotide sequences to genomes and briefly surveys the common programs and algorithms that are used to perform genome mapping, all available via public hosted web sites. Selection of the appropriate sequence search and mapping tool will often demand tradeoffs in sensitivity and specificity relating to the statistics of the search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    There are 2,469 genomes as of April 2, 2007, according to the Genomes Online Database (http://www.genomesonline.org).

  2. 2.

    Two hundred thirty-four euchromatic sequence gaps, 64 non-euchromatic gaps. Also see Ref. 3.

  3. 3.

    According to the GenBank file annotation, the first 100 bases of the transcript are in the 5′ untranslated region (5′ UTR) of the transcript.

  4. 4.

    Unlike English text palindromes, such as the oft-recited “Madam I’m Adam.” Reverse complement palindromic sequences are sometimes referred to as containing tandem inverted repeats.

References

  1. Waterman, M. S. (1995) Introduction to Computational Biology. London, Chapman & Hall.

    Google Scholar 

  2. Ewens, W. J., and Grant, G. R. (2001) Statistical Methods in Bioinformatics. New York, Springer-Verlag.

    Google Scholar 

  3. International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931–945.

    Article  Google Scholar 

  4. Kent, W. J. (2002) BLAT-the BLAST-like alignment tool. Genome Res. 12, 6–664.

    Google Scholar 

  5. Shine, J., and Dalgarno, L. (1974) The 3′-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. U. S. A. 71, 1342–1346.

    Article  PubMed  CAS  Google Scholar 

  6. Forsdyke, D. R., and Mortimer, J. R. (2000) Chargaff’s legacy. Gene 261, 127–137.

    Article  PubMed  CAS  Google Scholar 

  7. Prabhu, V. V. (1993) Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 21, 2797–2800.

    Article  PubMed  CAS  Google Scholar 

  8. Qi, D., and Cuticchia, A. J. (2001) Compositional symmetries in complete genomes. Bioinformatics 17, 557–559.

    Article  PubMed  CAS  Google Scholar 

  9. Zimmermann, K., Schogl, D., and Mannhalter, J. W. (1998) Digestion of terminal restriction endonuclease recognition sites on PCR products. Biotechniques 24, 582–584.

    PubMed  CAS  Google Scholar 

  10. Ma, J., Campbell, A., and Karlin, S. (2002) Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J. Bacteriol. 184, 5733–5745.

    Article  PubMed  CAS  Google Scholar 

  11. van Helden, J., Rios, A. F., and Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28, 1808–1818.

    Article  PubMed  Google Scholar 

  12. Klock, G., Strahle, U., and Schutz, G. (1987) Oestrogen and glucocorticoid responsive elements are closely related but distinct. Nature 329, 734–736.

    Article  PubMed  CAS  Google Scholar 

  13. Klinge, C. M. (2001) Estrogen receptor interaction with estrogen response elements. Nucleic Acids Res. 29, 2905–2919.

    Article  PubMed  CAS  Google Scholar 

  14. van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842.

    Article  PubMed  Google Scholar 

  15. Needleman, S. B., and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.

    Article  PubMed  CAS  Google Scholar 

  16. Smith, T. F., and Waterman, M. S. (1981) Identifi cation of common molecular subsequences. J. Mol. Biol. 147, 195–197.

    Article  PubMed  CAS  Google Scholar 

  17. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.

    Article  PubMed  CAS  Google Scholar 

  18. Weber, J. L., David, D., Heil, J., Fan, Y., Zhao, C., and Marth, G. (2002) Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. 71, 854–862.

    Article  PubMed  Google Scholar 

  19. Altschul, S. F., and Karlin, S. (1990) Methods for assessing the statistical significance of molecular sequences by using general scoring schemes. Proc. Natl. Acad. Sci. U. S. A. 87, 2264–2268.

    Article  PubMed  Google Scholar 

  20. Karlin, S., and Altschul, S. F. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U. S. A. 90, 5873–5877.

    Article  PubMed  CAS  Google Scholar 

  21. Korf, I., Yandell, M., and Bedell, B. (2003) BLAST. Sebastopol, O’Reilly & Associates.

    Google Scholar 

  22. Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214.

    Article  PubMed  CAS  Google Scholar 

  23. Ning, Z., Cox, A. J., and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729.

    Article  PubMed  CAS  Google Scholar 

  24. Bensen, J. T., Dawson, P. A., Mychaleckyj, J. C., and Bowden, D. W. (2001) Identification of a novel human cytokine gene in the interleukin gene cluster on chromosome 2q12–14. J. Interferon Cytokine Res. 21, 899–904.

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Humana Press Inc., Totowa, NJ

About this protocol

Cite this protocol

Mychaleckyj, J.C. (2007). Genome Mapping Statistics and Bioinformatics. In: Ambrosius, W.T. (eds) Topics in Biostatistics. Methods in Molecular Biology™, vol 404. Humana Press. https://doi.org/10.1007/978-1-59745-530-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-59745-530-5_22

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-58829-531-6

  • Online ISBN: 978-1-59745-530-5

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics