Abstract
The unprecedented availability of genome sequences, coupled with user-friendly, web-enabled search and analysis tools allows practitioners to locate interesting genome features or sequence tracts with relative ease. Although many public model organism- and genome-mapping resources offer pre-mapped genome browsing, biologists also still need to perform de novo mapping analyses. Correct interpretation of the results in genome annotation databases or the results of one’s individual analyses requires at least a conceptual understanding of the statistics and mechanics of genome searches, the expected results from statistical considerations, as well as the algorithms used by different search tools. This chapter introduces the basic statistical results that underlie mapping of nucleotide sequences to genomes and briefly surveys the common programs and algorithms that are used to perform genome mapping, all available via public hosted web sites. Selection of the appropriate sequence search and mapping tool will often demand tradeoffs in sensitivity and specificity relating to the statistics of the search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
There are 2,469 genomes as of April 2, 2007, according to the Genomes Online Database (http://www.genomesonline.org).
- 2.
Two hundred thirty-four euchromatic sequence gaps, 64 non-euchromatic gaps. Also see Ref. 3.
- 3.
According to the GenBank file annotation, the first 100 bases of the transcript are in the 5′ untranslated region (5′ UTR) of the transcript.
- 4.
Unlike English text palindromes, such as the oft-recited “Madam I’m Adam.” Reverse complement palindromic sequences are sometimes referred to as containing tandem inverted repeats.
References
Waterman, M. S. (1995) Introduction to Computational Biology. London, Chapman & Hall.
Ewens, W. J., and Grant, G. R. (2001) Statistical Methods in Bioinformatics. New York, Springer-Verlag.
International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931–945.
Kent, W. J. (2002) BLAT-the BLAST-like alignment tool. Genome Res. 12, 6–664.
Shine, J., and Dalgarno, L. (1974) The 3′-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. U. S. A. 71, 1342–1346.
Forsdyke, D. R., and Mortimer, J. R. (2000) Chargaff’s legacy. Gene 261, 127–137.
Prabhu, V. V. (1993) Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 21, 2797–2800.
Qi, D., and Cuticchia, A. J. (2001) Compositional symmetries in complete genomes. Bioinformatics 17, 557–559.
Zimmermann, K., Schogl, D., and Mannhalter, J. W. (1998) Digestion of terminal restriction endonuclease recognition sites on PCR products. Biotechniques 24, 582–584.
Ma, J., Campbell, A., and Karlin, S. (2002) Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J. Bacteriol. 184, 5733–5745.
van Helden, J., Rios, A. F., and Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28, 1808–1818.
Klock, G., Strahle, U., and Schutz, G. (1987) Oestrogen and glucocorticoid responsive elements are closely related but distinct. Nature 329, 734–736.
Klinge, C. M. (2001) Estrogen receptor interaction with estrogen response elements. Nucleic Acids Res. 29, 2905–2919.
van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842.
Needleman, S. B., and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.
Smith, T. F., and Waterman, M. S. (1981) Identifi cation of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.
Weber, J. L., David, D., Heil, J., Fan, Y., Zhao, C., and Marth, G. (2002) Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. 71, 854–862.
Altschul, S. F., and Karlin, S. (1990) Methods for assessing the statistical significance of molecular sequences by using general scoring schemes. Proc. Natl. Acad. Sci. U. S. A. 87, 2264–2268.
Karlin, S., and Altschul, S. F. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U. S. A. 90, 5873–5877.
Korf, I., Yandell, M., and Bedell, B. (2003) BLAST. Sebastopol, O’Reilly & Associates.
Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214.
Ning, Z., Cox, A. J., and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729.
Bensen, J. T., Dawson, P. A., Mychaleckyj, J. C., and Bowden, D. W. (2001) Identification of a novel human cytokine gene in the interleukin gene cluster on chromosome 2q12–14. J. Interferon Cytokine Res. 21, 899–904.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Humana Press Inc., Totowa, NJ
About this protocol
Cite this protocol
Mychaleckyj, J.C. (2007). Genome Mapping Statistics and Bioinformatics. In: Ambrosius, W.T. (eds) Topics in Biostatistics. Methods in Molecular Biology™, vol 404. Humana Press. https://doi.org/10.1007/978-1-59745-530-5_22
Download citation
DOI: https://doi.org/10.1007/978-1-59745-530-5_22
Publisher Name: Humana Press
Print ISBN: 978-1-58829-531-6
Online ISBN: 978-1-59745-530-5
eBook Packages: Springer Protocols