Genome Mapping Statistics and Bioinformatics

Mychaleckyj, Josyf C.

doi:10.1007/978-1-59745-530-5_22

Josyf C. Mychaleckyj PhD²

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 404))

8010 Accesses
5 Citations

Abstract

The unprecedented availability of genome sequences, coupled with user-friendly, web-enabled search and analysis tools allows practitioners to locate interesting genome features or sequence tracts with relative ease. Although many public model organism- and genome-mapping resources offer pre-mapped genome browsing, biologists also still need to perform de novo mapping analyses. Correct interpretation of the results in genome annotation databases or the results of one’s individual analyses requires at least a conceptual understanding of the statistics and mechanics of genome searches, the expected results from statistical considerations, as well as the algorithms used by different search tools. This chapter introduces the basic statistical results that underlie mapping of nucleotide sequences to genomes and briefly surveys the common programs and algorithms that are used to perform genome mapping, all available via public hosted web sites. Selection of the appropriate sequence search and mapping tool will often demand tradeoffs in sensitivity and specificity relating to the statistics of the search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There are 2,469 genomes as of April 2, 2007, according to the Genomes Online Database (http://www.genomesonline.org).
2.
Two hundred thirty-four euchromatic sequence gaps, 64 non-euchromatic gaps. Also see Ref. 3.
3.
According to the GenBank file annotation, the first 100 bases of the transcript are in the 5′ untranslated region (5′ UTR) of the transcript.
4.
Unlike English text palindromes, such as the oft-recited “Madam I’m Adam.” Reverse complement palindromic sequences are sometimes referred to as containing tandem inverted repeats.

References

Waterman, M. S. (1995) Introduction to Computational Biology. London, Chapman & Hall.
Google Scholar
Ewens, W. J., and Grant, G. R. (2001) Statistical Methods in Bioinformatics. New York, Springer-Verlag.
Google Scholar
International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431, 931–945.
Article Google Scholar
Kent, W. J. (2002) BLAT-the BLAST-like alignment tool. Genome Res. 12, 6–664.
Google Scholar
Shine, J., and Dalgarno, L. (1974) The 3′-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. U. S. A. 71, 1342–1346.
Article PubMed CAS Google Scholar
Forsdyke, D. R., and Mortimer, J. R. (2000) Chargaff’s legacy. Gene 261, 127–137.
Article PubMed CAS Google Scholar
Prabhu, V. V. (1993) Symmetry observations in long nucleotide sequences. Nucleic Acids Res. 21, 2797–2800.
Article PubMed CAS Google Scholar
Qi, D., and Cuticchia, A. J. (2001) Compositional symmetries in complete genomes. Bioinformatics 17, 557–559.
Article PubMed CAS Google Scholar
Zimmermann, K., Schogl, D., and Mannhalter, J. W. (1998) Digestion of terminal restriction endonuclease recognition sites on PCR products. Biotechniques 24, 582–584.
PubMed CAS Google Scholar
Ma, J., Campbell, A., and Karlin, S. (2002) Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J. Bacteriol. 184, 5733–5745.
Article PubMed CAS Google Scholar
van Helden, J., Rios, A. F., and Collado-Vides, J. (2000) Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28, 1808–1818.
Article PubMed Google Scholar
Klock, G., Strahle, U., and Schutz, G. (1987) Oestrogen and glucocorticoid responsive elements are closely related but distinct. Nature 329, 734–736.
Article PubMed CAS Google Scholar
Klinge, C. M. (2001) Estrogen receptor interaction with estrogen response elements. Nucleic Acids Res. 29, 2905–2919.
Article PubMed CAS Google Scholar
van Helden, J., Andre, B., and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842.
Article PubMed Google Scholar
Needleman, S. B., and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.
Article PubMed CAS Google Scholar
Smith, T. F., and Waterman, M. S. (1981) Identifi cation of common molecular subsequences. J. Mol. Biol. 147, 195–197.
Article PubMed CAS Google Scholar
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.
Article PubMed CAS Google Scholar
Weber, J. L., David, D., Heil, J., Fan, Y., Zhao, C., and Marth, G. (2002) Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet. 71, 854–862.
Article PubMed Google Scholar
Altschul, S. F., and Karlin, S. (1990) Methods for assessing the statistical significance of molecular sequences by using general scoring schemes. Proc. Natl. Acad. Sci. U. S. A. 87, 2264–2268.
Article PubMed Google Scholar
Karlin, S., and Altschul, S. F. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. U. S. A. 90, 5873–5877.
Article PubMed CAS Google Scholar
Korf, I., Yandell, M., and Bedell, B. (2003) BLAST. Sebastopol, O’Reilly & Associates.
Google Scholar
Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214.
Article PubMed CAS Google Scholar
Ning, Z., Cox, A. J., and Mullikin, J. C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725–1729.
Article PubMed CAS Google Scholar
Bensen, J. T., Dawson, P. A., Mychaleckyj, J. C., and Bowden, D. W. (2001) Identification of a novel human cytokine gene in the interleukin gene cluster on chromosome 2q12–14. J. Interferon Cytokine Res. 21, 899–904.
Article PubMed CAS Google Scholar

Download references

Author information

Authors and Affiliations

Center for Public Health Genomics, University of Virginia, Charlottesville, VA
Josyf C. Mychaleckyj PhD

Authors

Josyf C. Mychaleckyj PhD
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Biostatistical Sciences, Wake Forest University Health Sciences, Winston-Salem, NC
Walter T. Ambrosius

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Mychaleckyj, J.C. (2007). Genome Mapping Statistics and Bioinformatics. In: Ambrosius, W.T. (eds) Topics in Biostatistics. Methods in Molecular Biology™, vol 404. Humana Press. https://doi.org/10.1007/978-1-59745-530-5_22

Download citation

DOI: https://doi.org/10.1007/978-1-59745-530-5_22
Publisher Name: Humana Press
Print ISBN: 978-1-58829-531-6
Online ISBN: 978-1-59745-530-5
eBook Packages: Springer Protocols

Publish with us

Policies and ethics