Biological Sequence Analysis

Cawley, Simon E.

doi:10.1007/978-1-4614-1347-9_14

Simon E. Cawley²

Part of the book series: Selected Works in Probability and Statistics ((SWPS))

1538 Accesses

Abstract

Shortly after the start of my graduate studies at the U.C. Berkeley Statistics department in 1995, I had the good fortune to meet Terry and learn about some of his work in the area of the application of statistics to genetics and molecular biology. Not having thought about biology since high school, I was very impressed by the large impact statistical approaches were making in a field I had naively considered as one that had little to do with quantitative analysis. I eagerly dove in to a collaboration that Terry had put in place with the Human and Drosophila Genome Projects at Lawrence Berkeley National Laboratories and spent the next few years having a great time working on interesting and practical statistical problems that arose in the context of the ongoing genome sequencing efforts.

You have full access to this open access chapter, Download chapter PDF

Sequence Analysis

Statistical Genetic Terminology

Probability, Statistics, and Computational Science

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Shortly after the start of my graduate studies at the U.C. Berkeley Statistics department in 1995, I had the good fortune to meet Terry and learn about some of his work in the area of the application of statistics to genetics and molecular biology. Not having thought about biology since high school, I was very impressed by the large impact statistical approaches were making in a field I had naively considered as one that had little to do with quantitative analysis. I eagerly dove in to a collaboration that Terry had put in place with the Human and Drosophila Genome Projects at Lawrence Berkeley National Laboratories and spent the next few years having a great time working on interesting and practical statistical problems that arose in the context of the ongoing genome sequencing efforts.

In this section we present some of Terry’s contributions in the area of Sequence Analysis – generally speaking, the area of analysis of biological sequences such as DNA or protein sequences. The papers presented here relate to the interpretation of DNA sequences.

DNA sequence analysis has been an area of growing importance since DNA sequencing techniques started to emerge in the early 1970s. The chain-terminator method developed by Frederick Sanger at the University of Cambridge [7] was a pivotal moment, enabling the first rapid scaling up of DNA sequencing capabilities. The rate of sequencing was further accelerated through the 1980s and 1990s as ever-greater levels of automation were brought to bear on Sangers original concept.

As the level of automation increased, it became possible to sequence entire genomes of successively more complex organisms with larger genomes, ranging from bacteriophage phiX174 in the late 1970s, various microbial genomes in the early 1990s through to the draft of the human genome sequence published in 2001. The Sanger method showed remarkable longevity and was at the core of the vast majority of sequencing efforts through to the early 2000s.

The dominance of Sanger sequencing finally ended in the early 2000s with the advent of a renaissance of sorts as multiple new massively parallel technologies such as 454 pyrosequencing, followed soon after by Solexa (Illumina), SOLiD, polony, DNA nanoball and Ion Torrent sequencing.

As DNA sequencing technologies scaled up, huge opportunities arose along the way for the application of statistics, both in the area of analysis of the signals generated from each of the various instruments and technologies to improve DNA sequencing accuracy (the subject of Chapter 13), and in the downstream analysis of the DNA sequence collected. In particular, as the volumes of sequence generated started to exceed what an expert molecular biologist could manually browse and interpret, it became crucial to develop statistical models for assembling and interpreting the sequences.

The papers presented in this chapter cover two important areas in the interpretation of DNA sequences. The first, Cawley et al. [3], addresses the problem of analyzing stretches of DNA to search for the collections of sub-sequences that correspond to gene transcripts. The model presented was not the first of its kind; similar Hidden Markov Models (HMMs) had been published before [2, 4, 5]. Its novel contributions were various observations about computational shortcuts that can be made, at no cost to accuracy, taking advantage of some of the structure of the problem of applying HMMs to gene finding. This paper was also the first instance where the probabilistic formulation of the HMM gene finder was used to derive posterior probabilities of bases being part of the gene; previous attempts focused exclusively on the use of the Viterbi algorithm to predict gene structures. The software implementing the gene finder was also the first HMM gene finder made available as open-source software, something of value given the rate at which new organisms were then being sequenced.

As an interesting side note, while doing some of the work that was described in the publication, I had a near-death experience with the very Malaria parasite that was the subject of the work. A pure coincidence – the work had involved nothing more than electronic interaction with the parasite!

The second paper, Zhao et al. [8], introduced the novel concept of a Permuted Variable Length Markov Model (PVLMM), a generalization of the VLMM [1, 6]. VLMMs themselves are a generalization of Markov models. When applied to sequence analysis, they have the advantage of allowing for modeling of long context dependencies without necessarily coming at the cost of an exponential increase in the number of parameters to estimate. However, the dependencies that VLMMs best model are still relatively local dependencies and they are ill-suited to describe long-range dependencies between particular positions in a sequence as sometimes occurs. PVLMMs offer a way around that limitation by providing a framework in which the modeled sequence can be permuted to bring dependent positions together, turning long-range dependencies into local ones.

The paper provides some impressive work, putting the new theory into practice in two substantial applications: modeling of splice sites, a sub-component of gene sequences; and modeling of Transcription Factor Binding Sites (TFBS), important regions of DNA to which regulatory molecules known as transcription factors bind as part of the regulation mechanism for gene expression. By showing effective performance in two different sequence analysis problems, a strong case is made for the PVLMM as a general tool that will be well suited to a broad range of applications.

These papers, along with the diverse range of publications reviewed in the other chapters, provide a sense of the amazing breadth of Terry’s work. I am a direct beneficiary of his diverse interests – when he introduced me to the field of statistics applied to molecular biology, I enjoyed it so much that it ended up being the basis of my career to-date. I will always be grateful to him for how selflessly he shared his time and insights, and for the patient guidance he provided during my graduate years and beyond.

References

P. Bühlmann and A. J. Wyner. Model selection for variable length Markov chains and tuning the context algorithm. Ann. Inst. Stat. Math., 52(2):287–315, 2000.
Article MATH Google Scholar
C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268:78–94, 1997.
Article Google Scholar
S. Cawley, A. Wirth, and T. P. Speed. Phat—a gene finding program for Plasmodium falciparum. Mol. Biochem. Parasit., 118:167–174, 2001.
Article Google Scholar
A. Krogh. Two methods for improving performance of an HMM and their application for gene finding. In Proc. Int. Conf. Intell. Syst. Mol. Biol., volume 5, pages 179–186. ISMB, 1997.
Google Scholar
D. Kulp, D. Haussler, M. G. Reese, and F. H. Eeckman. A generalized hidden Markov model for the recognition of human genes in DNA. In Proc. Int. Conf. Intell. Syst. Mol. Biol., volume 4, pages 134–142. ISMB, 1996.
Google Scholar
J. Rissanen. Complexity of strings in the class of Markov sources. IEEE Trans. Inform. Theory, 32:526–532, 1986.
Article MathSciNet MATH Google Scholar
F. Sanger and A. R. Coulson. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol., 94(3): 441–448, 1975.
Article Google Scholar
X. Zhao, H. Huang, and T. P. Speed. Finding short DNA motifs using permuted Markov models. J. Comput. Biol., 12(6):894–906, 2005.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Ion Torrent, San Francisco, USA
Simon E. Cawley

Authors

Simon E. Cawley
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon E. Cawley .

Editor information

Editors and Affiliations

School of Public Health, Div. Biostatistics, University of California, Earl Warren Hall 140, Berkeley, 94720, California, USA
Sandrine Dudoit

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cawley, S.E. (2012). Biological Sequence Analysis. In: Dudoit, S. (eds) Selected Works of Terry Speed. Selected Works in Probability and Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1347-9_14

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1347-9_14
Published: 09 January 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1346-2
Online ISBN: 978-1-4614-1347-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Biological Sequence Analysis

Abstract