Molecular Evolution

Evans, Steven N.

doi:10.1007/978-1-4614-1347-9_11

Steven N. Evans²

Part of the book series: Selected Works in Probability and Statistics ((SWPS))

1391 Accesses

Abstract

Although the Department of Statistics at Berkeley decided they wanted to hire me in 1987, I didn’t take up my position there until 1989. I don’t have any recollection of meeting Terry when I interviewed, but, due in part to our shared Australian nationality, we became good friends shortly after I moved to Berkeley. Two years later, I jumped at the chance to move from my gloomy, north-facing office to one next to Terry’s. Its corner location with a view across the San Francisco Bay through the Golden Gate was merely an added inducement.

You have full access to this open access chapter, Download chapter PDF

Molecular Evolution: A Brief Introduction

A Not-So-Long Introduction to Computational Molecular Evolution

Testing Hypotheses of Molecular Evolution

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Although the Department of Statistics at Berkeley decided they wanted to hire me in 1987, I didn’t take up my position there until 1989. I don’t have any recollection of meeting Terry when I interviewed, but, due in part to our shared Australian nationality, we became good friends shortly after I moved to Berkeley. Two years later, I jumped at the chance to move from my gloomy, north-facing office to one next to Terry’s. Its corner location with a view across the San Francisco Bay through the Golden Gate was merely an added inducement.

The resulting proximity led us to discuss scientific matters to a much greater extent. I was soon meeting with Terry and his students, serving on his students’ dissertation committees, and attending Terry’s weekly “statistics and biology” seminar. The thing that got me irredeemably hooked on the applications of statistics and probability to biology arose out of Terry’s work with his student Trang Nguyen on phylogenetics, the enterprise that seeks to reconstruct the evolutionary family tree of some collection of taxa (typically, species) using data such as DNA sequences. Phylogenetics was already a huge field in the early 1990s with a variety of statistical and non-statistical methods, and it has expanded greatly since then. Some idea of its scope may be gleaned from Semple and Steel [43], Felsenstein [19], Gascuel [20], and Lemey et al. [28].

Phylogenetic inference can be viewed as a standard statistical estimation problem [22]. One has a probability model for the observed DNA sequences that involves two kinds of parameters: those that define the mechanism by which DNA changes over time down a lineage and those that define the tree. The latter can be thought of as being further divided into a discrete parameter, the shape of the tree, and a set of numerical parameters, the lengths of the various branches (which represent either chronological time or evolutionary distance). In principle, the problem is therefore amenable to standard inferential techniques such as maximum likelihood or Bayesian methods. Unfortunately, likelihoods are somewhat expensive to compute for large numbers of taxa because they consist of large sums of products – essentially, one has to sum over all the possibilities for the genetic types of the unobserved ancestors at each of the internal nodes of the tree. Even more forbidding is the fact that the number of possible trees for even a modest number taxa is enormous, so any approach that involves naively searching over tree space for the tree with maximal likelihood or summing and integrating over tree space to compute a posterior distribution will quickly become computationally infeasible, although there are widely used software packages that incorporate effective heuristics for maximizing the likelihood [21, 46, 45] and MCMC methods to computing posterior distributions [23, 24]. This computational difficulty is particularly galling because a significant proportion of the effort is expended to estimate the edge lengths of the tree and the parameters of the DNA substitution model, while the main object of interest is often just the shape of the tree.

Trang and Terry had come across an intriguing alternative approach to phylogenetic inference, the use of phylogenetic invariants, that had been proposed in Lake [27] and Cavender and Felsenstein [12] and developed further in Cavender [10] and Cavender [11]. The idea behind this approach is the following. Assume we have N taxa. At any site there are 4^N possibilities for the nucleotides exhibited by the taxa. Each of these possibilities has an associated probability that is a function of the parameters in our model. It is usual to assume that these probabilities are the same for each site and that different sites behave independently. Suppose that for a given tree there is a collection of polynomial functions of these probabilities such that each function has the property it has value zero regardless of the values of the numerical parameters. Such functions are called phylogenetic invariants. Moreover, suppose that the values of these polynomials are typically non-zero when they are evaluated on the corresponding probabilities associated with other trees for generic values of the numerical parameters. The hope is that one can find enough invariants to distinguish between any pair of trees, estimate their values using the observed empirical frequencies of vectors of nucleotides across many sites, and then determine which tree appears to have the estimates of the values of “its” invariants close to zero and hence is a suitable estimate of the underlying phylogenetic tree.

In order to implement this strategy, one needs ideally a procedure for finding all the invariants for a given tree. Because a linear combination of invariants is an invariant and the product of an invariant and an arbitrary polynomial is an invariant, the invariants form an ideal in the ring of polynomials, and so one actually wants to characterize an algebraically independent generating set. When Terry and I discussed this problem, we realized that the models of DNA substitution for which others had been successful in finding specific examples of invariants by ad hoc means were ones in which there is an underlying group structure. More specifically, if one identifies the nucleotides {A, G, C, T} with the elements of the abelian group \({\mathbb{Z}}_{2} \otimes {\mathbb{Z}}_{2}\) in an appropriate manner, then the substitution dynamics are just those of a continuous time random walk (that is, a processes with stationary independent increments) on this group. This suggested that we should attack the problem with Fourier theory for abelian groups – I should note that similar observations about the substitution models were made by others such as Székely et al. [50] around the same time. We found in our joint paper reproduced in this volume that the algebraic structure of the likelihoods looks much simpler in “Fourier coordinates” and that one can determine a generating set for the family of invariants of a given tree using essentially linear algebra. We also proposed some conjectures on the number of “independent” invariants for various models that were verified subsequently in Evans and Zhou [17] and Evans [18].

It turned out that Terry and I had been like Molière’s Monsieur Jourdain in Le Bourgeois Gentilhomme who “for more than forty years” had been “speaking prose without knowing it.” The simple structure we observed after the passage to Fourier coordinates is an instance of a toric ideal, and we had unwittingly reproduced some of the elementary theory related to such objects. This connection was made in Sturmfels and Sullivant [47] and it led to a large amount of work using tools from commutative algebra to construct and analyze phylogenetic invariants in a number of different settings [1, 2, 8, 15, 6, 3, 5, 4, 9, 14]. Even tools from the representation theory of non-abelian groups have turned out to be useful in this context [49, 48]. Moreover, the investigation of phylogenetic invariants led in part to an appreciation of the extent to which many statistical models could be profitably studied from the perspective of commutative algebra and algebraic geometry, and this point of view is the basis of the field of algebraic statistics [37, 38, 41, 16].

An extremely important observation in phylogenetics is that evolution occurs at the level of genes and that different genes can have evolutionary family trees that disagree with the associated species tree. For example, genes can be duplicated and the duplicate can mutate to take on a new function, sometimes resulting in the loss of another gene that originally performed that function. Also, the lineages of orthologous genes (that is, genes descended from a common ancestral gene) in two taxa will split some time before the corresponding split in the species tree, and if this difference is sufficiently great the shape of the tree for a given gene will differ from that of the species tree. This means that in constructing a species tree one needs to resolve the incompatibilities observed between the trees constructed for various genes. On the other hand, if one has an accepted species tree, then it is desirable to reconcile a discordant gene tree with the species tree by describing how the above phenomena might have conspired to produce the differences between the two. This general problem is discussed in Pamilo and Nei [40], Page and Charleston [39], Nichols [36], and Maddison [32].

The papers by Bourgon et al. [7] and Wilkinson et al. [51] carry out the task of clarifying the connection between a gene tree and a species tree in two important instances, the evolution of the serine repeat antigen in various Plasmodium species (including P. falciparum, the parasite responsible for the most acute form of malaria in humans) and the evolution of relaxin-like peptides across species ranging from humans to the zebra fish and the African clawed frog.

There has been considerable fascinating theoretical research on the problem of constructing species trees from gene trees, some of it showing quite paradoxical behavior; for example, the most likely gene tree can differ from the species tree and inferring a species tree by concatenating the sequences of several genes and treating the result as one gene can lead to an incorrect species tree with high probability [42, 13, 31, 25, 33, 34]. Some recent approaches to constructing well-behaved estimates of species trees using data from several genes are Liu and Pearl [30], Liu [29], Kubatko et al. [26], and Mossel and Roch [35].

The last of Terry’s work on molecular evolution is his paper with Sidow and Nguyen [44] on estimating invariable codons using capture-recapture methods. Invariable codons are those that are conserved across different species because of structural or functional constraints. In essence, they are codons that are prevented from changing because any change would have fatal biochemical consequences. It is not possible to observe which codons are invariable by simply looking at sequence data because some codons might be conserved by chance across all species even though there is no biochemical reason preventing a change, and so the invariable codons form some unknown fraction of the conserved ones. This paper is another example of Terry at his best: it provides answers of genuine scientific importance using simple, sensible statistical ideas that are normally not associated with the analysis of molecular data and that he probably learned from his extensive teaching and consulting experience.

Working with Terry has been one of the high points of my career at Berkeley. He has affected deeply the areas in which I have worked and my general attitude to research. Perhaps more importantly, by my good fortune of being his neighbor for around twenty years I have had an unrivaled opportunity to witness the humanity, dedication and commitment that he always shows to his students and collaborators. I may not have always lived up to the wonderful example he continues to set, but that does not make me any the less grateful for it.

References

E. S. Allman and J. A. Rhodes. Phylogenetic invariants for the general Markov model of sequence mutation. Math. Biosci., 186:113–144, 2003.
Article MathSciNet MATH Google Scholar
E. S. Allman and J. A. Rhodes. Phylogenetic invariants for stationary base composition. J. Symbolic Comput., 41:138–150, 2006.
Article MathSciNet MATH Google Scholar
E. S. Allman and J. A. Rhodes. Phylogenetic invariants. In Reconstructing Evolution, pages 108–146. Oxford University Press, 2007.
Google Scholar
E. S. Allman and J. A. Rhodes. Molecular phylogenetics from an algebraic viewpoint. Stat. Sinica, 17:1299–1316, 2007.
MathSciNet MATH Google Scholar
E. S. Allman and J. A. Rhodes. Phylogenetic ideals and varieties for the general Markov model. Adv. in Appl. Math., 40:127–148, 2008.
Article MathSciNet MATH Google Scholar
C. Bocci. Topics on phylogenetic algebraic geometry. Expo. Math., 25:235–259, 2007.
Article MathSciNet MATH Google Scholar
R. Bourgon, M. Delorenzi, T. Sargeant, A. N. Hodder, B. S. Crabb, and T. P. Speed. The serine repeat antigen (SERA) gene family phylogeny in Plasmodium: The impact of gc content and reconciliation of gene and species trees. Mol. Biol. Evol., 21:2161–2171, 2004.
Article Google Scholar
W. Buczyńska and J. A. Wiśniewski. On geometry of binary symmetric models of phylogenetic trees. J. Eur. Math. Soc. (JEMS), 9: 609–635, 2007.
Google Scholar
M. Casanellas and J. Fernández-Sánchez. Geometry of the Kimura 3-parameter model. Adv. in Appl. Math., 41:265–292, 2008.
Article MathSciNet MATH Google Scholar
J. Cavender. Mechanized derivation of linear invariants. Mol. Biol. Evol., 6:301–316, 1989.
Google Scholar
J. Cavender. Necessary conditions for the method of inferring phylogeny by linear invariants. Math. Biosci., 103:69–75, 1991.
Article MathSciNet MATH Google Scholar
J. Cavender and J. Felsenstein. Invariants of phylogenies in a simple case with discrete states. J. Class., 4:57–71, 1987.
Article MATH Google Scholar
J. H. Degnan and N. A. Rosenberg. Discordance of species trees with their most likely gene trees. PLoS Genetics, 2, 2006.
Google Scholar
J. Draisma and J. Kuttler. On the ideals of equivariant tree models. Math. Ann., 344:619–644, 2009.
Article MathSciNet MATH Google Scholar
A. Dress and M. Steel. Phylogenetic diversity over an abelian group. Ann. Comb., 11:143–160, 2007.
Article MathSciNet MATH Google Scholar
M. Drton, B. Sturmfels, and S. Sullivant. Lectures on Algebraic Statistics, volume 39 of Oberwolfach Seminars. Birkhäuser Verlag, 2009.
Google Scholar
S. Evans and X. Zhou. Constructing and counting phylogenetic invariants. J. Comput. Biol, 5:713–724, 1998.
Article Google Scholar
S. N. Evans. Fourier analysis and phylogenetic trees. In D. Healy, Jr. and D. Rockmore, editors, Modern Signal Processing (Lecture notes from an MSRI Summer School). Cambridge University Press, 2004.
Google Scholar
J. Felsenstein. Inferring Phylogenies. Sinauer, 2004.
Google Scholar
O. Gascuel, editor. Mathematics of Evolution and Phylogeny. Oxford University Press, 2007.
Google Scholar
S. Guindon and O. Gascuel. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol., 52: 696–704, 2003.
Article Google Scholar
S. Holmes. Statistics for phylogenetic trees. Theor. Popul. Biol., 63: 17–32, 2003.
Article MATH Google Scholar
J. P. Huelsenbeck and F. Ronquist. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics, 17:754–755, 2001.
Article Google Scholar
J. P. Huelsenbeck and F. Ronquist. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19:1572–1574, 2003.
Article Google Scholar
L. S. Kubatko and J. H. Degnan. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol., 56: 17–24, 2007.
Article Google Scholar
L. S. Kubatko, B. C. Carstens, and L. L. Knowles. STEM: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics, 25:971–973, 2009.
Article Google Scholar
J. Lake. A rate-independent technique for analysis of nucleic acid sequences: Evolutionary parsimony. Mol. Biol. Evol., 4:167–191, 1987.
Google Scholar
P. Lemey, M. Salemi, and A.-M. Vandamme, editors. The Phylogenetic Handbook: A Practical Approach to Phylogenetic Analysis and Hypothesis Testing. Cambridge University Press, 2nd edition, 2009.
Google Scholar
L. Liu. BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics, 24:2542–2543, 2008.
Article Google Scholar
L. Liu and D. K. Pearl. Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst. Biol., 56:504–514, 2007.
Article Google Scholar
W. Maddison and L. Knowles. Inferring phylogeny despite incomplete lineage sorting. Syst. Biol., 55:21–30, 2006.
Article Google Scholar
W. P. Maddison. Gene trees in species trees. Syst. Biol., 46:523–536, 1997.
Article Google Scholar
F. A. Matsen and M. Steel. Phylogenetic mixtures on a single tree can mimic a tree of another topology. Syst. Biol., 56:767–775, 2007.
Article Google Scholar
F. A. Matsen, E. Mossel, and M. Steel. Mixed-up trees: The structure of phylogenetic mixtures. Bull. Math. Biol., 70:1115–1139, 2008.
Article MathSciNet MATH Google Scholar
E. Mossel and S. Roch. Incomplete lineage sorting: Consistent phylogeny estimation from multiple loci. IEEE Comp. Bio. and Bioinformatics, 7:166–171, 2010.
Article Google Scholar
R. Nichols. Gene trees and species trees are not the same. Trends Ecol. Evol., 16:358–364, 2001.
Article Google Scholar
L. Pachter and B. Sturmfels, editors. Algebraic Statistics for Computational Biology. Cambridge University Press, 2005.
Google Scholar
L. Pachter and B. Sturmfels. The mathematics of phylogenomics. SIAM Rev., 49:3–31, 2007.
Article MathSciNet MATH Google Scholar
R. D. M. Page and M. A. Charleston. From gene to organismal phylogeny: Reconciled trees and the gene tree/species tree problem. Mol. Phylogenet. Evol., 7:231–240, 1997.
Article Google Scholar
P. Pamilo and M. Nei. Relationships between gene trees and species trees. Mol. Biol. Evol., 5:568–583, 1988.
Google Scholar
G. Pistone, E. Riccomagno, and H. P. Wynn. Algebraic Statistics, volume 89 of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, 2001.
Google Scholar
N. A. Rosenberg. The probability of topological concordance of gene trees and species trees. Theor. Popul. Biol., 61:225–247, 2002.
Article MATH Google Scholar
C. Semple and M. Steel. Phylogenetics, volume 22 of Mathematics and its Applications. Oxford University Press, 2003.
Google Scholar
A. Sidow, T. Nguyen, and T. P. Speed. Estimating the fraction of invariable codons with a capture-recapture method. J. Mol. Evol., 35: 253–260, 1992.
Article Google Scholar
A. Stamatakis. RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22:2688–2690, 2006.
Article Google Scholar
A. Stamatakis, T. Ludwig, and H. Meier. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics, 21:456–463, 2005.
Article Google Scholar
B. Sturmfels and S. Sullivant. Toric ideals of phylogenetic invariants. J. Comput. Biol., 12:204–228, 2005.
Article Google Scholar
J. Sumner and P. Jarvis. Markov invariants and the isotropy subgroup of a quartet tree. J. Theoret. Biol., 258:302–310, 2009.
Article Google Scholar
J. Sumner, M. Charleston, L. Jermiin, and P. Jarvis. Markov invariants, plethysms, and phylogenetics. J. Theoret. Biol., 253:601–615, 2008.
Article Google Scholar
L. A. Székely, M. A. Steel, and P. L. Erdős. Fourier calculus on evolutionary trees. Adv. in Appl. Math., 14:200–210, 1993.
Article MathSciNet MATH Google Scholar
T. N. Wilkinson, T. P. Speed, G. W. Tregear, and R. A. Bathgate. Evolution of the relaxin-like peptide family from neuropeptide to reproduction. Ann. N.Y. Acad. Sci., 1041:530–533, 2005.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of California, Berkeley, USA
Steven N. Evans

Authors

Steven N. Evans
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Steven N. Evans .

Editor information

Editors and Affiliations

School of Public Health, Div. Biostatistics, University of California, Earl Warren Hall 140, Berkeley, 94720, California, USA
Sandrine Dudoit

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Evans, S.N. (2012). Molecular Evolution. In: Dudoit, S. (eds) Selected Works of Terry Speed. Selected Works in Probability and Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1347-9_11

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1347-9_11
Published: 09 January 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1346-2
Online ISBN: 978-1-4614-1347-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Molecular Evolution

Abstract

Similar content being viewed by others

Molecular Evolution: A Brief Introduction

A Not-So-Long Introduction to Computational Molecular Evolution

Testing Hypotheses of Molecular Evolution

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Molecular Evolution

Abstract

Similar content being viewed by others

Molecular Evolution: A Brief Introduction

A Not-So-Long Introduction to Computational Molecular Evolution

Testing Hypotheses of Molecular Evolution

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation