- Split View
-
Views
-
Cite
Cite
Gráinne McGuire, Michael C. Denham, David J. Balding, Models of Sequence Evolution for DNA Sequences Containing Gaps, Molecular Biology and Evolution, Volume 18, Issue 4, April 2001, Pages 481–490, https://doi.org/10.1093/oxfordjournals.molbev.a003827
- Share Icon Share
Abstract
Most evolutionary tree estimation methods for DNA sequences ignore or inefficiently use the phylogenetic information contained within shared patterns of gaps. This is largely due to the computational difficulties in implementing models for insertions and deletions. A simple way to incorporate this information is to treat a gap as a fifth character (with the four nucleotides being the other four) and to incorporate it within a Markov model of nucleotide substitution. This idea has been dismissed in the past, since it treats a multiple-site insertion or deletion as a sequence of independent events rather than a single event. While this is true, we have found that under many circumstances it is better to incorporate gap information inadequately than to ignore it, at least for topology estimation. We propose an extension to a class of nucleotide substitution models to incorporate the gap character and show that, for data sets (both real and simulated) with short and medium gaps, these models do lead to effective use of the information contained within insertions and deletions. We also implement an ad hoc method in which the likelihood at columns containing multiple-site gaps is downweighted in order to avoid giving them undue influence. The precision of the estimated tree, assessed using Markov chain Monte Carlo techniques to find the posterior distribution over tree space, improves under these five-state models compared with standard methods which effectively ignore gaps.
Introduction
The efficiency of methods for phylogeny reconstruction from DNA multiple alignments is always important, but it is particularly so in reconstructing the history of gene families or other specific parts of the genome. This is because if a method leads to ambiguities in the structure of part of the tree, we cannot simply sequence more data, as can be done for the species tree problem. Due to the statistical nature of the problem of finding phylogenetic trees for DNA (and amino acid) sequences, likelihood methods are widely regarded as the most powerful tools for these estimation problems. In the past, these methods have been dogged by computational issues; nowadays, however, increasing computer power means that likelihood principles may be used to estimate the tree for reasonably large data sets.
Likelihood methods are model-dependent; they usually provide the optimal approach if the model is adequate, but if an inappropriate model is used, they may well return poor results. If a model ignores some of the information in the data, then the likelihood analysis may be inefficient. This latter case is certainly a worry for many data sets—the commonly used Markov models of evolution describe only the substitution process and ignore insertion and deletion events within the sequences. When analyzing alignments using these models, gaps are generally handled in one of two ways. The first possibility is to ignore any columns containing gaps. The second, used in most likelihood programs, is to treat a gap as an ambiguity character (N). This permits likelihood calculations to be carried out (essentially coding a gap as N has the effect of removing the corresponding branch from the tree). Both of these approaches are suboptimal, since shared patterns of insertions and deletions (indels) may reveal much about the relationships among different sequences (e.g., see the alignment in Williams and Holland [1998] ).
Within the pairwise alignment literature, there has been some work on developing stochastic models of sequence evolution which include indel events. Early work by Bishop and Thompson (1986) on evolutionary models for aligning two sequences was improved on by Thorne, Kishino, and Felsenstein (1991, 1992) and Thorne and Churchill (1995) . Their model allows for base substitution, as well as for single or multiple-base insertions and deletions. To do this, they assume that the DNA or amino acid sequences consist of short blocks (fragments) of one or more bases/residues joined by imaginary links. A birth-death process (Grimmett and Stirzaker 1982 , pp. 157–161) operates on the links. If a “birth” occurs, a new link and fragment are inserted, while the “death” of a link means that the link itself and its corresponding fragment are deleted. This model could, in principle, be used to model sequence evolution in a phylogenetic tree, but standard likelihood calculations (summing over all possible data assignments at internal nodes of a tree) would be intractable.
An alternate way of using indels in the estimation of a phylogenetic tree is through the use of hidden Markov models (see, e.g., MacDonald and Zucchini 1997 ). Mitchison (1999) expands on earlier work (Mitchison and Durbin 1995 ) to tackle the problem of simultaneously aligning and estimating the phylogeny for a set of sequences. He uses profile hidden Markov models (Krogh et al. 1994 ), but instead of concentrating on the emission probabilities (i.e., the probability of observing a given nucleotide, amino acid, or gap character given that the underlying state is either a gap or match state), he focuses on changes in the path though the model. For instance, given two sites with nucleotides, these could be represented as MM (M = match). If over the course of time a deletion occurred at the second site, then the path through the model would become MG (G = gap). The probability of path changes like these, together with the usual nucleotide or amino acid substitution probabilities, is used to evaluate a given phylogenetic tree. This algorithm appears to work well in general but has difficulties with some evolutionary events such as a multiple-base/residue deletion followed by an insertion at short distances.
Some implementations of the parsimony principle for estimating a tree treat a gap as a fifth character in the nucleotide alphabet, which seems an obvious first step to using the information contained within gaps (e.g., Simmons and Ochoterena 2000 ). For likelihood analyses, existing models of nucleotide or amino acid substitution could be extended to incorporate this additional state. Extending simple models like the Jukes-Cantor (JC; Jukes and Cantor 1969 ) or Felsenstein81 (F81; Felsenstein 1981 ) model is straightforward. This idea has been mentioned previously in the literature (e.g., Durbin et al. 1998 , p. 217) yet appears to have been largely dismissed because it assumes unrealistically that occurrences of gaps at adjacent sites are independent. However, this overweighting of gaps may well be an improvement over simply ignoring them. Treating gap characters as an extra state in nucleotide and amino acid models has proved useful in other applications; for example, Koshi and Goldstein (1995) use a Bayesian formalism to derive amino acid substitution models, including a gap as a 21st character, for different constituent parts of proteins.
This paper describes some extensions to standard nucleotide models which incorporate a gap character and examines their performance on some simulated and real data sets. In addition to identifying the tree favored under a model, it is important to examine the precision allocated to this estimate. One way this may be done is to carry out Bayesian inference, which involves calculating the posterior probability of a tree being the true one given the data. Within the Bayesian framework, a commonly used point estimate of the true tree is the maximum a posteriori (MAP) estimate, while the 95% credible set (the smallest set of trees whose cumulative probabilities sum to 0.95) indicates the precision of this estimate. While the posterior distribution cannot be calculated analytically, it is possible to develop efficient sampling strategies using Markov chain Monte Carlo (MCMC).
MCMC methods are generally used to return the posterior distribution of the parameters of interest (in this case, the tree) given the data. Working with the posterior distribution (proportional to the likelihood times a prior distribution for the parameters) is attractive because posterior probabilities have a simple interpretation which contrasts with the difficulty in interpreting nonparametric bootstrap values (e.g., Efron, Halloran, and Holmes 1996 ). Mau, Newton, and Larget (1999) described a computationally feasible approach for estimating the posterior distribution of trees from relatively large data sets conforming to a molecular clock using MCMC. This was subsequently extended to nonclocklike data by Larget and Simon (1999) , who also released software (BAMBE) implementing this analysis (Simon and Larget 2000). Compared with standard likelihood and bootstrapping approaches in finding the best tree and measuring its reliability, BAMBE returns the posterior distribution over tree space in a fraction of the time required by bootstrapping. Hence, MCMC seems a sensible approach for evaluating the precision of an estimate.
Method
Models for Sequence Evolution
In common with other authors, we assume that the sequences to be analyzed are aligned unambiguously and without error. This assumption will not be strictly valid in many cases and is important since the chosen alignment in effect imposes assumptions on the evolutionary history of the sequences. If a nucleotide, say, an A, is deleted, and subsequently another nucleotide, say, T, is inserted at the same location, it may be argued that the appropriate way to represent the alignment of ancestral with derived nucleotides involves two columns, i.e., as A- and -T. Although our method does incorporate the correct history in this scenario, it requires the A and T to be aligned in the same column, so that all the nucleotide substitution paths from A to T are also possible. Thus, in our framework, the alignment of two nucleotides does not necessarily imply a direct homology. One drawback to this is that given two aligned homologous nucleotides, the probability of evolving from one to the other will include the probability of a deletion followed by an insertion. In practice, however, the probability of this event will generally be small, thereby making a negligible contribution to the total probability.
The models described below are similar to the F84 model of nucleotide substitution (Felsenstein and Churchill 1996 ) and its special cases. The F84 model differentiates between transitions and transversions and allows for unequal base composition. It is very similar to the HKY85 model (Hasegawa, Kishino, and Yano 1985 ) but, despite having a more complicated instantaneous rate matrix, is mathematically more tractable.
Our extension to the F84 model, the F84+Gaps model, makes the usual sequence evolution assumptions, that sites evolve independently, which means that the likelihood is a product over all the individual site likelihoods. We also assume that sites evolve according to the same evolutionary process, and that character (nucleotides, gaps) frequencies have reached equilibrium. The F84+Gaps model, like the F84 model, is time-reversible, which means that it cannot be used to root a phylogenetic tree unless a further assumption of a molecular clock is made.
Following Felsenstein and Churchill (1996) , there are two types of events in the F84 model: In a type I event, a nucleotide is replaced by one drawn from its own class (e.g., a purine is replaced by one of the two purines, A and G). In a type II event, any nucleotide replaces the existing one. In both cases, a nucleotide may be “replaced” by the same one, resulting in no change.
The F84+Gaps model can deal with a higher frequency of transitions and can also model different rates of within-nucleotide and nucleotide-gap changes. It accomplishes this by using the rate parameters γ and, where applicable, α, and the stationary frequencies of each character. For example, if indels appear to be infrequent (a small number of gaps in the alignment), then the stationary frequency of the gap character will be low, and, hence, so too will the rate of change from a nucleotide to a gap.
Special cases of this model include the K2P+Gaps model (extension of the model described in Kimura [1980] ), which assumes that character frequencies are all equal (to 0.2). In this case, the transition rate, δ, corresponds to 0.5α + 0.2γ, while ϵ = 0.2γ is the rate of all other changes. Substitution probabilities for this model may easily be found from the F84+Gaps probabilities.
These substitution probabilities may be used in likelihood calculations. Since rates and time are confounded, we follow the usual convention here that the overall rate is scaled to 1. This results in the time (branch length) corresponding to the number of character substitutions that have occurred. Should branch lengths be important, some thought is needed to decide on an appropriate measure. If the researcher is interested in measuring distances between the sequences in the alignment itself, then the branch length estimate from the five-state model may be an appropriate measure, since it takes into account gaps which can be an important source of differences between otherwise closely related sequences. On the other hand, if reference is to be made to previously published work, it may be necessary to ignore the gaps and use the usual (four-state) nucleotide substitution models to estimate the number of nucleotide substitutions per site, this being the most commonly employed measure of distance between DNA sequences.
The main criticism that has been leveled at five- state models is that they treat each site independently. For example, a 20-bp insertion is treated as 20 independent events, implying strong evidence for any phylogenetic split supported by this event. At the other extreme, some approaches remove columns where insertion and deletion events have occurred, thereby removing all of the information contained within the indels and nucleotide substitutions at those sites. A simple ad hoc approach to finding a compromise between these extremes is to downweight the likelihoods at sites containing indel events so that they count approximately as one event. To implement this, for any column in an alignment containing a gap, we sum the lengths of any indels which pass through this site in all sequences. Dividing by the number of sequences with indels at that particular site yields the average indel length at that site. The inverse of this is used to downweight the log likelihood. This quantity is calculated for all sites (columns with no indels receive a weighting of 1), and all of the site log likelihoods are weighted to yield the overall log likelihood. This is still too conservative, however, since it essentially views (for example) shared 3-bp gaps and 25-bp gaps as carrying the same evidence that two sequences are monophyletic.
The Posterior Distribution over Tree Space
Following Bayes' theorem, the posterior distribution over the parameter space is proportional to the likelihood times the prior distribution. The likelihood function for a given tree (topology and branch lengths) may be calculated in the usual way (i.e., using the pruning algorithm described in Felsenstein [1981] ).
Prior distributions are required for all parameters used. Obviously, these include the topology (branching pattern) and the branch lengths. All possible topologies were considered equally likely a priori (this uniform distribution is a common choice; e.g., it is used in BAMBE [see Larget and Simon 1999] ). A weak (i.e., almost flat, therefore vague) gamma prior was placed on the branch lengths which assigned positive probability to a wide range of branch lengths. Alternatives include assuming that branch lengths are uniformly distributed over the positive real line or assuming that they lie uniformly within an interval bounded by 0 and some large number. The latter is a better choice than the former, since it is a proper prior (i.e., integrates to a finite number) but has the disadvantage of imposing an upper limit on the branch lengths. The gamma distribution is both a proper prior and does not impose an upper bound (although in practice, branch lengths will generally lie between 0 and 1, so a large enough upper bound should not materially change the posterior distribution).
Prior distributions may also be required for the model parameters. The five character frequencies must lie between 0 and 1 and sum to 1; an obvious prior for these is a Dirichlet distribution with all five parameters equal to 1. This is a vague prior in that it assumes all possible combinations of frequencies are equally likely. In the F84 model, once it has been assumed that the mean rate is 1, given the character frequencies, a simple relationship exists between α and γ. Thus, one can be written in terms of the other, so only one prior is required. We chose to represent γ in terms of α, with the prior for α being a uniform distribution over the interval (0, 2). Ideally, the upper bound should be the largest value of α consistent with γ remaining nonnegative. However, fixing the bound at 2 is easier to implement while allowing a range of transition-transversion bias from none to extremely high.
In this application, the posterior is known only up to the constant of proportionality. While this does not pose a problem for MCMC, it does lead to one drawback. Since the five-state models view a gap as another character, they allow a positive probability for a column consisting entirely of gaps. In practice, this will never be observed, so the correct likelihood would condition on that. Thus, the likelihood of seeing a tree T with all other parameter values (branch lengths, character frequencies, etc.) represented by 𝛉 would be Pr(actual data | T, 𝛉)/[1 − Pr(all gaps | T, 𝛉)], where Pr(all gaps | T, 𝛉) is the likelihood of T and 𝛉 for a site containing gap characters only. Being unable to make this correction is unlikely to have any noticeable effect on the posterior distribution, since, especially for larger trees, Pr(all gaps | T, 𝛉) will be very small.
MCMC Algorithm
The Metropolis-Hastings algorithm (Metropolis et al. 1953 ; Hastings 1970 ; Chib and Greenberg 1995 ) is used here to sample from the posterior distribution of trees. The steps of this algorithm are as follows:
Suppose the chain is in state xt at time t. Sample a candidate point y from a proposal distribution q(· | xt) which may or may not depend on the current state of the chain, xt.
- This candidate point is accepted with probability β, whereand π(x) is the unscaled posterior probability of x.
If y is accepted, then xt+1 = y, else xt+1 = xt.
In theory, q may have any form provided only that q(x | y) ≠ 0 whenever q(y | x) ≠ 0, but in practice it is important to select q carefully so that the chain moves rapidly around tree space yet has a reasonable proportion of candidate points accepted (which is usually achieved by ensuring that sampled points from q tend to be close to xt, yet not so close that it takes many steps to traverse the parameter space). Furthermore, if the parameter space is complex, it is often more efficient to split the parameters into related components and update these one by one (e.g., Gilks, Richardson, and Spiegelhalter 1995 ). This has been implemented below with three components. The first consists of the tree (both topology and branch lengths) and the second is the set of character frequencies, while the third is the parameter α from the F84+Gaps model. A suitable proposal distribution for each component is described below.
The proposal distribution for trees adopted here depends on the current tree and changes a small part of that tree. This distribution is symmetric (i.e., q(x | y) = q(y | x)) so that the acceptance probability, β, does not depend on q. Although our method does not employ the molecular-clock assumption and leads to inferences about unrooted trees, the proposal distribution generates rooted trees for ease of computer coding and likelihood calculations. It is a simple matter to identify the unrooted tree that corresponds to any given rooted tree. The steps to generate a candidate tree given the current tree are as follows:
Select an internal branch. Let p be the parent node and c be the child node.
Suppose p is not the root node. Then
Let pc be the other child node of p (i.e., not node c). Let cl and cr be the left and right children, respectively, of node c.
There are three possibilities: leave the branching order unchanged, switch pc and cl, or switch pc and cr. Each of these is equally likely.
Suppose p is the root node. Then, with probability 1/ 2, let c become the new root node.
If p remains the root node, follow steps 1–2 above to implement branch swapping.
If the root changes to node c then
p becomes one of the children of c.
Select one of c's original children, cl or cr, to remain as c's other child. The unselected child node and the existing child node of p, pc, become the two children of p.
Modify the four branch lengths involved in this move (|p, c|, |p, pc|, |c, cl|, |c, cr|) by multiplying each one by a different factor chosen uniformly from the interval [1/(1 + ϵ), 1 + ϵ] for positive ϵ ≪ 1.
Accept the proposed tree with probability β calculated using equation (2).
A Dirichlet distribution is used to generate new values for the character frequencies, thus assuring that they both lie in the interval (0, 1) and sum to 1. The parameters of this Dirichlet distribution are proportional to the current values of the character frequencies, thereby generating new values close to the existing ones. This proposal distribution is not symmetric, so the probabilities q(x | y) and q(y | x) contribute to the Metropolis-Hastings acceptance probability.
To vary the F84+Gaps model parameters, new values are proposed for α, the transition rate in the F84 model. Since we assume that the mean rate is 1, given α and the character frequencies, γ may easily be determined. To propose a new value for α, a value is selected uniformly from an interval centered around the existing value of α, leading to a symmetrical proposal distribution. Should α be close to its boundaries such that part of this interval lies outside, values can be reflected back into range; in such an event, the proposal distribution remains symmetric.
An initial burn-in period is necessary to ensure that the chain has reached its stationary distribution. During this time, the proportion of each of the components accepted is monitored. If any of these values is too low (e.g., <0.05), then the proposal distributions may be tuned to increase acceptance rates. For example, in the tree proposal algorithm, the value of ϵ can be reduced, leading to new branch lengths closer to the existing ones. For the character frequencies, using a higher constant of proportionality should lead to an increased chance of accepting new values, while the interval from which new α values are proposed can be narrowed. Post–burn-in, these aspects of the proposal distributions are required to remain fixed.
Since the tree proposal distribution in particular makes only small changes in the tree, successive points in the chain will be highly correlated. To reduce this correlation, the output should be thinned; i.e., only every jth iteration should be stored. Sensible choices of burn- in period and the level of thinning required may be made by running several short chains from different starting points for a particular data set and observing the output.
We have written a C++ implementation of this algorithm, MAC5, which returns a number of files containing different information including the topologies (branching order only) at every jth iteration. This application is available on the World Wide Web at http://www.reading.ac.uk/statistics/genetics/ in the software section. The BAMBE program SUMMARIZE (Simon and Larget 2000 ) is used to generate a listing of all topologies in descending order of frequency.
Results
Simulated Data
A small simulation study was carried out to investigate the performance of the five-state evolution models and the MCMC algorithm in estimating the posterior distribution over tree space. Initially, data were simulated under one of the five-state models, which allowed the performance of the MCMC algorithm to be validated. Data were also simulated using a model incorporating multiple-site insertions and deletions, which may be more realistic. Hence, the performance of our method based on the five-state model on data which did not conform to an independent-site assumption could be assessed.
All data sets were simulated under a simple eight- species bifurcating tree, (((1, 2), (3, 4)), ((5, 6), (7, 8))), with all branches having the same length. The data in the first part of the study were simulated under the JC+Gaps model. The sequences were 1,000 bp long, and three data sets having branch lengths of 0.05, 0.1, and 0.4 were generated.
For the second half of the study, the same tree was again used, but with branch lengths of 0.1, 0.2, 0.3, and 0.4. Nucleotide substitution followed the JC model. The root sequence of the tree was 500 bp long and contained no gap characters. At each site along an evolving sequence, there were three possible events: insertion, deletion, or substitution. If either an insertion or a deletion event occurred at a site, then the additional insertion/ deletion length l was randomly generated from a geometric distribution with parameter 0.5. For a deletion, l + 1 bases, including the current base, were replaced by gap characters. If an insertion occurred, then l + 1 bases were placed in the sequence at that point, with the nucleotides being randomly selected (the JC model assumes that nucleotides occur with equal frequencies). In all other sequences, l + 1 gap characters were added in the corresponding locations to maintain the sequence alignment. The smaller of the insertion and deletion probabilities was on the order of 0.005 per site per branch of length 0.1. Three groups of data were simulated; all consisted of four data sets (the four different branch lengths) with the first setting Prob(deletion) = 2 × Prob(insertion), the second having Prob(deletion) = Prob(insertion), and the third having Prob(insertion) = 2 × Prob(deletion). Part of one of these data sets is shown in figure 1 . Note that the lengths of the alignments returned will be variable; the higher the probability of insertion, the longer the alignment.
For all three data sets, the performance under the MCMC algorithm based on the JC+Gaps model (MCMC-5*) was examined and compared with the results from BAMBE and DNAML (maximum-likelihood [ML] analysis from the PHYLIP package; Felsenstein 1993 ). For the second set of data (multiple-site gaps), an implementation of the algorithm downweighting gaps as discussed above was also applied (referred to as MCMC-5*W). For the first type of data (simulated under the five-state model), the Jukes-Cantor or, if applicable, the JC+Gaps model was used for the likelihood calculations in all four methods. The second type of data is unlikely to have equal proportions of all characters (i.e., the gap character is unlikely to account for 20% of the sequence), so the F81 models were more appropriate here. BAMBE and DNAML deal with gap states by considering a gap to mean an A or C or G or T, which has the effect of removing the branch leading to a site with a gap from the tree. Comparisons between the output of MCMC-5*/MCMC-5*W and BAMBE give an idea of how the precision of the MAP tree was affected by including the insertion-deletion data. The ML tree is calculated using DNAML as a further check to see if including the gap states might affect the estimated topology. Note that in order to compare MCMC tree estimation with the ML tree from DNAML, the character frequencies were not updated in MCMC-5* or BAMBE.
Following some exploratory runs of the Markov chain for some of the simulated data sets, the burn-in time chosen for BAMBE, MCMC-5*, and MCMC-5*W was 200,000 iterations. Post–burn-in, 10,000 values were returned, sampled every 200 iterations.
When implementing MCMC, it is important to carry out diagnostics on the chain to check that it has reached the stationary distribution and that it is mixing well. The former was checked by starting the chain from different points and ensuring that the log posterior probabilities all reached a similar plateau; the latter was checked by examining the log posterior probability and branch length information (average branch length for MCMC-5*, half tree-diameter for BAMBE) to see if these values were mixing well with low amounts of serial dependency (this may be checked by examining the autocorrelation function). All the data sets appeared to have passed burn-in and were mixing well (data not shown). An example of these plots is given in figure 2 for a mitochondrial data set.
The results of the simulation study for the MCMC algorithm based on the five-state model are given in table 1 . It would be expected that all methods would perform well on these data sets but that using the five- state model would be the best because the data are simulated under this model. For trees with short branches (0.05 and 0.1), both BAMBE (nucleotide substitution model only) and MCMC-5* (five-state model) perform similarly, while for a branch length of 0.4, MCMC-5* returns the MAP tree with higher precision than BAMBE (0.99 vs. 0.91), meaning that the 95% credible set from BAMBE contains two trees, in comparison to the one in the MCMC-5* set. In all cases, the MAP and ML estimates correspond to the correct tree, while branch lengths for MCMC-5* are accurately estimated. BAMBE returns the half tree diameter (in effect, an average over large branches), so it would be expected to overestimate the true branch length. DNAML tends to underestimate the average branch length, which may be due to the fact that the JC model is inadequate for this data set.
The MCMC algorithms appear to be satisfactory for tree estimation if the model is correct. Next, the performance of the five-state model for the data sets with multiple-site insertions and deletions is examined. The results of the simulation study are given in table 2 .
A similar pattern is seen again—for small branch lengths (i.e., 0.1, 0.2), and thus fewer gaps, both MCMC-5* and BAMBE perform similarly, with the correct tree having a posterior probability close to or equal to 1. While the MAP estimate returned by BAMBE for larger branch lengths is generally the true tree, the precision of this estimate decreases significantly compared with the precision of the MCMC-5* and MCMC-5*W estimates (which also find the true tree).
Further evidence of the increased precision of the estimates comes from examining the number of trees in the 95% credible sets from substitution (BAMBE) and five-state (MCMC-5* and MCMC-5*W) models. All of the 95% credible sets using the latter models consist of only one tree, while, for larger branch lengths, the nucleotide substitution model sets contain more than one tree.
The comparison between the unweighted and weighted methods using the gap models is quite interesting. MCMC-5*W gives less weight to gaps and may represent an approximately minimum use of the gap information, while MCMC-5* tends to overweight gap characters because it treats each of them as resulting from independent indel events, effectively ignoring the possibility of multiple-site indel events. MCMC-5*W leads to a considerable gain in precision over BAMBE for longer trees. There is a further increase in precision from MCMC-5*W to MCMC-5*, but it is much smaller. This suggests that using the information contained within the gaps can lead to a considerable improvement in inference, while, for data sets with short to medium- length gaps, using a straightforward five-state model may not greatly overestimate the precision of the estimate.
The branch length estimation for longer trees appears to be poor for MCMC-5* and MCMC-5*W, which is not surprising. The longer trees, with their increased probability of insertions and deletions, contain a significant number of sites with matching gap characters. Since the five-state model treats a gap as a character, this fools it into believing that the sequences are more closely related than they actually are.
Real Data
We next applied the model to a set of four mitochondrial D-loop sequences (human, chimp, pygmy chimp, and gorilla) discussed by Saitou and Ueda (1994) , who noted that in this data set (and in others included in that paper), single-nucleotide insertions and deletions occurred at a considerably higher frequency than multiple insertions/deletions. This suggested that the five-state model was appropriate here. The data set was taken from Foran, Hixson, and Brown (1988) ; leaving out uncertain regions of the alignment, it is 990 bp long.
The F84+Gaps model was applied to this data set, with both the character frequencies and the transition- transversion ratio (via α) being updated alongside the tree in the MCMC algorithm. A burn-in of 200,000 iterations was used, and 100,000 values were returned, sampled every 200 iterations. The top graphs in figure 2 show part of the trace of the log posterior probability and the autocorrelation function of the entire post–burn- in log posterior probability. These show that the chain converges quickly and then mixes well.
This phylogeny is easily resolvable into ((chimp, pygmy chimp), human, gorilla); this tree is found in all iterations of the process. This agrees with standard ML analysis (using DNAML), which also finds that the correct tree has a considerably higher likelihood than the alternatives. Figure 2 also shows the marginal distributions of the mean branch length and the transition-transversion ratio as yielded by the MCMC algorithm. For this data set, defining the branch length to be the average number of mutations (substitutions, insertions, deletions) per site may be sensible, since most indels affect one site only and thus could be viewed mathematically as a substitution process. Inclusion of the indels leads to significantly longer branches (an average branch length of 0.0697 compared with 0.0599 in the ML tree under a no-gaps model). There is a transition/transversion bias in this data set, with an estimate of 1.58 for the transition-transversion ratio.
A further primate data set was examined. This consisted of the β-globin spacer for seven primate sequences (two human alleles, chimpanzee, gorilla, orangutan, rhesus macaque, and spider monkey) and was taken from Maeda et al. (1988) . This alignment contained a CT-rich region, which was removed. While many insertion and deletion events affected only one site, there were a number of longer events (lengths of 10 or 12 bp). Consequently, both the standard application and the weighted versions of the five-state model were applied to the data.
Again the full F84+Gaps model was used (i.e., both the character frequencies and α were updated in the MCMC algorithm). Once again, a burn-in of 200,000 iterations was applied, and 100,000 values were returned, sampled at intervals of 200. The diagnostic plots (data not shown) showed that the Markov chain rapidly converged and thereafter mixed well (the autocorrelation function rapidly tails off to 0).
The MAP estimate returned by both MCMC-5* and MCMC-5*W was the tree (((((human1, human2), chimp), gorilla), orang), macaque, spider monkey), which agrees with the currently accepted phylogeny. The posterior probability was estimated to be 1. BAMBE estimates the posterior probability of this tree as 0.69 (HKY model, kappa = 4.3), which is a considerable reduction in precision. The 95% credible interval returned by BAMBE contained two trees. The other tree exchanges chimp and gorilla in the tree above and has a posterior probability of 0.3.
Discussion
In the past, the idea of extending substitution models by considering a gap position as a fifth character alongside the nucleotides has been largely ignored except by some parsimony methods. The main argument against following this approach is that it would treat each gap position in a multiple-site insertion or deletion as an independent event, thereby giving too much weight to insertions and deletions. Simulation studies in this paper have suggested that using the information from gaps often adds more to the inference than the problems of overweighting the gaps detract from them—for some data sets simulated with multiple-site indel events, there was a significant gain in information for estimating the topology when the gap sites were used.
The main difficulties with this overweighting of gaps will be in sequences with long insertion and deletion events where some support relationships different from those in the correct tree. Treating each position independently results in a great deal of support for the incorrect relationship and may change the tree inferred. A good example of a case in which this occurs is that of the chloroplast data described by Golenberg et al. (1993) . These data consist of a region of noncoding chloroplast DNA for nine members of the grass family. Golenberg et al. (1993) found evidence that a limited number of sites were prone to insertion and deletion events and that consequently multiple indels had taken place along the same stretches of DNA. This led to conflicting phylogenetic signal which overwhelmed that from the rest of the (closely related) sequences, resulting in an incorrect tree being inferred (data not shown).
One suggestion to avoid this problem is to treat each indel as being one event, which can be approximated by dividing the site log likelihood by the average gap length at that site. As discussed previously, this may be regarded as roughly the other extreme for treating gaps—they are not ignored, but short and long gaps contribute similar amounts to the likelihood. However, our simulation study suggests that even doing this achieved a considerable gain in the precision of the estimate. This method is also better able to handle data sets such as the chloroplast one; it did not return the tree inferred by the gaps as the MAP estimate, although there was some support for it.
The best set of weights to correctly incorporate gap information using a five-state model may lie somewhere between the two cases presented here (weight each site equally vs. downweight each site likelihood by the average gap length if an indel event passes through it in at least one of the sequences). This possibility is currently being explored.
Of course, a fully statistical method would be preferable to using an ad hoc approach like downweighting gaps. The difficulty here will be in the likelihood computations. Multiple-site indel events clearly violate the assumption that each site evolves independently, suggesting that a model that allows some dependence between neighboring sites is necessary. However, independent sites are the key which makes tree likelihood calculations tractable, so this will not be a straightforward task. The method of Mitchison (1999) allows the state at a site to depend on whether or not there is a gap at neighboring sites; this may suggest a way to proceed. Implementing more realistic models is the subject of current work.
A further issue is the treatment of gaps at the start or end of a sequence alignment. These leading and trailing gaps do not generally correspond to indel events but are, rather, artifacts of the sequencing procedure (the sequencing procedure does not necessarily start or finish in the same place in all of the sequences). Clearly, these should not be treated in the same way as interior gaps. One suggestion is to code these gaps as missing data (e.g., N instead of –) before carrying out any inference.
One important application of this work is to the gene tree problem (the estimation of the hierarchical relationships among a gene family—sequences from the same gene in different species and/or similar genes within an individual). Since there is a limited amount of data (i.e., the gene) available for inferring the gene tree, it is important to use these data as efficiently as possible. Many gene families will have a considerable number of insertions and deletions (a good example is the aforementioned alignment in Williams and Holland [1998] ), so these models could be of use. The principles behind the five-state models can also be applied to models for amino acid replacement, leading to 21-state models.
Brandon Gaut, Reviewing Editor
Keywords: DNA sequences evolution models gaps likelihood Markov chain Monte Carlo phylogenetics
Address for correspondence and reprints: David Balding, Department of Applied Statistics, University of Reading, P.O. Box 240, Reading RG6 6FN, United Kingdom. d.j.balding@reading.ac.uk.
This work was funded by the BBSRC under grant number 45/G09600. We would like to thank Peter Holland and Simon Whelan for some useful discussions and the anonymous referees for helpful comments. Some of the computational work in the development of this algorithm was carried out on the HGMP-MRC computing system.
literature cited
Bishop, M. J., and E. A. Thompson.
Chib, S., and E. Greenberg.
Durbin, R., S. R. Eddy, A. Krogh, and G. Mitchison.
Efron, B., E. Halloran, and S. Holmes.
Felsenstein, J.
———.
Felsenstein, J., and G. A. Churchill.
Foran, D. R., J. E. Hixson, and W. M. Brown.
Gilks, W. R., S. Richardson, and D. J. Spiegelhalter.
Golenberg, E. M., M. T. Clegg, M. L. Durbin, J. Doebley, and D. P. Ma.
Grimmett, G. R., and D. R. Stirzaker.
Hasegawa, M., H. Kishino, and T. Yano.
Hastings, W. K.
Jukes, T. H., and C. R. Cantor.
Kimura, M.
Koshi, J. M., and R. A. Goldstein.
Krogh, A., M. Brown, I. S. Mian, and D. Haussler.
Larget, B., and D. L. Simon.
MacDonald, I. L., and W. Zucchini.
Maeda, N., C.-I. Wu, J. Bliska, and J. Reneke.
Mau, B., M. A. Newton, and B. Larget.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.
Mitchison, G. J.
Mitchison, G. J., and R. M. Durbin.
Saitou, N., and S. Ueda.
Simmons, M. P., and H. Ochoterena.
Thorne, J. L., and G. A. Churchill.
Thorne, J. L., H. Kishino, and J. Felsenstein.
———.