Finite-state models in the alignment of macromolecules

Allison, L.; Wallace, C. S.; Yee, C. N.

doi:10.1007/BF00160262

Finite-state models in the alignment of macromolecules

Published: July 1992

Volume 35, pages 77–89, (1992)
Cite this article

Journal of Molecular Evolution Aims and scope Submit manuscript

L. Allison¹,
C. S. Wallace¹ &
C. N. Yee¹

85 Accesses
44 Citations
Explore all metrics

Summary

Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages. It allows the posterior odds-ratio of two theories or hypotheses to be calculated. Here it is applied to problems of aligning or relating two strings, in particular two biological macromolecules. We compare the r-theory, that the strings are related, with the null-theory, that they are not related. If they are related, the probabilities of the various alignments can be calculated. This is done for one-, three-, and five-state models of relation or mutation. These correspond to linear and piecewise linear cost functions on runs of insertions and deletions. We describe how to estimate parameters of a model. The validity of a model is itself an hypothesis and can be objectively tested. This is done on real DNA strings and on artificial data. The tests on artificial data indicate limits on what can be inferred in various situations. The tests on real DNA support either the three- or five-state models over the one-state model. Finally, a fast, approximate minimum message length string comparison algorithm is described.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Allison L, Yee CN (1990) Minimum message length encoding and the comparison of macro-molecules. Bull Math Biol 52(3): 431–453
Google Scholar
Allison L, Wallace CS, Yee CN (1990) When is a string like a string? Proceedings, Artificial Intelligence and Mathematics, Ft. Lauderdale FL
Allison L, Wallace CS, Yee CN (1992) Minimum message length encoding, evolutionary trees and multiple alignment. Hawaii Int Conf Sys Sci (in press)
Bains W (1986) The multiple origins of the human Alu sequences. J Mol Evol 23:189–199
Google Scholar
Bishop MJ, Friday AE (1986) Molecular sequences and hominoid phylogeny. In: Wood B, Martin L, Andrews P (eds) Major topics in primate and human evolution. Cambridge University Press, Cambridge, pp 150–156
Google Scholar
Bishop MJ, Rawlings CJ (eds) (1987) Nucleic acid and protein sequence analysis, a practical approach. IRL Press
Bishop MJ, Friday AE, Thompson EA (1987) Inference of evolutionary relationships. In: Bishop MJ, Rawlings CJ (eds) Nucleic acid and protein sequence analysis, a practical approach. IRL Press, pp 359–385
Boulton DM, Wallace CS (1969) The information content of a multistate distribution. J Theor Biol 23:269–278
Google Scholar
Boulton DM, Wallace CS (1973) An information measure for hierarchic classification. Comput J 16:254–261
Google Scholar
Chaitin GJ (1966) On the length of programs for computing finite binary sequences. J Assoc Comput Mach 13(4):547–569
Google Scholar
Cohen DN, Reichert TA, Wong AKC (1975) Matching code sequences utilizing context free quality measures. Math Biosci 24:25–30
Google Scholar
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376
Google Scholar
Georgeff MP, Wallace CS (1984) A general selection criterion for inductive inference. Proceedings, European Conference on Artificial Intelligence, pp 473–482
Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708
Google Scholar
Gotoh O (1990) Optimal sequence alignment allowing for long gaps. Bull Math Biol 52(3):359–373
Google Scholar
Hamming RW (1980) Coding and information theory. Prentice Hall, Englewood Cliffs NJ
Google Scholar
Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequences. Commun Assoc Comput Mach 18(6):341–343
Google Scholar
Holmes EC (1989) Pattern and process in the evolution of the primates. PhD thesis, Cambridge University
Jurka J, Milosavljevic A (1991) Reconstruction and analysis of human Alu genes. J Mol Evol 32:105–121
Google Scholar
Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Inf Transmission 1(1):1–7
Google Scholar
Langdon GG (1984) An introduction to arithmetic coding. IBM J Res Dev 28(2):135–149
Google Scholar
Li M, Vitanyi PMB (1988) Two decades of applied Kolmogorov complexity. Proceedings of the Third Annual Conference on Structure in Complexity Theory. IEEE, pp 80–101
Miller W, Myers EW (1988) Sequence comparison with concave weighting functions. Bull Math Biol 50(2):97–120
Google Scholar
Milosavljevic AD (1990) Categorization of macromolecular sequences by minimal length encoding. PhD thesis, University of California at Santa Cruz, UCSC-CRL-90–41
Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Google Scholar
Reichert TA, Cohen DN, Wong KC (1973) An application of information theory to genetic mutations and the matching of polypeptide sequences. J Theor Biol 42:245–261
Google Scholar
Rissanen J (1983) A universal prior for integers and estimation by minimum description length. Ann Stats 11(2):416–431
Google Scholar
Sankoff D, Kruskal JB (eds) (1983) Time warps, string edits and macro-molecules. Addison Wesley
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26(4):787–793
Google Scholar
Sellers PH (1980) The theory and computation of evolutionary distances: pattern recognition. J Algorithms 1:359–373
Google Scholar
Shepherd JCW (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci USA 78:1596–1600
Google Scholar
Solomonoff R (1964) A formal theory of inductive inference, I and II. Inf Control 7:1–22, 224–254
Google Scholar
Thorne JL, Kishino H, Felsenstein J (1991) An evolutionary model for maximum likelihood alignment of DNA sequences. J Mol Evol (in press)
Ukkonen E (1983) On approximate string matching. In: Karpinski M (ed) Proceedings of an international conference on foundations of computation theory, vol 158. Springer Verlag, pp 482–495
Wallace CS (1990) Classification by minimum message length inference. AAAI Spring Symposium on the Theory and Application of Minimum Length Encoding, Stanford, pp 5–9
Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194
Google Scholar
Wallace CS, Freeman PR (1987) Estimation and inference by compact coding. J R Star Soc B 49(3):240–265
Google Scholar
Waterman MS (1984) General methods of sequence comparison. Bull Math Biol 46(4):473–500
Google Scholar
Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun Assoc Comput Mach 30(6):520–540
Google Scholar
Wong AKC, Reichert TA, Cohen DN, Aygun BO (1974) A generalized method for matching informational macromolecular code sequences. Comput Biol Med 4:43–57
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Monash University, 3168, Australia
L. Allison, C. S. Wallace & C. N. Yee

Authors

L. Allison
View author publications
You can also search for this author in PubMed Google Scholar
C. S. Wallace
View author publications
You can also search for this author in PubMed Google Scholar
C. N. Yee
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Offprint requests to: L. Allison

Rights and permissions

Reprints and permissions

About this article

Cite this article

Allison, L., Wallace, C.S. & Yee, C.N. Finite-state models in the alignment of macromolecules. J Mol Evol 35, 77–89 (1992). https://doi.org/10.1007/BF00160262

Download citation

Received: 05 June 1991
Revised: 02 December 1991
Accepted: 23 December 1991
Issue Date: July 1992
DOI: https://doi.org/10.1007/BF00160262

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finite-state models in the alignment of macromolecules

Summary

Access this article

Similar content being viewed by others

Estimating Evolutionary Distances from Spaced-Word Matches

libFLASM: a software library for fixed-length approximate string matching

Compositional Properties of Alignments

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Finite-state models in the alignment of macromolecules

Summary

Access this article

Similar content being viewed by others

Estimating Evolutionary Distances from Spaced-Word Matches

libFLASM: a software library for fixed-length approximate string matching

Compositional Properties of Alignments

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation