Abstract
The information capacity of nucleotide sequences is defined through the calculation of specific entropy of their frequency dictionary. The specificentropy of the frequency dictionary is calculated against the reconstructeddictionary; this latter bears the most probable continuations of the shorterstrings. This developed measure allows to distinguish the sequences both from the randons ones, and from those with high level of (rather simple) order. Some implications of the developed methodology in the fields of genetics,bioinformatics, and molecular biology are discussed.
Similar content being viewed by others
References
Waterman, M.S. (ed.): Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, 1998, 389 p.
Alexandrov, A.A., Alexandrov, N.N., Borodovsky, M.Yu., Kalambet, Yu.A., Kister, A.Z., Mironov, A.A., Pevzner, P.A. and Shepelev, V.A.: Computer Analysis of Genetic Texts, Nauka, Moscow, 1990, 264 p.
Claverie, J.-M., Sauvaget, I. and Bougueleret, L.: k-Tuple Frequency Analysis: From Intron/ Exon Discrimination to T-Cell Epitope Mapping, In: R.F. Doolittle (ed.), Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences Meth. Enzymol. vol. 183, 1990, pp. 252-281.
Karlin, S. and Cardon, L.R.: Computational DNA Sequence Analysis, Ann. Rev. Microbiol. 48 (1994), 619-654.
Yockey, H.P.: Information Theory and Molecular Biology, Cambridge Univ. Press, N.Y., 1992.
Kolmogorov, A.N.: Three Approaches to the Quantitive Definition of Information, Problems of Information Transmission 1 (1965), 1-17.
von Neumann, J.: Theory of Self-Reproducing Automata, University of Illinois Press, Urbana, Illinois, 1966.
Vitushkin, A.G.: Theory of Transmission and Processing of Information, Pergamon Press, New York, 1961.
Bugaenko, N.N., Gorban, A.N. and Karlin, I.V.: Universal Expansion of the Triplet Distribution Function, Teoreticheskaya i Matematicheskaya Fisica 88 (1991), 430-441. (in Russian)
Gorban, A.N. and Karlin, I.V.: Structure and Approximations of the Chapman-Enskog Expansion for Linearized Grad Equations, Transport Theory & Stat. Phys. 21 (1992), 101-117.
Hariri, A., Weber, B., and Olmsted, J.-3rd: On the Validity of Shannon-Information Calculations for Molecular Biological Sequences, J. Theor. Biol. 21 (1990), 235-254.
Churchill, G.A.: Stochastic Models for Heterogeneous DNA Sequeces, Bull. Math. Biol. 51 (1989), 70-94.
Finkelstein, A.V. and Roytberg, M.A.: Computation of Biopolymers: A General Approach to Different Problems, BioSystems 30 (1993), 161-185.
Schneider, T.D.: Evolution of Biological Information, Nucl. Acids Res. 28 (2000), 2794-2799.
Smith, T.F.: Genetic Sequence Semantic and Syntactic Patterns Location, In: G.I. Bell and T.G. Marr (eds.), Computers and DNA, Addison-Wesley, 1990, pp. 259-270.
Kirkwood, J. and Boggs, E.: The Radial Distribution Function in Liquids, J. Chem. Phys. 10 (1942), 394.
Gorban, A.N.: The Bypass of the Balance, Novosibirsk, Nauka, 1984, 364 p.
Balescu, R.: Equilibrium and Nonequilibrium Statistical Mechanics, Wiley, New York, 1975.
Bugaenko, N.N., Gorban, A.N. and Sadovsky, M.G.: Towards the Determination of Information Content of Nucleotide Sequences, Mol. Biology (Russian) 30 (1996), 529-546.
Bugaenko, N.N., Gorban, A.N. and Sadovsky, M.G.: Maximum Entropy Method in Analysis of Genetic Text and Measurement of its Information Content, Open Sys. Inf. Dyn. 5 (1998), 265-278.
Pavesi, A., De Iaco, B., Granero, M.I. and Porati, A.: On the Informational Content of Overlapping Genes in Prokaryotic and Eukaryotic Viruses, J. Mol. Evol. 44 (1997), 625.
Liaofu Luo, Weijiang Lee, Lijun Jia, Fengmin Ji and Lu Tsai: Statistical Correlation of Nucleotides in a DNA Sequence, Phys. Rev. E 58 (1998), 861-871.
Shenkin, P.S., Erman, B. and Mastrandrea, L.D.: Information-Theoretical Entropy as a Measure of Sequence Variability, Proteins 11 (1991), 297-313.
Weiss, O., Jimenez-Montano, M.A. and Herzel, H.: Information Content of Protein Sequences, J. Theor. Biol. 206 (2000), 379-386.
Yu, Z.G., Anh, V.V. and Wang, B.: Correlation Property of Length Sequences Based on Global Structure of the Complete Genome, Phys. Rev. E 63 (2001), 903-910.
Crochemore, M. and Verin, R.: Zones of Low Entropy in Genomic Sequences, Comput. Chem. 23 (1999), 275-282.
Loewenstern, D. and Yianilos, P.N.: Significantly Lower Entropy Estimates for Natural DNA Sequences, J. Comput. Biol. 6 (1999), 125-142.
Ragosta, M., Cosmi, C., Cuomo, V. and Macchiato, M.: An Application of Maximum Entropy Techniques to Determine Homogeneous Sets of Nucleotidic Sequences, J. Theor. Biol. 155 (1992), 129-136.
Kirsanova, E.N. and Sadovsky, M.G.: Entropy Approach to a Comparison of Images, Open Sys. Inf. Dyn. 8 (2001), 183-199.
Kirsanova, E.N. and Sadovsky, M.G.: Method of Statistical Comparison of Objects, Radioelectr. Inform. Contr. 2 (2000), 71-82.
Kirsanova, E.N. and Sadovsky, M.G.: On the Anisotropy of Digital Images, 9th Natl. Conf. Neuroinformatics and its applications, Krasnoyarsk, Oct. 5-7, 2001.
Gorban, A.N., Popova, T.G. and Sadovsky, M.G.: A Redundancy of Genetic Sequences and Mosaic Structure of a Genome, Mol. Biology (Russian) 28 (1994), 313-322.
Gorban, A.N., Mirkes, E.M., Popova, T.G. and Sadovsky, M.G.: Comparative Redundancy of Genes of Various Organisms and Their Viruses, Rus. J. Genet. 29 (1993), 1413-1419.
Gorban, A.N., Popova, T.G. and Sadovsky, M.G.: Human Genes are more Redundant than Genes of Human Viruses, Rus. J. Genet. 32 (1996), 281-294.
Gorban, A.N., Popova, T.G. and Sadovsky, M.G.: Classification of Symbol Sequences over their Frequency Dictionaries: Towards the Connection Between Structure and Natural Taxonomy, Open Sys. Inform. Dyn. 7 (2000), 1-17.
Popova, T.G. and Sadovsky, M.G.: Splicing Results in Decrease of Genes Redundancy, Mol. Biology (Russian) 29 (1995), 500-506.
Popova, T.G. and Sadovsky, M.G.: Introns Differ from Exons from the Point of View of Their Redundancy, Rus. J. Genet. 31 (1995), 1365-1379.
Jimenez-Montano, M.A., Ebeling, W., Pohl, T. and Rapp, P.E.: Entropy and Complexity of Finite Sequences as Fluctuating Quantities, Biosystems 64 (2002), 23-32.
Churchill, G.A.: Hidden Markov Chains and the Analysis of Genome Structure, Comput. Chem. 16 (1992), 107-115.
Kisliuk, O.S., Boronina, T.A. and Nazipova, N.N.: Evaluation of Genetic Text Redundancy using a High-Frequency Component of the l-gram Graph, Biofizika 44 (1999), 639-648.
Tao Jiang, Ying Xu and Zhang, M.Q. (eds.): Current Topics in Computational Molecular Biology, MIT Press, Cambridge, 2002, 540 p.
Cheung, A. and Kieff, E.: Long Internal Direct Repeat in Epstein-Barr Virus DNA, J. Virol. 44 (1982), 286-294.
Popova, T.G. and Sadovsky, M.G.: Investigating Statistical Properties of Genetic Texts: Local Redundancy Displays a New Structure of Genes, Adv. Model. & Anal. C, AMSE Press. 48 (1995), 17-22.
Bugaenko, N.N., Popova, T.G. and Sadovsky, M.G.: Information Structure of Genetic Sequences, In: A.N. Gorban (ed.), Proc. 8th Natl. Conf. Neuroinformatica and its applications, Krasnoyarsk, KSTU Press, (2000), pp. 26-27.
Bugaenko, N.N., Sadovsky, M.G. and Sapozhnikov, A.N.: Classification of Symbols and Alphabet Development Optimal for a Revealing the Statistical Regularities, Proc. 5th Natl. Conf. Neuroinformatics and its Applications, Krasnoyarsk, Sept. 22-25, 1997, pp. 27-30.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Sadovsky, M.G. Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae. Journal of Biological Physics 29, 23–38 (2003). https://doi.org/10.1023/A:1022554613105
Issue Date:
DOI: https://doi.org/10.1023/A:1022554613105