Skip to main content
Log in

Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae

  • Published:
Journal of Biological Physics Aims and scope Submit manuscript

Abstract

The information capacity of nucleotide sequences is defined through the calculation of specific entropy of their frequency dictionary. The specificentropy of the frequency dictionary is calculated against the reconstructeddictionary; this latter bears the most probable continuations of the shorterstrings. This developed measure allows to distinguish the sequences both from the randons ones, and from those with high level of (rather simple) order. Some implications of the developed methodology in the fields of genetics,bioinformatics, and molecular biology are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Waterman, M.S. (ed.): Mathematical Methods for DNA Sequences, CRC Press, Boca Raton, 1998, 389 p.

    Google Scholar 

  2. Alexandrov, A.A., Alexandrov, N.N., Borodovsky, M.Yu., Kalambet, Yu.A., Kister, A.Z., Mironov, A.A., Pevzner, P.A. and Shepelev, V.A.: Computer Analysis of Genetic Texts, Nauka, Moscow, 1990, 264 p.

    Google Scholar 

  3. Claverie, J.-M., Sauvaget, I. and Bougueleret, L.: k-Tuple Frequency Analysis: From Intron/ Exon Discrimination to T-Cell Epitope Mapping, In: R.F. Doolittle (ed.), Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences Meth. Enzymol. vol. 183, 1990, pp. 252-281.

  4. Karlin, S. and Cardon, L.R.: Computational DNA Sequence Analysis, Ann. Rev. Microbiol. 48 (1994), 619-654.

    Google Scholar 

  5. Yockey, H.P.: Information Theory and Molecular Biology, Cambridge Univ. Press, N.Y., 1992.

    Google Scholar 

  6. Kolmogorov, A.N.: Three Approaches to the Quantitive Definition of Information, Problems of Information Transmission 1 (1965), 1-17.

    Google Scholar 

  7. von Neumann, J.: Theory of Self-Reproducing Automata, University of Illinois Press, Urbana, Illinois, 1966.

    Google Scholar 

  8. Vitushkin, A.G.: Theory of Transmission and Processing of Information, Pergamon Press, New York, 1961.

    Google Scholar 

  9. Bugaenko, N.N., Gorban, A.N. and Karlin, I.V.: Universal Expansion of the Triplet Distribution Function, Teoreticheskaya i Matematicheskaya Fisica 88 (1991), 430-441. (in Russian)

    Google Scholar 

  10. Gorban, A.N. and Karlin, I.V.: Structure and Approximations of the Chapman-Enskog Expansion for Linearized Grad Equations, Transport Theory & Stat. Phys. 21 (1992), 101-117.

    Google Scholar 

  11. Hariri, A., Weber, B., and Olmsted, J.-3rd: On the Validity of Shannon-Information Calculations for Molecular Biological Sequences, J. Theor. Biol. 21 (1990), 235-254.

    Google Scholar 

  12. Churchill, G.A.: Stochastic Models for Heterogeneous DNA Sequeces, Bull. Math. Biol. 51 (1989), 70-94.

    Google Scholar 

  13. Finkelstein, A.V. and Roytberg, M.A.: Computation of Biopolymers: A General Approach to Different Problems, BioSystems 30 (1993), 161-185.

    Google Scholar 

  14. Schneider, T.D.: Evolution of Biological Information, Nucl. Acids Res. 28 (2000), 2794-2799.

    Google Scholar 

  15. Smith, T.F.: Genetic Sequence Semantic and Syntactic Patterns Location, In: G.I. Bell and T.G. Marr (eds.), Computers and DNA, Addison-Wesley, 1990, pp. 259-270.

  16. Kirkwood, J. and Boggs, E.: The Radial Distribution Function in Liquids, J. Chem. Phys. 10 (1942), 394.

    Google Scholar 

  17. Gorban, A.N.: The Bypass of the Balance, Novosibirsk, Nauka, 1984, 364 p.

    Google Scholar 

  18. Balescu, R.: Equilibrium and Nonequilibrium Statistical Mechanics, Wiley, New York, 1975.

    Google Scholar 

  19. Bugaenko, N.N., Gorban, A.N. and Sadovsky, M.G.: Towards the Determination of Information Content of Nucleotide Sequences, Mol. Biology (Russian) 30 (1996), 529-546.

    Google Scholar 

  20. Bugaenko, N.N., Gorban, A.N. and Sadovsky, M.G.: Maximum Entropy Method in Analysis of Genetic Text and Measurement of its Information Content, Open Sys. Inf. Dyn. 5 (1998), 265-278.

    Google Scholar 

  21. Pavesi, A., De Iaco, B., Granero, M.I. and Porati, A.: On the Informational Content of Overlapping Genes in Prokaryotic and Eukaryotic Viruses, J. Mol. Evol. 44 (1997), 625.

    Google Scholar 

  22. Liaofu Luo, Weijiang Lee, Lijun Jia, Fengmin Ji and Lu Tsai: Statistical Correlation of Nucleotides in a DNA Sequence, Phys. Rev. E 58 (1998), 861-871.

    Google Scholar 

  23. Shenkin, P.S., Erman, B. and Mastrandrea, L.D.: Information-Theoretical Entropy as a Measure of Sequence Variability, Proteins 11 (1991), 297-313.

    Google Scholar 

  24. Weiss, O., Jimenez-Montano, M.A. and Herzel, H.: Information Content of Protein Sequences, J. Theor. Biol. 206 (2000), 379-386.

    Google Scholar 

  25. Yu, Z.G., Anh, V.V. and Wang, B.: Correlation Property of Length Sequences Based on Global Structure of the Complete Genome, Phys. Rev. E 63 (2001), 903-910.

    Google Scholar 

  26. Crochemore, M. and Verin, R.: Zones of Low Entropy in Genomic Sequences, Comput. Chem. 23 (1999), 275-282.

    Google Scholar 

  27. Loewenstern, D. and Yianilos, P.N.: Significantly Lower Entropy Estimates for Natural DNA Sequences, J. Comput. Biol. 6 (1999), 125-142.

    Google Scholar 

  28. Ragosta, M., Cosmi, C., Cuomo, V. and Macchiato, M.: An Application of Maximum Entropy Techniques to Determine Homogeneous Sets of Nucleotidic Sequences, J. Theor. Biol. 155 (1992), 129-136.

    Google Scholar 

  29. Kirsanova, E.N. and Sadovsky, M.G.: Entropy Approach to a Comparison of Images, Open Sys. Inf. Dyn. 8 (2001), 183-199.

    Google Scholar 

  30. Kirsanova, E.N. and Sadovsky, M.G.: Method of Statistical Comparison of Objects, Radioelectr. Inform. Contr. 2 (2000), 71-82.

    Google Scholar 

  31. Kirsanova, E.N. and Sadovsky, M.G.: On the Anisotropy of Digital Images, 9th Natl. Conf. Neuroinformatics and its applications, Krasnoyarsk, Oct. 5-7, 2001.

  32. Gorban, A.N., Popova, T.G. and Sadovsky, M.G.: A Redundancy of Genetic Sequences and Mosaic Structure of a Genome, Mol. Biology (Russian) 28 (1994), 313-322.

    Google Scholar 

  33. Gorban, A.N., Mirkes, E.M., Popova, T.G. and Sadovsky, M.G.: Comparative Redundancy of Genes of Various Organisms and Their Viruses, Rus. J. Genet. 29 (1993), 1413-1419.

    Google Scholar 

  34. Gorban, A.N., Popova, T.G. and Sadovsky, M.G.: Human Genes are more Redundant than Genes of Human Viruses, Rus. J. Genet. 32 (1996), 281-294.

    Google Scholar 

  35. Gorban, A.N., Popova, T.G. and Sadovsky, M.G.: Classification of Symbol Sequences over their Frequency Dictionaries: Towards the Connection Between Structure and Natural Taxonomy, Open Sys. Inform. Dyn. 7 (2000), 1-17.

    Google Scholar 

  36. Popova, T.G. and Sadovsky, M.G.: Splicing Results in Decrease of Genes Redundancy, Mol. Biology (Russian) 29 (1995), 500-506.

    Google Scholar 

  37. Popova, T.G. and Sadovsky, M.G.: Introns Differ from Exons from the Point of View of Their Redundancy, Rus. J. Genet. 31 (1995), 1365-1379.

    Google Scholar 

  38. Jimenez-Montano, M.A., Ebeling, W., Pohl, T. and Rapp, P.E.: Entropy and Complexity of Finite Sequences as Fluctuating Quantities, Biosystems 64 (2002), 23-32.

    Google Scholar 

  39. Churchill, G.A.: Hidden Markov Chains and the Analysis of Genome Structure, Comput. Chem. 16 (1992), 107-115.

    Google Scholar 

  40. Kisliuk, O.S., Boronina, T.A. and Nazipova, N.N.: Evaluation of Genetic Text Redundancy using a High-Frequency Component of the l-gram Graph, Biofizika 44 (1999), 639-648.

    Google Scholar 

  41. Tao Jiang, Ying Xu and Zhang, M.Q. (eds.): Current Topics in Computational Molecular Biology, MIT Press, Cambridge, 2002, 540 p.

    Google Scholar 

  42. Cheung, A. and Kieff, E.: Long Internal Direct Repeat in Epstein-Barr Virus DNA, J. Virol. 44 (1982), 286-294.

    Google Scholar 

  43. Popova, T.G. and Sadovsky, M.G.: Investigating Statistical Properties of Genetic Texts: Local Redundancy Displays a New Structure of Genes, Adv. Model. & Anal. C, AMSE Press. 48 (1995), 17-22.

    Google Scholar 

  44. Bugaenko, N.N., Popova, T.G. and Sadovsky, M.G.: Information Structure of Genetic Sequences, In: A.N. Gorban (ed.), Proc. 8th Natl. Conf. Neuroinformatica and its applications, Krasnoyarsk, KSTU Press, (2000), pp. 26-27.

    Google Scholar 

  45. Bugaenko, N.N., Sadovsky, M.G. and Sapozhnikov, A.N.: Classification of Symbols and Alphabet Development Optimal for a Revealing the Statistical Regularities, Proc. 5th Natl. Conf. Neuroinformatics and its Applications, Krasnoyarsk, Sept. 22-25, 1997, pp. 27-30.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sadovsky, M.G. Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae. Journal of Biological Physics 29, 23–38 (2003). https://doi.org/10.1023/A:1022554613105

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1022554613105

Navigation