Skip to main content
Log in

DNA sequencing and string learning

  • Published:
Mathematical systems theory Aims and scope Submit manuscript

Abstract

In laboratories the majority of large-scale DNA sequencing is done following theshotgun strategy, which is to sequence large amount of relatively short fragments randomly and then heuristically find a shortest common superstring of the fragments [26].

We study mathematical frameworks, under plausible assumptions, suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms. We model the DNA sequencing problem as learning a string from its randomly drawn substrings. Under certain restrictions, this may be viewed as string learning in Valiant's distribution-free learning model and in this case we give an efficient learning algorithm and a quantitative bound on how many examples suffice.

One major obstacle to our approach turns out to be a quite well-known open question on how to approximate a shortest common superstring of a set of strings, raised by a number of authors in the last 10 years [9], [29], [30]. We give the firstprovably good algorithm which approximates a shortest superstring of lengthn by a superstring of lengthO(n logn). The algorithm works equally well even in the presence of negative examples, i.e., when merging of some strings is prohibited.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. D. Angluin and P. D. Laird. Learning from noisy examples.Machine Learning 2(4), 343–370, 1988.

    Google Scholar 

  2. A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis, Linear approximation of shortest superstrings. To appear inJournal of the ACM; also presented at 23rd ACM Symp. on Theory of Computing, New Orleans, 1991.

  3. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam's razor.Information Processing Letters 24, 377–380, 1987.

    Article  MATH  MathSciNet  Google Scholar 

  4. A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and Vapnik-Chervonenkis dimension.Journal of the ACM 35(4), 1989.

  5. V. Chvátal. A greedy heuristic for the set covering problem.Mathematics of Operations Research 4(3), 1979.

  6. R. Drmanac and C. Crkvenjakov. Sequencing by hybridization (SBH) with oligonucleotide probes as an integral approach for the analysis of complex genomes.International Journal of Genomic Research 1(1), 59–79, 1992.

    Google Scholar 

  7. D. Freifelder.Molecular Biology. Jones & Bartlett, 1983.

  8. P. Freidland and L. Kedes. Discovering the secrets of DNA.Communications of the ACM 28(11), 1164–1186, 1985.

    Article  Google Scholar 

  9. J. Gallant, D. Maier, and J. Storer. On finding minimal length superstring.Journal of Computer and System Sciences 20, 50–58, 1980.

    Article  MATH  MathSciNet  Google Scholar 

  10. M. Garey and D. Johnson.Computers and Intractability. Freeman, New York, 1979.

    MATH  Google Scholar 

  11. D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant's model.Artificial Intelligence 36(2), 177–221, 1988.

    Article  MATH  MathSciNet  Google Scholar 

  12. D. Helmbold, R. Sloan, and M. Warmuth. Learning nested differences of intersection-closed classes.Proc. 2nd Workshop on Computational Learning Theory, pp. 41–56, 1989.

  13. T. Jiang and M. Li. On the complexity of learning strings and sequences.Theoretical Computer Science 119, 363–371, 1993.

    Article  MATH  MathSciNet  Google Scholar 

  14. T. Jiang and M. Li. Approximating shortest superstrings with constraints. To appear inTheoretical Computer Science.

  15. R. Karp, Mapping the genome: some combinatorial problems arising in molecular biology.Proc. 23rdACM Symp. on Theory of Computing, pp. 278–285, 1993.

  16. M. Kearns. The computational complexity of machine learning. Ph.D. Thesis, Report TR-13-89, Harvard University, 1989.

  17. M. Kearns and M. Li. Learning in the presence of malicious errors.SIAM Journal on Computing 22(4), 807–837, 1993.

    Article  MATH  MathSciNet  Google Scholar 

  18. M. Kearns, M. Li, L. Pitt, and L. G. Valiant. On the learnability of Boolean formulae.Proc. 19thACM Symp. on Theory of Computing, pp. 285–295, 1987.

  19. G. Landau and U. Vishkin. Efficient string matching in the presence of errors.Proc. 26thIEEE Symp. on Foundations of Computer Science, pp. 126–136, 1985.

  20. A. Lesk (editor).Computational Molecular Biology, Sources and Methods for Sequence Analysis. Oxford University Press, Oxford, 1988.

    Google Scholar 

  21. M. Li. Towards a DNA sequencing theory.Proc. 31stIEEE Symp. on Foundations of Computer Science, pp. 125–134, 1990.

  22. M. Li and P. Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York, 1993.

    MATH  Google Scholar 

  23. R. Michalski, J. Carbonell, and T. Mitchell.Machine Learning. Morgan Kaufmann, Los Altos, CA, 1983.

    Google Scholar 

  24. H. Peltola, H. Soderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics.Information Processing 83 (Proc.IFIP Congress, 1983), pp. 53–64.

  25. R. Rivest, Learning decision-lists.Machine Learning 2(3), 229–246, 1987.

    Google Scholar 

  26. L. Smith, The future of DNA sequencing.Science 262, 530–532, 1993.

    Article  Google Scholar 

  27. R. Staden, Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing.Nucleic Acids Research 10(15), 4731–4751, 1982.

    Article  Google Scholar 

  28. J. Storer,Data Compression: Methods and Theory. Computer Science Press, Rockville, MD, 1988.

    Google Scholar 

  29. J. Tarhio and E. Ukkonen. A greedy approximation algorithm for constructing shortest common super-strings.Theoretical Computer Science 57, 131–145, 1988.

    Article  MATH  MathSciNet  Google Scholar 

  30. J. Turner, Approximation algorithms for the shortest common superstring problem.Information and Computation 83, 1–20, 1989.

    Article  MATH  MathSciNet  Google Scholar 

  31. L. G. Valiant, A theory of the learnable.Communications of the ACM 27(11), 1134–1142, 1984.

    Article  MATH  Google Scholar 

  32. L. G. Valiant. Deductive learning.Philosophical Transactions of the Royal Society of London. Series A 312, 441–446, 1984.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Some of the results presented in this paper appeared in theProceedings of the 31st IEEE Symposium on the Foundations of Computer Science, pp. 125–134, 1990 [21]. The first author was supported in part by NSERC Operating Grant OGP0046613. The second author was supported in part by NSERC Operating Grants OGP0036747 and OGP0046506.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, T., Li, M. DNA sequencing and string learning. Math. Systems Theory 29, 387–405 (1996). https://doi.org/10.1007/BF01192694

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01192694

Keywords

Navigation