DNA sequencing and string learning

Jiang, Tao; Li, Ming

doi:10.1007/BF01192694

DNA sequencing and string learning

Published: August 1996

Volume 29, pages 387–405, (1996)
Cite this article

Mathematical systems theory Aims and scope Submit manuscript

Tao Jiang¹ &
Ming Li²

75 Accesses
10 Citations
Explore all metrics

Abstract

In laboratories the majority of large-scale DNA sequencing is done following theshotgun strategy, which is to sequence large amount of relatively short fragments randomly and then heuristically find a shortest common superstring of the fragments [26].

We study mathematical frameworks, under plausible assumptions, suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms. We model the DNA sequencing problem as learning a string from its randomly drawn substrings. Under certain restrictions, this may be viewed as string learning in Valiant's distribution-free learning model and in this case we give an efficient learning algorithm and a quantitative bound on how many examples suffice.

One major obstacle to our approach turns out to be a quite well-known open question on how to approximate a shortest common superstring of a set of strings, raised by a number of authors in the last 10 years [9], [29], [30]. We give the firstprovably good algorithm which approximates a shortest superstring of lengthn by a superstring of lengthO(n logn). The algorithm works equally well even in the presence of negative examples, i.e., when merging of some strings is prohibited.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors and Jumbled-Index Queries in String Reconstruction

An Overview of Search and Match Algorithms Complexity and Performance

The Sequence Reconstruction Problem

References

D. Angluin and P. D. Laird. Learning from noisy examples.Machine Learning 2(4), 343–370, 1988.
Google Scholar
A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis, Linear approximation of shortest superstrings. To appear inJournal of the ACM; also presented at 23rd ACM Symp. on Theory of Computing, New Orleans, 1991.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam's razor.Information Processing Letters 24, 377–380, 1987.
Article MATH MathSciNet Google Scholar
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and Vapnik-Chervonenkis dimension.Journal of the ACM 35(4), 1989.
V. Chvátal. A greedy heuristic for the set covering problem.Mathematics of Operations Research 4(3), 1979.
R. Drmanac and C. Crkvenjakov. Sequencing by hybridization (SBH) with oligonucleotide probes as an integral approach for the analysis of complex genomes.International Journal of Genomic Research 1(1), 59–79, 1992.
Google Scholar
D. Freifelder.Molecular Biology. Jones & Bartlett, 1983.
P. Freidland and L. Kedes. Discovering the secrets of DNA.Communications of the ACM 28(11), 1164–1186, 1985.
Article Google Scholar
J. Gallant, D. Maier, and J. Storer. On finding minimal length superstring.Journal of Computer and System Sciences 20, 50–58, 1980.
Article MATH MathSciNet Google Scholar
M. Garey and D. Johnson.Computers and Intractability. Freeman, New York, 1979.
MATH Google Scholar
D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant's model.Artificial Intelligence 36(2), 177–221, 1988.
Article MATH MathSciNet Google Scholar
D. Helmbold, R. Sloan, and M. Warmuth. Learning nested differences of intersection-closed classes.Proc. 2nd Workshop on Computational Learning Theory, pp. 41–56, 1989.
T. Jiang and M. Li. On the complexity of learning strings and sequences.Theoretical Computer Science 119, 363–371, 1993.
Article MATH MathSciNet Google Scholar
T. Jiang and M. Li. Approximating shortest superstrings with constraints. To appear inTheoretical Computer Science.
R. Karp, Mapping the genome: some combinatorial problems arising in molecular biology.Proc. 23rdACM Symp. on Theory of Computing, pp. 278–285, 1993.
M. Kearns. The computational complexity of machine learning. Ph.D. Thesis, Report TR-13-89, Harvard University, 1989.
M. Kearns and M. Li. Learning in the presence of malicious errors.SIAM Journal on Computing 22(4), 807–837, 1993.
Article MATH MathSciNet Google Scholar
M. Kearns, M. Li, L. Pitt, and L. G. Valiant. On the learnability of Boolean formulae.Proc. 19thACM Symp. on Theory of Computing, pp. 285–295, 1987.
G. Landau and U. Vishkin. Efficient string matching in the presence of errors.Proc. 26thIEEE Symp. on Foundations of Computer Science, pp. 126–136, 1985.
A. Lesk (editor).Computational Molecular Biology, Sources and Methods for Sequence Analysis. Oxford University Press, Oxford, 1988.
Google Scholar
M. Li. Towards a DNA sequencing theory.Proc. 31stIEEE Symp. on Foundations of Computer Science, pp. 125–134, 1990.
M. Li and P. Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York, 1993.
MATH Google Scholar
R. Michalski, J. Carbonell, and T. Mitchell.Machine Learning. Morgan Kaufmann, Los Altos, CA, 1983.
Google Scholar
H. Peltola, H. Soderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics.Information Processing 83 (Proc.IFIP Congress, 1983), pp. 53–64.
R. Rivest, Learning decision-lists.Machine Learning 2(3), 229–246, 1987.
Google Scholar
L. Smith, The future of DNA sequencing.Science 262, 530–532, 1993.
Article Google Scholar
R. Staden, Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing.Nucleic Acids Research 10(15), 4731–4751, 1982.
Article Google Scholar
J. Storer,Data Compression: Methods and Theory. Computer Science Press, Rockville, MD, 1988.
Google Scholar
J. Tarhio and E. Ukkonen. A greedy approximation algorithm for constructing shortest common super-strings.Theoretical Computer Science 57, 131–145, 1988.
Article MATH MathSciNet Google Scholar
J. Turner, Approximation algorithms for the shortest common superstring problem.Information and Computation 83, 1–20, 1989.
Article MATH MathSciNet Google Scholar
L. G. Valiant, A theory of the learnable.Communications of the ACM 27(11), 1134–1142, 1984.
Article MATH Google Scholar
L. G. Valiant. Deductive learning.Philosophical Transactions of the Royal Society of London. Series A 312, 441–446, 1984.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, McMaster University, L8S 4K1, Hamilton, Ontario, Canada
Tao Jiang
Department of Computer Science, University of Waterloo, N2L 3G1, Waterloo, Ontario, Canada
Ming Li

Authors

Tao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Some of the results presented in this paper appeared in theProceedings of the 31st IEEE Symposium on the Foundations of Computer Science, pp. 125–134, 1990 [21]. The first author was supported in part by NSERC Operating Grant OGP0046613. The second author was supported in part by NSERC Operating Grants OGP0036747 and OGP0046506.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, T., Li, M. DNA sequencing and string learning. Math. Systems Theory 29, 387–405 (1996). https://doi.org/10.1007/BF01192694

Download citation

Received: 15 December 1993
Accepted: 04 October 1994
Issue Date: August 1996
DOI: https://doi.org/10.1007/BF01192694

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DNA sequencing and string learning

Abstract

Access this article

Similar content being viewed by others

Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors and Jumbled-Index Queries in String Reconstruction

An Overview of Search and Match Algorithms Complexity and Performance

The Sequence Reconstruction Problem

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DNA sequencing and string learning

Abstract

Access this article

Similar content being viewed by others

Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors and Jumbled-Index Queries in String Reconstruction

An Overview of Search and Match Algorithms Complexity and Performance

The Sequence Reconstruction Problem

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation