Abstract
In laboratories the majority of large-scale DNA sequencing is done following theshotgun strategy, which is to sequence large amount of relatively short fragments randomly and then heuristically find a shortest common superstring of the fragments [26].
We study mathematical frameworks, under plausible assumptions, suitable for massive automated DNA sequencing and for analyzing DNA sequencing algorithms. We model the DNA sequencing problem as learning a string from its randomly drawn substrings. Under certain restrictions, this may be viewed as string learning in Valiant's distribution-free learning model and in this case we give an efficient learning algorithm and a quantitative bound on how many examples suffice.
One major obstacle to our approach turns out to be a quite well-known open question on how to approximate a shortest common superstring of a set of strings, raised by a number of authors in the last 10 years [9], [29], [30]. We give the firstprovably good algorithm which approximates a shortest superstring of lengthn by a superstring of lengthO(n logn). The algorithm works equally well even in the presence of negative examples, i.e., when merging of some strings is prohibited.
Similar content being viewed by others
References
D. Angluin and P. D. Laird. Learning from noisy examples.Machine Learning 2(4), 343–370, 1988.
A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis, Linear approximation of shortest superstrings. To appear inJournal of the ACM; also presented at 23rd ACM Symp. on Theory of Computing, New Orleans, 1991.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam's razor.Information Processing Letters 24, 377–380, 1987.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and Vapnik-Chervonenkis dimension.Journal of the ACM 35(4), 1989.
V. Chvátal. A greedy heuristic for the set covering problem.Mathematics of Operations Research 4(3), 1979.
R. Drmanac and C. Crkvenjakov. Sequencing by hybridization (SBH) with oligonucleotide probes as an integral approach for the analysis of complex genomes.International Journal of Genomic Research 1(1), 59–79, 1992.
D. Freifelder.Molecular Biology. Jones & Bartlett, 1983.
P. Freidland and L. Kedes. Discovering the secrets of DNA.Communications of the ACM 28(11), 1164–1186, 1985.
J. Gallant, D. Maier, and J. Storer. On finding minimal length superstring.Journal of Computer and System Sciences 20, 50–58, 1980.
M. Garey and D. Johnson.Computers and Intractability. Freeman, New York, 1979.
D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant's model.Artificial Intelligence 36(2), 177–221, 1988.
D. Helmbold, R. Sloan, and M. Warmuth. Learning nested differences of intersection-closed classes.Proc. 2nd Workshop on Computational Learning Theory, pp. 41–56, 1989.
T. Jiang and M. Li. On the complexity of learning strings and sequences.Theoretical Computer Science 119, 363–371, 1993.
T. Jiang and M. Li. Approximating shortest superstrings with constraints. To appear inTheoretical Computer Science.
R. Karp, Mapping the genome: some combinatorial problems arising in molecular biology.Proc. 23rdACM Symp. on Theory of Computing, pp. 278–285, 1993.
M. Kearns. The computational complexity of machine learning. Ph.D. Thesis, Report TR-13-89, Harvard University, 1989.
M. Kearns and M. Li. Learning in the presence of malicious errors.SIAM Journal on Computing 22(4), 807–837, 1993.
M. Kearns, M. Li, L. Pitt, and L. G. Valiant. On the learnability of Boolean formulae.Proc. 19thACM Symp. on Theory of Computing, pp. 285–295, 1987.
G. Landau and U. Vishkin. Efficient string matching in the presence of errors.Proc. 26thIEEE Symp. on Foundations of Computer Science, pp. 126–136, 1985.
A. Lesk (editor).Computational Molecular Biology, Sources and Methods for Sequence Analysis. Oxford University Press, Oxford, 1988.
M. Li. Towards a DNA sequencing theory.Proc. 31stIEEE Symp. on Foundations of Computer Science, pp. 125–134, 1990.
M. Li and P. Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag, New York, 1993.
R. Michalski, J. Carbonell, and T. Mitchell.Machine Learning. Morgan Kaufmann, Los Altos, CA, 1983.
H. Peltola, H. Soderlund, J. Tarhio, and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics.Information Processing 83 (Proc.IFIP Congress, 1983), pp. 53–64.
R. Rivest, Learning decision-lists.Machine Learning 2(3), 229–246, 1987.
L. Smith, The future of DNA sequencing.Science 262, 530–532, 1993.
R. Staden, Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing.Nucleic Acids Research 10(15), 4731–4751, 1982.
J. Storer,Data Compression: Methods and Theory. Computer Science Press, Rockville, MD, 1988.
J. Tarhio and E. Ukkonen. A greedy approximation algorithm for constructing shortest common super-strings.Theoretical Computer Science 57, 131–145, 1988.
J. Turner, Approximation algorithms for the shortest common superstring problem.Information and Computation 83, 1–20, 1989.
L. G. Valiant, A theory of the learnable.Communications of the ACM 27(11), 1134–1142, 1984.
L. G. Valiant. Deductive learning.Philosophical Transactions of the Royal Society of London. Series A 312, 441–446, 1984.
Author information
Authors and Affiliations
Additional information
Some of the results presented in this paper appeared in theProceedings of the 31st IEEE Symposium on the Foundations of Computer Science, pp. 125–134, 1990 [21]. The first author was supported in part by NSERC Operating Grant OGP0046613. The second author was supported in part by NSERC Operating Grants OGP0036747 and OGP0046506.
Rights and permissions
About this article
Cite this article
Jiang, T., Li, M. DNA sequencing and string learning. Math. Systems Theory 29, 387–405 (1996). https://doi.org/10.1007/BF01192694
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF01192694