skip to main content
article
Free Access

Linear approximation of shortest superstrings

Published:01 July 1994Publication History
Skip Abstract Section

Abstract

We consider the following problem: given a collection of strings s1,…, sm, find the shortest string s such that each si appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of (distinct) strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n) result.

We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the superstring problem to be MAXSNP-hard, which implies that a polynomial-time approximation scheme for this problem is unlikely.

References

  1. ~ALON, N., COSARES, S., HOCHBAUM, D., AND SHAMIR, R. 1989. An algorithm for the detection ~and construction of Monge Sequences. LiB. Alg. Appl. 114/115, 669-680.Google ScholarGoogle Scholar
  2. ~ARORA, A., LUND, C., MOTWANI, R., SUDAN, M., AND SZEGEDY, M. 1992. Proof verification and ~hardness of approximation problems. In Proceedings of the 33ra' IEEE Symposium on the ~Foundations of Computer Science. IEEE, New York, pp. 14 23.Google ScholarGoogle Scholar
  3. ~BARNES, E., AND HOFFMAN, A., 1985. On transportation problems with upper bounds on leading ~rectangles. SIAM J. Alg. Disc. Me&. 6, 487-496.Google ScholarGoogle Scholar
  4. ~FINE, N., AND WILF, H. 1965. Uniqueness theorems for periodic functions. Proc. Amer. Math. ~Soc. 16, 109-114.Google ScholarGoogle Scholar
  5. ~GALLANT, J., MAIER, D., AND STORER, J. 1980. On finding minimal length superstrings. J. ~Comput. Syst. Sci. 20, 50-58.Google ScholarGoogle Scholar
  6. ~GAREY, M., AND JOHNSON, D. 1979. Computers and I~tractabilitv. Freeman, New York.Google ScholarGoogle Scholar
  7. ~HOFFMAN, A., 1963. On simple transportation problems. In Convexitw Proceedings of Symposta in ~Pure Mathematics, vol. 7. American Mathematical Society, Providence, R.I., pp. 317 327.Google ScholarGoogle Scholar
  8. ~LESK, A. (ED) 1988. Computational Molecular Btology, Sources and Methods for Sequence Analysis. ~Oxford University Press. Google ScholarGoogle Scholar
  9. ~M. 1990. Towards a DNA sequencing theory. In Proceedings of &e 31st IEEE Symposimn on ~Foundations of Computer Science. IEEE, New York, pp. 125-134.Google ScholarGoogle Scholar
  10. ~PAPADIMITR1OU, C., AND STEIGLITZ, K. 1982. Combinatorial Optimization: Algonthm~ and Com- ~plexity. Prentice-Hall, Englewood Cliffs, N.J. Google ScholarGoogle Scholar
  11. ~PAPADIMITRIOU, C., AND YANNAKAKIS, M. 1988. Optimization, approximation, and complexity ~classes. In Proceedings of the 20th ACM Symposium on Theory of Computing (Chicago, Ill., May ~2-4). ACM, New York, pp. 229-234. Google ScholarGoogle Scholar
  12. ~APADIMITRIOU, C. AND YANNAKAK1S, M. 1993. The traveling salesman problem with distances ~one and two. Math. Oper. Res. 18, 1, 1-11. Google ScholarGoogle Scholar
  13. ~PELTOLA, H., SODERLUND, H., TARHIO, J., AND UKKONEN, E. 1983. Algorithms for some string ~matching problems arising in molecular genetics. In Information Processing 83 (Proceedtngs of ~IFIP Congress, 1983). Elsevier Science Publishers R. V. (North-Holland), Amsterdam, The ~Netherlands, pp. 53-64.Google ScholarGoogle Scholar
  14. STORER, J. 1988. Data compression: methods and theory. Computer Science Press, Rockville, Md. Google ScholarGoogle Scholar
  15. TARHIO, J., AND UKKONEN, E. 1988. A Greedy approximation algorithm for constructing shortest ~common superstrings. Theoret. Comput. Sct. 57, 131-145. Google ScholarGoogle Scholar
  16. ~TURNER, J., 1989. Approximation algorithms for the shortest common superstring problem. Inf. ~Comput. 83, 1-20. Google ScholarGoogle Scholar
  17. ~VALIANT, L. G. 1984. A Theory of the learnable. Commun. ACM 27, 11 (Nov.), 1134-1142. Google ScholarGoogle Scholar

Index Terms

  1. Linear approximation of shortest superstrings

      Recommendations

      Reviews

      Ralph Walter Wilkerson

      A greedy algorithm to solve the following problem is studied: for a finite set of strings, find the shortest common superstring S such that every string in the set is a substring of S . This problem occurs in data compression and most recently has received considerable attention due to its relevance in the DNA sequencing problem. While the problem is known to be NP-hard, the paper shows that the greedy algorithm produces a superstring of length O n , where n is the length of the optimal superstring. In fact, an upper bound of 4n is achieved. A modified version of this algorithm can achieve 3n as an upper bound. The paper includes all basic notation and detailed proofs concerning the correctness of the algorithms.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader