Abstract
We consider the following problem: given a collection of strings s1,…, sm, find the shortest string s such that each si appears as a substring (a consecutive block) of s. Although this problem is known to be NP-hard, a simple greedy procedure appears to do quite well and is routinely used in DNA sequencing and data compression practice, namely: repeatedly merge the pair of (distinct) strings with maximum overlap until only one string remains. Let n denote the length of the optimal superstring. A common conjecture states that the above greedy procedure produces a superstring of length O(n) (in fact, 2n), yet the only previous nontrivial bound known for any polynomial-time algorithm is a recent O(n log n) result.
We show that the greedy algorithm does in fact achieve a constant factor approximation, proving an upper bound of 4n. Furthermore, we present a simple modified version of the greedy algorithm that we show produces a superstring of length at most 3n. We also show the superstring problem to be MAXSNP-hard, which implies that a polynomial-time approximation scheme for this problem is unlikely.
- ~ALON, N., COSARES, S., HOCHBAUM, D., AND SHAMIR, R. 1989. An algorithm for the detection ~and construction of Monge Sequences. LiB. Alg. Appl. 114/115, 669-680.Google Scholar
- ~ARORA, A., LUND, C., MOTWANI, R., SUDAN, M., AND SZEGEDY, M. 1992. Proof verification and ~hardness of approximation problems. In Proceedings of the 33ra' IEEE Symposium on the ~Foundations of Computer Science. IEEE, New York, pp. 14 23.Google Scholar
- ~BARNES, E., AND HOFFMAN, A., 1985. On transportation problems with upper bounds on leading ~rectangles. SIAM J. Alg. Disc. Me&. 6, 487-496.Google Scholar
- ~FINE, N., AND WILF, H. 1965. Uniqueness theorems for periodic functions. Proc. Amer. Math. ~Soc. 16, 109-114.Google Scholar
- ~GALLANT, J., MAIER, D., AND STORER, J. 1980. On finding minimal length superstrings. J. ~Comput. Syst. Sci. 20, 50-58.Google Scholar
- ~GAREY, M., AND JOHNSON, D. 1979. Computers and I~tractabilitv. Freeman, New York.Google Scholar
- ~HOFFMAN, A., 1963. On simple transportation problems. In Convexitw Proceedings of Symposta in ~Pure Mathematics, vol. 7. American Mathematical Society, Providence, R.I., pp. 317 327.Google Scholar
- ~LESK, A. (ED) 1988. Computational Molecular Btology, Sources and Methods for Sequence Analysis. ~Oxford University Press. Google Scholar
- ~M. 1990. Towards a DNA sequencing theory. In Proceedings of &e 31st IEEE Symposimn on ~Foundations of Computer Science. IEEE, New York, pp. 125-134.Google Scholar
- ~PAPADIMITR1OU, C., AND STEIGLITZ, K. 1982. Combinatorial Optimization: Algonthm~ and Com- ~plexity. Prentice-Hall, Englewood Cliffs, N.J. Google Scholar
- ~PAPADIMITRIOU, C., AND YANNAKAKIS, M. 1988. Optimization, approximation, and complexity ~classes. In Proceedings of the 20th ACM Symposium on Theory of Computing (Chicago, Ill., May ~2-4). ACM, New York, pp. 229-234. Google Scholar
- ~APADIMITRIOU, C. AND YANNAKAK1S, M. 1993. The traveling salesman problem with distances ~one and two. Math. Oper. Res. 18, 1, 1-11. Google Scholar
- ~PELTOLA, H., SODERLUND, H., TARHIO, J., AND UKKONEN, E. 1983. Algorithms for some string ~matching problems arising in molecular genetics. In Information Processing 83 (Proceedtngs of ~IFIP Congress, 1983). Elsevier Science Publishers R. V. (North-Holland), Amsterdam, The ~Netherlands, pp. 53-64.Google Scholar
- STORER, J. 1988. Data compression: methods and theory. Computer Science Press, Rockville, Md. Google Scholar
- TARHIO, J., AND UKKONEN, E. 1988. A Greedy approximation algorithm for constructing shortest ~common superstrings. Theoret. Comput. Sct. 57, 131-145. Google Scholar
- ~TURNER, J., 1989. Approximation algorithms for the shortest common superstring problem. Inf. ~Comput. 83, 1-20. Google Scholar
- ~VALIANT, L. G. 1984. A Theory of the learnable. Commun. ACM 27, 11 (Nov.), 1134-1142. Google Scholar
Index Terms
- Linear approximation of shortest superstrings
Recommendations
Improved approximation guarantees for shortest superstrings using cycle classification by overlap to length ratios
STOC 2022: Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of ComputingIn the Shortest Superstring problem, we are given a set of strings and we are asking for a common superstring, which has the minimum number of characters. The Shortest Superstring problem is NP-hard and several constant-factor approximation algorithms ...
A linear-time algorithm for finding approximate shortest common superstrings
AbstractApproximate shortest common superstrings for a given setR of strings can be constructed by applying the greedy heuristics for finding a longest Hamiltonian path in the weighted graph that represents the pairwise overlaps between the strings inR. ...
Approximating shortest superstrings
SFCS '93: Proceedings of the 1993 IEEE 34th Annual Foundations of Computer ScienceThe Shortest Superstring Problem is to find a shortest possible string that contains every string in a given set as substrings. This problem has applications to data compression and DNA sequencing. As the problem is NP-hard and MAX SNP-hard, ...
Comments