Abstract
Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically.
This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O(n) bits for a text collection L of total length n, which can be updated in O(|T| log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O(|P| log n + occ log2 n) time, where occ is the number of occurrences.
Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection D of total length d, which can be updated in O(|P| log2 d) time when a pattern P is inserted or deleted from D; also, the index supports searching the occurrences of all patterns of D in any text T in O((|T| + occ)log2 d) time. When compared with the O(d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only.
The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O(n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(|T| log2 n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O(n)-bit representation for maintaining n pairs of balanced parentheses in O(log n/log log n) time per operation, matching the time complexity of the previous O(n log n)-bit solution.
- Aho, A., and Corasick, M. 1975. Efficient string matching: An aid to bibliographic search. Commun. ACM 18, 6, 333--340. Google ScholarDigital Library
- Amir, A., and Farach, M. 1991. Adaptive dictionary matching. In Proceedings of the Symposium on Foundations of Computer Science. 760--766. Google ScholarDigital Library
- Amir, A., Farach, M., Galil, Z., Giancarlo, R., and Park, K. 1994. Dynamic dictionary matching. J. Comput. Syst. Sci. 49, 2, 208--222. Google ScholarDigital Library
- Amir, A., Farach, M., Idury, R., Poutre, A. L., and Schaffer, A. 1995. Improved dynamic dictionary matching. Inf. Comput. 119, 2, 258--282. Google ScholarDigital Library
- Amir, A., Farach, M., and Matias, Y. 1992. Efficient randomized dictionary matching algorithms (extended abstract). In Proceedings of the Symposium on Combinatorial Pattern Matching. 262--275. Google ScholarDigital Library
- Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Paolo Alto, California.Google Scholar
- Elias, P. 1975. Universal codeword sets and representation of the integers. IEEE Trans. Inf. Theory 21, 2, 194--203.Google ScholarDigital Library
- Ferragina, P., and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the Symposium on Foundations of Computer Science. 390--398. Google ScholarDigital Library
- Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581. Google ScholarDigital Library
- Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the International Symposium on String Processing and Information Retrieval. 150--160.Google Scholar
- Fredman, M. L., and Saks, M. E. 1989. The cell probe complexity of dynamic data structures. In Proceedings of the Symposium on Theory of Computing. 345--354. Google ScholarDigital Library
- Grossi, R., Gupta, A., and Vitter, J. S. 2003. High-order entropy-compressed text indexes. In Proceedings of the Symposium on Discrete Algorithms. 841--850. Google ScholarDigital Library
- Grossi, R., Gupta, A., and Vitter, J. S. 2004. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proceedings of the Symposium on Discrete Algorithms. 636--645. Google ScholarDigital Library
- Grossi, R., and Vitter, J. S. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the Symposium on Theory of Computing. 397--406. Google ScholarDigital Library
- Grossi, R., and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407. Google ScholarDigital Library
- Hon, W. K., Lam, T. W., Sung, W. K., Tse, W. L., Wong, C. K., and Yiu, S. M. 2004. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the Workshop on Algorithm Engineering and Experiments. 31--38.Google Scholar
- Jacobson, G. 1989. Space-Efficient static trees and graphs. In Proceedings of the Symposium on Foundations of Computer Science. 549--554.Google ScholarDigital Library
- Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Experien. 29, 1149--1171. Google ScholarDigital Library
- Lam, T. W., Sadakane, K., Sung, W. K., and Yiu, S. M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings of the International Conference on Computing and Combinatorics. 401--410. Google ScholarDigital Library
- Mäkinen, V., and Navarro, G. 2004. Run-Length FM-index. In Proceedings of the DIMACS Workshop: The Burrows-Wheeler Transform: Ten Years Later. 17--19.Google Scholar
- Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948. Google ScholarDigital Library
- McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarDigital Library
- Mewes, H. W., and Heumann, K. 1995. Genome analysis: Pattern search in biological macromolecules. In Proceedings of the Symposium on Combinatorial Pattern Matching. 261--285.Google Scholar
- Munro, J. I., and Raman, V. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 3, 762--776. Google ScholarDigital Library
- Overmars, M. H. 1983. The design of dynamic data structures. Lecture Notes in Computer Science, vol. 156. Google ScholarDigital Library
- Raman, R., Raman, V., and Rao, S. S. 2001. Succinct dynamic data structures. In Proceedings of the Workshop on Algorithms and Data Structures. 426--437. Google ScholarDigital Library
- Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on compressed suffix array. In Proceedings of the International Symposium on Algorithms and Computation. 410--421. Google ScholarDigital Library
- Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the Symposium on Discrete Algorithms. 225--232. Google ScholarDigital Library
- Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comput. Syst. to appear.Google Scholar
- Sahinalp, S. C., and Vishkin, U. 1996. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In Proceedings of the Symposium on Foundations of Computer Science. 320--328. Google ScholarDigital Library
- Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings of the Symposium on Switching and Automata Theory. 1--11.Google ScholarDigital Library
- Yao, A. C. 1981. Should tables be sorted? J. ACM 28, 3, 615--628. Google ScholarDigital Library
Index Terms
- Compressed indexes for dynamic text collections
Recommendations
Compressed representations of sequences and full-text indexes
Given a sequence S = s1s2…sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order ...
Dynamic entropy-compressed sequences and full-text indexes
We give new solutions to the Searchable Partial Sums with Indels problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-...
Dynamic entropy-compressed sequences and full-text indexes
CPM'06: Proceedings of the 17th Annual conference on Combinatorial Pattern MatchingGiven a sequence of n bits with binary zero-order entropy H0, we present a dynamic data structure that requires nH0 + o(n) bits of space, which is able of performing rank and select, as well as inserting and deleting bits at arbitrary positions, in O(...
Comments