skip to main content
article

Compressed indexes for dynamic text collections

Published:01 May 2007Publication History
Skip Abstract Section

Abstract

Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically.

This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O(n) bits for a text collection L of total length n, which can be updated in O(|T| log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O(|P| log n + occ log2 n) time, where occ is the number of occurrences.

Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection D of total length d, which can be updated in O(|P| log2 d) time when a pattern P is inserted or deleted from D; also, the index supports searching the occurrences of all patterns of D in any text T in O((|T| + occ)log2 d) time. When compared with the O(d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only.

The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O(n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(|T| log2 n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O(n)-bit representation for maintaining n pairs of balanced parentheses in O(log n/log log n) time per operation, matching the time complexity of the previous O(n log n)-bit solution.

References

  1. Aho, A., and Corasick, M. 1975. Efficient string matching: An aid to bibliographic search. Commun. ACM 18, 6, 333--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amir, A., and Farach, M. 1991. Adaptive dictionary matching. In Proceedings of the Symposium on Foundations of Computer Science. 760--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amir, A., Farach, M., Galil, Z., Giancarlo, R., and Park, K. 1994. Dynamic dictionary matching. J. Comput. Syst. Sci. 49, 2, 208--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Amir, A., Farach, M., Idury, R., Poutre, A. L., and Schaffer, A. 1995. Improved dynamic dictionary matching. Inf. Comput. 119, 2, 258--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Amir, A., Farach, M., and Matias, Y. 1992. Efficient randomized dictionary matching algorithms (extended abstract). In Proceedings of the Symposium on Combinatorial Pattern Matching. 262--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Paolo Alto, California.Google ScholarGoogle Scholar
  7. Elias, P. 1975. Universal codeword sets and representation of the integers. IEEE Trans. Inf. Theory 21, 2, 194--203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ferragina, P., and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the Symposium on Foundations of Computer Science. 390--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the International Symposium on String Processing and Information Retrieval. 150--160.Google ScholarGoogle Scholar
  11. Fredman, M. L., and Saks, M. E. 1989. The cell probe complexity of dynamic data structures. In Proceedings of the Symposium on Theory of Computing. 345--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Grossi, R., Gupta, A., and Vitter, J. S. 2003. High-order entropy-compressed text indexes. In Proceedings of the Symposium on Discrete Algorithms. 841--850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Grossi, R., Gupta, A., and Vitter, J. S. 2004. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proceedings of the Symposium on Discrete Algorithms. 636--645. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Grossi, R., and Vitter, J. S. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the Symposium on Theory of Computing. 397--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Grossi, R., and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hon, W. K., Lam, T. W., Sung, W. K., Tse, W. L., Wong, C. K., and Yiu, S. M. 2004. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the Workshop on Algorithm Engineering and Experiments. 31--38.Google ScholarGoogle Scholar
  17. Jacobson, G. 1989. Space-Efficient static trees and graphs. In Proceedings of the Symposium on Foundations of Computer Science. 549--554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Experien. 29, 1149--1171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lam, T. W., Sadakane, K., Sung, W. K., and Yiu, S. M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings of the International Conference on Computing and Combinatorics. 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mäkinen, V., and Navarro, G. 2004. Run-Length FM-index. In Proceedings of the DIMACS Workshop: The Burrows-Wheeler Transform: Ten Years Later. 17--19.Google ScholarGoogle Scholar
  21. Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mewes, H. W., and Heumann, K. 1995. Genome analysis: Pattern search in biological macromolecules. In Proceedings of the Symposium on Combinatorial Pattern Matching. 261--285.Google ScholarGoogle Scholar
  24. Munro, J. I., and Raman, V. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 3, 762--776. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Overmars, M. H. 1983. The design of dynamic data structures. Lecture Notes in Computer Science, vol. 156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Raman, R., Raman, V., and Rao, S. S. 2001. Succinct dynamic data structures. In Proceedings of the Workshop on Algorithms and Data Structures. 426--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on compressed suffix array. In Proceedings of the International Symposium on Algorithms and Computation. 410--421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the Symposium on Discrete Algorithms. 225--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comput. Syst. to appear.Google ScholarGoogle Scholar
  30. Sahinalp, S. C., and Vishkin, U. 1996. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In Proceedings of the Symposium on Foundations of Computer Science. 320--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings of the Symposium on Switching and Automata Theory. 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yao, A. C. 1981. Should tables be sorted? J. ACM 28, 3, 615--628. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Compressed indexes for dynamic text collections

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Algorithms
                  ACM Transactions on Algorithms  Volume 3, Issue 2
                  May 2007
                  338 pages
                  ISSN:1549-6325
                  EISSN:1549-6333
                  DOI:10.1145/1240233
                  Issue’s Table of Contents

                  Copyright © 2007 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 May 2007
                  Published in talg Volume 3, Issue 2

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • article

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader