article

Compressed indexes for dynamic text collections

Authors:
Ho-Leung Chan

University of Hong Kong

University of Hong Kong
View Profile

,
Wing-Kai Hon

National Tsing Hua University

National Tsing Hua University
View Profile

,
Tak-Wah Lam

University of Hong Kong

University of Hong Kong
View Profile

,
Kunihiko Sadakane

Kyushu University

Kyushu University
View Profile

Authors Info & Claims

ACM Transactions on Algorithms Volume 3 Issue 2pp 21–eshttps://doi.org/10.1145/1240233.1240244

Published:01 May 2007Publication History

ACM Transactions on Algorithms

Abstract

Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically.

This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O(n) bits for a text collection L of total length n, which can be updated in O(|T| log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O(|P| log n + occ log² n) time, where occ is the number of occurrences.

Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection D of total length d, which can be updated in O(|P| log² d) time when a pattern P is inserted or deleted from D; also, the index supports searching the occurrences of all patterns of D in any text T in O((|T| + occ)log² d) time. When compared with the O(d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only.

The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O(n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(|T| log² n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O(n)-bit representation for maintaining n pairs of balanced parentheses in O(log n/log log n) time per operation, matching the time complexity of the previous O(n log n)-bit solution.

References

Aho, A., and Corasick, M. 1975. Efficient string matching: An aid to bibliographic search. Commun. ACM 18, 6, 333--340. Google ScholarDigital Library
Amir, A., and Farach, M. 1991. Adaptive dictionary matching. In Proceedings of the Symposium on Foundations of Computer Science. 760--766. Google ScholarDigital Library
Amir, A., Farach, M., Galil, Z., Giancarlo, R., and Park, K. 1994. Dynamic dictionary matching. J. Comput. Syst. Sci. 49, 2, 208--222. Google ScholarDigital Library
Amir, A., Farach, M., Idury, R., Poutre, A. L., and Schaffer, A. 1995. Improved dynamic dictionary matching. Inf. Comput. 119, 2, 258--282. Google ScholarDigital Library
Amir, A., Farach, M., and Matias, Y. 1992. Efficient randomized dictionary matching algorithms (extended abstract). In Proceedings of the Symposium on Combinatorial Pattern Matching. 262--275. Google ScholarDigital Library
Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Paolo Alto, California.Google Scholar
Elias, P. 1975. Universal codeword sets and representation of the integers. IEEE Trans. Inf. Theory 21, 2, 194--203.Google ScholarDigital Library
Ferragina, P., and Manzini, G. 2000. Opportunistic data structures with applications. In Proceedings of the Symposium on Foundations of Computer Science. 390--398. Google ScholarDigital Library
Ferragina, P., and Manzini, G. 2005. Indexing compressed text. J. ACM 52, 4, 552--581. Google ScholarDigital Library
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the International Symposium on String Processing and Information Retrieval. 150--160.Google Scholar
Fredman, M. L., and Saks, M. E. 1989. The cell probe complexity of dynamic data structures. In Proceedings of the Symposium on Theory of Computing. 345--354. Google ScholarDigital Library
Grossi, R., Gupta, A., and Vitter, J. S. 2003. High-order entropy-compressed text indexes. In Proceedings of the Symposium on Discrete Algorithms. 841--850. Google ScholarDigital Library
Grossi, R., Gupta, A., and Vitter, J. S. 2004. When indexing equals compression: Experiments with compressing suffix arrays and applications. In Proceedings of the Symposium on Discrete Algorithms. 636--645. Google ScholarDigital Library
Grossi, R., and Vitter, J. S. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the Symposium on Theory of Computing. 397--406. Google ScholarDigital Library
Grossi, R., and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407. Google ScholarDigital Library
Hon, W. K., Lam, T. W., Sung, W. K., Tse, W. L., Wong, C. K., and Yiu, S. M. 2004. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the Workshop on Algorithm Engineering and Experiments. 31--38.Google Scholar
Jacobson, G. 1989. Space-Efficient static trees and graphs. In Proceedings of the Symposium on Foundations of Computer Science. 549--554.Google ScholarDigital Library
Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Experien. 29, 1149--1171. Google ScholarDigital Library
Lam, T. W., Sadakane, K., Sung, W. K., and Yiu, S. M. 2002. A space and time efficient algorithm for constructing compressed suffix arrays. In Proceedings of the International Conference on Computing and Combinatorics. 401--410. Google ScholarDigital Library
Mäkinen, V., and Navarro, G. 2004. Run-Length FM-index. In Proceedings of the DIMACS Workshop: The Burrows-Wheeler Transform: Ten Years Later. 17--19.Google Scholar
Manber, U., and Myers, G. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948. Google ScholarDigital Library
McCreight, E. M. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2, 262--272. Google ScholarDigital Library
Mewes, H. W., and Heumann, K. 1995. Genome analysis: Pattern search in biological macromolecules. In Proceedings of the Symposium on Combinatorial Pattern Matching. 261--285.Google Scholar
Munro, J. I., and Raman, V. 2001. Succinct representation of balanced parentheses and static trees. SIAM J. Comput. 31, 3, 762--776. Google ScholarDigital Library
Overmars, M. H. 1983. The design of dynamic data structures. Lecture Notes in Computer Science, vol. 156. Google ScholarDigital Library
Raman, R., Raman, V., and Rao, S. S. 2001. Succinct dynamic data structures. In Proceedings of the Workshop on Algorithms and Data Structures. 426--437. Google ScholarDigital Library
Sadakane, K. 2000. Compressed text databases with efficient query algorithms based on compressed suffix array. In Proceedings of the International Symposium on Algorithms and Computation. 410--421. Google ScholarDigital Library
Sadakane, K. 2002. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings of the Symposium on Discrete Algorithms. 225--232. Google ScholarDigital Library
Sadakane, K. 2007. Compressed suffix trees with full functionality. Theor. Comput. Syst. to appear.Google Scholar
Sahinalp, S. C., and Vishkin, U. 1996. Efficient approximate and dynamic matching of patterns using a labeling paradigm. In Proceedings of the Symposium on Foundations of Computer Science. 320--328. Google ScholarDigital Library
Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings of the Symposium on Switching and Automata Theory. 1--11.Google ScholarDigital Library
Yao, A. C. 1981. Should tables be sorted? J. ACM 28, 3, 615--628. Google ScholarDigital Library

Index Terms

Recommendations

Compressed representations of sequences and full-text indexes

Given a sequence S = s₁s₂…s_n of integers smaller than r = O(polylog(n)), we show how S can be represented using nH₀(S) + o(n) bits, so that we can know any s_q, as well as answer rank and select queries on S, in constant time. H₀(S) is the zero-order ...
Read More
Dynamic entropy-compressed sequences and full-text indexes

We give new solutions to the Searchable Partial Sums with Indels problem. Given a sequence of n k-bit numbers, we present a structure taking kn + o(kn) bits of space, able of performing operations sum, search, insert, and delete, all in O(log n) worst-...
Read More
Dynamic entropy-compressed sequences and full-text indexes
CPM'06: Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

Given a sequence of n bits with binary zero-order entropy H₀, we present a dynamic data structure that requires nH₀ + o(n) bits of space, which is able of performing rank and select, as well as inserting and deleting bits at arbitrary positions, in O(...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Algorithms Volume 3, Issue 2
May 2007
338 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/1240233
Issue’s Table of Contents

Copyright © 2007 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2007
Published in talg Volume 3, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Compressed suffix tree
string matching
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 67
  Total Citations
  View Citations
- 902
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compressed indexes for dynamic text collections

ACM Transactions on Algorithms

Abstract

References

Cited By

Index Terms

Recommendations

Compressed representations of sequences and full-text indexes

Dynamic entropy-compressed sequences and full-text indexes

Dynamic entropy-compressed sequences and full-text indexes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Compressed indexes for dynamic text collections

ACM Transactions on Algorithms

Abstract

References

Cited By

Index Terms

Recommendations

Compressed representations of sequences and full-text indexes

Dynamic entropy-compressed sequences and full-text indexes

Dynamic entropy-compressed sequences and full-text indexes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media