Abstract
We consider the problem of pattern-search in compressed text in a context in which: (a) the text is stored as a sequence of factors against a static phrase-book; (b) decoding of factors is from right-to-left; and (c) extraction of each symbol in each factor requires Θ(logσ) time, where σ is the size of the original alphabet. To determine possible alignments given information about decoded characters we introduce two Boyer-Moore-like searching mechanisms, including one that makes use of a suffix array constructed over the pattern. The new mechanisms decode fewer than half the symbols that are required by a sequential left-to-right search such as the Knuth-Morris-Pratt approach, a saving that translates directly into improved execution time. Experiments with a two-level suffix array index structure for 4 GB of English text demonstrate the usefulness of the new techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Boyer, R.S., Moore, J.S.: A fast string searching algorithm. C. ACM 20, 1075–1091 (1977)
Colussi, L.: Fastest pattern matching in strings. J. Alg. 16, 163–189 (1994)
Faro, S., Lecroq, T.: The exact online string matching problem: A review of the most recent results. ACM Comput. Surv. 45(2), 13:1–13:42 (2013)
Ferragina, P., Grossi, R.: The string B-tree: A new data structure for search in external memory and its applications. J. ACM 46(2), 236–280 (1999)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: Plug and play with succinct data structures. In: Proc. Symp. Experimental Algorithms, pp. 326–337 (2014)
Gog, S., Moffat, A.: Adding compression and blended search to a compact two-level suffix array. In: Proc. Symp. String Processing and Inf. Retrieval, pp. 141–152 (2013)
Gog, S., Moffat, A., Culpepper, J.S., Turpin, A., Wirth, A.: Large-scale pattern search using reduced-space on-disk suffix arrays. IEEE Trans. Knowledge and Data Engineering 26(8), 1 (2014)
Horspool, R.N.: Practical fast searching in strings. Soft. Prac. & Exp. 10(6), 501–506 (1980)
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comp. 6(1), 323–350 (1977)
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences. Cambridge University Press (2002)
Raita, T.: Tuning the Boyer-Moore-Horspool string searching algorithms. Soft. Prac. & Exp. 22(10), 879–884 (1992)
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Proc. ACM-SIAM Symp. Discrete Algorithms, pp. 233–242 (2002)
Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 661–672 (2008)
Smith, P.D.: Experiments with a very fast substring search algorithm. Soft. Prac. & Exp. 21(10), 1065–1074 (1991)
Sunday, D.M.: A very fast substring search algorithm. C. ACM 33(8), 132–142 (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gog, S., Moffat, A., Petri, M. (2014). Strategic Pattern Search in Factor-Compressed Text. In: Moura, E., Crochemore, M. (eds) String Processing and Information Retrieval. SPIRE 2014. Lecture Notes in Computer Science, vol 8799. Springer, Cham. https://doi.org/10.1007/978-3-319-11918-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-11918-2_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11917-5
Online ISBN: 978-3-319-11918-2
eBook Packages: Computer ScienceComputer Science (R0)