Skip to main content
Log in

Geometric BWT: Compressed Text Indexing via Sparse Suffixes and Range Searching

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

We introduce a new variant of the popular Burrows-Wheeler transform (BWT), called Geometric Burrows-Wheeler Transform (GBWT), which converts a text into a set of points in 2-dimensional geometry. We also introduce a reverse transform, called Points2Text, which converts a set of points into text. Using these two transforms, we show strong equivalence between data structural problems in geometric range searching and text pattern matching. This allows us to apply the lower bounds known in the field of orthogonal range searching to the problems in compressed text indexing. In addition, we give the first succinct (compact) index for I/O-efficient pattern matching in external memory, and show how this index can be further improved to achieve higher-order entropy compressed space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. Although the more optimal first term would be |P|/(Blog|Σ| n) because the block size is measured in terms of words and we assume the word-size to be logn bits and each character of the pattern takes log|Σ| bits.

  2. The notation \(\tilde{O}\) ignores poly-logarithmic factors. Precisely, \(\tilde {O}(f(n)) \equiv O(f(n)\log^{O(1)} n)\).

  3. For simplicity, we assume that n is a power of B, so that log B n is an integer. Otherwise, we simply consider the range of values in A as [1,n′], where \(n' = B^{\lceil\log_{B} n \rceil}\), so that both the space and query bounds in our proposed scheme follow.

  4. For simplicity, we assume n is a multiple of d. Otherwise, T is first padded with enough special character $ at the end to make the length a multiple of d.

  5. For simplicity, we assume that d is an integer. If not, we can slightly modify the data structures without affecting the overall complexity.

  6. Without loss of generality, we assume here that \(|\varSigma| < \sqrt{n}\). The parameters can be appropriately adjusted for the more general case when |Σ|=O(n 1−ϵ) for any fixed ϵ>0.

  7. Here, we make a slight modification that one extra bit is spent for each meta-character, such that if our kth-order encoding of the next o(log|Σ| n) characters already exceeds 0.5logn, we shall instead encode the next 0.5log|Σ| n characters (i.e., more characters) in its plain form. The extra bit is used to indicate whether we use the plain encoding or the kth-order encoding.

  8. As mentioned, there is also an extra bit overhead per meta-character; however, we will soon see that the number of meta-characters = O((nH k +o(nlog|Σ|))/logn) so that this overhead is negligible.

  9. Note that when we switch back to a node in Δ sbt , we choose the top-most node in Δ sbt corresponding to the node v.

  10. Note that choosing larger d allows more sparsification, but it is not possible to design the four-russians data structure for small patterns in such cases.

References

  1. Agarwal, P.K., Erickson, J.: Geometric range searching and its relatives. Adv. Discret. Comput. Geom. 23, 1–56 (1999)

    Article  MathSciNet  Google Scholar 

  2. Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1998)

    Article  MathSciNet  Google Scholar 

  3. Aref, W.G., Ilyas, I.F.: SP-GiST: an extensible database index for supporting space partitioning trees. J. Intell. Inf. Syst. 17(2–3), 215–240 (2001)

    Article  MATH  Google Scholar 

  4. Arge, L., Brodal, G.S., Fagerberg, R., Laustsen, M.: Cache-oblivious planar orthogonal range searching and counting. In: Proceedings of Symposium on Computational Geometry, pp. 160–169 (2005)

    Google Scholar 

  5. Arge, L., Samoladas, V., Vitter, J.S.: Two-dimensional indexability and optimal range search indexing. In: Proceedings of Symposium on Principles of Database Systems, pp. 346–357 (1999)

    Google Scholar 

  6. Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 83–94 (2007)

    Chapter  Google Scholar 

  7. Baeza-Yates, R., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Syst. 21(6), 497–514 (1996)

    Article  Google Scholar 

  8. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation, Paolo Alto CA, USA (1994)

  9. Chazelle, B.: Lower bounds for orthogonal range searching. I: The reporting case. J. ACM 37, 200–212 (1990)

    Article  MATH  MathSciNet  Google Scholar 

  10. Clark, D., Munro, I.: Efficient suffix trees on secondary storage. In: Proceedings of Symposium on Discrete Algorithms, pp. 383–391 (1996)

    Google Scholar 

  11. Chien, Y.F., Hon, W.K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler transform: linking range searching and text indexing. In: Proceedings of Data Compression Conference, pp. 252–261 (2008)

    Google Scholar 

  12. Chiu, S.Y., Hon, W.K., Shah, R., Vitter, J.S.: I/O-efficient compressed text indexes: from theory to practice. In: Proceedings of Data Compression Conference, pp. 426–434 (2010)

    Google Scholar 

  13. Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string searching in external memory and its application. J. ACM 46(2), 236–280 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  14. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  15. Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. In: Proceedings of Symposium on Discrete Algorithms, pp. 690–696 (2007)

    Google Scholar 

  16. Fischer, J., Gagie, T., Kopelowitz, T., Lewenstein, M., Mäkinen, V., Salmela, L., Välimäki, N.N.: Forbidden patterns. In: Proceedings of Latin American Theoretical Informatics, pp. 327–337 (2012)

    Google Scholar 

  17. Gagie, T., Gawrychowski, P.: Linear-space substring range counting over polylogarithmic alphabets. (2012). CoRR. arXiv:1202.3208 [cs.DS]

  18. González, R., Navarro, G.: A compressed text index on secondary memory. In: Proceedings of International Workshop on Combinatorial Algorithms, pp. 80–91 (2007)

    Google Scholar 

  19. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of Symposium on Discrete Algorithms, pp. 841–850 (2003)

    Google Scholar 

  20. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  21. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of International Conference on Management of Data, pp. 47–57 (1984)

    Google Scholar 

  22. Hellerstein, J.M., Naughton, J.F., Pfeffer, A.: Generalized search trees for database systems. In: Proceedings of International Conference on Very Large Data Bases, pp. 562–573 (1995)

    Google Scholar 

  23. Hon, W.K., Lam, T.W., Shah, R., Lung, S.L., Vitter, J.S.: Succinct index for dynamic dictionary matching. In: Proceedings of Symposium on Algorithms and Computation, pp. 1034–1043 (2009)

    Chapter  Google Scholar 

  24. Hon, W.K., Lam, T.W., Shah, R., Lung, S.L., Vitter, J.S.: Compressed index for dictionary matching. In: Proceedings of Data Compression Conference, pp. 23–32 (2008)

    Google Scholar 

  25. Hon, W.K., Shah, R., Vitter, J.S.: Ordered pattern matching: towards full-text retrieval. Technical report TR-06-008, Purdue University (2006)

  26. Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: On entropy-compressed text indexing in external memory. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 75–89 (2009)

    Chapter  Google Scholar 

  27. Hon, W.K., Ku, T.H., Shah, R., Thankachan, S.V., Vitter, J.S.: Compressed text indexing with wildcards. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 267–277 (2011)

    Chapter  Google Scholar 

  28. Hon, W.K., Ku, T.H., Shah, R., Thankachan, S.V., Vitter, J.S.: Compressed dictionary matching with one errors. In: Proceedings of Data Compression Conference, pp. 113–122 (2011)

    Google Scholar 

  29. Hon, W.K., Shah, R., Vitter, J.S.: Compression, indexing, and retrieval for massive string data. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 260–274 (2010)

    Chapter  Google Scholar 

  30. Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of Symposium on Foundations of Computer Science, pp. 549–554 (1989)

    Chapter  Google Scholar 

  31. Kanth, K.V.R., Singh, A.K.: Optimal dynamic range searching in non-replicating index structures. In: Proceedings of International Conference on Database Theory, pp. 257–276 (1999)

    Google Scholar 

  32. Kärkkäinen, J., Ukkonen, E.: Sparse suffix trees. In: Proceedings of International Conference on Computing and Combinatorics, pp. 219–230 (1996)

    Chapter  Google Scholar 

  33. Kolpakov, R., Kucherov, G., Starikovskaya, T.A.: Pattern matching on sparse suffix trees. In: International Conference on Data Compression, Communications and Processing (2011). doi:10.1109/CCP.2011.45

    Google Scholar 

  34. Mäkinen, V., Navarro, G.: Compressed full-text indexes. ACM Comput. Surv. 39(1) (2007)

  35. Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. Technical report TR/DCC-2006-10, University of Chile (2006)

  36. Mäkinen, V., Navarro, G.: Position-restricted substring searching. In: Proceedings of Latin American Theoretical Informatics Symposium, pp. 703–714 (2006)

    Google Scholar 

  37. Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching-efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proceedings of Symposium on Algorithms and Computation, pp. 681–692 (2004)

    Chapter  Google Scholar 

  38. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  39. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  40. Munro, J.I.: Tables. In: Proceedings of Conference on Foundations of Software Technology and Theoretical Computer Science, pp. 37–42 (1996)

    Chapter  Google Scholar 

  41. Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully compressed suffix trees. ACM Trans. Algorithms 7(4), 53 (2011)

    Article  MathSciNet  Google Scholar 

  42. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 589–607(2007)

  43. Samet, H.: The quadtree and related hierarchical data structures. ACM Comput. Surv. 16(2), 187–260 (1984)

    Article  MathSciNet  Google Scholar 

  44. Subramanian, S., Ramaswamy, S.: The P-range tree: a new data structure for range searching in secondary memory. In: Proceedings of Symposium on Discrete Algorithms, pp. 378–387 (1995)

    Google Scholar 

  45. Thankachan, S.V.: Compressed indexes for aligned pattern matching. In: Proceedings of International Symposium on String Processing and Information Retrieval, pp. 410–419 (2011)

    Chapter  Google Scholar 

  46. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Chapter  Google Scholar 

  47. Willard, D.E.: Log-logarithmic worst-case range queries are possible in space θ(N). Inf. Process. Lett. 17(2), 81–84 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  48. Yu, C.C., Hon, W.K., Wang, B.F.: Efficient data structures for orthogonal range successor problem. In: Proceedings of International Computing and Combinatorics Conference, pp. 96–105 (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rahul Shah.

Additional information

This work is supported in part by Taiwan NSC Grant 99-2221-E-007-123 (W.-K. Hon) and US NSF Grant CCF-1017623 (R. Shah).

Early parts of this work appeared in DCC 2008 [11] and SPIRE 2009 [26].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chien, YF., Hon, WK., Shah, R. et al. Geometric BWT: Compressed Text Indexing via Sparse Suffixes and Range Searching. Algorithmica 71, 258–278 (2015). https://doi.org/10.1007/s00453-013-9792-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-013-9792-1

Keywords

Navigation