skip to main content
10.1145/1376916.1376943acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

On searching compressed string collections cache-obliviously

Published:09 June 2008Publication History

ABSTRACT

Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.

References

  1. R. Bayer and K. Unterauer. Prefix B-trees. ACM Transactions on Database Systems, 2(1):11--26, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bender, M. Farach-Colton, and B. Kuszmaul. Cache-oblivious string b-trees. In Proc. ACM PODS, 233--242, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 43:275--292, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. ACM-SIAM SODA, 360--369, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. S. Brodal and R. Fagerberg. Cache-oblivious string dictionaries. In Proc. ACM-SIAM SODA, 581--590, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan. A data structure for a sequence of string accesses in external memory. ACM Transactions on Algorithms, 3(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. IEEE FOCS, 184--193, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. WWW, 751--760, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Ferragina and R. Venturini. Compressed permuterm index. In Proc. ACM SIGIR, 535--542, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE FOCS, 285--298, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Golynski, R. Grossi, A. Gupta, R. Raman, and S. S. Rao. On the size of succinct indices. In Proc. ESA, LNCS 4698, 371--382, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. He, J. I. Munro, and S. S. Rao. Succinct ordinal trees based on tree covering. In Proc. ICALP, LNCS 4596, 509--520, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Jacobson. Space-efficient static trees and graphs. In Proc. IEEE FOCS, 549--554, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Jansson, K. Sadakane, and W. Sung. Ultra-succinct representation of ordered trees. In Proc. ACM-SIAM SODA, 575--584, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, second edition, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Ko and S. Aluru. Optimal self-adjusting trees for dynamic string data in secondary storage. In Proc. SPIRE, LNCS 4726, 184--194, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Manku, A. Jain, and A.-D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, 141--150, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Mehlhorn and A. K. Tsakalidis. Data structures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), 301--342, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. I. Munro. Succinct data structures. Electr. Notes Theor. Comput. Sci., 91(3), 2004.Google ScholarGoogle Scholar
  21. G. Navarro and V. Mäkinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. ACM-SIAM SODA, 233--242, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Ruskey. Combinatorial Generation, 2007. In preparation.Google ScholarGoogle Scholar
  24. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, second edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On searching compressed string collections cache-obliviously

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
        June 2008
        330 pages
        ISBN:9781605581521
        DOI:10.1145/1376916

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 June 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PODS '08 Paper Acceptance Rate28of159submissions,18%Overall Acceptance Rate642of2,707submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader