ABSTRACT
Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.
- R. Bayer and K. Unterauer. Prefix B-trees. ACM Transactions on Database Systems, 2(1):11--26, 1977. Google ScholarDigital Library
- M. Bender, M. Farach-Colton, and B. Kuszmaul. Cache-oblivious string b-trees. In Proc. ACM PODS, 233--242, 2006. Google ScholarDigital Library
- D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 43:275--292, 2005. Google ScholarDigital Library
- J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. ACM-SIAM SODA, 360--369, 1996. Google ScholarDigital Library
- G. S. Brodal and R. Fagerberg. Cache-oblivious string dictionaries. In Proc. ACM-SIAM SODA, 581--590, 2006. Google ScholarDigital Library
- V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan. A data structure for a sequence of string accesses in external memory. ACM Transactions on Algorithms, 3(1), 2007. Google ScholarDigital Library
- P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999. Google ScholarDigital Library
- P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. IEEE FOCS, 184--193, 2005. Google ScholarDigital Library
- P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. WWW, 751--760, 2006. Google ScholarDigital Library
- P. Ferragina and R. Venturini. Compressed permuterm index. In Proc. ACM SIGIR, 535--542, 2007. Google ScholarDigital Library
- M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE FOCS, 285--298, 1999. Google ScholarDigital Library
- A. Golynski, R. Grossi, A. Gupta, R. Raman, and S. S. Rao. On the size of succinct indices. In Proc. ESA, LNCS 4698, 371--382, 2007. Google ScholarDigital Library
- M. He, J. I. Munro, and S. S. Rao. Succinct ordinal trees based on tree covering. In Proc. ICALP, LNCS 4596, 509--520, 2007. Google ScholarDigital Library
- G. Jacobson. Space-efficient static trees and graphs. In Proc. IEEE FOCS, 549--554, 1989. Google ScholarDigital Library
- J. Jansson, K. Sadakane, and W. Sung. Ultra-succinct representation of ordered trees. In Proc. ACM-SIAM SODA, 575--584, 2007. Google ScholarDigital Library
- D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, second edition, 1998. Google ScholarDigital Library
- P. Ko and S. Aluru. Optimal self-adjusting trees for dynamic string data in secondary storage. In Proc. SPIRE, LNCS 4726, 184--194, 2007. Google ScholarDigital Library
- G. Manku, A. Jain, and A.-D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, 141--150, 2007. Google ScholarDigital Library
- K. Mehlhorn and A. K. Tsakalidis. Data structures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), 301--342, 1990. Google ScholarDigital Library
- J. I. Munro. Succinct data structures. Electr. Notes Theor. Comput. Sci., 91(3), 2004.Google Scholar
- G. Navarro and V. Mäkinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007. Google ScholarDigital Library
- R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. ACM-SIAM SODA, 233--242, 2002. Google ScholarDigital Library
- F. Ruskey. Combinatorial Generation, 2007. In preparation.Google Scholar
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, second edition, 1999. Google ScholarDigital Library
Index Terms
- On searching compressed string collections cache-obliviously
Recommendations
Size-Aware Cache Management for Compressed Cache Architectures
A practical way to increase the effective capacity of a microprocessor's cache, without physically increasing the cache size, is to employ data compression. Last-Level Caches (LLC) are particularly amenable to such compression schemes, since the primary ...
Enhancing a manycore-oriented compressed cache for GPGPU
HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific RegionGPUs can achieve high performance by exploiting massive-thread parallelism. However, some factors limit performance on GPUs, one of which is the negative effects of L1 cache misses. In some applications, GPUs are likely to suffer from L1 cache conflicts ...
Yet Another Compressed Cache: A Low-Cost Yet Effective Compressed Cache
Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip's area and thus account for a significant fraction ...
Comments