research-article

On searching compressed string collections cache-obliviously

Authors:
Paolo Ferragina

Università di Pisa, Pisa, Italy

Università di Pisa, Pisa, Italy
View Profile

,
Roberto Grossi

Università di Pisa, Pisa, Italy

Università di Pisa, Pisa, Italy
View Profile

,
Ankur Gupta

Butler University, Indianapolis, IN, USA

Butler University, Indianapolis, IN, USA
View Profile

,
Rahul Shah

Louisiana State University, Baton Rouge, LA, USA

Louisiana State University, Baton Rouge, LA, USA
View Profile

,
Jeffrey Scott Vitter

Purdue University, West Lafayette, IN, USA

Purdue University, West Lafayette, IN, USA
View Profile

PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsJune 2008Pages 181–190https://doi.org/10.1145/1376916.1376943

Published:09 June 2008Publication History

PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 181–190

ABSTRACT

Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.

References

R. Bayer and K. Unterauer. Prefix B-trees. ACM Transactions on Database Systems, 2(1):11--26, 1977. Google ScholarDigital Library
M. Bender, M. Farach-Colton, and B. Kuszmaul. Cache-oblivious string b-trees. In Proc. ACM PODS, 233--242, 2006. Google ScholarDigital Library
D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 43:275--292, 2005. Google ScholarDigital Library
J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. ACM-SIAM SODA, 360--369, 1996. Google ScholarDigital Library
G. S. Brodal and R. Fagerberg. Cache-oblivious string dictionaries. In Proc. ACM-SIAM SODA, 581--590, 2006. Google ScholarDigital Library
V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan. A data structure for a sequence of string accesses in external memory. ACM Transactions on Algorithms, 3(1), 2007. Google ScholarDigital Library
P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999. Google ScholarDigital Library
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. IEEE FOCS, 184--193, 2005. Google ScholarDigital Library
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. WWW, 751--760, 2006. Google ScholarDigital Library
P. Ferragina and R. Venturini. Compressed permuterm index. In Proc. ACM SIGIR, 535--542, 2007. Google ScholarDigital Library
M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE FOCS, 285--298, 1999. Google ScholarDigital Library
A. Golynski, R. Grossi, A. Gupta, R. Raman, and S. S. Rao. On the size of succinct indices. In Proc. ESA, LNCS 4698, 371--382, 2007. Google ScholarDigital Library
M. He, J. I. Munro, and S. S. Rao. Succinct ordinal trees based on tree covering. In Proc. ICALP, LNCS 4596, 509--520, 2007. Google ScholarDigital Library
G. Jacobson. Space-efficient static trees and graphs. In Proc. IEEE FOCS, 549--554, 1989. Google ScholarDigital Library
J. Jansson, K. Sadakane, and W. Sung. Ultra-succinct representation of ordered trees. In Proc. ACM-SIAM SODA, 575--584, 2007. Google ScholarDigital Library
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, second edition, 1998. Google ScholarDigital Library
P. Ko and S. Aluru. Optimal self-adjusting trees for dynamic string data in secondary storage. In Proc. SPIRE, LNCS 4726, 184--194, 2007. Google ScholarDigital Library
G. Manku, A. Jain, and A.-D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, 141--150, 2007. Google ScholarDigital Library
K. Mehlhorn and A. K. Tsakalidis. Data structures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), 301--342, 1990. Google ScholarDigital Library
J. I. Munro. Succinct data structures. Electr. Notes Theor. Comput. Sci., 91(3), 2004.Google Scholar
G. Navarro and V. Mäkinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007. Google ScholarDigital Library
R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. ACM-SIAM SODA, 233--242, 2002. Google ScholarDigital Library
F. Ruskey. Combinatorial Generation, 2007. In preparation.Google Scholar
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, second edition, 1999. Google ScholarDigital Library

Index Terms

On searching compressed string collections cache-obliviously
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information storage systems
    1. Record storage systems

Recommendations

Size-Aware Cache Management for Compressed Cache Architectures
A practical way to increase the effective capacity of a microprocessor's cache, without physically increasing the cache size, is to employ data compression. Last-Level Caches (LLC) are particularly amenable to such compression schemes, since the primary ...
Read More
Enhancing a manycore-oriented compressed cache for GPGPU
HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

GPUs can achieve high performance by exploiting massive-thread parallelism. However, some factors limit performance on GPUs, one of which is the negative effects of L1 cache misses. In some applications, GPUs are likely to suffer from L1 cache conflicts ...
Read More
Yet Another Compressed Cache: A Low-Cost Yet Effective Compressed Cache

Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip's area and thus account for a significant fraction ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2008
330 pages
ISBN:9781605581521
DOI:10.1145/1376916
General Chair:
Phokion Kolaitis
IBM Almaden Research Center, USA
,
Program Chair:
Maurizio Lenzerini
SAPIENZA University of Rome, Italy
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
b-tree
cache efficiency
data compression
front coding
string searching
Qualifiers
- research-article
Conference

Acceptance Rates
PODS '08 Paper Acceptance Rate28of159submissions,18%Overall Acceptance Rate642of2,707submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 657
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On searching compressed string collections cache-obliviously

PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Size-Aware Cache Management for Compressed Cache Architectures

Enhancing a manycore-oriented compressed cache for GPGPU

Yet Another Compressed Cache: A Low-Cost Yet Effective Compressed Cache

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On searching compressed string collections cache-obliviously

PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Size-Aware Cache Management for Compressed Cache Architectures

Enhancing a manycore-oriented compressed cache for GPGPU

Yet Another Compressed Cache: A Low-Cost Yet Effective Compressed Cache

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media