research-article

Optimal Bloom Filters and Adaptive Merging for LSM-Trees

Authors:
Niv Dayan

Harvard University, USA

Harvard University, USA
View Profile

,
Manos Athanassoulis

Harvard University, USA

Harvard University, USA

0000-0003-1837-0010
View Profile

,
Stratos Idreos

Harvard University, USA

Harvard University, USA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 43 Issue 4Article No.: 16pp 1–48https://doi.org/10.1145/3276980

Published:08 December 2018Publication History

ACM Transactions on Database Systems

Abstract

In this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune tradeoff among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false-positive rates across the LSM-tree’s different levels.

We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false-positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false-positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50--80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best tradeoff among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.

References

M. Y. Ahmad and B. Kemme. 2015. Compaction management in distributed key-value datastores. Proc. VLDB Endow. 8, 8 (2015), 850--861. Google ScholarDigital Library
S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y.-S. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905--1916. Google ScholarDigital Library
A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. 2010. Cheap and large CAMs for high performance data-intensive networked systems. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’10). 433--448. Google ScholarDigital Library
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’09). 1--14. Google ScholarDigital Library
M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Ré, and C. Zhang. 2013. Brainwash: A data system for feature engineering. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’13).Google Scholar
Apache. Accumulo. Retrieved from https://accumulo.apache.org/.Google Scholar
Apache. Cassandra. Retrieved from http://cassandra.apache.org.Google Scholar
Apache. HBase. Retrieved from http://hbase.apache.org/.Google Scholar
L. Arge. 2003. The buffer tree: A technique for designing batched external data structures. Algorithmica 37, 1 (2003), 1--24. Google ScholarDigital Library
T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. 2013. LinkBench: A database benchmark based on the facebook social graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1185--1196. Google ScholarDigital Library
M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2011. MaSM: Efficient Online Updates in Data Warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 865--876. Google ScholarDigital Library
M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2015. Online updates on data warehouses via judicious use of solid-state storage. ACM Trans. Database Syst. 40, 1 (2015). Google ScholarDigital Library
M. Athanassoulis and S. Idreos. 2016. Design tradeoffs of data access methods. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tutorial. Google ScholarDigital Library
M. Athanassoulis, M. S. Kester, L. M. Maas, R. Stoica, S. Idreos, A. Ailamaki, and M. Callaghan. 2016. Designing access methods: The RUM conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT’16). 461--466.Google Scholar
A. Badam, K. Park, V. S. Pai, and L. L. Peterson. 2009. HashCache: Cache storage for the next billion. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’09). 123--136. Google ScholarDigital Library
O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan, A. Arora, K. Gupta, and P. Konka. 2017. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 363--375. Google ScholarDigital Library
M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. 2007. Cache-Oblivious Streaming B-trees. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81--92. Google ScholarDigital Library
M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. 2012. Don’t thrash: How to cache your hash on flash. Proc. VLDB Endow. 5, 11 (2012), 1627--1637. Google ScholarDigital Library
B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and G. Sheffi. 2018. Accordion: Better memory organization for LSM key-value stores. Proc. VLDB Endow. 11, 12 (2018), 1863--1875. Google ScholarDigital Library
G. S. Brodal and R. Fagerberg. 2003. Lower Bounds for External Memory Dictionaries. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 546--554. Google ScholarDigital Library
N. G. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. C. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings of the USENIX Annual Technical Conference (ATC’13). 49--60. Google ScholarDigital Library
Y. Bu, V. R. Borkar, J. Jia, M. J. Carey, and T. Condie. 2014. Pregelix: Big(ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8, 2 (2014), 161--172. Google ScholarDigital Library
A. L. Buchsbaum, M. H. Goldwasser, S. Venkatasubramanian, and J. Westbrook. 2000. On external memory graph traversal. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00). 859--860. Google ScholarDigital Library
H. H. W. Chan, Y. Li, P. P. C. Lee, and Y. Xu. 2018. HashKV: Enabling efficient updates in kv storage via hashing. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 1007--1019. Google ScholarDigital Library
B. Chandramouli, G. Prasaad, D. Kossmann, J. J. Levandoski, J. Hunter, and M. Barnett. 2018. FASTER: A concurrent key-value store with in-place updates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 275--290. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). 205--218. Google ScholarDigital Library
B. Chazelle and L. J. Guibas. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2 (1986), 133--162.Google ScholarDigital Library
J. Chen, C. Douglas, M. Mutsuzaki, P. Quaid, R. Ramakrishnan, S. Rao, and R. Sears. 2012. Walnut: A unified cloud object store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 743--754. Google ScholarDigital Library
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’10). 143--154. Google ScholarDigital Library
N. Dayan, M. Athanassoulis, and S. Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 79--94. Google ScholarDigital Library
N. Dayan, P. Bonnet, and S. Idreos. 2016. GeckoFTL: Scalable flash translation techniques for very large flash devices. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 327--342. Google ScholarDigital Library
N. Dayan and S. Idreos. 2018. Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superfluous merging. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 505--520. Google ScholarDigital Library
B. Debnath, S. Sengupta, and J. Li. 2010. FlashStore: High throughput persistent key-value store. Proc. VLDB Endow. 3, 1--2 (2010), 1414--1425. Google ScholarDigital Library
B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 25--36. Google ScholarDigital Library
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operat. Syst. Review 41, 6 (2007), 205--220. Google ScholarDigital Library
J. Dejun, G. Pierre, and C.-H. Chi. 2009. EC2 performance analysis for resource provisioning of service-oriented applications. In Proceedings of the ICSOC/ServiceWave 2009 WorkshopsService-Oriented Computing. 197--207. Google ScholarDigital Library
S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. 2017. Optimizing space amplification in RocksDB. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’17).Google Scholar
Facebook. MyRocks. Retrieved from http://myrocks.io/.Google Scholar
Facebook. RocksDB. Retrieved from https://github.com/facebook/rocksdb.Google Scholar
B. Fan, D. G. Andersen, M. Kaminsky, and M. Mitzenmacher. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the ACM International on Conference on emerging Networking Experiments and Technologies (CoNEXT’14). 75--88. Google ScholarDigital Library
B. Fitzpatrick and A. Vorobey. 2011. Memcached: A distributed memory object caching system. White Paper.Google Scholar
G. Golan-Gueta, E. Bortnikov, E. Hillel, and I. Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’15). 32:1--32:14 Google ScholarDigital Library
Google. LevelDB. Retrieved from https://github.com/google/leveldb/.Google Scholar
S. Idreos, K. Zoumpatianos, M. Athanassoulis, N. Dayan, B. Hentschel, M. S. Kester, D. Guo, L. M. Maas, W. Qin, A. Wasay, and Y. Sun. 2018. The periodic table of data structures. IEEE Data Eng. Bull. 41, 3 (2018), 64--75.Google Scholar
S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, and D. Guo. 2018. The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 535--550. Google ScholarDigital Library
H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti. 1997. Incremental organization for data recording and warehousing. In Proceedings of the International Conference on Very Large Data Bases (VLDB’97). 16--25. Google ScholarDigital Library
C. Jermaine, A. Datta, and E. Omiecinski. 1999. A novel index supporting high volume data warehouse insertion. In Proceedings of the International Conference on Very Large Data Bases (VLDB’99). 235--246. Google ScholarDigital Library
C. Jermaine, E. Omiecinski, and W. G. Yee. 2007. The partitioned exponential file for database storage management. VLDB J. 16, 4 (2007), 417--437. Google ScholarDigital Library
B. C. Kuszmaul. 2014. A comparison of fractal trees to log-structured merge (LSM) trees. Tokutek White Paper.Google Scholar
A. Lakshman and P. Malik. 2010. Cassandra—A decentralized structured storage system. ACM SIGOPS Operat. Syst. Rev. 44, 2 (2010), 35--40. Google ScholarDigital Library
Y. Li, B. He, J. Yang, Q. Luo, K. Yi, and R. J. Yang. 2010. Tree indexing on solid state drives. Proc. VLDB Endow 3, 1--2 (2010), 1195--1206. Google ScholarDigital Library
H. Lim, D. G. Andersen, and M. Kaminsky. 2016. Towards accurate and fast evaluation of multi-stage log-structured designs. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 149--166. Google ScholarDigital Library
H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). 1--13. Google ScholarDigital Library
LinkedIn. 2016. Online reference. Retrieved from http://www.project-voldemort.com.Google Scholar
L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in ssd-conscious storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 133--148. Google ScholarDigital Library
P. E. O’Neil, E. Cheng, D. Gawlick and E. J. O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351--385. Google ScholarDigital Library
A. Papagiannis, G. Saloustros, P. González-Férez, and A. Bilas. 2016. Tucana: Design and implementation of a fast and efficient scale-up key-value store. In Proceedings of the USENIX Annual Technical Conference (ATC’16). 537--550. Google ScholarDigital Library
M. Pilman, K. Bocksrocker, L. Braun, R. Marroquin, and D. Kossmann. 2017. Fast scans on key-value stores. Proc. VLDB Endow. 10, 11 (2017), 1526--1537. Google ScholarDigital Library
P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham. 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’17). 497--514. Google ScholarDigital Library
Redis. Online reference. Retrieved from http://redis.io/.Google Scholar
K. Ren, Q. Zheng, J. Arulraj, and G. Gibson. 2017. SlimDB: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (2017), 2037--2048. Google ScholarDigital Library
R. Sears and R. Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 217--228. Google ScholarDigital Library
J. Sheehy and D. Smith. 2010. Bitcask: A log-structured hash table for fast key/value data. Basho White Paper.Google Scholar
P. Shetty, R. P. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13). 17--30. Google ScholarDigital Library
S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. 2012. Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor 14, 1 (2012), 131--155.Google ScholarCross Ref
R. Thonangi and J. Yang. 2017. On log-structured merge for solid-state drives. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’17). 683--694.Google Scholar
D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. 2010. Analyzing the energy efficiency of a database server. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 231--242. Google ScholarDigital Library
P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and J. Cong. 2014. An efficient design and implementation of lsm-tree based key-value store on open-channel SSD. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’14). 16:1--16:14 Google ScholarDigital Library
WiredTiger. Source Code. Retrieved from https://github.com/wiredtiger/wiredtiger.Google Scholar
X. Wu, Y. Xu, Z. Shao, and S. Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference (ATC’15). 71--82. Google ScholarDigital Library
H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, and A. Pavlo. 2018. SuRF: Practical range query filtering with fast succinct tries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 323--336. Google ScholarDigital Library

Index Terms

Optimal Bloom Filters and Adaptive Merging for LSM-Trees
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
        Point lookups
  2. Information storage systems
    1. Storage management
      1. Hierarchical storage management
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Data structures and algorithms for data management

Recommendations

Monkey: Optimal Navigable Key-Value Store
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

In this paper, we show that key-value stores backed by an LSM-tree exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these ...
Read More
Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

In this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, ...
Read More
LSM-Trees and B-Trees: The Best of Both Worlds
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

LSM-Trees and B-Trees are the two primary data structures used as storage engines in modern key-value (KV) stores. These two structures are optimal for different workloads; LSM-Trees perform better on update queries, whereas B-Trees are preferable for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 43, Issue 4
Best of SIGMOD 2017 Papers
December 2018
173 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3298792
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2018
- Revised: 1 September 2018
- Accepted: 1 September 2018
- Received: 1 December 2017
Published in tods Volume 43, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bloom filters
LSM-tree
NoSQL
key-value stores
system design
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 847
  Total Downloads
- Downloads (Last 12 months)80
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimal Bloom Filters and Adaptive Merging for LSM-Trees

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Monkey: Optimal Navigable Key-Value Store

Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging

LSM-Trees and B-Trees: The Best of Both Worlds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimal Bloom Filters and Adaptive Merging for LSM-Trees

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Monkey: Optimal Navigable Key-Value Store

Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging

LSM-Trees and B-Trees: The Best of Both Worlds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media