skip to main content
research-article

Optimal Bloom Filters and Adaptive Merging for LSM-Trees

Published:08 December 2018Publication History
Skip Abstract Section

Abstract

In this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune tradeoff among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false-positive rates across the LSM-tree’s different levels.

We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false-positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false-positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50--80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best tradeoff among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.

References

  1. M. Y. Ahmad and B. Kemme. 2015. Compaction management in distributed key-value datastores. Proc. VLDB Endow. 8, 8 (2015), 850--861. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y.-S. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905--1916. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. 2010. Cheap and large CAMs for high performance data-intensive networked systems. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’10). 433--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’09). 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Ré, and C. Zhang. 2013. Brainwash: A data system for feature engineering. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’13).Google ScholarGoogle Scholar
  6. Apache. Accumulo. Retrieved from https://accumulo.apache.org/.Google ScholarGoogle Scholar
  7. Apache. Cassandra. Retrieved from http://cassandra.apache.org.Google ScholarGoogle Scholar
  8. Apache. HBase. Retrieved from http://hbase.apache.org/.Google ScholarGoogle Scholar
  9. L. Arge. 2003. The buffer tree: A technique for designing batched external data structures. Algorithmica 37, 1 (2003), 1--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. 2013. LinkBench: A database benchmark based on the facebook social graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1185--1196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2011. MaSM: Efficient Online Updates in Data Warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 865--876. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2015. Online updates on data warehouses via judicious use of solid-state storage. ACM Trans. Database Syst. 40, 1 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Athanassoulis and S. Idreos. 2016. Design tradeoffs of data access methods. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tutorial. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Athanassoulis, M. S. Kester, L. M. Maas, R. Stoica, S. Idreos, A. Ailamaki, and M. Callaghan. 2016. Designing access methods: The RUM conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT’16). 461--466.Google ScholarGoogle Scholar
  15. A. Badam, K. Park, V. S. Pai, and L. L. Peterson. 2009. HashCache: Cache storage for the next billion. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’09). 123--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan, A. Arora, K. Gupta, and P. Konka. 2017. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 363--375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. 2007. Cache-Oblivious Streaming B-trees. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. 2012. Don’t thrash: How to cache your hash on flash. Proc. VLDB Endow. 5, 11 (2012), 1627--1637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and G. Sheffi. 2018. Accordion: Better memory organization for LSM key-value stores. Proc. VLDB Endow. 11, 12 (2018), 1863--1875. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. S. Brodal and R. Fagerberg. 2003. Lower Bounds for External Memory Dictionaries. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 546--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. G. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. C. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings of the USENIX Annual Technical Conference (ATC’13). 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Bu, V. R. Borkar, J. Jia, M. J. Carey, and T. Condie. 2014. Pregelix: Big(ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8, 2 (2014), 161--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. L. Buchsbaum, M. H. Goldwasser, S. Venkatasubramanian, and J. Westbrook. 2000. On external memory graph traversal. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00). 859--860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. H. W. Chan, Y. Li, P. P. C. Lee, and Y. Xu. 2018. HashKV: Enabling efficient updates in kv storage via hashing. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 1007--1019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Chandramouli, G. Prasaad, D. Kossmann, J. J. Levandoski, J. Hunter, and M. Barnett. 2018. FASTER: A concurrent key-value store with in-place updates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 275--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). 205--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Chazelle and L. J. Guibas. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2 (1986), 133--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Chen, C. Douglas, M. Mutsuzaki, P. Quaid, R. Ramakrishnan, S. Rao, and R. Sears. 2012. Walnut: A unified cloud object store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 743--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’10). 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Dayan, M. Athanassoulis, and S. Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 79--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Dayan, P. Bonnet, and S. Idreos. 2016. GeckoFTL: Scalable flash translation techniques for very large flash devices. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 327--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Dayan and S. Idreos. 2018. Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superfluous merging. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 505--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Debnath, S. Sengupta, and J. Li. 2010. FlashStore: High throughput persistent key-value store. Proc. VLDB Endow. 3, 1--2 (2010), 1414--1425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operat. Syst. Review 41, 6 (2007), 205--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Dejun, G. Pierre, and C.-H. Chi. 2009. EC2 performance analysis for resource provisioning of service-oriented applications. In Proceedings of the ICSOC/ServiceWave 2009 WorkshopsService-Oriented Computing. 197--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. 2017. Optimizing space amplification in RocksDB. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’17).Google ScholarGoogle Scholar
  39. Facebook. MyRocks. Retrieved from http://myrocks.io/.Google ScholarGoogle Scholar
  40. Facebook. RocksDB. Retrieved from https://github.com/facebook/rocksdb.Google ScholarGoogle Scholar
  41. B. Fan, D. G. Andersen, M. Kaminsky, and M. Mitzenmacher. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the ACM International on Conference on emerging Networking Experiments and Technologies (CoNEXT’14). 75--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. B. Fitzpatrick and A. Vorobey. 2011. Memcached: A distributed memory object caching system. White Paper.Google ScholarGoogle Scholar
  43. G. Golan-Gueta, E. Bortnikov, E. Hillel, and I. Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’15). 32:1--32:14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Google. LevelDB. Retrieved from https://github.com/google/leveldb/.Google ScholarGoogle Scholar
  45. S. Idreos, K. Zoumpatianos, M. Athanassoulis, N. Dayan, B. Hentschel, M. S. Kester, D. Guo, L. M. Maas, W. Qin, A. Wasay, and Y. Sun. 2018. The periodic table of data structures. IEEE Data Eng. Bull. 41, 3 (2018), 64--75.Google ScholarGoogle Scholar
  46. S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, and D. Guo. 2018. The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 535--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti. 1997. Incremental organization for data recording and warehousing. In Proceedings of the International Conference on Very Large Data Bases (VLDB’97). 16--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. C. Jermaine, A. Datta, and E. Omiecinski. 1999. A novel index supporting high volume data warehouse insertion. In Proceedings of the International Conference on Very Large Data Bases (VLDB’99). 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. C. Jermaine, E. Omiecinski, and W. G. Yee. 2007. The partitioned exponential file for database storage management. VLDB J. 16, 4 (2007), 417--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. B. C. Kuszmaul. 2014. A comparison of fractal trees to log-structured merge (LSM) trees. Tokutek White Paper.Google ScholarGoogle Scholar
  51. A. Lakshman and P. Malik. 2010. Cassandra—A decentralized structured storage system. ACM SIGOPS Operat. Syst. Rev. 44, 2 (2010), 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Y. Li, B. He, J. Yang, Q. Luo, K. Yi, and R. J. Yang. 2010. Tree indexing on solid state drives. Proc. VLDB Endow 3, 1--2 (2010), 1195--1206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. H. Lim, D. G. Andersen, and M. Kaminsky. 2016. Towards accurate and fast evaluation of multi-stage log-structured designs. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 149--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. LinkedIn. 2016. Online reference. Retrieved from http://www.project-voldemort.com.Google ScholarGoogle Scholar
  56. L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in ssd-conscious storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 133--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. P. E. O’Neil, E. Cheng, D. Gawlick and E. J. O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. A. Papagiannis, G. Saloustros, P. González-Férez, and A. Bilas. 2016. Tucana: Design and implementation of a fast and efficient scale-up key-value store. In Proceedings of the USENIX Annual Technical Conference (ATC’16). 537--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. M. Pilman, K. Bocksrocker, L. Braun, R. Marroquin, and D. Kossmann. 2017. Fast scans on key-value stores. Proc. VLDB Endow. 10, 11 (2017), 1526--1537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham. 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’17). 497--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Redis. Online reference. Retrieved from http://redis.io/.Google ScholarGoogle Scholar
  62. K. Ren, Q. Zheng, J. Arulraj, and G. Gibson. 2017. SlimDB: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (2017), 2037--2048. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. R. Sears and R. Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. J. Sheehy and D. Smith. 2010. Bitcask: A log-structured hash table for fast key/value data. Basho White Paper.Google ScholarGoogle Scholar
  65. P. Shetty, R. P. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13). 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. 2012. Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor 14, 1 (2012), 131--155.Google ScholarGoogle ScholarCross RefCross Ref
  67. R. Thonangi and J. Yang. 2017. On log-structured merge for solid-state drives. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’17). 683--694.Google ScholarGoogle Scholar
  68. D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. 2010. Analyzing the energy efficiency of a database server. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 231--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and J. Cong. 2014. An efficient design and implementation of lsm-tree based key-value store on open-channel SSD. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’14). 16:1--16:14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. WiredTiger. Source Code. Retrieved from https://github.com/wiredtiger/wiredtiger.Google ScholarGoogle Scholar
  71. X. Wu, Y. Xu, Z. Shao, and S. Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference (ATC’15). 71--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, and A. Pavlo. 2018. SuRF: Practical range query filtering with fast succinct tries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 323--336. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimal Bloom Filters and Adaptive Merging for LSM-Trees

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Database Systems
          ACM Transactions on Database Systems  Volume 43, Issue 4
          Best of SIGMOD 2017 Papers
          December 2018
          173 pages
          ISSN:0362-5915
          EISSN:1557-4644
          DOI:10.1145/3298792
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 8 December 2018
          • Revised: 1 September 2018
          • Accepted: 1 September 2018
          • Received: 1 December 2017
          Published in tods Volume 43, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader