Abstract
In this article, we show that key-value stores backed by a log-structured merge-tree (LSM-tree) exhibit an intrinsic tradeoff between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune tradeoff among these metrics. We pinpoint the problem to the fact that modern key-value stores suboptimally co-tune the merge policy, the buffer size, and the Bloom filters’ false-positive rates across the LSM-tree’s different levels.
We present Monkey, an LSM-tree based key-value store that strikes the optimal balance between the costs of updates and lookups with any given main memory budget. The core insight is that worst-case lookup cost is proportional to the sum of the false-positive rates of the Bloom filters across all levels of the LSM-tree. Contrary to state-of-the-art key-value stores that assign a fixed number of bits-per-element to all Bloom filters, Monkey allocates memory to filters across different levels so as to minimize the sum of their false-positive rates. We show analytically that Monkey reduces the asymptotic complexity of the worst-case lookup I/O cost, and we verify empirically using an implementation on top of RocksDB that Monkey reduces lookup latency by an increasing margin as the data volume grows (50--80% for the data sizes we experimented with). Furthermore, we map the design space onto a closed-form model that enables adapting the merging frequency and memory allocation to strike the best tradeoff among lookup cost, update cost and main memory, depending on the workload (proportion of lookups and updates), the dataset (number and size of entries), and the underlying hardware (main memory available, disk vs. flash). We show how to use this model to answer what-if design questions about how changes in environmental parameters impact performance and how to adapt the design of the key-value store for optimal performance.
- M. Y. Ahmad and B. Kemme. 2015. Compaction management in distributed key-value datastores. Proc. VLDB Endow. 8, 8 (2015), 850--861. Google ScholarDigital Library
- S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y.-S. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (2014), 1905--1916. Google ScholarDigital Library
- A. Anand, C. Muthukrishnan, S. Kappes, A. Akella, and S. Nath. 2010. Cheap and large CAMs for high performance data-intensive networked systems. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’10). 433--448. Google ScholarDigital Library
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’09). 1--14. Google ScholarDigital Library
- M. R. Anderson, D. Antenucci, V. Bittorf, M. Burgess, M. J. Cafarella, A. Kumar, F. Niu, Y. Park, C. Ré, and C. Zhang. 2013. Brainwash: A data system for feature engineering. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’13).Google Scholar
- Apache. Accumulo. Retrieved from https://accumulo.apache.org/.Google Scholar
- Apache. Cassandra. Retrieved from http://cassandra.apache.org.Google Scholar
- Apache. HBase. Retrieved from http://hbase.apache.org/.Google Scholar
- L. Arge. 2003. The buffer tree: A technique for designing batched external data structures. Algorithmica 37, 1 (2003), 1--24. Google ScholarDigital Library
- T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. 2013. LinkBench: A database benchmark based on the facebook social graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1185--1196. Google ScholarDigital Library
- M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2011. MaSM: Efficient Online Updates in Data Warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 865--876. Google ScholarDigital Library
- M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. 2015. Online updates on data warehouses via judicious use of solid-state storage. ACM Trans. Database Syst. 40, 1 (2015). Google ScholarDigital Library
- M. Athanassoulis and S. Idreos. 2016. Design tradeoffs of data access methods. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tutorial. Google ScholarDigital Library
- M. Athanassoulis, M. S. Kester, L. M. Maas, R. Stoica, S. Idreos, A. Ailamaki, and M. Callaghan. 2016. Designing access methods: The RUM conjecture. In Proceedings of the International Conference on Extending Database Technology (EDBT’16). 461--466.Google Scholar
- A. Badam, K. Park, V. S. Pai, and L. L. Peterson. 2009. HashCache: Cache storage for the next billion. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’09). 123--136. Google ScholarDigital Library
- O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan, A. Arora, K. Gupta, and P. Konka. 2017. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 363--375. Google ScholarDigital Library
- M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. 2007. Cache-Oblivious Streaming B-trees. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81--92. Google ScholarDigital Library
- M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. 2012. Don’t thrash: How to cache your hash on flash. Proc. VLDB Endow. 5, 11 (2012), 1627--1637. Google ScholarDigital Library
- B. H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and G. Sheffi. 2018. Accordion: Better memory organization for LSM key-value stores. Proc. VLDB Endow. 11, 12 (2018), 1863--1875. Google ScholarDigital Library
- G. S. Brodal and R. Fagerberg. 2003. Lower Bounds for External Memory Dictionaries. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’03). 546--554. Google ScholarDigital Library
- N. G. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. C. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. 2013. TAO: Facebook’s distributed data store for the social graph. In Proceedings of the USENIX Annual Technical Conference (ATC’13). 49--60. Google ScholarDigital Library
- Y. Bu, V. R. Borkar, J. Jia, M. J. Carey, and T. Condie. 2014. Pregelix: Big(ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8, 2 (2014), 161--172. Google ScholarDigital Library
- A. L. Buchsbaum, M. H. Goldwasser, S. Venkatasubramanian, and J. Westbrook. 2000. On external memory graph traversal. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’00). 859--860. Google ScholarDigital Library
- H. H. W. Chan, Y. Li, P. P. C. Lee, and Y. Xu. 2018. HashKV: Enabling efficient updates in kv storage via hashing. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 1007--1019. Google ScholarDigital Library
- B. Chandramouli, G. Prasaad, D. Kossmann, J. J. Levandoski, J. Hunter, and M. Barnett. 2018. FASTER: A concurrent key-value store with in-place updates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 275--290. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’06). 205--218. Google ScholarDigital Library
- B. Chazelle and L. J. Guibas. 1986. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2 (1986), 133--162.Google ScholarDigital Library
- J. Chen, C. Douglas, M. Mutsuzaki, P. Quaid, R. Ramakrishnan, S. Rao, and R. Sears. 2012. Walnut: A unified cloud object store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 743--754. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’10). 143--154. Google ScholarDigital Library
- N. Dayan, M. Athanassoulis, and S. Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 79--94. Google ScholarDigital Library
- N. Dayan, P. Bonnet, and S. Idreos. 2016. GeckoFTL: Scalable flash translation techniques for very large flash devices. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 327--342. Google ScholarDigital Library
- N. Dayan and S. Idreos. 2018. Dostoevsky: Better space-time trade-offs for LSM-tree based key-value stores via adaptive removal of superfluous merging. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 505--520. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. 2010. FlashStore: High throughput persistent key-value store. Proc. VLDB Endow. 3, 1--2 (2010), 1414--1425. Google ScholarDigital Library
- B. Debnath, S. Sengupta, and J. Li. 2011. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 25--36. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operat. Syst. Review 41, 6 (2007), 205--220. Google ScholarDigital Library
- J. Dejun, G. Pierre, and C.-H. Chi. 2009. EC2 performance analysis for resource provisioning of service-oriented applications. In Proceedings of the ICSOC/ServiceWave 2009 WorkshopsService-Oriented Computing. 197--207. Google ScholarDigital Library
- S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. 2017. Optimizing space amplification in RocksDB. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR’17).Google Scholar
- Facebook. MyRocks. Retrieved from http://myrocks.io/.Google Scholar
- Facebook. RocksDB. Retrieved from https://github.com/facebook/rocksdb.Google Scholar
- B. Fan, D. G. Andersen, M. Kaminsky, and M. Mitzenmacher. 2014. Cuckoo filter: Practically better than bloom. In Proceedings of the ACM International on Conference on emerging Networking Experiments and Technologies (CoNEXT’14). 75--88. Google ScholarDigital Library
- B. Fitzpatrick and A. Vorobey. 2011. Memcached: A distributed memory object caching system. White Paper.Google Scholar
- G. Golan-Gueta, E. Bortnikov, E. Hillel, and I. Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’15). 32:1--32:14 Google ScholarDigital Library
- Google. LevelDB. Retrieved from https://github.com/google/leveldb/.Google Scholar
- S. Idreos, K. Zoumpatianos, M. Athanassoulis, N. Dayan, B. Hentschel, M. S. Kester, D. Guo, L. M. Maas, W. Qin, A. Wasay, and Y. Sun. 2018. The periodic table of data structures. IEEE Data Eng. Bull. 41, 3 (2018), 64--75.Google Scholar
- S. Idreos, K. Zoumpatianos, B. Hentschel, M. S. Kester, and D. Guo. 2018. The data calculator: Data structure design and cost synthesis from first principles and learned cost models. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 535--550. Google ScholarDigital Library
- H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti. 1997. Incremental organization for data recording and warehousing. In Proceedings of the International Conference on Very Large Data Bases (VLDB’97). 16--25. Google ScholarDigital Library
- C. Jermaine, A. Datta, and E. Omiecinski. 1999. A novel index supporting high volume data warehouse insertion. In Proceedings of the International Conference on Very Large Data Bases (VLDB’99). 235--246. Google ScholarDigital Library
- C. Jermaine, E. Omiecinski, and W. G. Yee. 2007. The partitioned exponential file for database storage management. VLDB J. 16, 4 (2007), 417--437. Google ScholarDigital Library
- B. C. Kuszmaul. 2014. A comparison of fractal trees to log-structured merge (LSM) trees. Tokutek White Paper.Google Scholar
- A. Lakshman and P. Malik. 2010. Cassandra—A decentralized structured storage system. ACM SIGOPS Operat. Syst. Rev. 44, 2 (2010), 35--40. Google ScholarDigital Library
- Y. Li, B. He, J. Yang, Q. Luo, K. Yi, and R. J. Yang. 2010. Tree indexing on solid state drives. Proc. VLDB Endow 3, 1--2 (2010), 1195--1206. Google ScholarDigital Library
- H. Lim, D. G. Andersen, and M. Kaminsky. 2016. Towards accurate and fast evaluation of multi-stage log-structured designs. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 149--166. Google ScholarDigital Library
- H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’11). 1--13. Google ScholarDigital Library
- LinkedIn. 2016. Online reference. Retrieved from http://www.project-voldemort.com.Google Scholar
- L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in ssd-conscious storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’16). 133--148. Google ScholarDigital Library
- P. E. O’Neil, E. Cheng, D. Gawlick and E. J. O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351--385. Google ScholarDigital Library
- A. Papagiannis, G. Saloustros, P. González-Férez, and A. Bilas. 2016. Tucana: Design and implementation of a fast and efficient scale-up key-value store. In Proceedings of the USENIX Annual Technical Conference (ATC’16). 537--550. Google ScholarDigital Library
- M. Pilman, K. Bocksrocker, L. Braun, R. Marroquin, and D. Kossmann. 2017. Fast scans on key-value stores. Proc. VLDB Endow. 10, 11 (2017), 1526--1537. Google ScholarDigital Library
- P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham. 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’17). 497--514. Google ScholarDigital Library
- Redis. Online reference. Retrieved from http://redis.io/.Google Scholar
- K. Ren, Q. Zheng, J. Arulraj, and G. Gibson. 2017. SlimDB: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (2017), 2037--2048. Google ScholarDigital Library
- R. Sears and R. Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 217--228. Google ScholarDigital Library
- J. Sheehy and D. Smith. 2010. Bitcask: A log-structured hash table for fast key/value data. Basho White Paper.Google Scholar
- P. Shetty, R. P. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13). 17--30. Google ScholarDigital Library
- S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. 2012. Theory and practice of bloom filters for distributed systems. IEEE Commun. Surv. Tutor 14, 1 (2012), 131--155.Google ScholarCross Ref
- R. Thonangi and J. Yang. 2017. On log-structured merge for solid-state drives. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’17). 683--694.Google Scholar
- D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. 2010. Analyzing the energy efficiency of a database server. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 231--242. Google ScholarDigital Library
- P. Wang, G. Sun, S. Jiang, J. Ouyang, S. Lin, C. Zhang, and J. Cong. 2014. An efficient design and implementation of lsm-tree based key-value store on open-channel SSD. In Proceedings of the ACM European Conference on Computer Systems (EuroSys’14). 16:1--16:14 Google ScholarDigital Library
- WiredTiger. Source Code. Retrieved from https://github.com/wiredtiger/wiredtiger.Google Scholar
- X. Wu, Y. Xu, Z. Shao, and S. Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference (ATC’15). 71--82. Google ScholarDigital Library
- H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, and A. Pavlo. 2018. SuRF: Practical range query filtering with fast succinct tries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 323--336. Google ScholarDigital Library
Index Terms
- Optimal Bloom Filters and Adaptive Merging for LSM-Trees
Recommendations
Monkey: Optimal Navigable Key-Value Store
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataIn this paper, we show that key-value stores backed by an LSM-tree exhibit an intrinsic trade-off between lookup cost, update cost, and main memory footprint, yet all existing designs expose a suboptimal and difficult to tune trade-off among these ...
Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataIn this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, ...
LSM-Trees and B-Trees: The Best of Both Worlds
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataLSM-Trees and B-Trees are the two primary data structures used as storage engines in modern key-value (KV) stores. These two structures are optimal for different workloads; LSM-Trees perform better on update queries, whereas B-Trees are preferable for ...
Comments