ABSTRACT
Inspired by Google's BigTable, a variety of scalable, semi-structured, weak-semantic table stores have been developed and optimized for different priorities such as query speed, ingest speed, availability, and interactivity. As these systems mature, performance benchmarking will advance from measuring the rate of simple workloads to understanding and debugging the performance of advanced features such as ingest speed-up techniques and function shipping filters from client to servers. This paper describes YCSB++, a set of extensions to the Yahoo! Cloud Serving Benchmark (YCSB) to improve performance understanding and debugging of these advanced features. YCSB++ includes multi-tester coordination for increased load and eventual consistency measurement, multi-phase workloads to quantify the consequences of work deferment and the benefits of anticipatory configuration optimization such as B-tree pre-splitting or bulk loading, and abstract APIs for explicit incorporation of advanced features in benchmark tests. To enhance performance debugging, we customized an existing cluster monitoring tool to gather the internal statistics of YCSB++, table stores, system services like HDFS, and operating systems, and to offer easy post-test correlation and reporting of performance behaviors. YCSB++ features are illustrated in case studies of two BigTable-like table stores, Apache HBase and Accumulo, developed to emphasize high ingest rates and finegrained security.
- Apache Cassandra. http://cassandra.apache.org/.Google Scholar
- MongoDB. http://www.mongodb.org/.Google Scholar
- Project Voldemort: A distributed database. http://project-voldemort.com/.Google Scholar
- A. S. Aiyer, E. Anderson, X. Li, M. A. Shah, and J. J. Wylie. Consistability: Describing usually consistent systems. In Proc. of the 4th Workshop on Hot Topics in Syetms Dependability (HotDep '2008), San Diego, CA, December 2008. Google ScholarDigital Library
- A. Barbuzzi, P. Michiardi, E. Biersack, and G. Boggia. Parallel bulk Insertion for large-scale analytics applications. In Proc. of the 4th ACM SIGOPS/SIGACT International Workshop on Large Scale Distributed Systems and Middleware (LADIS '2010), Zurich, Switzerland, July 2010. Google ScholarDigital Library
- D. Borthakur. The Hadoop Distributed File System: Architecture and Design. http://hadoop.apache.org/core/docs/r0.16.4/hdfsdesign.html.Google Scholar
- E. A. Brewer. Towards robust distributed systems. Keynote at the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC '2000) on July 19, 2000 in Portland OR. Google ScholarDigital Library
- M. Cafarella, E. Chang, A. Fikes, A. Halevy, W. Hsieh, A. Lerner, J. Madhavan, and S. Muthukrishnan. Data Management Projects at Google. SIGMOD Record, 37(1), 2008. Google ScholarDigital Library
- Cassandra. Cassandra's Binary Memtable. http://wiki.apache.org/cassandra/BinaryMemtable.Google Scholar
- Cassandra. Cassandra's Extensible Authentication/Authorization. http://wiki.apache.org/cassandra/ExtensibleAuth.Google Scholar
- R. Cattell. Scalable SQL and NoSQL Data Stores. http://www.cattell.net/datastores/Datastores.pdf.Google Scholar
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proc. of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI '2006), Seattle, WA, November 2006. Google ScholarDigital Library
- Collectd: The system statistics collection daemon. http://collectd.org/.Google Scholar
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proc. of the 1st ACM Symposium on Cloud Computing (SOCC '2010), Indianapolis, IN, June 2010. Google ScholarDigital Library
- J. Dean. Designs, Lessons and Advice from Building Large Distributed Systems. Keynote at the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS '2009) on October 11, 2009 - http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI '2004), San Francisco, CA, December 2004. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In Proc. of the 21st ACM Symposium on Operating Systems Principles (SOSP '2007), Stevenson, WA, October 2007. Google ScholarDigital Library
- D. J. Dewitt and J. Gray. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6), 1992. Google ScholarDigital Library
- A. Fekete and K. Ramamritham. Consistency Models for Replicated Data. In Replication, volume 5959 of Lecture Notes in Computer Science, 2010. Google ScholarDigital Library
- A. Fikes. Storage Architecture and Challenges. Talk at the Google Faculty Summit 2010 on July 29, 2010.Google Scholar
- R. Geambasu, A. A. Levy, T. Kohno, A. Krishnamurthy, and H. M. Levy. Comet: An Active Distributed Key-Value Store. In Proc. of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '2010), Vancouver, Canada, October 2010. Google ScholarDigital Library
- S. Gilbert and N. Lynch. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2), 2002. Google ScholarDigital Library
- G. Graefe. Partitioned B-trees: A user's guide. In Proc. of the 10th Conference on Database Systems for Business, Technology and Web (BTW '2003), Leipzig, Germany, February 2003.Google Scholar
- G. Graefe. B-tree indexes for high update rates. SIGMOD Record, 35(1), 2006. Google ScholarDigital Library
- G. Graefe and H. Kuno. Fast Loads and Queries. In Transactions on Large-Scale Data- and Knowledge-Centered Systems II, volume 6380 of Lecture Notes in Computer Science, 2010. Google ScholarDigital Library
- Hadoop. Apache Hadoop. http://hadoop.apache.org/.Google Scholar
- HBase. Apache HBase. http://hbase.apache.org/.Google Scholar
- HBase. HBase - Bulk Loads in HBase. http://hbase.apache.org/docs/r0.89.20100621/bulk-loads.html.Google Scholar
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proc. of the 2010 USENIX Annual Technical Conference (USENIX ATC '2010), Boston, MA, June 2010. Google ScholarDigital Library
- E. Kootz. The HBase Blog -- Secure HBase: Access Controls. http://hbaseblog.com/2010/10/11/secure-hbase-access-controls/.Google Scholar
- T. Kraska, M. Hentschel, G. Alonso, and D. Kossmann. Consistency Rationing in the Cloud: Pay only when it matters. Proc. of the VLDB Endowment, 2(1), 2009. Google ScholarDigital Library
- M. Lai. HBase Coprocessors. http://hbaseblog.com/2010/11/30/hbase-coprocessors/.Google Scholar
- A. Lakshman and P. Malik. Cassandra -- A Decentralized Structured Storage System. In Proc. of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS '2009), Big Sky, MT, October 2009.Google Scholar
- A. Li, X. Yang, S. Kandula, and M. Zhang. CloudCmp: Comparing Public Cloud Providers. In Proc. of the 9th ACM SIGCOMM Conference on Internet Measurement (IMC '2009), Chicago, IL, November 2009. Google ScholarDigital Library
- Lily. Bulk Imports in Lily. http://docs.outerthought.org/lily-docs-current/438-lily.html.Google Scholar
- H. Liu. The cost of eventual consistency. http://huanliu.wordpress.com/2010/03/03/the-cost-of-eventual-consistency/.Google Scholar
- M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia Distributed Monitoring System: Design, Implementation And Experience. Parallel Computing, 30(7), 2004.Google Scholar
- P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4), 1996. Google ScholarDigital Library
- G. Pohl and M. Renner. Munin: Graphisches Netzwerk-und System-Monitoring. Open Source Press, 2008.Google Scholar
- A. Purtell. Coprocessors: Support small query language as filter on server side. https://issues.apache.org/jira/browse/HBASE-1002.Google Scholar
- K. Ren, J. López, and G. Gibson. Otus: Resource Attribution in Data-Intensive Clusters. In Proc. of the 2nd International Workshop on MapReduce and its Applications (MapReduce '2011), San Jose, CA, June 2011. Google ScholarDigital Library
- E. Riedel, C. Faloutsos, G. Gibson, and D. Nagle. Active Disks for Large-Scale Data Processing. IEEE Computer, 34(6), 2001. Google ScholarDigital Library
- G. Robidoux. Minimally Logging Bulk Load Inserts into SQL Server. http://www.mssqltips.com/tip.asp?tip=1185.Google Scholar
- M. Rosenblum and J. K. Ousterhout. The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems (TOCS), 10(1), August 1992. Google ScholarDigital Library
- SciDB. Use Cases for SciDB. http://www.scidb.org/use/.Google Scholar
- M. Seltzer. Beyond Relational Databases. Communications of the ACM, 51(7), 2008. Google ScholarDigital Library
- A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerncni, and R. Ramakrishnan. Efficient Bulk Insertions into a Distributed Ordered Table. In Proc. of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '2008), Vancouver, BC, Canada, June 2008. Google ScholarDigital Library
- M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. R. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-Store: A Column Oriented DBMS. In VLDB, 2005. Google ScholarDigital Library
- TokuTek. Fractal Tree Indexing in TokuDB. http://tokutek.com/technology/.Google Scholar
- W. Vogels. Eventually Consistent. ACM Queue, 6(6), 2008. Google ScholarDigital Library
- H. Wada, A. Fekete, L. Zhao, K. Lee, and A. Liu. Data Consistency Properties and the Trade-offs in Commercial Cloud Storages: the Consumers' Perspective. In Proc. of the 5th Biennial Conference on Innovative Data Systems Research (CIDR '2011), Asilomar, CA, January 2011.Google Scholar
- ZooKeeper. Apache ZooKeeper. http://zookeeper.apache.org/.Google Scholar
Index Terms
- YCSB++: benchmarking and performance debugging advanced features in scalable table stores
Recommendations
Benchmarking cloud serving systems with YCSB
SoCC '10: Proceedings of the 1st ACM symposium on Cloud computingWhile the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud ...
A Read-Optimized Index Structure for Distributed Log-Structured Key-Value Store
COMPSAC '15: Proceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference - Volume 03Recently, Big Data processing is becoming a necessary technique to efficiently store, manage, and analyze massive data obtained by social media contents. NoSQL is one of databases that efficiently handle Big Data compared to the traditional database ...
Testing Cloud Benchmark Scalability with Cassandra
SERVICES '14: Proceedings of the 2014 IEEE World Congress on ServicesNoSQL databases were developed as highly scalable databases that allow easy data distribution over a number of servers. With the increased interest of researchers and companies in non-relational technology, NoSQL databases became widely used and a ...
Comments