ABSTRACT
We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16x (up to 78x) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs --- all were done without random walks or manual checkpoints.
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- BUG: CASSANDRA-5925: Race condition in update lightweight transaction. https://issues.apache.org/jira/browse/CASSANDRA-5925.Google Scholar
- BUG: CASSANDRA-6013: CAS may return false but still commit the insert. https://issues.apache.org/jira/browse/CASSANDRA-6013,.Google Scholar
- BUG: CASSANDRA-6023: CAS should distinguish promised and accepted ballots. https://issues.apache.org/jira/browse/CASSANDRA-6023.Google Scholar
- BUG: ETHEREUM-15138: eth/downloader: track peer drops and deassign state sync tasks. https://github.com/ethereum/go-ethereum/issues/15138.Google Scholar
- BUG: HBASE-4397: -ROOT-, .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time. https://issues.apache.org/jira/browse/HBASE-4397.Google Scholar
- BUG: LOGCABIN-174: resiliency in InstallSnapshot. https://github.com/logcabin/logcabin/issues/174.Google Scholar
- BUG: MAPREDUCE-5505: Clients should be notified job finished only after job successfully unregistered. https://issues.apache.org/jira/browse/MAPREDUCE-5505.Google Scholar
- BUG: SPARK-15262: race condition in killing an executor and reregistering an executor. https://issues.apache.org/jira/browse/SPARK-15262.Google Scholar
- BUG: SPARK-19623: DAGScheduler should avoid sending conflicting task set. https://issues.apache.org/jira/browse/SPARK-19263.Google Scholar
- BUG: ZOOKEEPER-1419: Leader election never settles for a 5-node cluster. https://issues.apache.org/jira/browse/ZOOKEEPER-1419.Google Scholar
- BUG: ZOOKEEPER-1492: leader cannot switch to LOOKING state when lost the majority. https://issues.apache.org/jira/browse/ZOOKEEPER-1492.Google Scholar
- BUG: ZOOKEEPER-335: zookeeper servers should commit the new leader txn to their logs. https://issues.apache.org/jira/browse/ZOOKEEPER-335.Google Scholar
- BUG: ZOOKEEPER-790: Last processed zxid set prematurely while establishing leadership. https://issues.apache.org/jira/browse/ZOOKEEPER-790.Google Scholar
- Chameleon. https://www.chameleoncloud.org.Google Scholar
- Chameleon Haswell Website. https://bit.ly/2KrnE4L.Google Scholar
- Eclipse Abstract Syntaxt Tree (AST). http://www.eclipse.org/articles/article.php?file=Article-JavaCodeManipulation_AST/index.html.Google Scholar
- Emulab d430 Website. https://wiki.emulab.net/wiki/d430.Google Scholar
- Ethereum. https://www.ethereum.org.Google Scholar
- FlyMC Open-Sourced Code. http://ucare.cs.uchicago.edu/projects/FlyMC/.Google Scholar
- FlyMC Technical Report (includes correctness sketch, pseudo-code, implementation details, etc.). https://tinyurl.com/flymc-technical-report.Google Scholar
- Java Path Finder. https://babelfish.arc.nasa.gov/trac/jpf.Google Scholar
- Jepsen. http://jepsen.io/.Google Scholar
- Kudu. https://kudu.apache.org/.Google Scholar
- Logcabin. https://github.com/logcabin/logcabin.Google Scholar
- Namazu. http://osrg.github.io/namazu/.Google Scholar
- Personal Communication with ZooKeeper Developers (Michael Han, Patrick Hunt, and Alex Shraer).Google Scholar
- RIVER: A Research Infrastructure to Explore Volatility, Energy-Efficiency, and Resilience. http://river.cs.uchicago.edu.Google Scholar
- Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. Optimal Dynamic Partial Order Reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2014. Google ScholarDigital Library
- Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D. Ernst. Debugging Distributed Systems: Challenges and Options for Validation and Debugging. In Communications of the ACM (CACM), 2016. Google ScholarDigital Library
- Ella Bounimova, Patrice Godefroid, and David Molnar. Billions and Billions of Constraints: Whitebox Fuzz Testing in Production. In Proceedings of the 35th International Conference on Software Engineering (ICSE), 2013. Google ScholarDigital Library
- Edmund M. Clarke, E. Allen Emerson, Somesh Jha, and A. Prasad Sistla. Symmetry reductions in model checking. In 10th International Conference on Computer Aided Verification (CAV), 1998. Google ScholarDigital Library
- Edmund M. Clarke, Orna Grumberg, and David E. Long. Model Checking and Abstraction. ACM Transactions on Programming Languages and Systems, 1994. Google ScholarDigital Library
- Katherine E. Coons, Sebastian Burckhardt, and Madanlal Musuvathi. GAMBIT: Effective Unit Testing for Concurrency Libraries. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2010. Google ScholarDigital Library
- Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST), 2016. Google ScholarDigital Library
- Ankush Desai, Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and Damien Zufferey. P: Safe Asynchronous Event-Driven Programming. In Proceedings of the ACM SIGPLAN 2013 Conference on Programming Language Design and Implementation (PLDI), 2013. Google ScholarDigital Library
- Ernest Allen Emerson. The Beginning of Model Checking: A Personal Perspective. Springer-Verlag, 2008.Google ScholarDigital Library
- Cormac Flanagan and Patrice Godefroid. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2005. Google ScholarDigital Library
- Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the 2017 EuroSys Conference (EuroSys), 2017. Google ScholarDigital Library
- Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Friday: Global Comprehension for Distributed Replay. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007. Google ScholarDigital Library
- Patrice Godefroid. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. volume 1032, 1996. Google ScholarDigital Library
- Patrice Godefroid. Model checking for programming languages using verisoft. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 1997. Google ScholarDigital Library
- Patrice Godefroid. Between Testing and Verification: Software Model Checking via Systematic Testing (Talk). In Haifa Verification Conference (HVC), 2015.Google Scholar
- Patrice Godefroid, Michael Y. Levin, and David Molnar. SAGE: Whitebox Fuzzing for Security Testing. In Communications of the ACM (CACM), 2012. Google ScholarDigital Library
- Patrice Godefroid and Nachiappan Nagappan. Concurrency At Microsoft - An Exploratory Study. Technical report, Microsoft Research, 2008.Google Scholar
- Rachid Guerraoui and Maysam Yabandeh. Model Checking a Networked System Without the Network. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarDigital Library
- Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarDigital Library
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC), 2014. Google ScholarDigital Library
- Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarDigital Library
- Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving Practical Distributed Systems Correct. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), 2015. Google ScholarDigital Library
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC), 2010. Google ScholarDigital Library
- Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems. In Conference on Timely Results in Operating Systems (TRIOS), 2013. Google ScholarDigital Library
- Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. On Fault Resilience of OpenStack. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC), 2013. Google ScholarDigital Library
- Vineet Kahlon, Chao Wang, and Aarti Gupta. Monotonic Partial Order Reduction: An Optimal Symbolic Partial Order Reduction Technique. In 21st International Conference on Computer Aided Verification (CAV), 2009. Google ScholarDigital Library
- Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007. Google ScholarDigital Library
- Avinash Lakshman and Prashant Malik. Cassandra - A Decentralized Structured Storage System. In The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), 2009.Google Scholar
- Leslie Lamport. The part-time parliament (paxos). ACM Transactions on Computer Systems, 16(2), May 1998. Google ScholarDigital Library
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), 2014. Google ScholarDigital Library
- Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016. Google ScholarDigital Library
- Thomas A. Limoncelli and Doug Hughe. LISA '11 Theme -- DevOps: New Challenges, Proven Values. USENIX;login: Magazine, 36(4), August 2011.Google Scholar
- Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017. Google ScholarDigital Library
- Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018. Google ScholarDigital Library
- Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In Proceedings of the 5th Symposium on Networked Systems Design and Implementation (NSDI), 2008. Google ScholarDigital Library
- Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. WiDS Checker: Combating Bugs in Distributed Systems. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007. Google ScholarDigital Library
- Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), 2015. Google ScholarDigital Library
- Madanlal Musuvathi, Shaz Qadeer, Tom Ball, Gerard Basler, Piramanayakam Arumuga Nainar, and Iulian Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), 2008. Google ScholarDigital Library
- Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC), 2014. Google ScholarDigital Library
- Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI), 2006. Google ScholarDigital Library
- Cesar Rodriguez, Marcelo Sousa, Subodh Sharma, and Daniel Kroening. Unfolding-based Partial Order Reduction. In Proceedings of the 26th International Conference on Concurrency Theory (CONCUR'15), 2015.Google Scholar
- Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. Principled workflow-centric tracing of distributed systems. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC), 2016. Google ScholarDigital Library
- Colin Scott, Vjekoslav Brajkovic, George Necula, Arvind Krishnamurthy, and Scott Shenker. Minimizing Faulty Executions of Distributed Systems. In Proceedings of the 13th Symposium on Networked Systems Design and Implementation (NSDI), 2016. Google ScholarDigital Library
- Benjamin H. Sigelman, Luiz AndrÃl' Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google Scholar
- Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In 5th International Workshop on Systems Software Verification (SSV), 2010. Google ScholarDigital Library
- Jiri Simsa, Randy Bryant, Garth A. Gibson, and Jason Hickey. Scalable Dynamic Partial Order Reduction. In The 3rd International Conference on Runtime Verification (RV), 2012.Google Scholar
- A. Prasad Sistla, Viktor Gyuris, and E. Allen Emerson. SMC: a symmetry-based model checker for verification of safety and liveness properties. ACM Transactions on Software Engineering and Methodology, 2010. Google ScholarDigital Library
- Chao Wang, Swarat Chaudhuri, Aarti Gupta, and Yu Yang. Symbolic Pruning of Concurrent Program Executions. In Proceedings of the 17th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2009. Google ScholarDigital Library
- Chao Wang, Mahmoud Said, and Aarti Gupta. Coverage guided systematic concurrency testing. In Proceedings of the 33rd International Conference on Software Engineering (ICSE), 2011. Google ScholarDigital Library
- Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. An Integrated Experimental Environment for Distributed Systems and Networks. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), 2002. Google ScholarDigital Library
- James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Tom Anderson. Verdi: A framework for formally verifying distributed system implementations. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2015. Google ScholarDigital Library
- Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), 2009. Google ScholarDigital Library
- Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), 2009. Google ScholarDigital Library
- Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Robert M. Kirby. Distributed Dynamic Partial Order Reduction Based Verification of Threaded Software*. In International SPIN Workshop on Model Checking of Software (SPIN), 2007. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. In The 2nd Workshop on Hot Topics in Cloud Computing (HotCloud), 2010. Google ScholarDigital Library
- Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. Pensieve: Non-Intrusive Failure Reproduction of Distributed Systems using the Event Chaining Approach. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP), 2017. Google ScholarDigital Library
Index Terms
- FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems
Recommendations
An Approach to Improving Reliability for Distributed Video-Based Monitoring Systems
SSIRI '09: Proceedings of the 2009 Third IEEE International Conference on Secure Software Integration and Reliability ImprovementA large-scale distributed system may experience software or hardware failures that lead to undesirable down-time of the system. While the failure of a hardware node is common for large distributed systems, the reliability of software can also be a ...
Support for Distributed Transactions in the TABS Prototype
The TABS prototype is an experimental facility that provides operating system-level support for distributed transactions that operate on shared abstract types. The facility is designed to simplify the construction of highly available and reliable ...
Dependability assessment for decentralized systems
CASE '95: Proceedings of the Seventh International Workshop on Computer-Aided Software EngineeringAbstract: Conventional dependability measures, such as reliability or availability, assume that the equipment characterized by the measure is either operational or has failed. This dichotomy does not hold for decentralized or distributed systems because ...
Comments