skip to main content
10.1145/3302424.3303986acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Public Access

FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems

Published:25 March 2019Publication History

ABSTRACT

We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16x (up to 78x) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs --- all were done without random walks or manual checkpoints.

References

  1. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  2. BUG: CASSANDRA-5925: Race condition in update lightweight transaction. https://issues.apache.org/jira/browse/CASSANDRA-5925.Google ScholarGoogle Scholar
  3. BUG: CASSANDRA-6013: CAS may return false but still commit the insert. https://issues.apache.org/jira/browse/CASSANDRA-6013,.Google ScholarGoogle Scholar
  4. BUG: CASSANDRA-6023: CAS should distinguish promised and accepted ballots. https://issues.apache.org/jira/browse/CASSANDRA-6023.Google ScholarGoogle Scholar
  5. BUG: ETHEREUM-15138: eth/downloader: track peer drops and deassign state sync tasks. https://github.com/ethereum/go-ethereum/issues/15138.Google ScholarGoogle Scholar
  6. BUG: HBASE-4397: -ROOT-, .META. tables stay offline for too long in recovery phase after all RSs are shutdown at the same time. https://issues.apache.org/jira/browse/HBASE-4397.Google ScholarGoogle Scholar
  7. BUG: LOGCABIN-174: resiliency in InstallSnapshot. https://github.com/logcabin/logcabin/issues/174.Google ScholarGoogle Scholar
  8. BUG: MAPREDUCE-5505: Clients should be notified job finished only after job successfully unregistered. https://issues.apache.org/jira/browse/MAPREDUCE-5505.Google ScholarGoogle Scholar
  9. BUG: SPARK-15262: race condition in killing an executor and reregistering an executor. https://issues.apache.org/jira/browse/SPARK-15262.Google ScholarGoogle Scholar
  10. BUG: SPARK-19623: DAGScheduler should avoid sending conflicting task set. https://issues.apache.org/jira/browse/SPARK-19263.Google ScholarGoogle Scholar
  11. BUG: ZOOKEEPER-1419: Leader election never settles for a 5-node cluster. https://issues.apache.org/jira/browse/ZOOKEEPER-1419.Google ScholarGoogle Scholar
  12. BUG: ZOOKEEPER-1492: leader cannot switch to LOOKING state when lost the majority. https://issues.apache.org/jira/browse/ZOOKEEPER-1492.Google ScholarGoogle Scholar
  13. BUG: ZOOKEEPER-335: zookeeper servers should commit the new leader txn to their logs. https://issues.apache.org/jira/browse/ZOOKEEPER-335.Google ScholarGoogle Scholar
  14. BUG: ZOOKEEPER-790: Last processed zxid set prematurely while establishing leadership. https://issues.apache.org/jira/browse/ZOOKEEPER-790.Google ScholarGoogle Scholar
  15. Chameleon. https://www.chameleoncloud.org.Google ScholarGoogle Scholar
  16. Chameleon Haswell Website. https://bit.ly/2KrnE4L.Google ScholarGoogle Scholar
  17. Eclipse Abstract Syntaxt Tree (AST). http://www.eclipse.org/articles/article.php?file=Article-JavaCodeManipulation_AST/index.html.Google ScholarGoogle Scholar
  18. Emulab d430 Website. https://wiki.emulab.net/wiki/d430.Google ScholarGoogle Scholar
  19. Ethereum. https://www.ethereum.org.Google ScholarGoogle Scholar
  20. FlyMC Open-Sourced Code. http://ucare.cs.uchicago.edu/projects/FlyMC/.Google ScholarGoogle Scholar
  21. FlyMC Technical Report (includes correctness sketch, pseudo-code, implementation details, etc.). https://tinyurl.com/flymc-technical-report.Google ScholarGoogle Scholar
  22. Java Path Finder. https://babelfish.arc.nasa.gov/trac/jpf.Google ScholarGoogle Scholar
  23. Jepsen. http://jepsen.io/.Google ScholarGoogle Scholar
  24. Kudu. https://kudu.apache.org/.Google ScholarGoogle Scholar
  25. Logcabin. https://github.com/logcabin/logcabin.Google ScholarGoogle Scholar
  26. Namazu. http://osrg.github.io/namazu/.Google ScholarGoogle Scholar
  27. Personal Communication with ZooKeeper Developers (Michael Han, Patrick Hunt, and Alex Shraer).Google ScholarGoogle Scholar
  28. RIVER: A Research Infrastructure to Explore Volatility, Energy-Efficiency, and Resilience. http://river.cs.uchicago.edu.Google ScholarGoogle Scholar
  29. Parosh Abdulla, Stavros Aronis, Bengt Jonsson, and Konstantinos Sagonas. Optimal Dynamic Partial Order Reduction. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D. Ernst. Debugging Distributed Systems: Challenges and Options for Validation and Debugging. In Communications of the ACM (CACM), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ella Bounimova, Patrice Godefroid, and David Molnar. Billions and Billions of Constraints: Whitebox Fuzz Testing in Production. In Proceedings of the 35th International Conference on Software Engineering (ICSE), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Edmund M. Clarke, E. Allen Emerson, Somesh Jha, and A. Prasad Sistla. Symmetry reductions in model checking. In 10th International Conference on Computer Aided Verification (CAV), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Edmund M. Clarke, Orna Grumberg, and David E. Long. Model Checking and Abstraction. ACM Transactions on Programming Languages and Systems, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Katherine E. Coons, Sebastian Burckhardt, and Madanlal Musuvathi. GAMBIT: Effective Unit Testing for Concurrency Libraries. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pantazis Deligiannis, Matt McCutchen, Paul Thomson, Shuo Chen, Alastair F. Donaldson, John Erickson, Cheng Huang, Akash Lal, Rashmi Mudduluru, Shaz Qadeer, and Wolfram Schulte. Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!). In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ankush Desai, Vivek Gupta, Ethan Jackson, Shaz Qadeer, Sriram Rajamani, and Damien Zufferey. P: Safe Asynchronous Event-Driven Programming. In Proceedings of the ACM SIGPLAN 2013 Conference on Programming Language Design and Implementation (PLDI), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ernest Allen Emerson. The Beginning of Model Checking: A Personal Perspective. Springer-Verlag, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Cormac Flanagan and Patrice Godefroid. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Pedro Fonseca, Kaiyuan Zhang, Xi Wang, and Arvind Krishnamurthy. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the 2017 EuroSys Conference (EuroSys), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Dennis Geels, Gautam Altekar, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Friday: Global Comprehension for Distributed Replay. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Patrice Godefroid. Partial-Order Methods for the Verification of Concurrent Systems: An Approach to the State-Explosion Problem. volume 1032, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Patrice Godefroid. Model checking for programming languages using verisoft. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Patrice Godefroid. Between Testing and Verification: Software Model Checking via Systematic Testing (Talk). In Haifa Verification Conference (HVC), 2015.Google ScholarGoogle Scholar
  44. Patrice Godefroid, Michael Y. Levin, and David Molnar. SAGE: Whitebox Fuzzing for Security Testing. In Communications of the ACM (CACM), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Patrice Godefroid and Nachiappan Nagappan. Concurrency At Microsoft - An Exploratory Study. Technical report, Microsoft Research, 2008.Google ScholarGoogle Scholar
  46. Rachid Guerraoui and Maysam Yabandeh. Model Checking a Networked System Without the Network. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R. Lorch, Bryan Parno, Michael L. Roberts, Srinath Setty, and Brian Zill. IronFleet: Proving Practical Distributed Systems Correct. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Annual Technical Conference (ATC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems. In Conference on Timely Results in Operating Systems (TRIOS), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. On Fault Resilience of OpenStack. In Proceedings of the 4th ACM Symposium on Cloud Computing (SoCC), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Vineet Kahlon, Chao Wang, and Aarti Gupta. Monotonic Partial Order Reduction: An Optimal Symbolic Partial Order Reduction Technique. In 21st International Conference on Computer Aided Verification (CAV), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Avinash Lakshman and Prashant Malik. Cassandra - A Decentralized Structured Storage System. In The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS), 2009.Google ScholarGoogle Scholar
  57. Leslie Lamport. The part-time parliament (paxos). ACM Transactions on Computer Systems, 16(2), May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Thomas A. Limoncelli and Doug Hughe. LISA '11 Theme -- DevOps: New Challenges, Proven Values. USENIX;login: Magazine, 36(4), August 2011.Google ScholarGoogle Scholar
  61. Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In Proceedings of the 5th Symposium on Networked Systems Design and Implementation (NSDI), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Xuezheng Liu, Wei Lin, Aimin Pan, and Zheng Zhang. WiDS Checker: Combating Bugs in Distributed Systems. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Madanlal Musuvathi, Shaz Qadeer, Tom Ball, Gerard Basler, Piramanayakam Arumuga Nainar, and Iulian Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Cesar Rodriguez, Marcelo Sousa, Subodh Sharma, and Daniel Kroening. Unfolding-based Partial Order Reduction. In Proceedings of the 26th International Conference on Concurrency Theory (CONCUR'15), 2015.Google ScholarGoogle Scholar
  70. Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, and Gregory R. Ganger. Principled workflow-centric tracing of distributed systems. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Colin Scott, Vjekoslav Brajkovic, George Necula, Arvind Krishnamurthy, and Scott Shenker. Minimizing Faulty Executions of Distributed Systems. In Proceedings of the 13th Symposium on Networked Systems Design and Implementation (NSDI), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Benjamin H. Sigelman, Luiz AndrÃl' Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google ScholarGoogle Scholar
  73. Jiri Simsa, Randy Bryant, and Garth Gibson. dBug: Systematic Evaluation of Distributed Systems. In 5th International Workshop on Systems Software Verification (SSV), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Jiri Simsa, Randy Bryant, Garth A. Gibson, and Jason Hickey. Scalable Dynamic Partial Order Reduction. In The 3rd International Conference on Runtime Verification (RV), 2012.Google ScholarGoogle Scholar
  75. A. Prasad Sistla, Viktor Gyuris, and E. Allen Emerson. SMC: a symmetry-based model checker for verification of safety and liveness properties. ACM Transactions on Software Engineering and Methodology, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Chao Wang, Swarat Chaudhuri, Aarti Gupta, and Yu Yang. Symbolic Pruning of Concurrent Program Executions. In Proceedings of the 17th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Chao Wang, Mahmoud Said, and Aarti Gupta. Coverage guided systematic concurrency testing. In Proceedings of the 33rd International Conference on Software Engineering (ICSE), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. An Integrated Experimental Environment for Distributed Systems and Networks. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Tom Anderson. Verdi: A framework for formally verifying distributed system implementations. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Yu Yang, Xiaofang Chen, Ganesh Gopalakrishnan, and Robert M. Kirby. Distributed Dynamic Partial Order Reduction Based Verification of Threaded Software*. In International SPIN Workshop on Model Checking of Software (SPIN), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster Computing with Working Sets. In The 2nd Workshop on Hot Topics in Cloud Computing (HotCloud), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, and Ding Yuan. Pensieve: Non-Intrusive Failure Reproduction of Distributed Systems using the Event Chaining Approach. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
        March 2019
        714 pages
        ISBN:9781450362818
        DOI:10.1145/3302424

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 March 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader