skip to main content
10.1145/3236024.3236071acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

CloudRaid: hunting concurrency bugs in the cloud via log-mining

Authors Info & Claims
Published:26 October 2018Publication History

ABSTRACT

Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a new effective tool to battle distributed concurrency bugs. CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors. We observe that large-scale online cloud applications process millions of user requests per second, exercising many permutations of message orderings extensively. Those already sufficiently-tested message orderings are unlikely to expose errors. Hence, CloudRaid mines logs from previous executions to uncover those message orderings which are feasible, but not sufficiently tested. Specifically, CloudRaid tries to flip the order of a pair of messages <S,P> if they may happen in parallel, but S always arrives before P from existing logs, i.e., excercising the order PS. The log-based approach makes it suitable to live systems.

We have applied CloudRaid to automatically test four representative distributed systems: Apache Hadoop2/Yarn, HBase, HDFS and Cassandra. CloudRaid can automatically test 40 different versions of the 4 systems (10 versions per system) in 35 hours, and can successfully trigger 28 concurrency bugs, including 8 new bugs that have never been found before. The 8 new bugs have all been confirmed by their original developers, and 3 of them are considered as critical bugs that have already been fixed.

References

  1. 2018.Google ScholarGoogle Scholar
  2. Google Protocol Buffer. (2018).Google ScholarGoogle Scholar
  3. Retrieved April 26, 2018 from https: //developers.google.com/protocolbuffers/. 2018.Google ScholarGoogle Scholar
  4. WALA Home page. (2018).Google ScholarGoogle Scholar
  5. Retrieved April 26, 2018 from http://wala. sourceforge.net/wiki/index.php/Main_Page/.Google ScholarGoogle Scholar
  6. Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, Arvind Krishnamurthy, and Thomas E Anderson. 2012. Mining temporal invariants from partially ordered logs. ACM SIGOPS Operating Systems Review 45, 3 (2012), 39–46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D Ernst. 2011. Leveraging existing instrumentation to automatically infer invariantconstrained models. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 267–277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).Google ScholarGoogle Scholar
  9. Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services.. In OSDI. 217–231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. 1145/1327452.1327492 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Florin Dinu and TS Ng. 2012.Google ScholarGoogle Scholar
  12. Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 78, 12 pages. http://dl.acm.org/citation.cfm?id=2388996.2389102 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cormac Flanagan and Stephen N Freund. 2009. FastTrack: efficient and precise dynamic race detection. In ACM Sigplan Notices, Vol. 44. ACM, 121–133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009.Google ScholarGoogle Scholar
  16. Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Erich Gamma. 1995.Google ScholarGoogle Scholar
  18. Design patterns: elements of reusable object-oriented software. Pearson Education India.Google ScholarGoogle Scholar
  19. Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lars George. 2011.Google ScholarGoogle Scholar
  21. HBase: the definitive guide: random access to your planet-size data. " O’Reilly Media, Inc.".Google ScholarGoogle Scholar
  22. Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of NSDIâĂŹ11: 8th USENIX Symposium on Networked Systems Design and Implementation. 239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, and Koushik Sen. 2010.Google ScholarGoogle Scholar
  24. Towards Automatically Checking Thousands of Failures with Micro-specifications.. In HotDep.Google ScholarGoogle Scholar
  25. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC ’14). ACM, New York, NY, USA, Article 7, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011.Google ScholarGoogle Scholar
  27. Practical software model checking via dynamic interface reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 265–278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems (HotOS’13). USENIX Association, Berkeley, CA, USA, 8–8. http://dl.acm.org/citation.cfm? id=2490483.2490491 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. 2014. Race detection for eventdriven mobile applications. ACM SIGPLAN Notices 49, 6 (2014), 326–336. ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA Jie Lu, Feng Li, Lian Li, and Xiaobing Feng Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUD ¯ O: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems. ACM, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pallavi Joshi, Haryadi S Gunawi, and Koushik Sen. 2011.Google ScholarGoogle Scholar
  32. PREFAIL: A programmable tool for multiple-failure injection. In ACM SIGPLAN Notices, Vol. 46. ACM, 171–188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013.Google ScholarGoogle Scholar
  34. On fault resilience of OpenStack. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Baris Kasikci, Cristian Zamfir, and George Candea. 2012. Data races vs. data race bugs: telling the difference with portend. ACM SIGPLAN Notices 47, 4 (2012), 185–198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kamal Kc and Xiaohui Gu. 2011. ELT: Efficient log-based troubleshooting system for cloud computing infrastructures. In Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on. IEEE, 11–20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Charles Killian, James W Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems.. In OSDI. 399–414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016.Google ScholarGoogle Scholar
  41. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). ACM, New York, NY, USA, 517–530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Lian Li, Cristina Cifuentes, and Nathan Keynes. 2011. Boosting the Performance of Flow-sensitive Points-to Analysis Using Value Flow. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11). ACM, New York, NY, USA, 343–353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Lian Li, Cristina Cifuentes, and Nathan Keynes. 2013.Google ScholarGoogle Scholar
  44. Precise and Scalable Context-sensitive Pointer Analysis via Value Flow Graph. In Proceedings of the 2013 International Symposium on Memory Management (ISMM ’13). ACM, New York, NY, USA, 85–96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009.Google ScholarGoogle Scholar
  46. MODIST: Transparent model checking of unmodified distributed systems. In 6th USENIX Symposium on Networked Systems Design &amp; Implementation (NSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 677–691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jian-Guang Lou, Qiang Fu, Yi Wang, and Jiang Li. 2010. Mining dependency in distributed systems through unstructured logs analysis. ACM SIGOPS Operating Systems Review 44, 1 (2010), 91–96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jian-Guang Lou, Qiang Fu, Shengqi Yang, Jiang Li, and Bin Wu. 2010. Mining program workflow from interleaved traces. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 613–622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. 2006. AVIO: detecting atomicity violations via access interleaving invariants. In ACM SIGOPS Operating Systems Review, Vol. 40. ACM, 37–48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Brandon Lucia, Luis Ceze, and Karin Strauss. 2010. ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations. ACM SIGARCH computer architecture news 38, 3 (2010), 222–233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 26–26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems (TOCS) 15, 4 (1997), 391–411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Jiri Simsa, Randal E Bryant, and Garth Gibson. 2010. dBug: systematic evaluation of distributed systems. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yulei Sui and Jingling Xue. 2016. On-demand Strong Update Analysis via Valueflow Refinement. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 460–473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2010.Google ScholarGoogle Scholar
  57. Visual, log-based causal tracing for performance debugging of mapreduce systems. In Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 795–806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2008. SALSA: Analyzing Logs as StAte Machines. WASL 8 (2008), 6–6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Tian Tan, Yue Li, and Jingling Xue. 2017. Efficient and Precise Points-to Analysis: Modeling the Heap by Merging Equivalent Automata. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 278–291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. 3062360Google ScholarGoogle Scholar
  61. Hadoop Team. 2018. Fault Injection framework. (2018).Google ScholarGoogle Scholar
  62. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013.Google ScholarGoogle Scholar
  63. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC ’13). ACM, New York, NY, USA, Article 5, 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009.Google ScholarGoogle Scholar
  65. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. Cloudseer: Workflow monitoring of cloud infrastructures via interleaved logs. In ACM SIGPLAN Notices, Vol. 51. ACM, 489–502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Dataintensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, USA, 249–265. http://dl.acm.org/citation.cfm?id=2685048.2685068 Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle.. In OSDI. 603–618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems.. In OSDI, Vol. 14. 629–644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Qing Zhou, Lian Li, Lei Wang, Jingling Xue, and Xiaobing Feng. 2018.Google ScholarGoogle Scholar

Index Terms

  1. CloudRaid: hunting concurrency bugs in the cloud via log-mining

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
        October 2018
        987 pages
        ISBN:9781450355735
        DOI:10.1145/3236024

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate112of543submissions,21%

        Upcoming Conference

        FSE '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader