ABSTRACT
Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a new effective tool to battle distributed concurrency bugs. CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors. We observe that large-scale online cloud applications process millions of user requests per second, exercising many permutations of message orderings extensively. Those already sufficiently-tested message orderings are unlikely to expose errors. Hence, CloudRaid mines logs from previous executions to uncover those message orderings which are feasible, but not sufficiently tested. Specifically, CloudRaid tries to flip the order of a pair of messages <S,P> if they may happen in parallel, but S always arrives before P from existing logs, i.e., excercising the order P ↣ S. The log-based approach makes it suitable to live systems.
We have applied CloudRaid to automatically test four representative distributed systems: Apache Hadoop2/Yarn, HBase, HDFS and Cassandra. CloudRaid can automatically test 40 different versions of the 4 systems (10 versions per system) in 35 hours, and can successfully trigger 28 concurrency bugs, including 8 new bugs that have never been found before. The 8 new bugs have all been confirmed by their original developers, and 3 of them are considered as critical bugs that have already been fixed.
- 2018.Google Scholar
- Google Protocol Buffer. (2018).Google Scholar
- Retrieved April 26, 2018 from https: //developers.google.com/protocolbuffers/. 2018.Google Scholar
- WALA Home page. (2018).Google Scholar
- Retrieved April 26, 2018 from http://wala. sourceforge.net/wiki/index.php/Main_Page/.Google Scholar
- Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, Arvind Krishnamurthy, and Thomas E Anderson. 2012. Mining temporal invariants from partially ordered logs. ACM SIGOPS Operating Systems Review 45, 3 (2012), 39–46. Google ScholarDigital Library
- Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D Ernst. 2011. Leveraging existing instrumentation to automatically infer invariantconstrained models. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 267–277. Google ScholarDigital Library
- Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).Google Scholar
- Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services.. In OSDI. 217–231. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. 1145/1327452.1327492 Google ScholarDigital Library
- Florin Dinu and TS Ng. 2012.Google Scholar
- Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298. Google ScholarDigital Library
- David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 78, 12 pages. http://dl.acm.org/citation.cfm?id=2388996.2389102 Google ScholarDigital Library
- Cormac Flanagan and Stephen N Freund. 2009. FastTrack: efficient and precise dynamic race detection. In ACM Sigplan Notices, Vol. 44. ACM, 121–133. Google ScholarDigital Library
- Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009.Google Scholar
- Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158. Google ScholarDigital Library
- Erich Gamma. 1995.Google Scholar
- Design patterns: elements of reusable object-oriented software. Pearson Education India.Google Scholar
- Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Google ScholarDigital Library
- Lars George. 2011.Google Scholar
- HBase: the definitive guide: random access to your planet-size data. " O’Reilly Media, Inc.".Google Scholar
- Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of NSDIâĂŹ11: 8th USENIX Symposium on Networked Systems Design and Implementation. 239. Google ScholarDigital Library
- Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, and Koushik Sen. 2010.Google Scholar
- Towards Automatically Checking Thousands of Failures with Micro-specifications.. In HotDep.Google Scholar
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC ’14). ACM, New York, NY, USA, Article 7, 14 pages. Google ScholarDigital Library
- Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011.Google Scholar
- Practical software model checking via dynamic interface reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 265–278. Google ScholarDigital Library
- Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems (HotOS’13). USENIX Association, Berkeley, CA, USA, 8–8. http://dl.acm.org/citation.cfm? id=2490483.2490491 Google ScholarDigital Library
- Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. 2014. Race detection for eventdriven mobile applications. ACM SIGPLAN Notices 49, 6 (2014), 326–336. ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA Jie Lu, Feng Li, Lian Li, and Xiaobing Feng Google ScholarDigital Library
- Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUD ¯ O: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems. ACM, 7. Google ScholarDigital Library
- Pallavi Joshi, Haryadi S Gunawi, and Koushik Sen. 2011.Google Scholar
- PREFAIL: A programmable tool for multiple-failure injection. In ACM SIGPLAN Notices, Vol. 46. ACM, 171–188. Google ScholarDigital Library
- Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013.Google Scholar
- On fault resilience of OpenStack. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2. Google ScholarDigital Library
- Baris Kasikci, Cristian Zamfir, and George Candea. 2012. Data races vs. data race bugs: telling the difference with portend. ACM SIGPLAN Notices 47, 4 (2012), 185–198. Google ScholarDigital Library
- Kamal Kc and Xiaohui Gu. 2011. ELT: Efficient log-based troubleshooting system for cloud computing infrastructures. In Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on. IEEE, 11–20. Google ScholarDigital Library
- Charles Killian, James W Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. NSDI. Google ScholarDigital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40. Google ScholarDigital Library
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems.. In OSDI. 399–414. Google ScholarDigital Library
- Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016.Google Scholar
- TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). ACM, New York, NY, USA, 517–530. Google ScholarDigital Library
- Lian Li, Cristina Cifuentes, and Nathan Keynes. 2011. Boosting the Performance of Flow-sensitive Points-to Analysis Using Value Flow. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11). ACM, New York, NY, USA, 343–353. Google ScholarDigital Library
- Lian Li, Cristina Cifuentes, and Nathan Keynes. 2013.Google Scholar
- Precise and Scalable Context-sensitive Pointer Analysis via Value Flow Graph. In Proceedings of the 2013 International Symposium on Memory Management (ISMM ’13). ACM, New York, NY, USA, 85–96. Google ScholarDigital Library
- Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009.Google Scholar
- MODIST: Transparent model checking of unmodified distributed systems. In 6th USENIX Symposium on Networked Systems Design & Implementation (NSDI). Google ScholarDigital Library
- Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 677–691. Google ScholarDigital Library
- Jian-Guang Lou, Qiang Fu, Yi Wang, and Jiang Li. 2010. Mining dependency in distributed systems through unstructured logs analysis. ACM SIGOPS Operating Systems Review 44, 1 (2010), 91–96. Google ScholarDigital Library
- Jian-Guang Lou, Qiang Fu, Shengqi Yang, Jiang Li, and Bin Wu. 2010. Mining program workflow from interleaved traces. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 613–622. Google ScholarDigital Library
- Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. 2006. AVIO: detecting atomicity violations via access interleaving invariants. In ACM SIGOPS Operating Systems Review, Vol. 40. ACM, 37–48. Google ScholarDigital Library
- Brandon Lucia, Luis Ceze, and Karin Strauss. 2010. ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations. ACM SIGARCH computer architecture news 38, 3 (2010), 222–233. Google ScholarDigital Library
- Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 26–26. Google ScholarDigital Library
- Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems (TOCS) 15, 4 (1997), 391–411. Google ScholarDigital Library
- Jiri Simsa, Randal E Bryant, and Garth Gibson. 2010. dBug: systematic evaluation of distributed systems. USENIX. Google ScholarDigital Library
- Yulei Sui and Jingling Xue. 2016. On-demand Strong Update Analysis via Valueflow Refinement. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 460–473. Google ScholarDigital Library
- Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2010.Google Scholar
- Visual, log-based causal tracing for performance debugging of mapreduce systems. In Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 795–806. Google ScholarDigital Library
- Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2008. SALSA: Analyzing Logs as StAte Machines. WASL 8 (2008), 6–6.Google ScholarDigital Library
- Tian Tan, Yue Li, and Jingling Xue. 2017. Efficient and Precise Points-to Analysis: Modeling the Heap by Merging Equivalent Automata. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 278–291. Google ScholarDigital Library
- 3062360Google Scholar
- Hadoop Team. 2018. Fault Injection framework. (2018).Google Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013.Google Scholar
- Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC ’13). ACM, New York, NY, USA, Article 5, 16 pages. Google ScholarDigital Library
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009.Google Scholar
- Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132. Google ScholarDigital Library
- Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. Cloudseer: Workflow monitoring of cloud infrastructures via interleaved logs. In ACM SIGPLAN Notices, Vol. 51. ACM, 489–502. Google ScholarDigital Library
- Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Dataintensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, USA, 249–265. http://dl.acm.org/citation.cfm?id=2685048.2685068 Google ScholarDigital Library
- Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle.. In OSDI. 603–618. Google ScholarDigital Library
- Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems.. In OSDI, Vol. 14. 629–644. Google ScholarDigital Library
- Qing Zhou, Lian Li, Lei Wang, Jingling Xue, and Xiaobing Feng. 2018.Google Scholar
Index Terms
- CloudRaid: hunting concurrency bugs in the cloud via log-mining
Recommendations
Automatically detecting and fixing concurrency bugs in go software systems
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating SystemsGo is a statically typed programming language designed for efficient and reliable concurrent programming. For this purpose, Go provides lightweight goroutines and recommends passing messages using channels as a less error-prone means of thread ...
Detect Related Bugs from Source Code Using Bug Information
COMPSAC '10: Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications ConferenceOpen source projects often maintain open bug repositories during development and maintenance, and the reporters often point out straightly or implicitly the reasons why bugs occur when they submit them. The comments about a bug are very valuable for ...
Finding complex concurrency bugs in large multi-threaded applications
EuroSys '11: Proceedings of the sixth conference on Computer systemsParallel software is increasingly necessary to take advantage of multi-core architectures, but it is also prone to concurrency bugs which are particularly hard to avoid, find, and fix, since their occurrence depends on specific thread interleavings. In ...
Comments