research-article

CloudRaid: hunting concurrency bugs in the cloud via log-mining

Authors:
Jie Lu

Institute of Computing Technology at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Computing Technology at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

,
Feng Li

Institute of Computing Technology at Chinese Academy of Sciences, China / Institute of Information Engineering at Chinese Academy of Sciences, China

Institute of Computing Technology at Chinese Academy of Sciences, China / Institute of Information Engineering at Chinese Academy of Sciences, China
View Profile

,
Lian Li

Institute of Computing Technology at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Computing Technology at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

,
Xiaobing Feng

Institute of Computing Technology at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Computing Technology at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringOctober 2018Pages 3–14https://doi.org/10.1145/3236024.3236071

Published:26 October 2018Publication History

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 3–14

ABSTRACT

Cloud systems suffer from distributed concurrency bugs, which are notoriously difficult to detect and often lead to data loss and service outage. This paper presents CloudRaid, a new effective tool to battle distributed concurrency bugs. CloudRaid automatically detects concurrency bugs in cloud systems, by analyzing and testing those message orderings that are likely to expose errors. We observe that large-scale online cloud applications process millions of user requests per second, exercising many permutations of message orderings extensively. Those already sufficiently-tested message orderings are unlikely to expose errors. Hence, CloudRaid mines logs from previous executions to uncover those message orderings which are feasible, but not sufficiently tested. Specifically, CloudRaid tries to flip the order of a pair of messages <S,P> if they may happen in parallel, but S always arrives before P from existing logs, i.e., excercising the order P ↣ S. The log-based approach makes it suitable to live systems.

We have applied CloudRaid to automatically test four representative distributed systems: Apache Hadoop2/Yarn, HBase, HDFS and Cassandra. CloudRaid can automatically test 40 different versions of the 4 systems (10 versions per system) in 35 hours, and can successfully trigger 28 concurrency bugs, including 8 new bugs that have never been found before. The 8 new bugs have all been confirmed by their original developers, and 3 of them are considered as critical bugs that have already been fixed.

References

2018.Google Scholar
Google Protocol Buffer. (2018).Google Scholar
Retrieved April 26, 2018 from https: //developers.google.com/protocolbuffers/. 2018.Google Scholar
WALA Home page. (2018).Google Scholar
Retrieved April 26, 2018 from http://wala. sourceforge.net/wiki/index.php/Main_Page/.Google Scholar
Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, Arvind Krishnamurthy, and Thomas E Anderson. 2012. Mining temporal invariants from partially ordered logs. ACM SIGOPS Operating Systems Review 45, 3 (2012), 39–46. Google ScholarDigital Library
Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D Ernst. 2011. Leveraging existing instrumentation to automatically infer invariantconstrained models. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 267–277. Google ScholarDigital Library
Dhruba Borthakur et al. 2008. HDFS architecture guide. Hadoop Apache Project 53 (2008).Google Scholar
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services.. In OSDI. 217–231. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107–113. 1145/1327452.1327492 Google ScholarDigital Library
Florin Dinu and TS Ng. 2012.Google Scholar
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298. Google ScholarDigital Library
David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 78, 12 pages. http://dl.acm.org/citation.cfm?id=2388996.2389102 Google ScholarDigital Library
Cormac Flanagan and Stephen N Freund. 2009. FastTrack: efficient and precise dynamic race detection. In ACM Sigplan Notices, Vol. 44. ACM, 121–133. Google ScholarDigital Library
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009.Google Scholar
Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158. Google ScholarDigital Library
Erich Gamma. 1995.Google Scholar
Design patterns: elements of reusable object-oriented software. Pearson Education India.Google Scholar
Yu Gao, Wensheng Dou, Feng Qin, Chushu Gao, Dong Wang, Jun Wei, Ruirui Huang, Li Zhou, and Yongming Wu. 2018. An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Google ScholarDigital Library
Lars George. 2011.Google Scholar
HBase: the definitive guide: random access to your planet-size data. " O’Reilly Media, Inc.".Google Scholar
Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A framework for cloud recovery testing. In Proceedings of NSDIâĂŹ11: 8th USENIX Symposium on Networked Systems Design and Implementation. 239. Google ScholarDigital Library
Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, and Koushik Sen. 2010.Google Scholar
Towards Automatically Checking Thousands of Failures with Micro-specifications.. In HotDep.Google Scholar
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC ’14). ACM, New York, NY, USA, Article 7, 14 pages. Google ScholarDigital Library
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011.Google Scholar
Practical software model checking via dynamic interface reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 265–278. Google ScholarDigital Library
Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Peter Bodik, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems (HotOS’13). USENIX Association, Berkeley, CA, USA, 8–8. http://dl.acm.org/citation.cfm? id=2490483.2490491 Google ScholarDigital Library
Chun-Hung Hsiao, Jie Yu, Satish Narayanasamy, Ziyun Kong, Cristiano L Pereira, Gilles A Pokam, Peter M Chen, and Jason Flinn. 2014. Race detection for eventdriven mobile applications. ACM SIGPLAN Notices 49, 6 (2014), 326–336. ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA Jie Lu, Feng Li, Lian Li, and Xiaobing Feng Google ScholarDigital Library
Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUD ¯ O: perturbation-based testing framework for scalable distributed systems. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems. ACM, 7. Google ScholarDigital Library
Pallavi Joshi, Haryadi S Gunawi, and Koushik Sen. 2011.Google Scholar
PREFAIL: A programmable tool for multiple-failure injection. In ACM SIGPLAN Notices, Vol. 46. ACM, 171–188. Google ScholarDigital Library
Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013.Google Scholar
On fault resilience of OpenStack. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2. Google ScholarDigital Library
Baris Kasikci, Cristian Zamfir, and George Candea. 2012. Data races vs. data race bugs: telling the difference with portend. ACM SIGPLAN Notices 47, 4 (2012), 185–198. Google ScholarDigital Library
Kamal Kc and Xiaohui Gu. 2011. ELT: Efficient log-based troubleshooting system for cloud computing infrastructures. In Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on. IEEE, 11–20. Google ScholarDigital Library
Charles Killian, James W Anderson, Ranjit Jhala, and Amin Vahdat. 2007. Life, death, and the critical transition: Finding liveness bugs in systems code. NSDI. Google ScholarDigital Library
Avinash Lakshman and Prashant Malik. 2010. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40. Google ScholarDigital Library
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems.. In OSDI. 399–414. Google ScholarDigital Library
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016.Google Scholar
TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’16). ACM, New York, NY, USA, 517–530. Google ScholarDigital Library
Lian Li, Cristina Cifuentes, and Nathan Keynes. 2011. Boosting the Performance of Flow-sensitive Points-to Analysis Using Value Flow. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11). ACM, New York, NY, USA, 343–353. Google ScholarDigital Library
Lian Li, Cristina Cifuentes, and Nathan Keynes. 2013.Google Scholar
Precise and Scalable Context-sensitive Pointer Analysis via Value Flow Graph. In Proceedings of the 2013 International Symposium on Memory Management (ISMM ’13). ACM, New York, NY, USA, 85–96. Google ScholarDigital Library
Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009.Google Scholar
MODIST: Transparent model checking of unmodified distributed systems. In 6th USENIX Symposium on Networked Systems Design & Implementation (NSDI). Google ScholarDigital Library
Haopeng Liu, Guangpu Li, Jeffrey F Lukman, Jiaxin Li, Shan Lu, Haryadi S Gunawi, and Chen Tian. 2017. DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 677–691. Google ScholarDigital Library
Jian-Guang Lou, Qiang Fu, Yi Wang, and Jiang Li. 2010. Mining dependency in distributed systems through unstructured logs analysis. ACM SIGOPS Operating Systems Review 44, 1 (2010), 91–96. Google ScholarDigital Library
Jian-Guang Lou, Qiang Fu, Shengqi Yang, Jiang Li, and Bin Wu. 2010. Mining program workflow from interleaved traces. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 613–622. Google ScholarDigital Library
Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. 2006. AVIO: detecting atomicity violations via access interleaving invariants. In ACM SIGOPS Operating Systems Review, Vol. 40. ACM, 37–48. Google ScholarDigital Library
Brandon Lucia, Luis Ceze, and Karin Strauss. 2010. ColorSafe: architectural support for debugging and dynamically avoiding multi-variable atomicity violations. ACM SIGARCH computer architecture news 38, 3 (2010), 222–233. Google ScholarDigital Library
Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 26–26. Google ScholarDigital Library
Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Anderson. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems (TOCS) 15, 4 (1997), 391–411. Google ScholarDigital Library
Jiri Simsa, Randal E Bryant, and Garth Gibson. 2010. dBug: systematic evaluation of distributed systems. USENIX. Google ScholarDigital Library
Yulei Sui and Jingling Xue. 2016. On-demand Strong Update Analysis via Valueflow Refinement. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 460–473. Google ScholarDigital Library
Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2010.Google Scholar
Visual, log-based causal tracing for performance debugging of mapreduce systems. In Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 795–806. Google ScholarDigital Library
Jiaqi Tan, Xinghao Pan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. 2008. SALSA: Analyzing Logs as StAte Machines. WASL 8 (2008), 6–6.Google ScholarDigital Library
Tian Tan, Yue Li, and Jingling Xue. 2017. Efficient and Precise Points-to Analysis: Modeling the Heap by Merging Equivalent Automata. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 278–291. Google ScholarDigital Library
3062360Google Scholar
Hadoop Team. 2018. Fault Injection framework. (2018).Google Scholar
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013.Google Scholar
Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC ’13). ACM, New York, NY, USA, Article 5, 16 pages. Google ScholarDigital Library
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009.Google Scholar
Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132. Google ScholarDigital Library
Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. Cloudseer: Workflow monitoring of cloud infrastructures via interleaved logs. In ACM SIGPLAN Notices, Vol. 51. ACM, 489–502. Google ScholarDigital Library
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Dataintensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, USA, 249–265. http://dl.acm.org/citation.cfm?id=2685048.2685068 Google ScholarDigital Library
Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle.. In OSDI. 603–618. Google ScholarDigital Library
Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems.. In OSDI, Vol. 14. 629–644. Google ScholarDigital Library
Qing Zhou, Lian Li, Lei Wang, Jingling Xue, and Xiaobing Feng. 2018.Google Scholar

Index Terms

CloudRaid: hunting concurrency bugs in the cloud via log-mining
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

Automatically detecting and fixing concurrency bugs in go software systems
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Go is a statically typed programming language designed for efficient and reliable concurrent programming. For this purpose, Go provides lightweight goroutines and recommends passing messages using channels as a less error-prone means of thread ...
Read More
Detect Related Bugs from Source Code Using Bug Information
COMPSAC '10: Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications Conference

Open source projects often maintain open bug repositories during development and maintenance, and the reporters often point out straightly or implicitly the reasons why bugs occur when they submit them. The comments about a bug are very valuable for ...
Read More
Finding complex concurrency bugs in large multi-threaded applications
EuroSys '11: Proceedings of the sixth conference on Computer systems

Parallel software is increasingly necessary to take advantage of multi-core architectures, but it is also prone to concurrency bugs which are particularly hard to avoid, find, and fix, since their occurrence depends on specific thread interleavings. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
October 2018
987 pages
ISBN:9781450355735
DOI:10.1145/3236024
General Chair:
Gary T. Leavens
University of Central Florida, USA
,
Program Chairs:
Alessandro Garcia
PUC-Rio, Brazil
,
Corina S. Păsăreanu
NASA Ames Research Center, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bug Detection
Cloud Computing
Concurrency Bugs
Distributed Systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 697
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CloudRaid: hunting concurrency bugs in the cloud via log-mining

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically detecting and fixing concurrency bugs in go software systems

Detect Related Bugs from Source Code Using Bug Information

Finding complex concurrency bugs in large multi-threaded applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CloudRaid: hunting concurrency bugs in the cloud via log-mining

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically detecting and fixing concurrency bugs in go software systems

Detect Related Bugs from Source Code Using Bug Information

Finding complex concurrency bugs in large multi-threaded applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media