research-article

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Authors:
Domenico Cotroneo

Federico II University of Naples, Italy

Federico II University of Naples, Italy
View Profile

,
Luigi De Simone

Federico II University of Naples, Italy

Federico II University of Naples, Italy
View Profile

,
Pietro Liguori

Federico II University of Naples, Italy

Federico II University of Naples, Italy
View Profile

,
Roberto Natella

Federico II University of Naples, Italy

Federico II University of Naples, Italy
View Profile

,
Nematollah Bidokhti

Futurewei Technologies, USA

Futurewei Technologies, USA
View Profile

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringAugust 2019Pages 200–211https://doi.org/10.1145/3338906.3338916

Published:12 August 2019Publication History

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 200–211

ABSTRACT

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

References

J.H. Andrews, L.C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments?. In Proc. Intl. Conf. on Software Engineering. 402–411. Google ScholarDigital Library
Jean Arlat, J-C Fabre, and Manuel Rodríguez. 2002. Dependability of COTS microkernel-based systems. IEEE Transactions on computers 51, 2 (2002), 138–163. Google ScholarDigital Library
Eric Bauer and Randee Adams. 2012. Reliability and Availability of Cloud Computing (1st ed.). Wiley-IEEE Press. Google ScholarDigital Library
Black Duck Software, Inc. 2018. The OpenStack Open Source Project on Open Hub. https://www.openhub.net/p/openstackGoogle Scholar
George Candea and Armando Fox. 2003. Crash-Only Software. In Workshop on Hot Topics in Operating Systems (HotOS), Vol. 3. 67–72. Google ScholarDigital Library
Gabriella Carrozza, Domenico Cotroneo, Roberto Natella, Roberto Pietrantuono, and Stefano Russo. 2013. Analysis and prediction of mandelbugs in an industrial software system. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. IEEE, 262–271. Google ScholarDigital Library
Frederico Cerveira, Raul Barbosa, Henrique Madeira, and Filipe Araujo. 2015. Recovery for Virtualized Environments. In Proc. EDCC. 25–36. Google ScholarDigital Library
Feng Chen and Grigore Roşu. 2007. MOP: An efficient and generic runtime verification framework. In Acm Sigplan Notices, Vol. 42. ACM, 569–588. Google ScholarDigital Library
J. Christmansson and R. Chillarege. 1996. Generation of an Error Set that Emulates Software Faults based on Field Data. In Digest of Papers, Intl. Symp. on Fault-Tolerant Computing. 304–313. Google ScholarDigital Library
Jörgen Christmansson and Ram Chillarege. 1996. Generation of an error set that emulates software faults based on field data. In Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on. IEEE, 304–313. Google ScholarDigital Library
Domenico Cotroneo, Roberto Pietrantuono, and Stefano Russo. 2013. Combining operational and debug testing for improving reliability. IEEE Transactions on Reliability 62, 2 (2013), 408–423.Google ScholarCross Ref
M. Daran and P. Thévenod-Fosse. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. ACM Soft. Eng. Notes 21, 3 (1996), 158–171. Google ScholarDigital Library
Nelly Delgado, Ann Q Gates, and Steve Roach. 2004. A taxonomy and catalog of runtime software-fault monitoring tools. IEEE Transactions on software Engineering 30, 12 (2004), 859–872. Google ScholarDigital Library
James Denton. 2015. Learning OpenStack Networking. Packt Publishing Ltd.Google Scholar
Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. Lava: Large-scale automated vulnerability addition. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 110–121.Google ScholarCross Ref
Joao A Duraes and Henrique S Madeira. 2006. Emulation of Software Faults: A Field Data Study and a Practical Approach. IEEE Transactions on Software Engineering 32, 11 (2006), 849. Google ScholarDigital Library
Mostafa Farshchi, Jean-Guy Schneider, Ingo Weber, and John Grundy. 2018. Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. Journal of Systems and Software 137 (2018), 531–549.Google ScholarCross Ref
Vincenzo De Florio and Chris Blondia. 2008. A survey of linguistic structures for application-level fault tolerance. ACM Computing Surveys (CSUR) 40, 2 (2008), 6. Google ScholarDigital Library
Min Fu, Liming Zhu, Ingo Weber, Len Bass, Anna Liu, and Xiwei Xu. 2016. Process-Oriented Non-intrusive Recovery for Sporadic Operations on Cloud. In Dependable Systems and Networks (DSN), 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 85–96.Google ScholarCross Ref
Cristiano Giuffrida, Anton Kuijsten, and Andrew S Tanenbaum. 2013. EDFI: A dependable fault injection tool for dependability benchmarking experiments. In 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 31–40. Google ScholarDigital Library
Jim Gray. 1986. Why do computers stop and what can be done about it?. In Symposium on Reliability in Distributed Software and Database Systems. 3–12.Google Scholar
Michael Grottke and Kishor S Trivedi. 2007. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer 40, 2 (2007). Google ScholarDigital Library
Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proc. NSDI. Google ScholarDigital Library
Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake,ThanhDo,JeffryAdityatama,KurniaJEliazar,AgungLaksono,JeffreyFLukman, Vincentius Martin, et al. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing. Google ScholarDigital Library
Haryadi S Gunawi, Agung Laksono, Riza O Suminto, Mingzhe Hao, et al. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proc. SoCC. Google ScholarDigital Library
Jorrit N Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S Tanenbaum. 2009. Fault isolation for device drivers. In IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 33–42.Google ScholarCross Ref
Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678. Google ScholarDigital Library
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. 37, 5 (2011), 649–678. Google ScholarDigital Library
Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL: A Programmable Tool for Multiple-failure Injection. In Proc. OOPSLA. Google ScholarDigital Library
Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). ACM, 2. Google ScholarDigital Library
Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proc. SoCC. Google ScholarDigital Library
René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 654–665. Google ScholarDigital Library
Anna Lanzaro, Roberto Natella, Stefan Winter, Domenico Cotroneo, and Neeraj Suri. 2014. An empirical study of injected versus actual interface errors. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 397–408. Google ScholarDigital Library
Inhwan Lee and Ravishankar K Iyer. 1993. Faults, symptoms, and software fault tolerance in the tandem guardian90 operating system. In Fault-Tolerant Computing, 1993. FTCS-23. Digest of Papers., The Twenty-Third International Symposium on. IEEE, 20–29.Google ScholarCross Ref
Heng Li, Weiyi Shang, and Ahmed E Hassan. 2017. Which log level should developers choose for a new logging statement? Empirical Software Engineering 22, 4 (2017), 1684–1716. Google ScholarDigital Library
Zhongwei Li, Qinghua Lu, Liming Zhu, Xiwei Xu, Yue Liu, and Weishan Zhang. 2018. An Empirical Study of Cloud API Issues. IEEE Cloud Computing 5, 2 (2018), 58–72.Google ScholarCross Ref
libvirt. 2018. libvirt Home Page. https://www.libvirt.org/Google Scholar
Michael R Lyu. 2007. Software reliability engineering: A roadmap. In Future of Software Engineering, 2007. FOSE’07. IEEE, 153–170. Google ScholarDigital Library
Andrey Markelov. 2016. How to Build Your Own Virtual Test Environment. Apress.Google Scholar
Matias Martinez, Laurence Duchien, and Martin Monperrus. 2013. Automatically extracting instances of code change patterns with ast analysis. In 2013 IEEE international conference on software maintenance. IEEE, 388–391. Google ScholarDigital Library
Steve McConnell. 2004. Code complete. Pearson Education.Google Scholar
Microsoft Corp. 2017. Bulkhead pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/bulkheadGoogle Scholar
Microsoft Corp. 2017. Circuit Breaker pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/circuit-breakerGoogle Scholar
Pooya Musavi, Bram Adams, and Foutse Khomh. 2016. Experience Report: An Empirical Study of API Failures in OpenStack Cloud Environments. In Software Reliability Engineering (ISSRE), 2016 IEEE 27th International Symposium on. IEEE, 424–434.Google ScholarCross Ref
Roberto Natella, Domenico Cotroneo, and Henrique S Madeira. 2016. Assessing dependability with software fault injection: A survey. ACM Computing Surveys (CSUR) 48, 3 (2016), 44. Google ScholarDigital Library
Netflix. 2017. The Chaos Monkey. https://github.com/Netflix/SimianArmy/wiki/ Chaos-MonkeyGoogle Scholar
Netflix Inc. 2017. Hystrix Wiki - How It Works. https://github.com/Netflix/ Hystrix/wiki/How-it-WorksGoogle Scholar
W.T. Ng and P.M. Chen. 2001. The Design and Verification of the Rio File Cache. IEEE Trans. on Computers 50, 4 (2001), 322–337. Google ScholarDigital Library
OpenStack. 2018. OpenStack. http://www.openstack.org/Google Scholar
OpenStack. 2018. OpenStack issue tracker. https://bugs.launchpad.net/openstackGoogle Scholar
OpenStack. 2018. Tempest Testing Project. https://docs.openstack.org/tempestGoogle Scholar
OpenStack. 2018. Virtual Machine States and Transitions. https: //docs.openstack.org/nova/latest/reference/vm-states.htmlGoogle Scholar
OpenStack project. 2018. OpenStack in Launchpad. https://launchpad.net/ openstackGoogle Scholar
OpenStack project. 2018. The OpenStack Marketplace. https: //www.openstack.org/marketplace/Google Scholar
OpenStack project. 2018. Stackalytics. https://www.stackalytics.comGoogle Scholar
OpenStack project. 2018. User Stories Showing How The World #RunsOnOpen-Stack. https://www.openstack.org/user-stories/Google Scholar
David Oppenheimer, Archana Ganapathi, and David A Patterson. 2003. Why do Internet services fail, and what can be done about it?. In USENIX symposium on internet technologies and systems, Vol. 67. Seattle, WA. Google ScholarDigital Library
Matteo Orrú, Ewan D Tempero, Michele Marchesi, Roberto Tonelli, and Giuseppe Destefanis. 2015. A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering.. In PROMISE. 2–1.Google Scholar
Kai Pan, Sunghun Kim, and E James Whitehead. 2009. Toward an understanding of bug fix patterns. Empirical Software Engineering 14, 3 (2009), 286–315. Google ScholarDigital Library
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: An analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378.Google Scholar
Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are mutation scores correlated with real fault detection? A large scale empirical study on the relationship between mutants and real faults. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 537–548. Google ScholarDigital Library
Cuong Pham, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2011. CloudVal: A framework for validation of virtualization environment in cloud infrastructure. In Proc. DSN. Google ScholarDigital Library
Cuong Pham, Long Wang, Byung-Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. IEEE Trans. Parallel Distrib. Syst. 28, 2 (2017), 503–516. Google ScholarDigital Library
Rick Rabiser, Sam Guinea, Michael Vierhauser, Luciano Baresi, and Paul Grünbacher. 2017. A comparison framework for runtime monitoring approaches. Journal of Systems and Software 125 (2017), 309–321. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti Google ScholarDigital Library
RDO. 2018. Packstack. https://www.rdoproject.org/install/packstack/Google Scholar
Red Hat, Inc. 2018. Evaluating OpenStack: Single-Node Deployment. https://access.redhat.com/articles/1127153Google Scholar
Gema Rodríguez-Pérez, Andy Zaidman, Alexander Serebrenik, Gregorio Robles, and Jesús M González-Barahona. 2018. What if a bug has a different origin?: making sense of bugs without an explicit bug introducing change. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, 52. Google ScholarDigital Library
Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug synthesis: challenging bug-finding tools with deep faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 224–234. Google ScholarDigital Library
Suhrid Satyal, Ingo Weber, Len Bass, and Min Fu. 2017. Rollback Mechanisms for Cloud Management APIs using AI planning. IEEE Transactions on Dependable and Secure Computing (2017).Google Scholar
M. Solberg. 2017. OpenStack for Architects. Packt Publishing.Google Scholar
Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad, and Henry M Levy. 2006. Recovering device drivers. ACM Transactions on Computer Systems (TOCS) 24, 4 (2006), 333–360. Google ScholarDigital Library
Andrew S Tanenbaum, Jorrit N Herder, and Herbert Bos. 2006. Can we make operating systems reliable and secure? Computer 39, 5 (2006), 44–51. Google ScholarDigital Library
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Learning How to Mutate Source Code from Bug-Fixes. arXiv preprint arXiv:1812.10772 (2018).Google Scholar
J.M. Voas, F. Charron, G. McGraw, K. Miller, and M. Friedman. 1997. Predicting How Badly "Good" Software Can Behave. IEEE Software 14, 4 (1997), 73–83. Google ScholarDigital Library
Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, and Len Bass. 2012. Automatic Undo for Cloud Management via AI Planning.. In HotDep. Google ScholarDigital Library
Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael Mihn-Jong Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In OSDI, Vol. 12. 293–306. Google ScholarDigital Library
Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2012. Improving software diagnosability via log enhancement. ACM Transactions on Computer Systems (TOCS) 30, 1 (2012), 4. Google ScholarDigital Library
Hao Zhong and Na Meng. 2018. Towards reusing hints from past fixes. Empirical Software Engineering 23, 5 (2018), 2521–2549. Google ScholarDigital Library
Jingwen Zhou, Zhenbang Chen, Ji Wang, Zibin Zheng, and Wei Dong. 2014. A runtime verification based trace-oriented monitoring framework for cloud systems. In Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium on. IEEE, 152–155. Google ScholarDigital Library

Index Terms

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
      2. Software reliability

Recommendations

Dependability Analysis on Open Stack IaaS Cloud: Bug Anaysis and Fault Injection
CLOUDCOM '14: Proceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and Science

This paper proposes a comparative study of cloud dependability between two methods--bug analysis and fault injection for assessing the impact of component failure on cloud service availability. We focus on the IaaS cloud with open source platform Open ...
Read More
Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform
Abstract
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by ...
Highlights
- The approach performs run-time verification without using session IDs.
- The approach improves the failure detection of the OpenStack cloud computing system.
- The approach can be used in combination with the system failure logging ...
Read More
On fault resilience of OpenStack
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

Cloud-management stacks have become an increasingly important element in cloud computing, serving as the resource manager of cloud platforms. While the functionality of this emerging layer has been constantly expanding, its fault resilience remains ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
August 2019
1264 pages
ISBN:9781450355728
DOI:10.1145/3338906
General Chairs:
Marlon Dumas
University of Tartu, Estonia
,
Dietmar Pfahl
University of Tartu, Estonia
,
Program Chairs:
Sven Apel
Saarland University, Germany
,
Alessandra Russo
Imperial College, UK
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available
- Artifacts Evaluated & Reusable
Author Tags
Bug analysis
Fault injection
OpenStack
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 854
  Total Downloads
- Downloads (Last 12 months)203
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dependability Analysis on Open Stack IaaS Cloud: Bug Anaysis and Fault Injection

Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform

On fault resilience of OpenStack