ABSTRACT
Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.
- J.H. Andrews, L.C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments?. In Proc. Intl. Conf. on Software Engineering. 402–411. Google ScholarDigital Library
- Jean Arlat, J-C Fabre, and Manuel Rodríguez. 2002. Dependability of COTS microkernel-based systems. IEEE Transactions on computers 51, 2 (2002), 138–163. Google ScholarDigital Library
- Eric Bauer and Randee Adams. 2012. Reliability and Availability of Cloud Computing (1st ed.). Wiley-IEEE Press. Google ScholarDigital Library
- Black Duck Software, Inc. 2018. The OpenStack Open Source Project on Open Hub. https://www.openhub.net/p/openstackGoogle Scholar
- George Candea and Armando Fox. 2003. Crash-Only Software. In Workshop on Hot Topics in Operating Systems (HotOS), Vol. 3. 67–72. Google ScholarDigital Library
- Gabriella Carrozza, Domenico Cotroneo, Roberto Natella, Roberto Pietrantuono, and Stefano Russo. 2013. Analysis and prediction of mandelbugs in an industrial software system. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. IEEE, 262–271. Google ScholarDigital Library
- Frederico Cerveira, Raul Barbosa, Henrique Madeira, and Filipe Araujo. 2015. Recovery for Virtualized Environments. In Proc. EDCC. 25–36. Google ScholarDigital Library
- Feng Chen and Grigore Roşu. 2007. MOP: An efficient and generic runtime verification framework. In Acm Sigplan Notices, Vol. 42. ACM, 569–588. Google ScholarDigital Library
- J. Christmansson and R. Chillarege. 1996. Generation of an Error Set that Emulates Software Faults based on Field Data. In Digest of Papers, Intl. Symp. on Fault-Tolerant Computing. 304–313. Google ScholarDigital Library
- Jörgen Christmansson and Ram Chillarege. 1996. Generation of an error set that emulates software faults based on field data. In Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on. IEEE, 304–313. Google ScholarDigital Library
- Domenico Cotroneo, Roberto Pietrantuono, and Stefano Russo. 2013. Combining operational and debug testing for improving reliability. IEEE Transactions on Reliability 62, 2 (2013), 408–423.Google ScholarCross Ref
- M. Daran and P. Thévenod-Fosse. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. ACM Soft. Eng. Notes 21, 3 (1996), 158–171. Google ScholarDigital Library
- Nelly Delgado, Ann Q Gates, and Steve Roach. 2004. A taxonomy and catalog of runtime software-fault monitoring tools. IEEE Transactions on software Engineering 30, 12 (2004), 859–872. Google ScholarDigital Library
- James Denton. 2015. Learning OpenStack Networking. Packt Publishing Ltd.Google Scholar
- Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. Lava: Large-scale automated vulnerability addition. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 110–121.Google ScholarCross Ref
- Joao A Duraes and Henrique S Madeira. 2006. Emulation of Software Faults: A Field Data Study and a Practical Approach. IEEE Transactions on Software Engineering 32, 11 (2006), 849. Google ScholarDigital Library
- Mostafa Farshchi, Jean-Guy Schneider, Ingo Weber, and John Grundy. 2018. Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. Journal of Systems and Software 137 (2018), 531–549.Google ScholarCross Ref
- Vincenzo De Florio and Chris Blondia. 2008. A survey of linguistic structures for application-level fault tolerance. ACM Computing Surveys (CSUR) 40, 2 (2008), 6. Google ScholarDigital Library
- Min Fu, Liming Zhu, Ingo Weber, Len Bass, Anna Liu, and Xiwei Xu. 2016. Process-Oriented Non-intrusive Recovery for Sporadic Operations on Cloud. In Dependable Systems and Networks (DSN), 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 85–96.Google ScholarCross Ref
- Cristiano Giuffrida, Anton Kuijsten, and Andrew S Tanenbaum. 2013. EDFI: A dependable fault injection tool for dependability benchmarking experiments. In 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 31–40. Google ScholarDigital Library
- Jim Gray. 1986. Why do computers stop and what can be done about it?. In Symposium on Reliability in Distributed Software and Database Systems. 3–12.Google Scholar
- Michael Grottke and Kishor S Trivedi. 2007. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer 40, 2 (2007). Google ScholarDigital Library
- Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proc. NSDI. Google ScholarDigital Library
- Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake,ThanhDo,JeffryAdityatama,KurniaJEliazar,AgungLaksono,JeffreyFLukman, Vincentius Martin, et al. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing. Google ScholarDigital Library
- Haryadi S Gunawi, Agung Laksono, Riza O Suminto, Mingzhe Hao, et al. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proc. SoCC. Google ScholarDigital Library
- Jorrit N Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S Tanenbaum. 2009. Fault isolation for device drivers. In IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 33–42.Google ScholarCross Ref
- Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678. Google ScholarDigital Library
- Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. 37, 5 (2011), 649–678. Google ScholarDigital Library
- Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL: A Programmable Tool for Multiple-failure Injection. In Proc. OOPSLA. Google ScholarDigital Library
- Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). ACM, 2. Google ScholarDigital Library
- Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proc. SoCC. Google ScholarDigital Library
- René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 654–665. Google ScholarDigital Library
- Anna Lanzaro, Roberto Natella, Stefan Winter, Domenico Cotroneo, and Neeraj Suri. 2014. An empirical study of injected versus actual interface errors. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 397–408. Google ScholarDigital Library
- Inhwan Lee and Ravishankar K Iyer. 1993. Faults, symptoms, and software fault tolerance in the tandem guardian90 operating system. In Fault-Tolerant Computing, 1993. FTCS-23. Digest of Papers., The Twenty-Third International Symposium on. IEEE, 20–29.Google ScholarCross Ref
- Heng Li, Weiyi Shang, and Ahmed E Hassan. 2017. Which log level should developers choose for a new logging statement? Empirical Software Engineering 22, 4 (2017), 1684–1716. Google ScholarDigital Library
- Zhongwei Li, Qinghua Lu, Liming Zhu, Xiwei Xu, Yue Liu, and Weishan Zhang. 2018. An Empirical Study of Cloud API Issues. IEEE Cloud Computing 5, 2 (2018), 58–72.Google ScholarCross Ref
- libvirt. 2018. libvirt Home Page. https://www.libvirt.org/Google Scholar
- Michael R Lyu. 2007. Software reliability engineering: A roadmap. In Future of Software Engineering, 2007. FOSE’07. IEEE, 153–170. Google ScholarDigital Library
- Andrey Markelov. 2016. How to Build Your Own Virtual Test Environment. Apress.Google Scholar
- Matias Martinez, Laurence Duchien, and Martin Monperrus. 2013. Automatically extracting instances of code change patterns with ast analysis. In 2013 IEEE international conference on software maintenance. IEEE, 388–391. Google ScholarDigital Library
- Steve McConnell. 2004. Code complete. Pearson Education.Google Scholar
- Microsoft Corp. 2017. Bulkhead pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/bulkheadGoogle Scholar
- Microsoft Corp. 2017. Circuit Breaker pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/circuit-breakerGoogle Scholar
- Pooya Musavi, Bram Adams, and Foutse Khomh. 2016. Experience Report: An Empirical Study of API Failures in OpenStack Cloud Environments. In Software Reliability Engineering (ISSRE), 2016 IEEE 27th International Symposium on. IEEE, 424–434.Google ScholarCross Ref
- Roberto Natella, Domenico Cotroneo, and Henrique S Madeira. 2016. Assessing dependability with software fault injection: A survey. ACM Computing Surveys (CSUR) 48, 3 (2016), 44. Google ScholarDigital Library
- Netflix. 2017. The Chaos Monkey. https://github.com/Netflix/SimianArmy/wiki/ Chaos-MonkeyGoogle Scholar
- Netflix Inc. 2017. Hystrix Wiki - How It Works. https://github.com/Netflix/ Hystrix/wiki/How-it-WorksGoogle Scholar
- W.T. Ng and P.M. Chen. 2001. The Design and Verification of the Rio File Cache. IEEE Trans. on Computers 50, 4 (2001), 322–337. Google ScholarDigital Library
- OpenStack. 2018. OpenStack. http://www.openstack.org/Google Scholar
- OpenStack. 2018. OpenStack issue tracker. https://bugs.launchpad.net/openstackGoogle Scholar
- OpenStack. 2018. Tempest Testing Project. https://docs.openstack.org/tempestGoogle Scholar
- OpenStack. 2018. Virtual Machine States and Transitions. https: //docs.openstack.org/nova/latest/reference/vm-states.htmlGoogle Scholar
- OpenStack project. 2018. OpenStack in Launchpad. https://launchpad.net/ openstackGoogle Scholar
- OpenStack project. 2018. The OpenStack Marketplace. https: //www.openstack.org/marketplace/Google Scholar
- OpenStack project. 2018. Stackalytics. https://www.stackalytics.comGoogle Scholar
- OpenStack project. 2018. User Stories Showing How The World #RunsOnOpen-Stack. https://www.openstack.org/user-stories/Google Scholar
- David Oppenheimer, Archana Ganapathi, and David A Patterson. 2003. Why do Internet services fail, and what can be done about it?. In USENIX symposium on internet technologies and systems, Vol. 67. Seattle, WA. Google ScholarDigital Library
- Matteo Orrú, Ewan D Tempero, Michele Marchesi, Roberto Tonelli, and Giuseppe Destefanis. 2015. A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering.. In PROMISE. 2–1.Google Scholar
- Kai Pan, Sunghun Kim, and E James Whitehead. 2009. Toward an understanding of bug fix patterns. Empirical Software Engineering 14, 3 (2009), 286–315. Google ScholarDigital Library
- Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: An analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378.Google Scholar
- Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are mutation scores correlated with real fault detection? A large scale empirical study on the relationship between mutants and real faults. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 537–548. Google ScholarDigital Library
- Cuong Pham, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2011. CloudVal: A framework for validation of virtualization environment in cloud infrastructure. In Proc. DSN. Google ScholarDigital Library
- Cuong Pham, Long Wang, Byung-Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. IEEE Trans. Parallel Distrib. Syst. 28, 2 (2017), 503–516. Google ScholarDigital Library
- Rick Rabiser, Sam Guinea, Michael Vierhauser, Luciano Baresi, and Paul Grünbacher. 2017. A comparison framework for runtime monitoring approaches. Journal of Systems and Software 125 (2017), 309–321. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti Google ScholarDigital Library
- RDO. 2018. Packstack. https://www.rdoproject.org/install/packstack/Google Scholar
- Red Hat, Inc. 2018. Evaluating OpenStack: Single-Node Deployment. https://access.redhat.com/articles/1127153Google Scholar
- Gema Rodríguez-Pérez, Andy Zaidman, Alexander Serebrenik, Gregorio Robles, and Jesús M González-Barahona. 2018. What if a bug has a different origin?: making sense of bugs without an explicit bug introducing change. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, 52. Google ScholarDigital Library
- Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug synthesis: challenging bug-finding tools with deep faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 224–234. Google ScholarDigital Library
- Suhrid Satyal, Ingo Weber, Len Bass, and Min Fu. 2017. Rollback Mechanisms for Cloud Management APIs using AI planning. IEEE Transactions on Dependable and Secure Computing (2017).Google Scholar
- M. Solberg. 2017. OpenStack for Architects. Packt Publishing.Google Scholar
- Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad, and Henry M Levy. 2006. Recovering device drivers. ACM Transactions on Computer Systems (TOCS) 24, 4 (2006), 333–360. Google ScholarDigital Library
- Andrew S Tanenbaum, Jorrit N Herder, and Herbert Bos. 2006. Can we make operating systems reliable and secure? Computer 39, 5 (2006), 44–51. Google ScholarDigital Library
- Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Learning How to Mutate Source Code from Bug-Fixes. arXiv preprint arXiv:1812.10772 (2018).Google Scholar
- J.M. Voas, F. Charron, G. McGraw, K. Miller, and M. Friedman. 1997. Predicting How Badly "Good" Software Can Behave. IEEE Software 14, 4 (1997), 73–83. Google ScholarDigital Library
- Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, and Len Bass. 2012. Automatic Undo for Cloud Management via AI Planning.. In HotDep. Google ScholarDigital Library
- Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael Mihn-Jong Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In OSDI, Vol. 12. 293–306. Google ScholarDigital Library
- Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2012. Improving software diagnosability via log enhancement. ACM Transactions on Computer Systems (TOCS) 30, 1 (2012), 4. Google ScholarDigital Library
- Hao Zhong and Na Meng. 2018. Towards reusing hints from past fixes. Empirical Software Engineering 23, 5 (2018), 2521–2549. Google ScholarDigital Library
- Jingwen Zhou, Zhenbang Chen, Ji Wang, Zibin Zheng, and Wei Dong. 2014. A runtime verification based trace-oriented monitoring framework for cloud systems. In Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium on. IEEE, 152–155. Google ScholarDigital Library
Index Terms
- How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform
Recommendations
Dependability Analysis on Open Stack IaaS Cloud: Bug Anaysis and Fault Injection
CLOUDCOM '14: Proceedings of the 2014 IEEE 6th International Conference on Cloud Computing Technology and ScienceThis paper proposes a comparative study of cloud dependability between two methods--bug analysis and fault injection for assessing the impact of component failure on cloud service availability. We focus on the IaaS cloud with open source platform Open ...
Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform
AbstractCloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by ...
Highlights- The approach performs run-time verification without using session IDs.
- The approach improves the failure detection of the OpenStack cloud computing system.
- The approach can be used in combination with the system failure logging ...
On fault resilience of OpenStack
SOCC '13: Proceedings of the 4th annual Symposium on Cloud ComputingCloud-management stacks have become an increasingly important element in cloud computing, serving as the resource manager of cloud platforms. While the functionality of this emerging layer has been constantly expanding, its fault resilience remains ...
Comments