skip to main content
10.1145/3338906.3338916acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Published:12 August 2019Publication History

ABSTRACT

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

References

  1. J.H. Andrews, L.C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments?. In Proc. Intl. Conf. on Software Engineering. 402–411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jean Arlat, J-C Fabre, and Manuel Rodríguez. 2002. Dependability of COTS microkernel-based systems. IEEE Transactions on computers 51, 2 (2002), 138–163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eric Bauer and Randee Adams. 2012. Reliability and Availability of Cloud Computing (1st ed.). Wiley-IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Black Duck Software, Inc. 2018. The OpenStack Open Source Project on Open Hub. https://www.openhub.net/p/openstackGoogle ScholarGoogle Scholar
  5. George Candea and Armando Fox. 2003. Crash-Only Software. In Workshop on Hot Topics in Operating Systems (HotOS), Vol. 3. 67–72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gabriella Carrozza, Domenico Cotroneo, Roberto Natella, Roberto Pietrantuono, and Stefano Russo. 2013. Analysis and prediction of mandelbugs in an industrial software system. In 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation. IEEE, 262–271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Frederico Cerveira, Raul Barbosa, Henrique Madeira, and Filipe Araujo. 2015. Recovery for Virtualized Environments. In Proc. EDCC. 25–36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Feng Chen and Grigore Roşu. 2007. MOP: An efficient and generic runtime verification framework. In Acm Sigplan Notices, Vol. 42. ACM, 569–588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Christmansson and R. Chillarege. 1996. Generation of an Error Set that Emulates Software Faults based on Field Data. In Digest of Papers, Intl. Symp. on Fault-Tolerant Computing. 304–313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jörgen Christmansson and Ram Chillarege. 1996. Generation of an error set that emulates software faults based on field data. In Fault Tolerant Computing, 1996., Proceedings of Annual Symposium on. IEEE, 304–313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Domenico Cotroneo, Roberto Pietrantuono, and Stefano Russo. 2013. Combining operational and debug testing for improving reliability. IEEE Transactions on Reliability 62, 2 (2013), 408–423.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Daran and P. Thévenod-Fosse. 1996. Software Error Analysis: A Real Case Study Involving Real Faults and Mutations. ACM Soft. Eng. Notes 21, 3 (1996), 158–171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nelly Delgado, Ann Q Gates, and Steve Roach. 2004. A taxonomy and catalog of runtime software-fault monitoring tools. IEEE Transactions on software Engineering 30, 12 (2004), 859–872. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. James Denton. 2015. Learning OpenStack Networking. Packt Publishing Ltd.Google ScholarGoogle Scholar
  15. Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. Lava: Large-scale automated vulnerability addition. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 110–121.Google ScholarGoogle ScholarCross RefCross Ref
  16. Joao A Duraes and Henrique S Madeira. 2006. Emulation of Software Faults: A Field Data Study and a Practical Approach. IEEE Transactions on Software Engineering 32, 11 (2006), 849. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mostafa Farshchi, Jean-Guy Schneider, Ingo Weber, and John Grundy. 2018. Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. Journal of Systems and Software 137 (2018), 531–549.Google ScholarGoogle ScholarCross RefCross Ref
  18. Vincenzo De Florio and Chris Blondia. 2008. A survey of linguistic structures for application-level fault tolerance. ACM Computing Surveys (CSUR) 40, 2 (2008), 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Min Fu, Liming Zhu, Ingo Weber, Len Bass, Anna Liu, and Xiwei Xu. 2016. Process-Oriented Non-intrusive Recovery for Sporadic Operations on Cloud. In Dependable Systems and Networks (DSN), 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 85–96.Google ScholarGoogle ScholarCross RefCross Ref
  20. Cristiano Giuffrida, Anton Kuijsten, and Andrew S Tanenbaum. 2013. EDFI: A dependable fault injection tool for dependability benchmarking experiments. In 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC). IEEE, 31–40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jim Gray. 1986. Why do computers stop and what can be done about it?. In Symposium on Reliability in Distributed Software and Database Systems. 3–12.Google ScholarGoogle Scholar
  22. Michael Grottke and Kishor S Trivedi. 2007. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer 40, 2 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Haryadi S Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M Hellerstein, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proc. NSDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patanaanake,ThanhDo,JeffryAdityatama,KurniaJEliazar,AgungLaksono,JeffreyFLukman, Vincentius Martin, et al. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Haryadi S Gunawi, Agung Laksono, Riza O Suminto, Mingzhe Hao, et al. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proc. SoCC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jorrit N Herder, Herbert Bos, Ben Gras, Philip Homburg, and Andrew S Tanenbaum. 2009. Fault isolation for device drivers. In IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 33–42.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yue Jia and Mark Harman. 2010. An analysis and survey of the development of mutation testing. IEEE transactions on software engineering 37, 5 (2010), 649–678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. 37, 5 (2011), 649–678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL: A Programmable Tool for Multiple-failure Injection. In Proc. OOPSLA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xiaoen Ju, Livio Soares, Kang G Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). ACM, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Xiaoen Ju, Livio Soares, Kang G. Shin, Kyung Dong Ryu, and Dilma Da Silva. 2013. On fault resilience of OpenStack. In Proc. SoCC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. René Just, Darioush Jalali, Laura Inozemtseva, Michael D Ernst, Reid Holmes, and Gordon Fraser. 2014. Are mutants a valid substitute for real faults in software testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 654–665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Anna Lanzaro, Roberto Natella, Stefan Winter, Domenico Cotroneo, and Neeraj Suri. 2014. An empirical study of injected versus actual interface errors. In Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 397–408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Inhwan Lee and Ravishankar K Iyer. 1993. Faults, symptoms, and software fault tolerance in the tandem guardian90 operating system. In Fault-Tolerant Computing, 1993. FTCS-23. Digest of Papers., The Twenty-Third International Symposium on. IEEE, 20–29.Google ScholarGoogle ScholarCross RefCross Ref
  35. Heng Li, Weiyi Shang, and Ahmed E Hassan. 2017. Which log level should developers choose for a new logging statement? Empirical Software Engineering 22, 4 (2017), 1684–1716. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhongwei Li, Qinghua Lu, Liming Zhu, Xiwei Xu, Yue Liu, and Weishan Zhang. 2018. An Empirical Study of Cloud API Issues. IEEE Cloud Computing 5, 2 (2018), 58–72.Google ScholarGoogle ScholarCross RefCross Ref
  37. libvirt. 2018. libvirt Home Page. https://www.libvirt.org/Google ScholarGoogle Scholar
  38. Michael R Lyu. 2007. Software reliability engineering: A roadmap. In Future of Software Engineering, 2007. FOSE’07. IEEE, 153–170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Andrey Markelov. 2016. How to Build Your Own Virtual Test Environment. Apress.Google ScholarGoogle Scholar
  40. Matias Martinez, Laurence Duchien, and Martin Monperrus. 2013. Automatically extracting instances of code change patterns with ast analysis. In 2013 IEEE international conference on software maintenance. IEEE, 388–391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Steve McConnell. 2004. Code complete. Pearson Education.Google ScholarGoogle Scholar
  42. Microsoft Corp. 2017. Bulkhead pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/bulkheadGoogle ScholarGoogle Scholar
  43. Microsoft Corp. 2017. Circuit Breaker pattern. https://docs.microsoft.com/enus/azure/architecture/patterns/circuit-breakerGoogle ScholarGoogle Scholar
  44. Pooya Musavi, Bram Adams, and Foutse Khomh. 2016. Experience Report: An Empirical Study of API Failures in OpenStack Cloud Environments. In Software Reliability Engineering (ISSRE), 2016 IEEE 27th International Symposium on. IEEE, 424–434.Google ScholarGoogle ScholarCross RefCross Ref
  45. Roberto Natella, Domenico Cotroneo, and Henrique S Madeira. 2016. Assessing dependability with software fault injection: A survey. ACM Computing Surveys (CSUR) 48, 3 (2016), 44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Netflix. 2017. The Chaos Monkey. https://github.com/Netflix/SimianArmy/wiki/ Chaos-MonkeyGoogle ScholarGoogle Scholar
  47. Netflix Inc. 2017. Hystrix Wiki - How It Works. https://github.com/Netflix/ Hystrix/wiki/How-it-WorksGoogle ScholarGoogle Scholar
  48. W.T. Ng and P.M. Chen. 2001. The Design and Verification of the Rio File Cache. IEEE Trans. on Computers 50, 4 (2001), 322–337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. OpenStack. 2018. OpenStack. http://www.openstack.org/Google ScholarGoogle Scholar
  50. OpenStack. 2018. OpenStack issue tracker. https://bugs.launchpad.net/openstackGoogle ScholarGoogle Scholar
  51. OpenStack. 2018. Tempest Testing Project. https://docs.openstack.org/tempestGoogle ScholarGoogle Scholar
  52. OpenStack. 2018. Virtual Machine States and Transitions. https: //docs.openstack.org/nova/latest/reference/vm-states.htmlGoogle ScholarGoogle Scholar
  53. OpenStack project. 2018. OpenStack in Launchpad. https://launchpad.net/ openstackGoogle ScholarGoogle Scholar
  54. OpenStack project. 2018. The OpenStack Marketplace. https: //www.openstack.org/marketplace/Google ScholarGoogle Scholar
  55. OpenStack project. 2018. Stackalytics. https://www.stackalytics.comGoogle ScholarGoogle Scholar
  56. OpenStack project. 2018. User Stories Showing How The World #RunsOnOpen-Stack. https://www.openstack.org/user-stories/Google ScholarGoogle Scholar
  57. David Oppenheimer, Archana Ganapathi, and David A Patterson. 2003. Why do Internet services fail, and what can be done about it?. In USENIX symposium on internet technologies and systems, Vol. 67. Seattle, WA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Matteo Orrú, Ewan D Tempero, Michele Marchesi, Roberto Tonelli, and Giuseppe Destefanis. 2015. A Curated Benchmark Collection of Python Systems for Empirical Studies on Software Engineering.. In PROMISE. 2–1.Google ScholarGoogle Scholar
  59. Kai Pan, Sunghun Kim, and E James Whitehead. 2009. Toward an understanding of bug fix patterns. Empirical Software Engineering 14, 3 (2009), 286–315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2019. Mutation testing advances: An analysis and survey. In Advances in Computers. Vol. 112. Elsevier, 275–378.Google ScholarGoogle Scholar
  61. Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are mutation scores correlated with real fault detection? A large scale empirical study on the relationship between mutants and real faults. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 537–548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Cuong Pham, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2011. CloudVal: A framework for validation of virtualization environment in cloud infrastructure. In Proc. DSN. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Cuong Pham, Long Wang, Byung-Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2017. Failure Diagnosis for Distributed Systems Using Targeted Fault Injection. IEEE Trans. Parallel Distrib. Syst. 28, 2 (2017), 503–516. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Rick Rabiser, Sam Guinea, Michael Vierhauser, Luciano Baresi, and Paul Grünbacher. 2017. A comparison framework for runtime monitoring approaches. Journal of Systems and Software 125 (2017), 309–321. ESEC/FSE ’19, August 26–30, 2019, Tallinn, Estonia Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella, and Nematollah Bidokhti Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. RDO. 2018. Packstack. https://www.rdoproject.org/install/packstack/Google ScholarGoogle Scholar
  66. Red Hat, Inc. 2018. Evaluating OpenStack: Single-Node Deployment. https://access.redhat.com/articles/1127153Google ScholarGoogle Scholar
  67. Gema Rodríguez-Pérez, Andy Zaidman, Alexander Serebrenik, Gregorio Robles, and Jesús M González-Barahona. 2018. What if a bug has a different origin?: making sense of bugs without an explicit bug introducing change. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. ACM, 52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug synthesis: challenging bug-finding tools with deep faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 224–234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Suhrid Satyal, Ingo Weber, Len Bass, and Min Fu. 2017. Rollback Mechanisms for Cloud Management APIs using AI planning. IEEE Transactions on Dependable and Secure Computing (2017).Google ScholarGoogle Scholar
  70. M. Solberg. 2017. OpenStack for Architects. Packt Publishing.Google ScholarGoogle Scholar
  71. Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad, and Henry M Levy. 2006. Recovering device drivers. ACM Transactions on Computer Systems (TOCS) 24, 4 (2006), 333–360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Andrew S Tanenbaum, Jorrit N Herder, and Herbert Bos. 2006. Can we make operating systems reliable and secure? Computer 39, 5 (2006), 44–51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2018. Learning How to Mutate Source Code from Bug-Fixes. arXiv preprint arXiv:1812.10772 (2018).Google ScholarGoogle Scholar
  74. J.M. Voas, F. Charron, G. McGraw, K. Miller, and M. Friedman. 1997. Predicting How Badly "Good" Software Can Behave. IEEE Software 14, 4 (1997), 73–83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Ingo Weber, Hiroshi Wada, Alan Fekete, Anna Liu, and Len Bass. 2012. Automatic Undo for Cloud Management via AI Planning.. In HotDep. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael Mihn-Jong Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In OSDI, Vol. 12. 293–306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2012. Improving software diagnosability via log enhancement. ACM Transactions on Computer Systems (TOCS) 30, 1 (2012), 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Hao Zhong and Na Meng. 2018. Towards reusing hints from past fixes. Empirical Software Engineering 23, 5 (2018), 2521–2549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Jingwen Zhou, Zhenbang Chen, Ji Wang, Zibin Zheng, and Wei Dong. 2014. A runtime verification based trace-oriented monitoring framework for cloud systems. In Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium on. IEEE, 152–155. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
            August 2019
            1264 pages
            ISBN:9781450355728
            DOI:10.1145/3338906

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 August 2019

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate112of543submissions,21%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader