skip to main content
10.1145/1508244.1508250acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

ASSURE: automatic software self-healing using rescue points

Published:07 March 2009Publication History

ABSTRACT

Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover software from unknown faults while maintaining both system integrity and availability, by mimicking system behavior under known error conditions. Rescue points are locations in existing application code for handling a given set of programmer-anticipated failures, which are automatically repurposed and tested for safely enabling fault recovery from a larger class of (unanticipated) faults. When a fault occurs at an arbitrary location in the program, ASSURE restores execution to an appropriate rescue point and induces the program to recover execution by virtualizing the program's existing error-handling facilities. Rescue points are identified using fuzzing, implemented using a fast coordinated checkpoint-restart mechanism that handles multi-process and multi-threaded applications, and, after testing, are injected into production code using binary patching. We have implemented an ASSURE Linux prototype that operates without application source code and without base operating system kernel changes. Our experimental results on a set of real-world server applications and bugs show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods.

References

  1. M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti. Control-flow Integrity. In Proceedings of the ACM conference on Computer and Communications Security (CCS), pages 340--353, November 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Boyd. Patterns of Conflict. Unpublished briefing, http://www.d-n-i.net/boyd/pdf/poc.pdf, 1986.Google ScholarGoogle Scholar
  3. T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14(1):80--107, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Brumley, H. Wang, S. Jha, and D. Song. Creating vulnerability signatures using weakest pre-conditions. In Proceedings of the 2007 Computer Security Foundations Symposium, Venice, Italy, July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Buck and J. K. Hollingsworth. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14(4):317--329, Winter 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Candea and A. Fox. Crash-only software. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chandra. An evaluation of the recovery-related properties of Software Faults. PhD thesis, University of Michigan, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Costa, J. Crowcroft, M. Castro, and A. Rowstron. Vigilante: End-to-End Containment of Internet Worms. In Proceedings of the ACM Symposium on Systems and Operating Systems Principles (SOSP), December 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Demsky and M. C. Rinard. Automatic detection and repair of errors in data structures. In Proceedings of the ACM Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA), October 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Etoh. GCC extension for protecting applications from stack-smashing attacks. http://www.trl.ibm.com/projects/security/ssp/.Google ScholarGoogle Scholar
  11. S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the USENIX Technical Conference, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Kiriansky, D. Bruening, and S. Amarasinghe. Secure execution via program shepherding. In Proceedings of the USENIX Security Symposium, August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Kolettis and N. D. Fulton. Software rejuvenation: analysis, module and applications. In FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, page 381, Washington, DC, USA, 1995. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes on commodity operating systems. In Proceedings of the USENIX Technical Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12), December 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Newsome, D. Brumley, and D. Song. Vulnerability-specific execution filtering for exploit prevention on commodity software. In Proceedings of the Symposium on Network and Distributed System Security (SNDSS), February 2006.Google ScholarGoogle Scholar
  17. National Vulnerability Database. http://nvd.nist.gov/statistics.cfm, April 2006.Google ScholarGoogle Scholar
  18. S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of Zap: A system for migrating computing environments. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 361--376, December 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. PaX Project. Address space layout randomization, Mar 2003. http://pageexec.virtualave.net/docs/aslr.txt.Google ScholarGoogle Scholar
  20. A. D. Roelker. Snort 2.0: Protocol flow analyzer.Google ScholarGoogle Scholar
  21. S. Sidiroglou, Y. Giovanidis, and A. Keromytis. A dynamic mechanism for recovery from buffer overflow attacks. In Proceedings of the Information Security Conference (ISC), September 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In Proceedings of the USENIX Technical Conference, April 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Song, M. E. Locasto, A. Stavrou, A. D. Keromytis, and S. J. Stolfo. On the infeasibility of modeling polymorphic shellcode. In Proceedings of the 14th ACM conference on Computer and communications security (CCS), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Sullivan and R. Chillarege. Software defects and their impact on system availability -- a study of field failures in operating systems. 21st Int. Symp. on Fault-Tolerant Computing (FTCS--21), pages 2--9, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  25. J. Tucek, J. Newsome, S. Lu, C. Huang, S. Xanthos, D. Brumley, Y. Zhou, and D. Song. Sweeper: a lightweight end-to-end system for defending against fast worms. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EUROSYS), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. J. Wang, C. Guo, D. R. Simon, and A. Zugenmaier. Shield: vulnerability-driven network filters for Preventing Known Vulnerability Exploits. In Proceedings of the ACM SIGCOMM Conference, August 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Paxson. Bro: a system for detecting network intruders in real-time. Computer Networks (Amsterdam, Netherlands: 1999), 31(23-24):2435--2463, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: treating bugs as allergies -- a safe method to survive software failures. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Rescorla. Security holes.. Who cares? In Proceedings of the 12th USENIX Security Symposium, Washington, D.C., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Rinard. Acceptability-oriented Computing. In Proceedings of ACM Conference on Object Oriented Programming, Systems, Languages, and Applications, October 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Rinard, C. Cadar, D. Dumitran, D. Roy, T. Leu, and J. W Beebee. Enhancing server availability and security through Failure-Oblivious Computing. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ASSURE: automatic software self-healing using rescue points

        Recommendations

        Reviews

        Rafael Corchuelo

        Software crashes due to programming bugs constitute a major problem for systems that need to be available 24 hours a day, seven days a week. Many authors are researching techniques that allow computer applications to detect their own failures and recover from them automatically. "Recovering" means that the application rolls back to a safe state and returns a reasonable error code to the client; recovering in such a way improves system availability, while an appropriate patch is created by a programmer. Sidiroglou et al. have devised a tool called ASSURE that allows server applications that run on Linux 2.6 systems to recover from their failures. ASSURE is innovative insofar as it can deal with applications that are available in binary form only, run on multiple threads and processes, handle polymorphic or encrypted input, or have deterministic bugs (not necessarily memory leaks); furthermore, it does not require any modifications to the underlying operating system. The authors have tested ASSURE on a number of actual bugs in well-known server applications, such as Apache, Squid, and MySQL. They prove that the tool is very efficient. ASSURE builds on so-called rescue points, which are functions that return integer error codes or null pointers when a known error is detected. Rescue points and error codes are identified automatically by running the application in a testing environment where it is fed invalid inputs. When a failure is detected for the first time, ASSURE analyzes the problem in a sandbox. First, it determines what function failed. Then, it identifies the closest rescue point to which the application can be rolled back to keep working well. A piece of code is then inserted at this rescue point that returns an error code, thus preventing the application from continuing and failing. Finally, the application is restarted. The paper is not at all difficult to read, although it does not provide enough details for other scientists to repeat the work. The authors' writing style is didactic and they get to the point very straightforwardly. They also make it very clear what their original contributions are. I recommend this paper and ASSURE for system administrators who need to keep their servers highly available. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
          March 2009
          358 pages
          ISBN:9781605584065
          DOI:10.1145/1508244
          • cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 44, Issue 3
            ASPLOS 2009
            March 2009
            346 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/1508284
            Issue’s Table of Contents
          • cover image ACM SIGARCH Computer Architecture News
            ACM SIGARCH Computer Architecture News  Volume 37, Issue 1
            ASPLOS 2009
            March 2009
            346 pages
            ISSN:0163-5964
            DOI:10.1145/2528521
            Issue’s Table of Contents

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 March 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate535of2,713submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader