skip to main content
10.1145/2380356.2380375acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

Operating system support for redundant multithreading

Published:07 October 2012Publication History

ABSTRACT

In modern commodity operating systems, core functionality is usually designed assuming that the underlying processor hardware always functions correctly. Shrinking hardware feature sizes break this assumption. Existing approaches to cope with these issues either use hardware functionality that is not available in commercial-off-the-shelf (COTS) systems or poses additional requirements on the software development side, making reuse of existing software hard, if not impossible.

In this paper we present Romain, a framework that provides transparent redundant multithreading1 as an operating system service for hardware error detection and recovery. When applied to a standard benchmark suite, Romain requires a maximum runtime overhead of 30% for triple-modular redundancy (while in many cases remaining below 5%). Furthermore, our approach minimizes the complexity added to the operating system for the sake of replication.

References

  1. Ansel, J., Arya, K., and Cooperman, G. DMTCP: Transparent checkpointing for cluster computations and the desktop. In 23rd IEEE International Parallel and Distributed Processing Symposium (Rome, Italy, May 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arlat, J., Fabre, J.-C., Society, I. C., Rodriguez, M., and Salles, F. Dependability of COTS microkernel-based systems. IEEE Transactions on Computers 51 (2002), 138--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aron, M., Deller, L., Elphinstone, K., Jaeger, T., Liedtke, J., and Park, Y. The SawMill framework for virtual memory diversity. In Proceedings of the 8th Asia-Pacific Computer Systems Architecture Conference (Bond University, Gold Coast, QLD, Australia, Jan. 29 - Feb. 2 2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Austin, T. DIVA: a reliable substrate for deep submicron microarchitecture design. In Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on (1999), pp. 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bartlett, J. F. A nonstop kernel. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles (New York, NY, USA, 1981), SOSP '81, ACM, pp. 22--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bernick, D., Bruckert, B., Vigna, P., Garcia, D., Jardine, R., Klecka, J., and Smullen, J. Nonstop: Advanced architecture. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on (june 1 - july 2005), pp. 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Borkar, S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25, 6 (Nov. - Dec. 2005), 10--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bressoud, T. C., and Schneider, F. B. Hypervisor-based fault tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (New York, NY, USA, 1995), SOSP '95, ACM, pp. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Brown, J., and Knight, T. F. A minimal trusted computing base for dynamically ensuring secure information flow. Tech. rep., 2001.Google ScholarGoogle Scholar
  10. David, F. M., Chan, E. M., Carlyle, J. C., and Campbell, R. H. CuriOS: Improving Reliability through Operating System Structure. In USENIX Symposium on Operating Systems Design and Implementation (San Diego, CA, December 2008), pp. 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fetzer, C., Schiffel, U., and Süsskraut, M. AN-encoding compiler: Building safety-critical systems with commodity hardware. In Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security (Berlin, Heidelberg, 2009), SAFECOMP '09, Springer-Verlag, pp. 283--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gray, J. Why do computers stop and what can be done about it? In Symposium on Reliability in Distributed Software and Database Systems (1986), pp. 3--12.Google ScholarGoogle Scholar
  13. Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop (Washington, DC, USA, 2001), IEEE Computer Society, pp. 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hendricks, J., and van Doorn, L. Secure bootstrap is not enough: shoring up the trusted computing base. In Proceedings of the 11th workshop on ACM SIGOPS European workshop (New York, NY, USA, 2004), EW 11, ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Herder, J. N. Building a dependable operating system: Fault Tolerance in MINIX3. Dissertation, Vrije Universiteit Amsterdam, 2010.Google ScholarGoogle Scholar
  16. IBM. PowerPC 750GX Lockstep facility. IBM Application Note, 2008.Google ScholarGoogle Scholar
  17. IBM. z/OS - a smarter operating system for smarter computing. http://www-03.ibm.com/systems/z/os/zos/, 2011.Google ScholarGoogle Scholar
  18. Kadav, A., Renzelmann, M. J., and Swift, M. M. Tolerating hardware device failures in software. Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (2009), 59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Klein, G., Elphinstone, K., Heiser, G., Andronick, J., Cock, D., Derrin, P., Elkaduwe, D., Engelhardt, K., Kolanski, R., Norrish, M., Sewell, T., Tuch, H., and Winwood, S. seL4: Formal verification of an OS kernel. In Proc. 22nd ACM Symposium on Operating Systems Principles (SOSP) (Big Sky, MT, USA, Oct. 2009), ACM, pp. 207--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lackorzynski, A., Warg, A., and Peter, M. Generic Virtualization with Virtual Processors. In Proceedings of Twelfth Real-Time Linux Workshop (Nairobi, Kenya, October 2010).Google ScholarGoogle Scholar
  21. Li, M.-L., Ramachandran, P., Sahoo, S. K., Adve, S. V., Adve, V. S., and Zhou, Y. Understanding the propagation of hard errors to software and implications for resilient system design. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2008), ASPLOS XIII, ACM, pp. 265--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Liu, T., Curtsinger, C., and Berger, E. D. Dthreads: efficient deterministic multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Meixner, A., and Sorin, D. J. Detouring: Translating software to circumvent hard faults in simple cores. In Proceedings of the International Conference on Dependable Systems and Networks (DSN) (2008), pp. 80--89.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mukherjee, S. Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nassif, S. R. The light at the end of the CMOS tunnel. In Int. Conf. on Application-specific Systems Architectures and Processors (july 2010), pp. 4--9.Google ScholarGoogle ScholarCross RefCross Ref
  26. Oh, N., Shirvani, P., and McCluskey, E. Control-flow checking by software signatures. IEEE Transactions on Reliability 51, 1 (mar 2002), 111--122.Google ScholarGoogle ScholarCross RefCross Ref
  27. Oh, N., Shirvani, P. P., and McCluskey, E. J. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51 (Mar 2002), 63--75.Google ScholarGoogle ScholarCross RefCross Ref
  28. Olszewski, M., Ansel, J., and Amarasinghe, S. Kendo: efficient deterministic multithreading in software. SIGPLAN Not. 44 (Mar. 2009), 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Palix, N., Thomas, G., Saha, S., Calvès, C., Lawall, J., and Muller, G. Faults in Linux: Ten years later. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2011), ASPLOS '11, ACM, pp. 305--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Patterson, D. A., Gibson, G., and Katz, R. H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 1988), SIGMOD '88, ACM, pp. 109--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Postel, J. Transmission Control Protocol. RFC 793 (Standard), Sept. 1981. Updated by RFCs 1122, 3168, 6093.Google ScholarGoogle Scholar
  32. Reick, K., Sanda, P., Swaney, S., Kellington, J., Mack, M., Floyd, M., and Henderson, D. Fault-tolerant design of the IBM Power6 Microprocessor. IEEE Micro 28, 2 (march-april 2008), 30--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Reinhardt, S. K., and Mukherjee, S. S. Transient fault detection via simultaneous multithreading. SIGARCH Comput. Archit. News 28 (May 2000), 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (2005), IEEE Computer Society, pp. 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ryzhyk, L., Chubb, P., Kuz, I., Le Sueur, E., and Heiser, G. Automatic device driver synthesis with Termite. Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles SOSP '09 (2009), 73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Saggese, G. P., Wang, N. J., Kalbarczyk, Z. T., Patel, S. J., and Iyer, R. K. An experimental study of soft errors in microprocessors. IEEE Micro 25 (November 2005), 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Schroder, D. K. Negative bias temperature instability: What do we understand? Microelectronics Reliability 47, 6 (2007), 841--852.Google ScholarGoogle ScholarCross RefCross Ref
  38. Shye, A., Moseley, T., Reddi, V. J., Blomstedt, J., and Connors, D. A. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (Washington, DC, USA, 2007), DSN '07, IEEE Computer Society, pp. 297--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Singaravelu, L., Pu, C., Härtig, H., and Helmuth, C. Reducing TCB complexity for security-sensitive applications: three case studies. SIGOPS Oper. Syst. Rev. 40 (April 2006), 161--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Steinberg, U., and Kauer, B. NOVA: a microhypervisor-based secure virtualization architecture. In Proceedings of the 5th European conference on Computer systems (New York, NY, USA, 2010), EuroSys '10, ACM, pp. 209--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Taber, A., and Normand, E. Single event upset in avionics. IEEE Transactions on Nuclear Science 40, 2 (apr 1993), 120--126.Google ScholarGoogle ScholarCross RefCross Ref
  42. Thampi, V. udis86 - disassembler library for x86 and x86-64. http://udis86.sourceforge.net/, 2009.Google ScholarGoogle Scholar
  43. TU Dresden OS Group. L4/Fiasco.OC microkernel. http://www.tudos.org/fiasco, 2012.Google ScholarGoogle Scholar
  44. uNdErX. Micro length-disassembler engine 32. http://vx.netlux.org/vx.php?id=em24, 2004.Google ScholarGoogle Scholar
  45. Venkatasubramanian, R., Hayes, J., and Murray, B. Low-cost on-line fault detection using control flow assertions. In On-Line Testing Symposium, 2003. IOLTS 2003. 9th IEEE (july 2003), pp. 137--143.Google ScholarGoogle ScholarCross RefCross Ref
  46. Vogt, D., Döbel, B., and Lackorzynski, A. Stay strong, stay safe: Enhancing reliability of a secure operating system. In Proceedings of the Workshop on Isolation and Integration for Dependable Systems (IIDS 2010), Paris, France, April 2010 (New York, NY, USA, 2010), ACM.Google ScholarGoogle Scholar
  47. Wang, C., Kim, H.-s., Wu, Y., and Ying, V. Compiler-managed software-based redundant multi-threading for transient fault detection. In Proceedings of the International Symposium on Code Generation and Optimization (Washington, DC, USA, 2007), CGO '07, IEEE Computer Society, pp. 244--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wang, N., Fertig, M., and Patel, S. Y-branches: when you come to a fork in the road, take it. In Parallel Architectures and Compilation Techniques, 2003. PACT 2003. Proceedings. 12th International Conference on (sept. - 1 oct. 2003), pp. 56--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhu, D., Melhem, R., and Mosse, D. The effects of energy management on reliability in real-time embedded systems. In IEEE/ACM International Conference on Computer-Aided design (Washington, DC, USA, 2004), ICCAD '04, IEEE Computer Society, pp. 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ziegler, J. F., and Lanford, W. A. Effect of cosmic rays on computer memories. Science 206, 4420 (1979), 776--788.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Operating system support for redundant multithreading

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      EMSOFT '12: Proceedings of the tenth ACM international conference on Embedded software
      October 2012
      266 pages
      ISBN:9781450314251
      DOI:10.1145/2380356

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 October 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate60of203submissions,30%

      Upcoming Conference

      ESWEEK '24
      Twentieth Embedded Systems Week
      September 29 - October 4, 2024
      Raleigh , NC , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader