skip to main content
article
Free Access

Optimistic recovery in distributed systems

Authors Info & Claims
Published:01 August 1985Publication History
Skip Abstract Section

Abstract

Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay.

Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.

References

  1. 1 AGHILI, H., KIM, W., MCPHERSON, J., SCHKOLNICK, M, AND STRONG, R. A highly available database system. IBM Research Rep. RJ 3755, IBM, Jan. 1983.Google ScholarGoogle Scholar
  2. 2 BARTLETT, J. F. A 'nonstop' operating system. In l lth Hawaii International Conference on System Sciences. University of Hawaii, 1978.Google ScholarGoogle Scholar
  3. 3 BJORK, L. Recovery scenario for a DB/DC system. In Proceedings of the ACM Annual Conference (Atlanta, Ga., Aug. 24-29). ACM, New York, 1973, pp. 142-146. Google ScholarGoogle Scholar
  4. 4 BORG, A., }3AUMBACH, J., AND GLAZER, S. A message system supporting fault tolerance, in 9th ACM Symposium on Operating Systems Principles (Bretton Woods, N.H., Oct. 11-13). Oper. Syst. Rev. 17, 5 (Oct. 1983), pp. 90-99. Google ScholarGoogle Scholar
  5. 5 CHANDY, K. M., AND LAMPORT L. Distributed snapshots: Determining global states in distributed systems. ACM Trans. Computer Syst. 3, 1 (Feb. 1985), 63-75. Google ScholarGoogle Scholar
  6. 6 DAVIES, C.T. Recovery semantics for a DB/DC system. In Proceedings of the ACM Annual Conference (Atlanta, Ga., Aug., 24-29). ACM, New York, 1973, pp. 136-141. Google ScholarGoogle Scholar
  7. 7 GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PUTZOLU, F., AND 13, 2, (June 1981), 223-242. Google ScholarGoogle Scholar
  8. 8 JEFFERSON, D. Virtual time. USC Tech. Rep. TR-83-213, Univ. of Southern California, Los Angeles, May 1983.Google ScholarGoogle Scholar
  9. 9 LAMeORT, L. Time clocks, and the ordering of events in a distributed system. Commun. ACM 21, {July 1978), 558-564. Google ScholarGoogle Scholar
  10. 10 LAMPSON, B., AND STURGIS, H. Crash recovery in a distributed storage system. Xerox PARC Tech. Rep., Xerox Palo Alto Research Center, Palo Alto, Calif., Apr. 1979.Google ScholarGoogle Scholar
  11. 11 LISKOV, B., AND SCHEIFLER R., Guardians and actions: Linguistic support for robust distributed programs. In The 9th Annual Symposium on Principles of Programming Languages (Albuquerque, New Mex., Jan. 25-27). ACM, New York, 1982, pp. 7-19. Google ScholarGoogle Scholar
  12. 12 MOHAN, C., AND LINDSAY, }3. Efficient commit protocols for the tree of processes model of distributed transactions. In Proceedings of the 2nd A CM SIGACT/SIGOPS Symposium on Principles of Distributed Computing (Montreal, Canada, Aug.), 1983, pp. 76-80. Google ScholarGoogle Scholar
  13. 13 MOHAN, C., STRONG, H. R., ANt) FINKELSTEIN, S. Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. IBM Res. Rep. RJ 3882, IBM, San Jose, Calif., June 1983.Google ScholarGoogle Scholar
  14. 14 RUSSELL, D.L. State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE-6, (2), (Mar. 1980), 193-194.Google ScholarGoogle Scholar
  15. 15 SCHNEIDER, F.B. Fail-stop processors. In Digest of Papers from Spring Compcon '83 (Mar.). IEEE Computer Society, San Francisco, 1983.Google ScholarGoogle Scholar
  16. 16 SCOTT, R. K., GAULT, J. W., MCALL~STER, D. G., AND W~GGS, J. Experimental validation of six fault-tolerant software reliability models. In Proceedings of 14th Annual Symposium on Fault- Tolerant Computer Systems (Kissimmee, Fla., June 20-22). 1984.Google ScholarGoogle Scholar
  17. 17 STROM, R. E., AND YEMINI, S. Optimistic recovery: An asynchronous approach to fault tolerance in distributed systems. Proceedings of the 14th Annual Symposium on Fault Tolerant Computer Systems (June 20-22, 1984).Google ScholarGoogle Scholar
  18. 18 STROM, R., AND YEMINI, S. Synthesizing distributed and parallel programs through optimistic transformations. IBM Res. Rep. RC 10797, IBM, 1984.Google ScholarGoogle Scholar
  19. 19 TANNENBAUM, A.S. Computer Networks. Prentice-Hall, Englewood Cliffs, N.J., 1981. Google ScholarGoogle Scholar

Index Terms

  1. Optimistic recovery in distributed systems

                  Recommendations

                  Reviews

                  Brent T. Hailpern

                  This paper discusses a processor-failure recovery technique for a system of communicating processes. The technique is implemented as a component of the process-scheduling and network-interface portion of the operating system. By being part of the operating system, the technique is transparent to application processes. The technique is optimistic in that (almost) all actions by the application processes are allowed to proceed on the assumption that the system will not fail. If failure occurs, the system rolls back to a previous checkpoint, replays those actions that the system has saved, and proceeds with execution. The recovery technique assumes a pure message-based distributed system (that is, no shared memory between processes). Processes are grouped into recovery units. Messages passing into or out of a recovery unit are logged in stable storage (that is, storage that is not lost during a failure). Periodically, process state is also logged in stable storage. Recovery is controlled by retaining state-dependency information for all messages sent and received by a recovery unit. Hence, when a recovery unit fails and then restarts with a new incarnation, other recovery units can determine which of their sent messages are now invalid because of dependencies on the failed unit. Dependent units rollback their state and in effect recall their invalid messages. The technique guarantees that there is a fixed and finite amount of necessary rollback. Performance tuning modifications and garbage collection of recovery information are discussed. The technique does require system overhead, in terms of special communication sessions between recovery units, dependency data for process state, a recovery manager process (per recovery unit) containing log and failure information, global broadcast of log and checkpoint information (possibly as overhead to every transmitted message), and special operations on the output boundary (that is, the interface to the “outside world” where messages cannot be recalled). The advantages of the system are: (1) the system is transparent, given a sophisticated distributed operating system; (2) logging proceeds in parallel with and asynchronous to application computation; and (3) if the system does not fail, then processes are not delayed due to “write-ahead” logs or two-phase commit protocols. The paper is self-contained, reasonably easy to read, and provides good references to related work. My only criticism is that the comparison to other work in database systems and distributed system snapshots is shallow, but at least references are there for interested readers. The notion of optimistic algorithms is a gem; it may provide a useful handle on the problem of how to use multiple processors effectively in a general purpose computing system.

                  Access critical reviews of Computing literature here

                  Become a reviewer for Computing Reviews.

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM Transactions on Computer Systems
                    ACM Transactions on Computer Systems  Volume 3, Issue 3
                    Aug. 1985
                    94 pages
                    ISSN:0734-2071
                    EISSN:1557-7333
                    DOI:10.1145/3959
                    Issue’s Table of Contents

                    Copyright © 1985 ACM

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 1 August 1985
                    Published in tocs Volume 3, Issue 3

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • article

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader