article

Free Access

Optimistic recovery in distributed systems

Authors:
Rob Strom

IBM Thomas J. Watson Research Center, Yorktown Heights, NY

IBM Thomas J. Watson Research Center, Yorktown Heights, NY
View Profile

,
Shaula Yemini

IBM Thomas J. Watson Research Center, Yorktown Heights, NY

IBM Thomas J. Watson Research Center, Yorktown Heights, NY
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 3 Issue 3pp 204–226https://doi.org/10.1145/3959.3962

Published:01 August 1985Publication History

ACM Transactions on Computer Systems

Abstract

Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay.

Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.

References

1 AGHILI, H., KIM, W., MCPHERSON, J., SCHKOLNICK, M, AND STRONG, R. A highly available database system. IBM Research Rep. RJ 3755, IBM, Jan. 1983.Google Scholar
2 BARTLETT, J. F. A 'nonstop' operating system. In l lth Hawaii International Conference on System Sciences. University of Hawaii, 1978.Google Scholar
3 BJORK, L. Recovery scenario for a DB/DC system. In Proceedings of the ACM Annual Conference (Atlanta, Ga., Aug. 24-29). ACM, New York, 1973, pp. 142-146. Google Scholar
4 BORG, A., }3AUMBACH, J., AND GLAZER, S. A message system supporting fault tolerance, in 9th ACM Symposium on Operating Systems Principles (Bretton Woods, N.H., Oct. 11-13). Oper. Syst. Rev. 17, 5 (Oct. 1983), pp. 90-99. Google Scholar
5 CHANDY, K. M., AND LAMPORT L. Distributed snapshots: Determining global states in distributed systems. ACM Trans. Computer Syst. 3, 1 (Feb. 1985), 63-75. Google Scholar
6 DAVIES, C.T. Recovery semantics for a DB/DC system. In Proceedings of the ACM Annual Conference (Atlanta, Ga., Aug., 24-29). ACM, New York, 1973, pp. 136-141. Google Scholar
7 GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PUTZOLU, F., AND 13, 2, (June 1981), 223-242. Google Scholar
8 JEFFERSON, D. Virtual time. USC Tech. Rep. TR-83-213, Univ. of Southern California, Los Angeles, May 1983.Google Scholar
9 LAMeORT, L. Time clocks, and the ordering of events in a distributed system. Commun. ACM 21, {July 1978), 558-564. Google Scholar
10 LAMPSON, B., AND STURGIS, H. Crash recovery in a distributed storage system. Xerox PARC Tech. Rep., Xerox Palo Alto Research Center, Palo Alto, Calif., Apr. 1979.Google Scholar
11 LISKOV, B., AND SCHEIFLER R., Guardians and actions: Linguistic support for robust distributed programs. In The 9th Annual Symposium on Principles of Programming Languages (Albuquerque, New Mex., Jan. 25-27). ACM, New York, 1982, pp. 7-19. Google Scholar
12 MOHAN, C., AND LINDSAY, }3. Efficient commit protocols for the tree of processes model of distributed transactions. In Proceedings of the 2nd A CM SIGACT/SIGOPS Symposium on Principles of Distributed Computing (Montreal, Canada, Aug.), 1983, pp. 76-80. Google Scholar
13 MOHAN, C., STRONG, H. R., ANt) FINKELSTEIN, S. Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. IBM Res. Rep. RJ 3882, IBM, San Jose, Calif., June 1983.Google Scholar
14 RUSSELL, D.L. State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE-6, (2), (Mar. 1980), 193-194.Google Scholar
15 SCHNEIDER, F.B. Fail-stop processors. In Digest of Papers from Spring Compcon '83 (Mar.). IEEE Computer Society, San Francisco, 1983.Google Scholar
16 SCOTT, R. K., GAULT, J. W., MCALL~STER, D. G., AND W~GGS, J. Experimental validation of six fault-tolerant software reliability models. In Proceedings of 14th Annual Symposium on Fault- Tolerant Computer Systems (Kissimmee, Fla., June 20-22). 1984.Google Scholar
17 STROM, R. E., AND YEMINI, S. Optimistic recovery: An asynchronous approach to fault tolerance in distributed systems. Proceedings of the 14th Annual Symposium on Fault Tolerant Computer Systems (June 20-22, 1984).Google Scholar
18 STROM, R., AND YEMINI, S. Synthesizing distributed and parallel programs through optimistic transformations. IBM Res. Rep. RC 10797, IBM, 1984.Google Scholar
19 TANNENBAUM, A.S. Computer Networks. Prentice-Hall, Englewood Cliffs, N.J., 1981. Google Scholar

Index Terms

Recommendations

Quasi-synchronous checkpointing and failure recovery in distributed systems
Read More
Dealing with failures during failure recovery of distributed systems

One of the characteristics of autonomic systems is self recovery from failures. Self recovery can be achieved through sensing failures, planning for recovery and executing the recovery plan to bring the system back to a normal state. For various reasons,...
Read More
Communication-induced checkpointing and recovery protocols for distributed systems
Read More

Reviews

Reviewer: Brent T. Hailpern

This paper discusses a processor-failure recovery technique for a system of communicating processes. The technique is implemented as a component of the process-scheduling and network-interface portion of the operating system. By being part of the operating system, the technique is transparent to application processes. The technique is optimistic in that (almost) all actions by the application processes are allowed to proceed on the assumption that the system will not fail. If failure occurs, the system rolls back to a previous checkpoint, replays those actions that the system has saved, and proceeds with execution. The recovery technique assumes a pure message-based distributed system (that is, no shared memory between processes). Processes are grouped into recovery units. Messages passing into or out of a recovery unit are logged in stable storage (that is, storage that is not lost during a failure). Periodically, process state is also logged in stable storage. Recovery is controlled by retaining state-dependency information for all messages sent and received by a recovery unit. Hence, when a recovery unit fails and then restarts with a new incarnation, other recovery units can determine which of their sent messages are now invalid because of dependencies on the failed unit. Dependent units rollback their state and in effect recall their invalid messages. The technique guarantees that there is a fixed and finite amount of necessary rollback. Performance tuning modifications and garbage collection of recovery information are discussed. The technique does require system overhead, in terms of special communication sessions between recovery units, dependency data for process state, a recovery manager process (per recovery unit) containing log and failure information, global broadcast of log and checkpoint information (possibly as overhead to every transmitted message), and special operations on the output boundary (that is, the interface to the “outside world” where messages cannot be recalled). The advantages of the system are: (1) the system is transparent, given a sophisticated distributed operating system; (2) logging proceeds in parallel with and asynchronous to application computation; and (3) if the system does not fail, then processes are not delayed due to “write-ahead” logs or two-phase commit protocols. The paper is self-contained, reasonably easy to read, and provides good references to related work. My only criticism is that the comparison to other work in database systems and distributed system snapshots is shallow, but at least references are there for interested readers. The notion of optimistic algorithms is a gem; it may provide a useful handle on the problem of how to use multiple processors effectively in a general purpose computing system.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 3, Issue 3
Aug. 1985
94 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/3959
Issue’s Table of Contents

Copyright © 1985 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 1985
Published in tocs Volume 3, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 549
  Total Citations
  View Citations
- 2,315
  Total Downloads
- Downloads (Last 12 months)177
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Quasi-synchronous checkpointing and failure recovery in distributed systems

Dealing with failures during failure recovery of distributed systems

Communication-induced checkpointing and recovery protocols for distributed systems

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Quasi-synchronous checkpointing and failure recovery in distributed systems

Dealing with failures during failure recovery of distributed systems

Communication-induced checkpointing and recovery protocols for distributed systems

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media