Abstract
Optimistic Recovery is a new technique supporting application-independent transparent recovery from processor failures in distributed systems. In optimistic recovery communication, computation and checkpointing proceed asynchronously. Synchronization is replaced by causal dependency tracking, which enables a posteriori reconstruction of a consistent distributed system state following a failure using process rollback and message replay.
Because there is no synchronization among computation, communication, and checkpointing, optimistic recovery can tolerate the failure of an arbitrary number of processors and yields better throughput and response time than other general recovery techniques whenever failures are infrequent.
- 1 AGHILI, H., KIM, W., MCPHERSON, J., SCHKOLNICK, M, AND STRONG, R. A highly available database system. IBM Research Rep. RJ 3755, IBM, Jan. 1983.Google Scholar
- 2 BARTLETT, J. F. A 'nonstop' operating system. In l lth Hawaii International Conference on System Sciences. University of Hawaii, 1978.Google Scholar
- 3 BJORK, L. Recovery scenario for a DB/DC system. In Proceedings of the ACM Annual Conference (Atlanta, Ga., Aug. 24-29). ACM, New York, 1973, pp. 142-146. Google Scholar
- 4 BORG, A., }3AUMBACH, J., AND GLAZER, S. A message system supporting fault tolerance, in 9th ACM Symposium on Operating Systems Principles (Bretton Woods, N.H., Oct. 11-13). Oper. Syst. Rev. 17, 5 (Oct. 1983), pp. 90-99. Google Scholar
- 5 CHANDY, K. M., AND LAMPORT L. Distributed snapshots: Determining global states in distributed systems. ACM Trans. Computer Syst. 3, 1 (Feb. 1985), 63-75. Google Scholar
- 6 DAVIES, C.T. Recovery semantics for a DB/DC system. In Proceedings of the ACM Annual Conference (Atlanta, Ga., Aug., 24-29). ACM, New York, 1973, pp. 136-141. Google Scholar
- 7 GRAY, J., MCJONES, P., BLASGEN, M., LINDSAY, B., LORIE, R., PRICE, T., PUTZOLU, F., AND 13, 2, (June 1981), 223-242. Google Scholar
- 8 JEFFERSON, D. Virtual time. USC Tech. Rep. TR-83-213, Univ. of Southern California, Los Angeles, May 1983.Google Scholar
- 9 LAMeORT, L. Time clocks, and the ordering of events in a distributed system. Commun. ACM 21, {July 1978), 558-564. Google Scholar
- 10 LAMPSON, B., AND STURGIS, H. Crash recovery in a distributed storage system. Xerox PARC Tech. Rep., Xerox Palo Alto Research Center, Palo Alto, Calif., Apr. 1979.Google Scholar
- 11 LISKOV, B., AND SCHEIFLER R., Guardians and actions: Linguistic support for robust distributed programs. In The 9th Annual Symposium on Principles of Programming Languages (Albuquerque, New Mex., Jan. 25-27). ACM, New York, 1982, pp. 7-19. Google Scholar
- 12 MOHAN, C., AND LINDSAY, }3. Efficient commit protocols for the tree of processes model of distributed transactions. In Proceedings of the 2nd A CM SIGACT/SIGOPS Symposium on Principles of Distributed Computing (Montreal, Canada, Aug.), 1983, pp. 76-80. Google Scholar
- 13 MOHAN, C., STRONG, H. R., ANt) FINKELSTEIN, S. Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. IBM Res. Rep. RJ 3882, IBM, San Jose, Calif., June 1983.Google Scholar
- 14 RUSSELL, D.L. State restoration in systems of communicating processes. IEEE Trans. Softw. Eng. SE-6, (2), (Mar. 1980), 193-194.Google Scholar
- 15 SCHNEIDER, F.B. Fail-stop processors. In Digest of Papers from Spring Compcon '83 (Mar.). IEEE Computer Society, San Francisco, 1983.Google Scholar
- 16 SCOTT, R. K., GAULT, J. W., MCALL~STER, D. G., AND W~GGS, J. Experimental validation of six fault-tolerant software reliability models. In Proceedings of 14th Annual Symposium on Fault- Tolerant Computer Systems (Kissimmee, Fla., June 20-22). 1984.Google Scholar
- 17 STROM, R. E., AND YEMINI, S. Optimistic recovery: An asynchronous approach to fault tolerance in distributed systems. Proceedings of the 14th Annual Symposium on Fault Tolerant Computer Systems (June 20-22, 1984).Google Scholar
- 18 STROM, R., AND YEMINI, S. Synthesizing distributed and parallel programs through optimistic transformations. IBM Res. Rep. RC 10797, IBM, 1984.Google Scholar
- 19 TANNENBAUM, A.S. Computer Networks. Prentice-Hall, Englewood Cliffs, N.J., 1981. Google Scholar
Index Terms
- Optimistic recovery in distributed systems
Recommendations
Dealing with failures during failure recovery of distributed systems
One of the characteristics of autonomic systems is self recovery from failures. Self recovery can be achieved through sensing failures, planning for recovery and executing the recovery plan to bring the system back to a normal state. For various reasons,...
Comments