Abstract
This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.
- Alvisi, L. 1996. Understanding the Message Logging Paradigm for Masking Process Crashes. Ph.D. Thesis, Cornell University, Department of Computer Science. Google Scholar
- Alvisi, L. and Marzullo, K. 1998. Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. 24, 2, 149--159. Google Scholar
- Alvisi, L., Elnozahy, E. N., Rao, S., Husain, S. A., and Mel, A. D. 1999. An analysis of communication-induced checkpointing. In Digest of Papers, FTCS-29, The Twenty Nineth Annual International Symposium on Fault-Tolerant Computing (Madison, Wisconsin), 242--249. Google Scholar
- Appel, A. W. 1989. A runtime system. Technical Report CS-TR220-89, Department of Computer Science, Princeton University.Google Scholar
- Babaoglu, O. and Joy, W. 1981. Converting a swap-based system to do paging in an architecture lacking page-reference bits. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, 78--86. Google Scholar
- Baldoni, R., Quaglia, F., and Ciciani, B. 1998. A VP-accordant checkpointing protocol preventing useless checkpoints. In Proceedings, Seventeenth Symposium on Reliable Distributed Systems, 61--67. Google Scholar
- Banâtre, J. P., Banâtre, M., and Muller, G. 1988. Ensuring data security and integrity with a fast stable storage. In Proceedings of The Fourth Conference on Data Engineering, 285--293. Google Scholar
- Bartlett, J. F. 1981. A Non Stop Kernel. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, 22--29. Google Scholar
- Beguelin, A., Seligman, E., and Stephan, P. 1997. Application-level fault tolerance in heterogeneous networks of workstations. J. Parallel and Distributed Comput. 43, 2, 147--155. Google Scholar
- Bhargava, B. and Lian, S. R. 1988. Independent checkpointing and concurrent rollback for recovery---An optimistic approach. In Proceedings, Seventh Symposium on Reliable Distributed Systems, 3--12.Google Scholar
- Bhargava, B., Lian, S. R., and Leu, P. J. 1990. Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms. In Proceedings of the Sixth International Conference on Data Engineering, 182--189. Google Scholar
- Borg, A., Blau, W., Graetsch, W., Hermann, F., and Oberle, W. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1, 1--24. Google Scholar
- Bressoud, T. C. and Schneider, F. B. 1995. Hypervisor-based fault tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, 1--11. Google Scholar
- Briatico, D., Ciuffoletti, A., and Simoncini, L. 1984. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, 207--215.Google Scholar
- Chandy, M. and Ramamoorthy, C. V. 1972. Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 21, 6, 546--556.Google Scholar
- Chandy, M. and Lamport, L. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 31, 1, 63--75. Google Scholar
- Cristian, F. and Jahanian, F. 1991. A timestamp-based checkpointing protocol for long-lived distributed computations. In Proceedings, Tenth Symposium on Reliable Distributed Systems, 12--20.Google Scholar
- Elnozahy, E. N. 1993. Manetho: Fault Tolerance in Distributed Systems using Rollback-Recovery and Process Replication. Ph.D. Thesis, Rice University, Department of Computer Science. Google Scholar
- Elnozahy, E. N. 1998. How safe is probabilistic checkpointing? In Digest of Papers, FTCS-28, the Twenty Eight Annual International Symposium on Fault-Tolerant Computing, 358--363. Google Scholar
- Elnozahy, E. N. and Zwaenepoel, W. 1994. On the use and implementing of message logging. In Digest of Papers, FTCS-24, The Twenty Fourth International Symposium on Fault-Tolerant Computing, 298--307.Google Scholar
- Elnozahy, E. N., Johnson, D. B., and Zwaenepoel, W. 1992. The performance of consistent checkpointing. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 39--47.Google Scholar
- Feldman, S. I. and Brown, C. B. 1989. Igor: A system for program debugging via reversible execution. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging 24, 1, 112--123. Google Scholar
- Goldberg, A., Gopal, A., Li, K., Strom, R., and Bacon, D. 1990. Transparent recovery of Mach applications. In Usenix Mach Workshop Proceedings, 169--184.Google Scholar
- Hélary, J. M., Mostefaoui, A., and Raynal, M. 1997a. Virtual precedence in asynchronous systems: concepts and applications. In Proceedings of the 11th Workshop on Distributed Algorithms, WDAG'97. Google Scholar
- Hélary, J. M., Mostefaoui, A., Netzer, R. H., and Raynal, M. 1997b. Preventing useless checkpoints in distributed computations. In Proceedings, Sixteenth Symposium on Reliable Distributed Systems, 183--190. Google Scholar
- Huang, Y. and Kintala, C. 1993. Software implemented fault tolerance: Technologies and experience. In Digest of Papers, FTCS-23, the Twenty Third Annual International Symposium on Fault-Tolerant Computing, 2--9.Google Scholar
- Huang, Y. and Wang, Y.-M. 1995. Why optimistic message logging has not been used in telecommunication systems. In Digest of Papers, FTCS-25, the Twenty Fifth Annual International Symposium on Fault-Tolerant Computing, 459--463. Google Scholar
- Johnson, D. B. 1989. Distributed System Fault Tolerance Using Message Logging and Checkpointing. Ph.D. Thesis, Rice University, Department of Computer Science. Google Scholar
- Johnson, D. B. and Zwaenepoel, W. 1987. Sender-based message logging. In Digest of Papers, FTCS-17, The Seventeenth Annual International Symposium on Fault-Tolerant Computing, 14--19.Google Scholar
- Johnson, D. B. and Zwaenepoel, W. 1990. Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11, 3, 462--491. Google Scholar
- Juang, T. T.-Y. and Venkatesan, S. 1991. Crash recovery with little overhead. In Proceedings, The 11th International Conference on Distributed Computing Systems, 454--461.Google Scholar
- Koo, R. and Toueg, S. 1987. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Engin. 13, 1, 23--31. Google Scholar
- Lai, T. H. and Yang, T. H. 1987. On distributed snapshots. Information Processing Letters 25, 153--158. Google Scholar
- Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7, 588--565. Google Scholar
- Lampson, B. W. and Sturgis, H. E. 1979. Crash recovery in a distributed data storage system. Technical Report, Xerox Palo Alto Research Center.Google Scholar
- Li, C. C. and Fuchs, W. K. 1990. CATCH: Compiler-assisted techniques for checkpointing. In Digest of Papers, FTCS-20, The Twentieth Annual International Symposium on Fault-Tolerant Computing, 74--81.Google Scholar
- Mellor-Crummey, J. and LeBlanc, T. 1989. A software instruction counter. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, 78--86. Google Scholar
- Morin, C. and Puaut, T. 1997. A survey of recoverable distributed shared memory systems. IEEE Trans. Parallel and Distributed Syst. 8, 9, 959--969. Google Scholar
- Muller, G., Hue, M., and Peyrouz, N. 1994. Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment. In Lecture Notes in Computer Science: Dependable Computing, EDDC-1, 491--508. Google Scholar
- Nam, H.-C., Kim, J., Hong, S. J., and Lee, S. 1997. Probabilistic checkpointing. In Digest of Papers, FTCS-27, The Twenty Seventh Annual International Symposium on Fault-Tolerant Computing, 48--57. Google Scholar
- Netzer, R. H. and Xu, J. 1995. Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel and Distributed Syst. 6, 2, 165--169. Google Scholar
- Pausch, R. 1988. Adding Input and Output to the Transactional Model. Ph.D. Thesis, Carnegie Mellon University, Department of Computer Science. Google Scholar
- Plank, J. S. 1993. Efficient Checkpointing on MIMD Architectures. Ph.D. Thesis, Princeton University, Department of Computer Science. Google Scholar
- Plank, J. S. and Li, K. 1994. Faster checkpointing with N + 1 parity. In Digest of Papers, FTCS-24, The Twenty Fourth Annual International Symposium on Fault-Tolerant Computing, 288--297.Google Scholar
- Plank, J. S., Xu, J., and Netzer, R. H. 1995a. Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee at Knoxville.Google Scholar
- Plank, J. S., Beck, M., Kingsley, G., and Li, K. 1995b. Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX Winter 1995 Technical Conference, 213--223. Google Scholar
- Prakash, R. and Singhal, M. 1996. Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel and Distributed Syst. 7, 10, 1035--1048. Google Scholar
- Randell, B. 1975. System structure for software fault tolerance. IEEE Trans. Softw. Engin. 1, 2, 220--232.Google Scholar
- Rao, S., Alvisi, L., and Vin, H. M. 1998. The cost of recovery in message logging protocols. In Proceedings, Seventeenth Symposium on Reliable Distributed Systems, 10--18. Google Scholar
- Ruffin, M. 1992. KITLOG: A generic logging service. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 139--148.Google Scholar
- Russell, D. L. 1980. State restoration in systems of communicating processes. IEEE Trans. Softw. Engin. 6, 2, 183--194.Google Scholar
- Schlichting, R. D. and Schneider, F. B. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1, 3, 222--238. Google Scholar
- Silva, L. M. 1997. Checkpointing Mechanisms for Scientific Parallel Applications. Ph.D. Thesis, University of Coimbra, Department of Computer Science.Google Scholar
- Sistla, A. and Welch, J. 1989. Efficient distributed recovery using message logging. In Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing (PODC), 223--238. Google Scholar
- Slye, J. H. and Elnozahy, E. N. 1998. Support for software interrupts in log-based rollback-recovery. IEEE Trans. Comput. 47, 10, 1113--1123. Google Scholar
- Smith, S. W. and Johnson, D. B. 1996. Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback. In Proceedings, the Fifteenth Symposium on Reliable Distributed Systems, 66--75. Google Scholar
- Strom, R. and Yemini, S. 1985. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 3, 204--226. Google Scholar
- Tamir, Y. and Sequin, C. H. 1984. Error recovery in multicomputers using global checkpoints. In Proceedings of the International Conference on Parallel Processing, 32--41.Google Scholar
- Tong, Z., Kain, R. Y., and Tsai, W. T. 1992. Rollback-recovery in distributed systems using loosely synchronized clocks. IEEE Trans. Parallel and Distributed Syst. 3, 2, 246--251. Google Scholar
- Wang, Y.-M. 1993. Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems. Ph.D. Thesis, University of Illinois, Department of Computer Science. Google Scholar
- Wang, Y.-M. 1997. Consistent global checkpoints that contain a set of local checkpoints. IEEE Trans. Comput. 46, 4, 456--468. Google Scholar
- Wang, Y.-M., Chung, P. Y., and Fuchs, W. K. 1995a. Tight upper bound on useful distributed system checkpoints. Technical Report, University of Illinois.Google Scholar
- Wang, Y.-M., Chung, P. Y., Lin, I. J., and Fuchs, W. K. 1995b. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel and Distributed Syst. 6, 5, 546--554. Google Scholar
Recommendations
Checkpointing and Rollback-Recovery for Distributed Systems
Special issue on distributed systemsWe consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery ...
Energy profile of rollback-recovery strategies in high performance computing
Highlights- An analytical model to understand and represent the energy consumption of rollback-recovery mechanisms.
AbstractExtreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major ...
Asynchronous recovery without using vector timestamps
A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also ...
Comments