article

A survey of rollback-recovery protocols in message-passing systems

Authors:
E. N. (Mootaz) Elnozahy

IBM Research, Austin, TX

IBM Research, Austin, TX
View Profile

,
Lorenzo Alvisi

The University of Texas at Austin, Austin, TX

The University of Texas at Austin, Austin, TX
View Profile

,
Yi-Min Wang

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
David B. Johnson

Rice University, Houston, TX

Rice University, Houston, TX
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 34 Issue 3pp 375–408https://doi.org/10.1145/568522.568525

Published:01 September 2002Publication History

ACM Computing Surveys

Abstract

This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.

References

Alvisi, L. 1996. Understanding the Message Logging Paradigm for Masking Process Crashes. Ph.D. Thesis, Cornell University, Department of Computer Science. Google Scholar
Alvisi, L. and Marzullo, K. 1998. Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. 24, 2, 149--159. Google Scholar
Alvisi, L., Elnozahy, E. N., Rao, S., Husain, S. A., and Mel, A. D. 1999. An analysis of communication-induced checkpointing. In Digest of Papers, FTCS-29, The Twenty Nineth Annual International Symposium on Fault-Tolerant Computing (Madison, Wisconsin), 242--249. Google Scholar
Appel, A. W. 1989. A runtime system. Technical Report CS-TR220-89, Department of Computer Science, Princeton University.Google Scholar
Babaoglu, O. and Joy, W. 1981. Converting a swap-based system to do paging in an architecture lacking page-reference bits. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, 78--86. Google Scholar
Baldoni, R., Quaglia, F., and Ciciani, B. 1998. A VP-accordant checkpointing protocol preventing useless checkpoints. In Proceedings, Seventeenth Symposium on Reliable Distributed Systems, 61--67. Google Scholar
Banâtre, J. P., Banâtre, M., and Muller, G. 1988. Ensuring data security and integrity with a fast stable storage. In Proceedings of The Fourth Conference on Data Engineering, 285--293. Google Scholar
Bartlett, J. F. 1981. A Non Stop Kernel. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, 22--29. Google Scholar
Beguelin, A., Seligman, E., and Stephan, P. 1997. Application-level fault tolerance in heterogeneous networks of workstations. J. Parallel and Distributed Comput. 43, 2, 147--155. Google Scholar
Bhargava, B. and Lian, S. R. 1988. Independent checkpointing and concurrent rollback for recovery---An optimistic approach. In Proceedings, Seventh Symposium on Reliable Distributed Systems, 3--12.Google Scholar
Bhargava, B., Lian, S. R., and Leu, P. J. 1990. Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms. In Proceedings of the Sixth International Conference on Data Engineering, 182--189. Google Scholar
Borg, A., Blau, W., Graetsch, W., Hermann, F., and Oberle, W. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1, 1--24. Google Scholar
Bressoud, T. C. and Schneider, F. B. 1995. Hypervisor-based fault tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, 1--11. Google Scholar
Briatico, D., Ciuffoletti, A., and Simoncini, L. 1984. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, 207--215.Google Scholar
Chandy, M. and Ramamoorthy, C. V. 1972. Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 21, 6, 546--556.Google Scholar
Chandy, M. and Lamport, L. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 31, 1, 63--75. Google Scholar
Cristian, F. and Jahanian, F. 1991. A timestamp-based checkpointing protocol for long-lived distributed computations. In Proceedings, Tenth Symposium on Reliable Distributed Systems, 12--20.Google Scholar
Elnozahy, E. N. 1993. Manetho: Fault Tolerance in Distributed Systems using Rollback-Recovery and Process Replication. Ph.D. Thesis, Rice University, Department of Computer Science. Google Scholar
Elnozahy, E. N. 1998. How safe is probabilistic checkpointing? In Digest of Papers, FTCS-28, the Twenty Eight Annual International Symposium on Fault-Tolerant Computing, 358--363. Google Scholar
Elnozahy, E. N. and Zwaenepoel, W. 1994. On the use and implementing of message logging. In Digest of Papers, FTCS-24, The Twenty Fourth International Symposium on Fault-Tolerant Computing, 298--307.Google Scholar
Elnozahy, E. N., Johnson, D. B., and Zwaenepoel, W. 1992. The performance of consistent checkpointing. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 39--47.Google Scholar
Feldman, S. I. and Brown, C. B. 1989. Igor: A system for program debugging via reversible execution. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging 24, 1, 112--123. Google Scholar
Goldberg, A., Gopal, A., Li, K., Strom, R., and Bacon, D. 1990. Transparent recovery of Mach applications. In Usenix Mach Workshop Proceedings, 169--184.Google Scholar
Hélary, J. M., Mostefaoui, A., and Raynal, M. 1997a. Virtual precedence in asynchronous systems: concepts and applications. In Proceedings of the 11th Workshop on Distributed Algorithms, WDAG'97. Google Scholar
Hélary, J. M., Mostefaoui, A., Netzer, R. H., and Raynal, M. 1997b. Preventing useless checkpoints in distributed computations. In Proceedings, Sixteenth Symposium on Reliable Distributed Systems, 183--190. Google Scholar
Huang, Y. and Kintala, C. 1993. Software implemented fault tolerance: Technologies and experience. In Digest of Papers, FTCS-23, the Twenty Third Annual International Symposium on Fault-Tolerant Computing, 2--9.Google Scholar
Huang, Y. and Wang, Y.-M. 1995. Why optimistic message logging has not been used in telecommunication systems. In Digest of Papers, FTCS-25, the Twenty Fifth Annual International Symposium on Fault-Tolerant Computing, 459--463. Google Scholar
Johnson, D. B. 1989. Distributed System Fault Tolerance Using Message Logging and Checkpointing. Ph.D. Thesis, Rice University, Department of Computer Science. Google Scholar
Johnson, D. B. and Zwaenepoel, W. 1987. Sender-based message logging. In Digest of Papers, FTCS-17, The Seventeenth Annual International Symposium on Fault-Tolerant Computing, 14--19.Google Scholar
Johnson, D. B. and Zwaenepoel, W. 1990. Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11, 3, 462--491. Google Scholar
Juang, T. T.-Y. and Venkatesan, S. 1991. Crash recovery with little overhead. In Proceedings, The 11th International Conference on Distributed Computing Systems, 454--461.Google Scholar
Koo, R. and Toueg, S. 1987. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Engin. 13, 1, 23--31. Google Scholar
Lai, T. H. and Yang, T. H. 1987. On distributed snapshots. Information Processing Letters 25, 153--158. Google Scholar
Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7, 588--565. Google Scholar
Lampson, B. W. and Sturgis, H. E. 1979. Crash recovery in a distributed data storage system. Technical Report, Xerox Palo Alto Research Center.Google Scholar
Li, C. C. and Fuchs, W. K. 1990. CATCH: Compiler-assisted techniques for checkpointing. In Digest of Papers, FTCS-20, The Twentieth Annual International Symposium on Fault-Tolerant Computing, 74--81.Google Scholar
Mellor-Crummey, J. and LeBlanc, T. 1989. A software instruction counter. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, 78--86. Google Scholar
Morin, C. and Puaut, T. 1997. A survey of recoverable distributed shared memory systems. IEEE Trans. Parallel and Distributed Syst. 8, 9, 959--969. Google Scholar
Muller, G., Hue, M., and Peyrouz, N. 1994. Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment. In Lecture Notes in Computer Science: Dependable Computing, EDDC-1, 491--508. Google Scholar
Nam, H.-C., Kim, J., Hong, S. J., and Lee, S. 1997. Probabilistic checkpointing. In Digest of Papers, FTCS-27, The Twenty Seventh Annual International Symposium on Fault-Tolerant Computing, 48--57. Google Scholar
Netzer, R. H. and Xu, J. 1995. Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel and Distributed Syst. 6, 2, 165--169. Google Scholar
Pausch, R. 1988. Adding Input and Output to the Transactional Model. Ph.D. Thesis, Carnegie Mellon University, Department of Computer Science. Google Scholar
Plank, J. S. 1993. Efficient Checkpointing on MIMD Architectures. Ph.D. Thesis, Princeton University, Department of Computer Science. Google Scholar
Plank, J. S. and Li, K. 1994. Faster checkpointing with N + 1 parity. In Digest of Papers, FTCS-24, The Twenty Fourth Annual International Symposium on Fault-Tolerant Computing, 288--297.Google Scholar
Plank, J. S., Xu, J., and Netzer, R. H. 1995a. Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee at Knoxville.Google Scholar
Plank, J. S., Beck, M., Kingsley, G., and Li, K. 1995b. Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX Winter 1995 Technical Conference, 213--223. Google Scholar
Prakash, R. and Singhal, M. 1996. Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel and Distributed Syst. 7, 10, 1035--1048. Google Scholar
Randell, B. 1975. System structure for software fault tolerance. IEEE Trans. Softw. Engin. 1, 2, 220--232.Google Scholar
Rao, S., Alvisi, L., and Vin, H. M. 1998. The cost of recovery in message logging protocols. In Proceedings, Seventeenth Symposium on Reliable Distributed Systems, 10--18. Google Scholar
Ruffin, M. 1992. KITLOG: A generic logging service. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 139--148.Google Scholar
Russell, D. L. 1980. State restoration in systems of communicating processes. IEEE Trans. Softw. Engin. 6, 2, 183--194.Google Scholar
Schlichting, R. D. and Schneider, F. B. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1, 3, 222--238. Google Scholar
Silva, L. M. 1997. Checkpointing Mechanisms for Scientific Parallel Applications. Ph.D. Thesis, University of Coimbra, Department of Computer Science.Google Scholar
Sistla, A. and Welch, J. 1989. Efficient distributed recovery using message logging. In Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing (PODC), 223--238. Google Scholar
Slye, J. H. and Elnozahy, E. N. 1998. Support for software interrupts in log-based rollback-recovery. IEEE Trans. Comput. 47, 10, 1113--1123. Google Scholar
Smith, S. W. and Johnson, D. B. 1996. Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback. In Proceedings, the Fifteenth Symposium on Reliable Distributed Systems, 66--75. Google Scholar
Strom, R. and Yemini, S. 1985. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 3, 204--226. Google Scholar
Tamir, Y. and Sequin, C. H. 1984. Error recovery in multicomputers using global checkpoints. In Proceedings of the International Conference on Parallel Processing, 32--41.Google Scholar
Tong, Z., Kain, R. Y., and Tsai, W. T. 1992. Rollback-recovery in distributed systems using loosely synchronized clocks. IEEE Trans. Parallel and Distributed Syst. 3, 2, 246--251. Google Scholar
Wang, Y.-M. 1993. Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems. Ph.D. Thesis, University of Illinois, Department of Computer Science. Google Scholar
Wang, Y.-M. 1997. Consistent global checkpoints that contain a set of local checkpoints. IEEE Trans. Comput. 46, 4, 456--468. Google Scholar
Wang, Y.-M., Chung, P. Y., and Fuchs, W. K. 1995a. Tight upper bound on useful distributed system checkpoints. Technical Report, University of Illinois.Google Scholar
Wang, Y.-M., Chung, P. Y., Lin, I. J., and Fuchs, W. K. 1995b. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel and Distributed Syst. 6, 5, 546--554. Google Scholar

Recommendations

Checkpointing and Rollback-Recovery for Distributed Systems
Special issue on distributed systems

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery ...
Read More
Energy profile of rollback-recovery strategies in high performance computing
Highlights
- An analytical model to understand and represent the energy consumption of rollback-recovery mechanisms.
Abstract
Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major ...
Read More
Asynchronous recovery without using vector timestamps

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also ...
Read More

Reviews

Reviewer: Bayard Kohlhepp

Computer applications now span the globe, and incorporate devices ranging in size and power from watches to clustered supercomputers. The further a system reaches, and the more its heterogeneity decreases, the more fragile (susceptible to exceptions and errors) it becomes. Every system we design and build is more likely than ever to encounter, and to have to recover from, unreliable communication. It is time for rollback-recovery techniques to become mainstream software design topics. This paper surveys the daunting volume of research literature that explores such techniques, concentrating on those approaches that can be implemented in any application environment (for example, those with no language dependencies). It splits these techniques into checkpoint-based and log-based techniques, and then subdivides each of those families. While this taxonomy alone is helpful, the authors go even deeper, and analyze the key ideas underlying each technique, along with the problems that accompany their implementation. There is no recovery technique that is universally satisfying, so the paper must be read to determine which solution, or family of solutions, is appropriate for a particular situation. However, reading this paper to discover the appropriate research reports is substantially quicker than diving into a literature search and trying to correlate all of the raw material yourself. Certainly, every practicing programmer, software designer, and architect should read this survey. There is no such thing as a standalone system anymore, and all current implementers would benefit from understanding recovery techniques. Managers would also benefit from a high-level understanding of recovery issues. Their projects will succeed or fail according to product reliability, and this paper covers a core technology for building reliable systems. Elnozahy et al. have done a stunning job of creating Rollback-Recovery Techniques 101. They did not introduce any new research themselves, but, rather, have brought order out of chaos, by integrating and explaining seemingly contradictory, or at least unrelated, findings. This paper should become a classic reference work, on the desk of every distributed systems programmer. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 34, Issue 3
September 2002
106 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/568522
Issue’s Table of Contents

Copyright © 2002 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2002
Published in csur Volume 34, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
message logging
rollback-recovery
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,175
  Total Citations
  View Citations
- 8,661
  Total Downloads
- Downloads (Last 12 months)175
- Downloads (Last 6 weeks)31
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

Checkpointing and Rollback-Recovery for Distributed Systems

Energy profile of rollback-recovery strategies in high performance computing

Asynchronous recovery without using vector timestamps

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

Checkpointing and Rollback-Recovery for Distributed Systems

Energy profile of rollback-recovery strategies in high performance computing

Asynchronous recovery without using vector timestamps

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media