skip to main content
article

A survey of rollback-recovery protocols in message-passing systems

Published:01 September 2002Publication History
Skip Abstract Section

Abstract

This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.

References

  1. Alvisi, L. 1996. Understanding the Message Logging Paradigm for Masking Process Crashes. Ph.D. Thesis, Cornell University, Department of Computer Science. Google ScholarGoogle Scholar
  2. Alvisi, L. and Marzullo, K. 1998. Message logging: pessimistic, optimistic, causal and optimal. IEEE Trans. Softw. Eng. 24, 2, 149--159. Google ScholarGoogle Scholar
  3. Alvisi, L., Elnozahy, E. N., Rao, S., Husain, S. A., and Mel, A. D. 1999. An analysis of communication-induced checkpointing. In Digest of Papers, FTCS-29, The Twenty Nineth Annual International Symposium on Fault-Tolerant Computing (Madison, Wisconsin), 242--249. Google ScholarGoogle Scholar
  4. Appel, A. W. 1989. A runtime system. Technical Report CS-TR220-89, Department of Computer Science, Princeton University.Google ScholarGoogle Scholar
  5. Babaoglu, O. and Joy, W. 1981. Converting a swap-based system to do paging in an architecture lacking page-reference bits. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, 78--86. Google ScholarGoogle Scholar
  6. Baldoni, R., Quaglia, F., and Ciciani, B. 1998. A VP-accordant checkpointing protocol preventing useless checkpoints. In Proceedings, Seventeenth Symposium on Reliable Distributed Systems, 61--67. Google ScholarGoogle Scholar
  7. Banâtre, J. P., Banâtre, M., and Muller, G. 1988. Ensuring data security and integrity with a fast stable storage. In Proceedings of The Fourth Conference on Data Engineering, 285--293. Google ScholarGoogle Scholar
  8. Bartlett, J. F. 1981. A Non Stop Kernel. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, 22--29. Google ScholarGoogle Scholar
  9. Beguelin, A., Seligman, E., and Stephan, P. 1997. Application-level fault tolerance in heterogeneous networks of workstations. J. Parallel and Distributed Comput. 43, 2, 147--155. Google ScholarGoogle Scholar
  10. Bhargava, B. and Lian, S. R. 1988. Independent checkpointing and concurrent rollback for recovery---An optimistic approach. In Proceedings, Seventh Symposium on Reliable Distributed Systems, 3--12.Google ScholarGoogle Scholar
  11. Bhargava, B., Lian, S. R., and Leu, P. J. 1990. Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms. In Proceedings of the Sixth International Conference on Data Engineering, 182--189. Google ScholarGoogle Scholar
  12. Borg, A., Blau, W., Graetsch, W., Hermann, F., and Oberle, W. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1, 1--24. Google ScholarGoogle Scholar
  13. Bressoud, T. C. and Schneider, F. B. 1995. Hypervisor-based fault tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, 1--11. Google ScholarGoogle Scholar
  14. Briatico, D., Ciuffoletti, A., and Simoncini, L. 1984. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, 207--215.Google ScholarGoogle Scholar
  15. Chandy, M. and Ramamoorthy, C. V. 1972. Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 21, 6, 546--556.Google ScholarGoogle Scholar
  16. Chandy, M. and Lamport, L. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 31, 1, 63--75. Google ScholarGoogle Scholar
  17. Cristian, F. and Jahanian, F. 1991. A timestamp-based checkpointing protocol for long-lived distributed computations. In Proceedings, Tenth Symposium on Reliable Distributed Systems, 12--20.Google ScholarGoogle Scholar
  18. Elnozahy, E. N. 1993. Manetho: Fault Tolerance in Distributed Systems using Rollback-Recovery and Process Replication. Ph.D. Thesis, Rice University, Department of Computer Science. Google ScholarGoogle Scholar
  19. Elnozahy, E. N. 1998. How safe is probabilistic checkpointing? In Digest of Papers, FTCS-28, the Twenty Eight Annual International Symposium on Fault-Tolerant Computing, 358--363. Google ScholarGoogle Scholar
  20. Elnozahy, E. N. and Zwaenepoel, W. 1994. On the use and implementing of message logging. In Digest of Papers, FTCS-24, The Twenty Fourth International Symposium on Fault-Tolerant Computing, 298--307.Google ScholarGoogle Scholar
  21. Elnozahy, E. N., Johnson, D. B., and Zwaenepoel, W. 1992. The performance of consistent checkpointing. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 39--47.Google ScholarGoogle Scholar
  22. Feldman, S. I. and Brown, C. B. 1989. Igor: A system for program debugging via reversible execution. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging 24, 1, 112--123. Google ScholarGoogle Scholar
  23. Goldberg, A., Gopal, A., Li, K., Strom, R., and Bacon, D. 1990. Transparent recovery of Mach applications. In Usenix Mach Workshop Proceedings, 169--184.Google ScholarGoogle Scholar
  24. Hélary, J. M., Mostefaoui, A., and Raynal, M. 1997a. Virtual precedence in asynchronous systems: concepts and applications. In Proceedings of the 11th Workshop on Distributed Algorithms, WDAG'97. Google ScholarGoogle Scholar
  25. Hélary, J. M., Mostefaoui, A., Netzer, R. H., and Raynal, M. 1997b. Preventing useless checkpoints in distributed computations. In Proceedings, Sixteenth Symposium on Reliable Distributed Systems, 183--190. Google ScholarGoogle Scholar
  26. Huang, Y. and Kintala, C. 1993. Software implemented fault tolerance: Technologies and experience. In Digest of Papers, FTCS-23, the Twenty Third Annual International Symposium on Fault-Tolerant Computing, 2--9.Google ScholarGoogle Scholar
  27. Huang, Y. and Wang, Y.-M. 1995. Why optimistic message logging has not been used in telecommunication systems. In Digest of Papers, FTCS-25, the Twenty Fifth Annual International Symposium on Fault-Tolerant Computing, 459--463. Google ScholarGoogle Scholar
  28. Johnson, D. B. 1989. Distributed System Fault Tolerance Using Message Logging and Checkpointing. Ph.D. Thesis, Rice University, Department of Computer Science. Google ScholarGoogle Scholar
  29. Johnson, D. B. and Zwaenepoel, W. 1987. Sender-based message logging. In Digest of Papers, FTCS-17, The Seventeenth Annual International Symposium on Fault-Tolerant Computing, 14--19.Google ScholarGoogle Scholar
  30. Johnson, D. B. and Zwaenepoel, W. 1990. Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11, 3, 462--491. Google ScholarGoogle Scholar
  31. Juang, T. T.-Y. and Venkatesan, S. 1991. Crash recovery with little overhead. In Proceedings, The 11th International Conference on Distributed Computing Systems, 454--461.Google ScholarGoogle Scholar
  32. Koo, R. and Toueg, S. 1987. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Engin. 13, 1, 23--31. Google ScholarGoogle Scholar
  33. Lai, T. H. and Yang, T. H. 1987. On distributed snapshots. Information Processing Letters 25, 153--158. Google ScholarGoogle Scholar
  34. Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7, 588--565. Google ScholarGoogle Scholar
  35. Lampson, B. W. and Sturgis, H. E. 1979. Crash recovery in a distributed data storage system. Technical Report, Xerox Palo Alto Research Center.Google ScholarGoogle Scholar
  36. Li, C. C. and Fuchs, W. K. 1990. CATCH: Compiler-assisted techniques for checkpointing. In Digest of Papers, FTCS-20, The Twentieth Annual International Symposium on Fault-Tolerant Computing, 74--81.Google ScholarGoogle Scholar
  37. Mellor-Crummey, J. and LeBlanc, T. 1989. A software instruction counter. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, 78--86. Google ScholarGoogle Scholar
  38. Morin, C. and Puaut, T. 1997. A survey of recoverable distributed shared memory systems. IEEE Trans. Parallel and Distributed Syst. 8, 9, 959--969. Google ScholarGoogle Scholar
  39. Muller, G., Hue, M., and Peyrouz, N. 1994. Performance of consistent checkpointing in a modular operating system: Results of the FTM experiment. In Lecture Notes in Computer Science: Dependable Computing, EDDC-1, 491--508. Google ScholarGoogle Scholar
  40. Nam, H.-C., Kim, J., Hong, S. J., and Lee, S. 1997. Probabilistic checkpointing. In Digest of Papers, FTCS-27, The Twenty Seventh Annual International Symposium on Fault-Tolerant Computing, 48--57. Google ScholarGoogle Scholar
  41. Netzer, R. H. and Xu, J. 1995. Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel and Distributed Syst. 6, 2, 165--169. Google ScholarGoogle Scholar
  42. Pausch, R. 1988. Adding Input and Output to the Transactional Model. Ph.D. Thesis, Carnegie Mellon University, Department of Computer Science. Google ScholarGoogle Scholar
  43. Plank, J. S. 1993. Efficient Checkpointing on MIMD Architectures. Ph.D. Thesis, Princeton University, Department of Computer Science. Google ScholarGoogle Scholar
  44. Plank, J. S. and Li, K. 1994. Faster checkpointing with N + 1 parity. In Digest of Papers, FTCS-24, The Twenty Fourth Annual International Symposium on Fault-Tolerant Computing, 288--297.Google ScholarGoogle Scholar
  45. Plank, J. S., Xu, J., and Netzer, R. H. 1995a. Compressed differences: An algorithm for fast incremental checkpointing. Technical Report CS-95-302, University of Tennessee at Knoxville.Google ScholarGoogle Scholar
  46. Plank, J. S., Beck, M., Kingsley, G., and Li, K. 1995b. Libckpt: Transparent checkpointing under UNIX. In Proceedings of the USENIX Winter 1995 Technical Conference, 213--223. Google ScholarGoogle Scholar
  47. Prakash, R. and Singhal, M. 1996. Low-cost checkpointing and failure recovery in mobile computing systems. IEEE Trans. Parallel and Distributed Syst. 7, 10, 1035--1048. Google ScholarGoogle Scholar
  48. Randell, B. 1975. System structure for software fault tolerance. IEEE Trans. Softw. Engin. 1, 2, 220--232.Google ScholarGoogle Scholar
  49. Rao, S., Alvisi, L., and Vin, H. M. 1998. The cost of recovery in message logging protocols. In Proceedings, Seventeenth Symposium on Reliable Distributed Systems, 10--18. Google ScholarGoogle Scholar
  50. Ruffin, M. 1992. KITLOG: A generic logging service. In Proceedings, Eleventh Symposium on Reliable Distributed Systems, 139--148.Google ScholarGoogle Scholar
  51. Russell, D. L. 1980. State restoration in systems of communicating processes. IEEE Trans. Softw. Engin. 6, 2, 183--194.Google ScholarGoogle Scholar
  52. Schlichting, R. D. and Schneider, F. B. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1, 3, 222--238. Google ScholarGoogle Scholar
  53. Silva, L. M. 1997. Checkpointing Mechanisms for Scientific Parallel Applications. Ph.D. Thesis, University of Coimbra, Department of Computer Science.Google ScholarGoogle Scholar
  54. Sistla, A. and Welch, J. 1989. Efficient distributed recovery using message logging. In Proceedings of the 8th Annual ACM Symposium on Principles of Distributed Computing (PODC), 223--238. Google ScholarGoogle Scholar
  55. Slye, J. H. and Elnozahy, E. N. 1998. Support for software interrupts in log-based rollback-recovery. IEEE Trans. Comput. 47, 10, 1113--1123. Google ScholarGoogle Scholar
  56. Smith, S. W. and Johnson, D. B. 1996. Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback. In Proceedings, the Fifteenth Symposium on Reliable Distributed Systems, 66--75. Google ScholarGoogle Scholar
  57. Strom, R. and Yemini, S. 1985. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 3, 204--226. Google ScholarGoogle Scholar
  58. Tamir, Y. and Sequin, C. H. 1984. Error recovery in multicomputers using global checkpoints. In Proceedings of the International Conference on Parallel Processing, 32--41.Google ScholarGoogle Scholar
  59. Tong, Z., Kain, R. Y., and Tsai, W. T. 1992. Rollback-recovery in distributed systems using loosely synchronized clocks. IEEE Trans. Parallel and Distributed Syst. 3, 2, 246--251. Google ScholarGoogle Scholar
  60. Wang, Y.-M. 1993. Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems. Ph.D. Thesis, University of Illinois, Department of Computer Science. Google ScholarGoogle Scholar
  61. Wang, Y.-M. 1997. Consistent global checkpoints that contain a set of local checkpoints. IEEE Trans. Comput. 46, 4, 456--468. Google ScholarGoogle Scholar
  62. Wang, Y.-M., Chung, P. Y., and Fuchs, W. K. 1995a. Tight upper bound on useful distributed system checkpoints. Technical Report, University of Illinois.Google ScholarGoogle Scholar
  63. Wang, Y.-M., Chung, P. Y., Lin, I. J., and Fuchs, W. K. 1995b. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Trans. Parallel and Distributed Syst. 6, 5, 546--554. Google ScholarGoogle Scholar

Recommendations

Reviews

Bayard Kohlhepp

Computer applications now span the globe, and incorporate devices ranging in size and power from watches to clustered supercomputers. The further a system reaches, and the more its heterogeneity decreases, the more fragile (susceptible to exceptions and errors) it becomes. Every system we design and build is more likely than ever to encounter, and to have to recover from, unreliable communication. It is time for rollback-recovery techniques to become mainstream software design topics. This paper surveys the daunting volume of research literature that explores such techniques, concentrating on those approaches that can be implemented in any application environment (for example, those with no language dependencies). It splits these techniques into checkpoint-based and log-based techniques, and then subdivides each of those families. While this taxonomy alone is helpful, the authors go even deeper, and analyze the key ideas underlying each technique, along with the problems that accompany their implementation. There is no recovery technique that is universally satisfying, so the paper must be read to determine which solution, or family of solutions, is appropriate for a particular situation. However, reading this paper to discover the appropriate research reports is substantially quicker than diving into a literature search and trying to correlate all of the raw material yourself. Certainly, every practicing programmer, software designer, and architect should read this survey. There is no such thing as a standalone system anymore, and all current implementers would benefit from understanding recovery techniques. Managers would also benefit from a high-level understanding of recovery issues. Their projects will succeed or fail according to product reliability, and this paper covers a core technology for building reliable systems. Elnozahy et al. have done a stunning job of creating Rollback-Recovery Techniques 101. They did not introduce any new research themselves, but, rather, have brought order out of chaos, by integrating and explaining seemingly contradictory, or at least unrelated, findings. This paper should become a classic reference work, on the desk of every distributed systems programmer. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 34, Issue 3
    September 2002
    106 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/568522
    Issue’s Table of Contents

    Copyright © 2002 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 September 2002
    Published in csur Volume 34, Issue 3

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader