Abstract
In this paper we consider the rollback propagation and the performance of a fault-tolerant multiprocessor with a rollback recovery mechanism (FTMR2M)[1], which was designed to be tolerant of hardware failure with minimum time overhead. Rollback propagation between cooperating processes is usually required to ensure correct recovery from failure. To minimize the waste of processor time and storage overhead required for handling sophisticated rollback propagations, the FTMR2M always keeps one recoverable state. Approaches for evaluating the recovery overhead and analyzing the performance of FTMR2M are presented. Two methods for detecting rollback propagations and multi-step rollbacks between cooperating processes are also proposed.
- 1 A. M. Feridun and K. G. Shin, "A Fault-Tolerant Multiprocessor System with Rollback Recovery Capabilities", Proc. 2nd Int'l Conf. on Distributed Computing System, April 1981.Google Scholar
- 2 B. Randell, "System Structure for Software Fault Tolerance", IEEE Trans. on Software Eng., Jun. 1975, pp. 220-232.Google ScholarDigital Library
- 3 K. M. Chandy and C. V. Ramamoorthy, "Rollback and Recovery Strategies for Computer Program", IEEE Trans. on Comp., June 1972, pp. 546-556.Google ScholarDigital Library
- 4 F. T. O'Brien, "Rollback Point Insertion Strategies", Proc. of the 6th Int'l Symp. on Fault-Tolerant Computing, Pittsburg, 1976, pp. 138-142.Google Scholar
- 5 K. Kant and A. Silberschatz, "Error Recovery in Concurrent Processes", Proc. COMPSAC 80, Fall 1980, pp. 608-614.Google Scholar
- 6 C. Meraud and F. Browaeys, "Automatic Rollback Techniques of the COPRA Computer", Proc. of 6th Int'l Conf. on Fault-Tolerant Computing, 1976, pp. 23-29.Google Scholar
- 7 K. H. Kim, "An Approach to Programmer-Transparent Coordination of Recovering Parallel Processes and its Efficient Implementation Rules", Proc. 1978 Int'l Conf. on Parallel Processing, Aug. 1978, pp. 58-68.Google Scholar
- 8 K. H. Kim, "An Implementation of a Programmer-Transparent Scheme for Coordinating Concurrent Processes in Recovery", Proc. COMPSAC 80, Fall 1980, pp. 615-621.Google Scholar
- 9 R. J. Swan, S. H. Fuller, and D. P. Siewiorek, "Cm*: a Modular Multi-Microprocessor", AFIPS Conf. Proc., Vol. 46, 1977, pp. 637-644.Google ScholarDigital Library
- 10 K. H. Kim, "Error Detection, Reconfiguration and Recovery in Distributed Processing System", Proc. Int'l Conf. on Distributed Computing Systems, Oct. 1979, pp. 284-295.Google Scholar
- 11 S. H. Fuller, J. K. Ornstein, L. Raskin, P. I. Rubinfeld, P. J. Swan, "Multi-Microprocessors: An Overview and Working Example", Proceedings of the IEEE, Vol. 66, No. 2, pp. 216-228, Feb. 1978.Google ScholarCross Ref
- 12 X. Castillo, D. P. Siewiorek, "A Performance-Reliability Model for Computing Systems", 10th Int'l Conf. on Fault-Tolerant Computing, 1980, pp. 187-192.Google Scholar
Index Terms
- Rollback propagation detection and performance evaluation of FTMR2M—a fault-tolerant multiprocessor
Recommendations
Rollback propagation detection and performance evaluation of FTMR2M—a fault-tolerant multiprocessor
ISCA '82: Proceedings of the 9th annual symposium on Computer ArchitectureIn this paper we consider the rollback propagation and the performance of a fault-tolerant multiprocessor with a rollback recovery mechanism (FTMR2M)[1], which was designed to be tolerant of hardware failure with minimum time overhead. Rollback ...
Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks
In this paper we consider the design and evaluation of a fault-tolerant multiprocessor with a rollback recovery mechanism.
Comments