Heartbeat: A timeout-free failure detector for quiescent reliable communication

Kawazoe Aguilera, Marcos; Chen, Wei; Toueg, Sam

doi:10.1007/BFb0030680

Marcos Kawazoe Aguilera¹,
Wei Chen¹ &
Sam Toueg¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1320))

Included in the following conference series:

International Workshop on Distributed Algorithms

423 Accesses
49 Citations
4 Altmetric

Abstract

We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this problem without failure detectors. We then show how to solve it using a new failure detector, called heartbeat. In contrast to previous failure detectors that have been used to circumvent impossibility results, the heartbeat failure detector is implementable, and its implementation does not use timeouts. These results have wide applicability: they can be used to transform many existing algorithms that tolerate only process crashes into quiescent algorithms that tolerate both process crashes and message losses. This can be applied to consensus, atomic broadcast, k-set agreement, atomic commitment, etc. The heartbeat failure detector is novel: besides being implementable without timeouts, it does not output lists of suspects as typical failure detectors do. If we restrict failure detectors to output only lists of suspects, quiescent reliable communication requires ◊P [2], which is not implementable. Combined with the results of this paper, this shows that traditional failure detectors that output only lists of suspects have fundamental limitations.

Research partially supported by NSF grant CCR-9402896, by ARPA/ONR grant N00014-961-1014, and by an Olin Fellowship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: a timeout-free failure detector for quiescent reliable communication. Technical Report 97–1631, Department of Computer Science, Cornell University, May 1997.
Google Scholar
M. K. Aguilera, W Chen, and S. Toueg. On the weakest failure detector for quiescent reliable communication. Technical report, Department of Computer Science, Cornell University, July 1997.
Google Scholar
M. K. Aguilera, W. Chen, and S. Toueg. Quiescent reliable communication and quiescent consensus in partitionable networks. Technical Report 97–1632, Department of Computer Science, Cornell University, June 1997.
Google Scholar
M. K. Aguilera and S. Toueg. Randomization and failure detection: a hybrid approach to solve consensus. In Proceedings of the 10th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science, pages 29–39. Springer-Verlag, Oct. 1996.
Google Scholar
Ö. Babaoĝlu, R. Davoli, and A. Montresor. Partitionable group membership: specification and algorithms. Technical Report UBLCS-97-1, Dept. of Computer Science, University of Bologna, Bologna, Italy, January 1997.
Google Scholar
A. Basu, B. Charron-Bost, and S. Toueg. Simulating reliable links with unreliable links in the presence of process crashes. In Proceedings of the 10th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science, pages 105–122. Springer-Verlag, Oct. 1996.
Google Scholar
R. Bazzi and G. Neiger. Simulating crash failures with many faulty processors. In A. Segal and S. Zaks, editors, Proceedings of the 6th International Workshop on Distributed Algorithms, volume 647 of Lecture Notes on Computer Science, pages 166-184. Springer-Verlag, 1992.
Google Scholar
M. Ben-Or. Another advantage of free choice: Completely asynchronous agreement protocols. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing, pages 27–30, Aug. 1983.
Google Scholar
G. Bracha and S. Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4):824–840, Oct. 1985.
Article Google Scholar
T D. Chandra, April 1997. Private Communication.
Google Scholar
T D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685–722, July 1996.
Article Google Scholar
T D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996.
Article Google Scholar
S. Chaudhuri. More choices allow more faults: Set consensus problems in totally asynchronous systems. Information and Computation, 105(1):132–158, July 1993.
Article Google Scholar
B. Chor, M. Merritt, and D. B. Shmoys. Simple constant-time consensus protocols in realistic failure models. Journal of the ACM, 36(3):591–614,1989.
Article Google Scholar
D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. Technical Report 96-1608, Department of Computer Science, Cornell University, Ithaca, New York, 1996.
Google Scholar
D. Dolev, N. A. Lynch, S. S. Pinter, E. W. Stark, and W. E. Weihl. Reaching approximate agreement in the presence of faults. J. ACM, 33(3):499–516, July 1986.
Article Google Scholar
P Feldman and S. Micali. An optimal algorithm for synchronous Byzantine agreement. Technical Report MIT/LCS/TM-425, Laboratory for Computer Science, Massachusetts Institute of Technology, June 1990.
Google Scholar
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382, Apr. 1985.
Article Google Scholar
A. Gopal. Fault-Tolerant Broadcasts and Multicasts: The Problem of Inconsistency and Contamination. PhD thesis, Corneal University, Jan. 1992.
Google Scholar
R. Guerraoui. Revisiting the relationship between non-blocking atomic commitment and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms, pages 87–100, Le Mont-St-Michel, France, 1995. Springer Verlag, LNCS 972.
Google Scholar
R. Guerraoui, M. Larrea, and A. Schiper. Non blocking atomic commitment with an unreliable failure detector. In Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, pages 13–15, 1995.
Google Scholar
V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical Report 94-1425, Department of Computer Science, Cornell University, Ithaca, New York, May 1994.
Google Scholar
R. Koo and S. Toueg. Effects of message loss on the termination of distributed protocols. Inf. Process. Lett., 27(4):181–188, Apr. 1988.
Article Google Scholar
W.-K. Lo and V Hadzilacos. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th International Workshop on Distributed Algorithms, pages 280–295, Terschelling, The Netherlands, 1994.
Google Scholar
N. A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., 1996.
Google Scholar
M. Rabin. Randomized Byzantine generals. In Proceedings of the 24th Symposium on Foundations of Computer Science, pages 403–409. IEEE Computer Society Press, Nov. 1983.
Google Scholar
L. S. Sabel and K. Marzullo. Election vs. consensus in asynchronous systems. Technical Report 95-1488, Department of Computer Science, Cornell University, Ithaca, New York, Febrary 1995.
Google Scholar
R. van Renesse, April 1997. Private Communication.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Cornell University, 14853-7501, Ithaca, NY, USA
Marcos Kawazoe Aguilera, Wei Chen & Sam Toueg

Authors

Marcos Kawazoe Aguilera
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Sam Toueg
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Marios Mavronicolas Philippas Tsigas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kawazoe Aguilera, M., Chen, W., Toueg, S. (1997). Heartbeat: A timeout-free failure detector for quiescent reliable communication. In: Mavronicolas, M., Tsigas, P. (eds) Distributed Algorithms. WDAG 1997. Lecture Notes in Computer Science, vol 1320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0030680

Download citation

DOI: https://doi.org/10.1007/BFb0030680
Published: 20 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63575-8
Online ISBN: 978-3-540-69600-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics