Skip to main content

Heartbeat: A timeout-free failure detector for quiescent reliable communication

  • Contributed Papers
  • Conference paper
  • First Online:
Distributed Algorithms (WDAG 1997)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1320))

Included in the following conference series:

Abstract

We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this problem without failure detectors. We then show how to solve it using a new failure detector, called heartbeat. In contrast to previous failure detectors that have been used to circumvent impossibility results, the heartbeat failure detector is implementable, and its implementation does not use timeouts. These results have wide applicability: they can be used to transform many existing algorithms that tolerate only process crashes into quiescent algorithms that tolerate both process crashes and message losses. This can be applied to consensus, atomic broadcast, k-set agreement, atomic commitment, etc. The heartbeat failure detector is novel: besides being implementable without timeouts, it does not output lists of suspects as typical failure detectors do. If we restrict failure detectors to output only lists of suspects, quiescent reliable communication requires ◊P [2], which is not implementable. Combined with the results of this paper, this shows that traditional failure detectors that output only lists of suspects have fundamental limitations.

Research partially supported by NSF grant CCR-9402896, by ARPA/ONR grant N00014-961-1014, and by an Olin Fellowship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: a timeout-free failure detector for quiescent reliable communication. Technical Report 97–1631, Department of Computer Science, Cornell University, May 1997.

    Google Scholar 

  2. M. K. Aguilera, W Chen, and S. Toueg. On the weakest failure detector for quiescent reliable communication. Technical report, Department of Computer Science, Cornell University, July 1997.

    Google Scholar 

  3. M. K. Aguilera, W. Chen, and S. Toueg. Quiescent reliable communication and quiescent consensus in partitionable networks. Technical Report 97–1632, Department of Computer Science, Cornell University, June 1997.

    Google Scholar 

  4. M. K. Aguilera and S. Toueg. Randomization and failure detection: a hybrid approach to solve consensus. In Proceedings of the 10th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science, pages 29–39. Springer-Verlag, Oct. 1996.

    Google Scholar 

  5. Ö. Babaoĝlu, R. Davoli, and A. Montresor. Partitionable group membership: specification and algorithms. Technical Report UBLCS-97-1, Dept. of Computer Science, University of Bologna, Bologna, Italy, January 1997.

    Google Scholar 

  6. A. Basu, B. Charron-Bost, and S. Toueg. Simulating reliable links with unreliable links in the presence of process crashes. In Proceedings of the 10th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science, pages 105–122. Springer-Verlag, Oct. 1996.

    Google Scholar 

  7. R. Bazzi and G. Neiger. Simulating crash failures with many faulty processors. In A. Segal and S. Zaks, editors, Proceedings of the 6th International Workshop on Distributed Algorithms, volume 647 of Lecture Notes on Computer Science, pages 166-184. Springer-Verlag, 1992.

    Google Scholar 

  8. M. Ben-Or. Another advantage of free choice: Completely asynchronous agreement protocols. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing, pages 27–30, Aug. 1983.

    Google Scholar 

  9. G. Bracha and S. Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4):824–840, Oct. 1985.

    Article  Google Scholar 

  10. T D. Chandra, April 1997. Private Communication.

    Google Scholar 

  11. T D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685–722, July 1996.

    Article  Google Scholar 

  12. T D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996.

    Article  Google Scholar 

  13. S. Chaudhuri. More choices allow more faults: Set consensus problems in totally asynchronous systems. Information and Computation, 105(1):132–158, July 1993.

    Article  Google Scholar 

  14. B. Chor, M. Merritt, and D. B. Shmoys. Simple constant-time consensus protocols in realistic failure models. Journal of the ACM, 36(3):591–614,1989.

    Article  Google Scholar 

  15. D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. Technical Report 96-1608, Department of Computer Science, Cornell University, Ithaca, New York, 1996.

    Google Scholar 

  16. D. Dolev, N. A. Lynch, S. S. Pinter, E. W. Stark, and W. E. Weihl. Reaching approximate agreement in the presence of faults. J. ACM, 33(3):499–516, July 1986.

    Article  Google Scholar 

  17. P Feldman and S. Micali. An optimal algorithm for synchronous Byzantine agreement. Technical Report MIT/LCS/TM-425, Laboratory for Computer Science, Massachusetts Institute of Technology, June 1990.

    Google Scholar 

  18. M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382, Apr. 1985.

    Article  Google Scholar 

  19. A. Gopal. Fault-Tolerant Broadcasts and Multicasts: The Problem of Inconsistency and Contamination. PhD thesis, Corneal University, Jan. 1992.

    Google Scholar 

  20. R. Guerraoui. Revisiting the relationship between non-blocking atomic commitment and consensus. In Proceedings of the 9th International Workshop on Distributed Algorithms, pages 87–100, Le Mont-St-Michel, France, 1995. Springer Verlag, LNCS 972.

    Google Scholar 

  21. R. Guerraoui, M. Larrea, and A. Schiper. Non blocking atomic commitment with an unreliable failure detector. In Proceedings of the 14th IEEE Symposium on Reliable Distributed Systems, pages 13–15, 1995.

    Google Scholar 

  22. V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical Report 94-1425, Department of Computer Science, Cornell University, Ithaca, New York, May 1994.

    Google Scholar 

  23. R. Koo and S. Toueg. Effects of message loss on the termination of distributed protocols. Inf. Process. Lett., 27(4):181–188, Apr. 1988.

    Article  Google Scholar 

  24. W.-K. Lo and V Hadzilacos. Using failure detectors to solve consensus in asynchronous shared-memory systems. In Proceedings of the 8th International Workshop on Distributed Algorithms, pages 280–295, Terschelling, The Netherlands, 1994.

    Google Scholar 

  25. N. A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., 1996.

    Google Scholar 

  26. M. Rabin. Randomized Byzantine generals. In Proceedings of the 24th Symposium on Foundations of Computer Science, pages 403–409. IEEE Computer Society Press, Nov. 1983.

    Google Scholar 

  27. L. S. Sabel and K. Marzullo. Election vs. consensus in asynchronous systems. Technical Report 95-1488, Department of Computer Science, Cornell University, Ithaca, New York, Febrary 1995.

    Google Scholar 

  28. R. van Renesse, April 1997. Private Communication.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Marios Mavronicolas Philippas Tsigas

Rights and permissions

Reprints and permissions

Copyright information

© 1997 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kawazoe Aguilera, M., Chen, W., Toueg, S. (1997). Heartbeat: A timeout-free failure detector for quiescent reliable communication. In: Mavronicolas, M., Tsigas, P. (eds) Distributed Algorithms. WDAG 1997. Lecture Notes in Computer Science, vol 1320. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0030680

Download citation

  • DOI: https://doi.org/10.1007/BFb0030680

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-63575-8

  • Online ISBN: 978-3-540-69600-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics