Scalable Fault Tolerant Protocol for Parallel Runtime Environments

Angskun, Thara; Fagg, Graham E.; Bosilca, George; Pješivac–Grbović, Jelena; Dongarra, Jack J.

doi:10.1007/11846802_25

Thara Angskun²⁰,
Graham E. Fagg²⁰,
George Bosilca²⁰,
Jelena Pješivac–Grbović²⁰ &
…
Jack J. Dongarra²⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 4192))

Included in the following conference series:

European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting

1176 Accesses
16 Citations

Abstract

The number of processors embedded on high performance computing platforms is growing daily to satisfy users desire for solving larger and more complex problems. Parallel runtime environments have to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic environments. This paper presents the design of a scalable and fault tolerant protocol for supporting parallel runtime environment communications. The protocol is designed to support transmission of messages across multiple nodes with in a self-healing topology to protect against recursive node and process failures. A formal protocol verification has validated the protocol for both the normal and failure cases. We have implemented multiple routing algorithms for the protocol and concluded that the variant rule-based routing algorithm yields the best overall results for damaged and incomplete topologies .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Parallel Byzantine Fault Tolerance

RADIC Based Fault Tolerance System with Dynamic Resource Controller

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

Article 30 September 2017

References

Beck, M., Dongarra, J.J., Fagg, G.E., Geist, G.A., Gray, P., Kohl, J., Migliardi, M., Moore, K., Moore, T., Papadopoulous, P., Scott, S.L., Sunderam, V.: HARNESS: A next generation distributed virtual machine. Future Generation Computer Systems 15(5–6), 571–582 (1999)
Article Google Scholar
Burns, G., Daoud, R., Vaigl, J.: LAM: An Open Cluster Environment for MPI. In: Proceedings Supercomputing Symposium, pp. 379–386 (1994)
Google Scholar
Butler, R., Gropp, W., Lusk, E.L.: A scalable process-management environment for parallel program. In: Proceedings of the 7th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, London, UK, pp. 168–175. Springer, Heidelberg (2000)
Chapter Google Scholar
Castain, R.H., Woodall, T.S., Daniel, D.J., Squyres, J.M., Barrett, B., Fagg, G.E.: The open run-time environment (openrte): A transparent multi-cluster environment for high-performance computing. In: Proceedings 12th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Italy, September 2005. Springer, Heidelberg (2005)
Google Scholar
Dongarra, J.J., Meuer, H., Strohmaier, E.: TOP500 supercomputer sites. Supercomputer 13(1), 89–120 (1997)
Google Scholar
Fagg, G.E., Gabriel, E., Bosilca, G., Angskun, T., Chen, Z., Pjesivac-Grbovic, J., London, K., Dongarra, J.: Extending the mpi specification for process fault tolerance on high performance computing systems. In: Proceedings of the International Supercomputer Conference (ICS) 2004, Heidelberg, Germany, June 2006, Primeur (2006)
Google Scholar
Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Proceedings 11th European PVM/MPI User’s Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Budapest, Hungary, September 2004, pp. 97–104. Springer, Heidelberg (2004)
Chapter Google Scholar
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high - performance, portable implementation of MPI message passing interface standard. Parallel Computing 22(6), 789–828 (1996)
Article MATH Google Scholar
Gupta, I., van Renesse, R., Birman, K.: Scalable fault-tolerant aggregation in large process groups. In: Proceedings of The International Conference on Dependable Systems and Networks (DSN), pp. 433–442 (2001)
Google Scholar
MPI Forum. MPI: A message-passing interface standard. Technical report (1994)
Google Scholar
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable content addressable network. Technical Report TR-00-010, Berkeley, CA (2000)
Google Scholar
Renesse, R.V., Minsky, Y., Hayden, M.: A gossip-style failure detection service. Technical Report TR98-1687, 28 (1998)
Google Scholar
Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)
Chapter Google Scholar
Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalable Peer-To-Peer lookup service for internet applications. In: Proceedings of the 2001 ACM SIGCOMM Conference, pp. 149–160 (2001)
Google Scholar
Zhao, B.Y., Kubiatowicz, J.D., Joseph, A.D.: Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC Berkeley (April 2001)
Google Scholar
Holzmann, G.J.: Design and validation of computer protocols. Prentice-Hall, Englewood Cliffs (1991)
Google Scholar
Holzmann, G.J.: The model checker SPIN. IEEE Transactions on Software Engineering 23, 279–295 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, The University of Tennessee, 1122 Volunteer Blvd., Suite 413, Knoxville, TN, 37996-3450, USA
Thara Angskun, Graham E. Fagg, George Bosilca, Jelena Pješivac–Grbović & Jack J. Dongarra

Authors

Thara Angskun
View author publications
You can also search for this author in PubMed Google Scholar
Graham E. Fagg
View author publications
You can also search for this author in PubMed Google Scholar
George Bosilca
View author publications
You can also search for this author in PubMed Google Scholar
Jelena Pješivac–Grbović
View author publications
You can also search for this author in PubMed Google Scholar
Jack J. Dongarra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Forschungszentrum Jülich, ZAM, 52425, Jülich, Germany
Bernd Mohr
NEC Europe Ltd., NEC Laboratories Europe, Rathausallee 10, D-53757, Sankt Augustin, Germany
Jesper Larsson Träff
Dolphin Interconnect Solutions ASA R&D Germany, Siebengebirgsblick 26, 53343, Wachtberg, Germany
Joachim Worringen
Computer Science Department, University of Tennessee, 37996-3450, Knoxville, TN, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Angskun, T., Fagg, G.E., Bosilca, G., Pješivac–Grbović, J., Dongarra, J.J. (2006). Scalable Fault Tolerant Protocol for Parallel Runtime Environments. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface. EuroPVM/MPI 2006. Lecture Notes in Computer Science, vol 4192. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846802_25

Download citation

DOI: https://doi.org/10.1007/11846802_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39110-4
Online ISBN: 978-3-540-39112-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Scalable Fault Tolerant Protocol for Parallel Runtime Environments

Abstract

Access this chapter

Preview

Similar content being viewed by others

Parallel Byzantine Fault Tolerance

RADIC Based Fault Tolerance System with Dynamic Resource Controller

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Scalable Fault Tolerant Protocol for Parallel Runtime Environments

Abstract

Access this chapter

Preview

Similar content being viewed by others

Parallel Byzantine Fault Tolerance

RADIC Based Fault Tolerance System with Dynamic Resource Controller

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation