A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

Jafar, Samir; Gautier, Thierry; Krings, Axel; Roch, Jean-Louis

doi:10.1007/11549468_74

Samir Jafar¹⁸,
Thierry Gautier¹⁸,
Axel Krings¹⁹ &
…
Jean-Louis Roch¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3648))

Included in the following conference series:

European Conference on Parallel Processing

727 Accesses
11 Citations

Abstract

This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.

This work is supported by the CNRS, ACI-GRID DOC-G and Région Rhône-Alpes project RAGTIME.

Download to read the full chapter text

Chapter PDF

Task-Level Checkpointing System for Task-Based Parallel Workflows

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Article Open access 13 March 2024

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Hyett, M., Nguyen-Tuong, A., Grimshaw, A.S.: Exploiting data-flow for fault-tolerance in a wide-area parallel system. In: Proceedings 15 th Symposium on Reliable Distributed Systesm, pp. 2–11 (1996)
Google Scholar
Baldoni, R.: A communication-induced checkpointing protocol that ensures rollbackdependency trackability. In: Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS 1997), p. 68. IEEE Computer Society, Los Alamitos (1997)
Chapter Google Scholar
Bender, M., Rabin, M.: Online scheduling of parallel programs on heterogeneous systems with applications to cilk (2002)
Google Scholar
Bouteiller, A., Cappello, F., Hérault, T., Lemarinier, P., Krawezik, G., Magniette, F.: Mpichv2: a fault tolerant mpi for volatile nodes based on the pessimistic sender based message logging. In: SuperComputing, Phoenix, USA (2003)
Google Scholar
Mootaz Elnozahy, E.N., Alvisi, L., Wang, Y.-M.: Johnson D. B. A survey of rollbackrecovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Article Google Scholar
Kalé, L.V., Zheng, G., Shi, L.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In: 2004 IEEE International Conference on Cluster Computing, San Dieago, CA (September 2004)
Google Scholar
Galilée, F., Roch, J.-L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: IEEE (ed.) PACT 1998, Paris, France, October 1998, pp. 88–95 (1998)
Google Scholar
Graham, R.L.: Bounds on multiprocessing timing anomalies. SIAM Journal of Applied Mathematics 17(2), 416–429 (1969)
Article MATH Google Scholar
Ungerer, T., Silc, J., Robic, B.: Asynchrony in parallel computing: from dataflow to multithreading, pp. 1–33. Nova Science Publishers, Inc. (2001)
Google Scholar
Jafar, S., Varrette, S., Roch, J.-L.: Using data-flow analysis for resilience and result checking in peer-to-peer computations. In: IEEE DEXA 2004, Zaragoza, Spain (August 2004)
Google Scholar
Lamport, L., Chandy, K.M.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Article Google Scholar
Marzullo, K., Alvisi, L.: Message logging: Pessimistic, optimistic, causal and optimal. TSE (Transactions on Software Engineering) 24(2), 149–159 (1998)
Google Scholar
Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical Report CS-TR-97-1346, Univ. Wisconsin, Madison (1997)
Google Scholar
Randall, K.H., Frigo, M., Leiserson, C.E.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pp. 212–223. ACM Press, New York (1998)
Google Scholar
Schipper, A., Wiesmann, M., Pedonne, F.: A systematic classification of replited database protocols based on atomic broadcast. In: Proceedings of the 3th European Research Seminar on Advances in Distributed Systems (ERSADS 1999), pp. 351–360 (1999)
Google Scholar
Yemini, S., Strom, R.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)
Article Google Scholar
Revire, R.: Ordonnancement de graphe dynamique de tâches sur architecture de grande taille. Régulation par dégénération séquentielle et distribuée. Thèse de doctorat en informatique, INPG (September 2004)
Google Scholar
Kale, L.V., Chakravorty, S.: A fault tolerant protocol for massively parallel machines. In: FTPDS Workshop for IPDPS 2004, IEEE Press, Los Alamitos (2004)
Google Scholar
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)
Google Scholar
Strumpen, V.: Compiler technology for portable checkpoints. Technical Report MA-02139, MIT Laboratory for Computer Science, Cambridge (1998)
Google Scholar
Trivedi, K.S.: Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley and Sons, New York (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire ID – IMAG, Pre-project MOAIS (CNRS-INRIA, INPG-UJF), 51, Avenue Jean Kuntzmann, 38330, Montbonnot St. Martin, France
Samir Jafar, Thierry Gautier & Jean-Louis Roch
Computer Science Dept, University of Idaho, Moscow, ID, 83844-1010, USA
Axel Krings

Authors

Samir Jafar
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Gautier
View author publications
You can also search for this author in PubMed Google Scholar
Axel Krings
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Louis Roch
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Topic Chairs,
José C. Cunha
Faculdade de Ciências e Technologia CITI Centre, Quinta da Torre, Universidade Nova de Lisboa, 2829-516, Caparica, Portugal
Pedro D. Medeiros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jafar, S., Gautier, T., Krings, A., Roch, JL. (2005). A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing. In: Cunha, J.C., Medeiros, P.D. (eds) Euro-Par 2005 Parallel Processing. Euro-Par 2005. Lecture Notes in Computer Science, vol 3648. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11549468_74

Download citation

DOI: https://doi.org/10.1007/11549468_74
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28700-1
Online ISBN: 978-3-540-31925-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

Abstract

Chapter PDF

Similar content being viewed by others

Task-Level Checkpointing System for Task-Based Parallel Workflows

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing

Abstract

Chapter PDF

Similar content being viewed by others

Task-Level Checkpointing System for Task-Based Parallel Workflows

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation