Abstract
This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.
This work is supported by the CNRS, ACI-GRID DOC-G and Région Rhône-Alpes project RAGTIME.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Hyett, M., Nguyen-Tuong, A., Grimshaw, A.S.: Exploiting data-flow for fault-tolerance in a wide-area parallel system. In: Proceedings 15 th Symposium on Reliable Distributed Systesm, pp. 2–11 (1996)
Baldoni, R.: A communication-induced checkpointing protocol that ensures rollbackdependency trackability. In: Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS 1997), p. 68. IEEE Computer Society, Los Alamitos (1997)
Bender, M., Rabin, M.: Online scheduling of parallel programs on heterogeneous systems with applications to cilk (2002)
Bouteiller, A., Cappello, F., Hérault, T., Lemarinier, P., Krawezik, G., Magniette, F.: Mpichv2: a fault tolerant mpi for volatile nodes based on the pessimistic sender based message logging. In: SuperComputing, Phoenix, USA (2003)
Mootaz Elnozahy, E.N., Alvisi, L., Wang, Y.-M.: Johnson D. B. A survey of rollbackrecovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Kalé, L.V., Zheng, G., Shi, L.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In: 2004 IEEE International Conference on Cluster Computing, San Dieago, CA (September 2004)
Galilée, F., Roch, J.-L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: IEEE (ed.) PACT 1998, Paris, France, October 1998, pp. 88–95 (1998)
Graham, R.L.: Bounds on multiprocessing timing anomalies. SIAM Journal of Applied Mathematics 17(2), 416–429 (1969)
Ungerer, T., Silc, J., Robic, B.: Asynchrony in parallel computing: from dataflow to multithreading, pp. 1–33. Nova Science Publishers, Inc. (2001)
Jafar, S., Varrette, S., Roch, J.-L.: Using data-flow analysis for resilience and result checking in peer-to-peer computations. In: IEEE DEXA 2004, Zaragoza, Spain (August 2004)
Lamport, L., Chandy, K.M.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
Marzullo, K., Alvisi, L.: Message logging: Pessimistic, optimistic, causal and optimal. TSE (Transactions on Software Engineering) 24(2), 149–159 (1998)
Litzkow, M., Tannenbaum, T., Basney, J., Livny, M.: Checkpoint and migration of unix processes in the condor distributed processing system. Technical Report CS-TR-97-1346, Univ. Wisconsin, Madison (1997)
Randall, K.H., Frigo, M., Leiserson, C.E.: The implementation of the cilk-5 multithreaded language. In: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pp. 212–223. ACM Press, New York (1998)
Schipper, A., Wiesmann, M., Pedonne, F.: A systematic classification of replited database protocols based on atomic broadcast. In: Proceedings of the 3th European Research Seminar on Advances in Distributed Systems (ERSADS 1999), pp. 351–360 (1999)
Yemini, S., Strom, R.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)
Revire, R.: Ordonnancement de graphe dynamique de tâches sur architecture de grande taille. Régulation par dégénération séquentielle et distribuée. Thèse de doctorat en informatique, INPG (September 2004)
Kale, L.V., Chakravorty, S.: A fault tolerant protocol for massively parallel machines. In: FTPDS Workshop for IPDPS 2004, IEEE Press, Los Alamitos (2004)
Stellner, G.: CoCheck: Checkpointing and Process Migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium (IPPS 1996), Honolulu, Hawaii (1996)
Strumpen, V.: Compiler technology for portable checkpoints. Technical Report MA-02139, MIT Laboratory for Computer Science, Cambridge (1998)
Trivedi, K.S.: Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley and Sons, New York (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jafar, S., Gautier, T., Krings, A., Roch, JL. (2005). A Checkpoint/Recovery Model for Heterogeneous Dataflow Computations Using Work-Stealing. In: Cunha, J.C., Medeiros, P.D. (eds) Euro-Par 2005 Parallel Processing. Euro-Par 2005. Lecture Notes in Computer Science, vol 3648. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11549468_74
Download citation
DOI: https://doi.org/10.1007/11549468_74
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28700-1
Online ISBN: 978-3-540-31925-2
eBook Packages: Computer ScienceComputer Science (R0)