Abstract
Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gropp, W., Lusk, E., Skjellum, A.: Using MPI, 2nd edn. MIT Press, Cambridge (1999)
Gropp, W., Lusk, E.: Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications 18(3), 363–372 (2004)
Huang, C.: System support for checkpoint and restart of Charm++ and AMPI applications. Master’s thesis, Dep. of Computer Science, University of Illinois, Urbana, IL (2004), Available at: http://charm.cs.uiuc.edu/papers/CheckpointThesis.html
Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE International Conference on Cluster Computing, San Diego, CA (2004)
Chakravorty, S., Kalé, L.V.: A fault tolerant protocol for massively parallel machines. In: FTPDS Workshop at IPDPS 2004, Santa Fe, NM. IEEE Press, Los Alamitos (2004)
Chakravorty, S., Mendes, C.L., Kale, L.V.: Proactive fault tolerance in large systems. In: HPCRI Workshop in conjunction with HPCA 2005 (2005)
Hewlett-Packard, Intel, Microsoft, Phoenix, Toshiba: Advanced configuration and power interface specification. ACPI Specification Document, Revision 3.0 (2004), Available from: http://www.acpi.info
Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: Proceedings og the ACM SIGKDD, Intl. Conf. on Knowledge Discovery Data Mining, pp. 426–435 (2003)
Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M., Sivasubramaniam, A.: Fault-aware job scheduling for BlueGene/L systems. Technical Report RC23077, IBM Research (2004)
Kalé, L.V., Krishnan, S.: Charm++: Parallel programming with message-driven objects. In: Wilson, G.V., Lu, P. (eds.) Parallel Programming using C++, pp. 175–213. MIT Press, Cambridge (1996)
Huang, C., Lawlor, O., Kalé, L.V.: Adaptive MPI. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958. Springer, Heidelberg (2004)
Gioachin, F., Sharma, A., Chakravorty, S., Mendes, C.L., Kalé, L.V., Quinn, T.: Scalable Cosmological Simulations on Parallel Machines. In: Daydé, M., Palma, J.M.L.M., Coutinho, Á.L.G.A., Pacitti, E., Lopes, J.C. (eds.) VECPAR 2006. LNCS, vol. 4395, pp. 476–489. Springer, Heidelberg (2007)
Kalé, L.V., Kumar, S., Zheng, G., Lee, C.W.: Scaling molecular dynamics to 3000 processors with projections: A performance analysis case study. In: Terascale Performance Analysis Workshop, International Conference on Computational Science (ICCS), Melbourne, Australia (2003)
Lawlor, O.S., Kalé, L.V.: Supporting dynamic parallel object arrays. Concurrency and Computation: Practice and Experience 15, 371–393 (2003)
Antoniu, G., Bouge, L., Namyst, R.: An efficient and transparent thread migration scheme in the PM 2 runtime system. In: Juan, S., Rico, P. (eds.) Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP). LNCS, vol. 1586, pp. 496–510. Springer, Heidelberg (1999)
Stellner, G.: CoCheck: Checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium, pp. 526–531 (1996)
Agbaria, A., Friedman, R.: Starfish: Fault-tolerant dynamic MPI programs on clusters of workstations. Cluster Computing 6(3), 227–236 (2003)
Chen, Y., Plank, J.S., Li, K.: Clip: A checkpointing tool for message-passing parallel programs. In: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM), pp. 1–11 (1997)
Strom, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Transactions on Computer Systems 3(3), 204–226 (1985)
Fagg, G.E., Dongarra, J.J.: Building and using a fault-tolerant MPI implementation. International Journal of High Performance Computing Applications 18(3), 353–361 (2004)
Batchu, R., Skjellum, A., Cui, Z., Beddhu, M., Neelamegam, J.P., Dandass, Y., Apte, M.: Mpi/fttm: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing. In: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, p. 26. IEEE Computer Society, Los Alamitos (2001)
Louca, S., Neophytou, N., Lachanas, A., Evripidou, P.: MPI-FT: Portable fault tolerance scheme for MPI. Parallel Processing Letters 10(4), 371–382 (2000)
Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: A fault tolerant MPI for volatile nodes based on the pessimistic sender based message logging programming via processor virtualization. In: Proceedings of Supercomputing 2003, Phoenix, AZ (2003)
Elnozahy, E.N., Zwaenepoel, W.: Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers 41(5), 526–531 (1992)
Pertet, S., Narasimhan, P.: Proactive recovery in distributed CORBA applications. In: Proceedings of the International Conference on Dependable Systems and Networks, pp. 357–366 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chakravorty, S., Mendes, C.L., Kalé, L.V. (2006). Proactive Fault Tolerance in MPI Applications Via Task Migration. In: Robert, Y., Parashar, M., Badrinath, R., Prasanna, V.K. (eds) High Performance Computing - HiPC 2006. HiPC 2006. Lecture Notes in Computer Science, vol 4297. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11945918_47
Download citation
DOI: https://doi.org/10.1007/11945918_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68039-0
Online ISBN: 978-3-540-68040-6
eBook Packages: Computer ScienceComputer Science (R0)