Abstract
Distributed computer network (DCN) consists of computing modules (CM). Two analytical performance models of tolerant computing networks are described in this paper. The first model is itself based on two models: a model for evaluating performance depending on the number of serviceable CM and a performance model depending on the method of ensuring the tolerance of the computer network. This model assumes that a computing module can’t be restored. The second model assumes that during the life cycle of a tolerant DCN its CM can be in one of three possible states: non-functional, functional – working, and functional – controlled. It is also assumed that the CM can be restored.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Qu, P., Zhang, Y., Zheng, W.: High Performance simulation of spiking neural network on GPGPUs. IEEE Trans. Parallel Distrib. Syst. 31(11), 2510–2523 (2020)
Pons, L., Sahuquillo, J., Selfa, V., Petit, S., Pons, J.: Phase-aware cache partitioning to target both turnaround time and system performance. IEEE Trans. Parallel Distrib. Syst. 31(11), 2556–2568 (2020)
Szustak, L., Wyrzykowski, R., Olas, T., Mele, V.: Correlation of performance optimizations and energy consumption for stencil-based application on Intel Xeon scalable processors. IEEE Trans. Parallel Distrib. Syst. 31(11), 2582–2593 (2020)
KhudaBukhsh, W.R., Kar, S., Alt, B., Rizk, A., Koeppl, H.: Generalized cost-based job scheduling in very large heterogeneous cluster systems. IEEE Trans. Parallel Distrib. Syst. 31(11), 2594–2604 (2020)
Li, J., et al.: QWEB: high-performance event-driven web architecture with QAT acceleration. IEEE Trans. Parallel Distrib. Syst. 31(11), 2633–2649 (2020)
Srinuan, P., Yuan, X., Tzeng, N.: Cooperative memory expansion via OS kernel support for networked computing systems. IEEE Trans. Parallel Distrib. Syst. 31(11), 2650–2667 (2020)
Akhremtsev, Y., Sanders, P., Schulz, C.: High-quality shared-memory graph partitioning. IEEE Trans. Parallel Distrib. Syst. 31(11), 2710–2722 (2020)
Losada, N., Bosilca, G., Bouteiller, A., González, P., Martín, M.J.: Local rollback for resilient MPI applications with application-level checkpointing and message logging. Future Gener. Comput. Syst. 91, 450–464 (2019)
Losada, N., González, P., Martín, M.J., Bosilca, G., Bouteiller, A., Teranishi, K.: Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Gener. Comput. Syst. 106, 467–481 (2020)
Tang, X., Zhai, J., Yu, B., Chen, W., Zheng, W., Li, K.: An efficient in-memory checkpoint method and its practice on fault-tolerant HPL. IEEE Trans. Parallel Distrib. Syst. 29(4), 758–771 (2018)
Wang, Z., Gao, L., Gu, Y., Bao, Y., Yu, G.: A fault-tolerant framework for asynchronous iterative computations in cloud environments. IEEE Trans. Parallel Distrib. Syst. 29(8), 1678–1692 (2018)
Cores, I., Rodríguez, G., Martín, M.J., González, P.: Achieving checkpointing global consistency through a hybrid compile time and runtime protocol. Procedia Comput. Sci. 18, 169–178 (2013)
Luo, Y., Manivannan, D.: Hope: a hybrid optimistic checkpointing and selective pessimistic message logging protocol for large scale distributed systems. Future Gener. Comput. Syst. 28(8), 1217–1235 (2012)
Castro-Le, M., Meyer, H., Rexachs, D., Luque, E.: Fault tolerance at system level based on radic architecture. J. Parallel Distrib. Comput. 86, 98–111 (2015)
Panadero, J., Wong, A., Rexachs, D., Luque, E.: P3S: a methodology to analyze and predict application scalability. IEEE Trans. Parallel Distrib. Syst. 29(3), 642–658 (2017)
Mohror, K., Moody, A., Bronevetsky, G., de Supinski, B.R.: Detailed modeling and evaluation of a scalable multilevel checkpointing system. IEEE Trans. Parallel Distrib. Syst. 25(9), 2255–2263 (2014)
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013)
Meyer, H., Muresano, R., Castro-León, M., Rexachs, D., Luque, E.: Hybrid message pessimistic logging. Improving current pessimistic message logging protocols. J. Parallel Distrib. Comput. 104, 206–222 (2017)
Wong, A., Rexachs, D., Luque, E.: Parallel application signature for performance analysis and prediction. IEEE Trans. Parallel Distrib. Syst. 26(7), 2009–2019 (2015)
Skrzypczak, J., Schintke, F., Schütt, T.: Fault-tolerant in-place consensus sequences. IEEE Trans. Parallel Distrib. Syst. 31(10), 2392–2405 (2020)
Zhong, D., Bouteiller, A., Luo, X., Bosilca, G.: Runtime level failure detection and propagation in HPC systems. In: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI 2019, pp. 1–11 (2019)
Castro, M., Rexachs, D., Luque, E.: Radic-based message passing fault tolerance system. In: Proceedings of the the 6th International Conference on Advanced Engineering Computing and Applications in Sciences, pp. 59–64 (2012)
Cao, J., et al.: System-level scalable checkpoint-restart for petascale computing. In: Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems, pp. 932–941 (2016)
Hassani, A., Skjellum, A., Brightwell, R.: Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 750–755 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Brekhov, O. (2020). Integrated Tolerant Distributed Computing Network. In: Vishnevskiy, V.M., Samouylov, K.E., Kozyrev, D.V. (eds) Distributed Computer and Communication Networks: Control, Computation, Communications. DCCN 2020. Communications in Computer and Information Science, vol 1337. Springer, Cham. https://doi.org/10.1007/978-3-030-66242-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-66242-4_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66241-7
Online ISBN: 978-3-030-66242-4
eBook Packages: Computer ScienceComputer Science (R0)