ABSTRACT
In May 2022, the newest supercomputer to top the TOP 500 list was Frontier at Oak Ridge National Laboratory, demonstrating the capability of computing more than 1.1 quintillion (1018) floating-point calculations every second. Driving this ground-breaking rate of computing is Frontier’s more than 37,000 graphics processing units (GPUs) and 9,408 central processing units (CPUs). In total, Frontier contains more than 60 million parts. At this scale, the smallest margin of error may generate hundreds of hardware errors across the system. These errors are capable of directly hindering world-class science performed on Frontier if not found. In this work, we describe and evaluate two strategies for finding hardware-level faults in Frontier’s 9,408 compute nodes. There are two strategies developed: the first uses the Slurm scheduler to scavenge available compute time to run the node screen, the second builds upon the lessons learned in the first strategy and enforces a weekly screen of each node. Using June 2023 as a case study, we find that the first scheduling strategy consumed more than ten times the resources as the second scheduling strategy, but successfully detected five hardware defects in Frontier. We summarize the lessons learned while developing and running a node screen on the world’s first exascale supercomputer.
- 2022. TOP500 List June 2022. https://www.top500.org/lists/top500/2022/06/.Google Scholar
- [2] 2023. https://www.ornl.gov/news/frontier-supercomputer-debuts-worlds-fastest-breaking-exascale-barrier.Google Scholar
- 2023. Frontier User Guide. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html.Google Scholar
- 2023. Grafana Open-Source. https://grafana.com/oss/grafana/.Google Scholar
- 2023. Hardware/Hybrid Accelerated Cosmology Code. https://cpac.hep.anl.gov/projects/hacc/.Google Scholar
- 2023. High-Performance Linpack Benchmark. http://icl.cs.utk.edu/hpl/index.html.Google Scholar
- 2023. InfluxDB. https://www.influxdata.com/products/influxdb/.Google Scholar
- 2023. OLCF Test Harness Examples. https://github.com/olcf/olcf-test-harness-examples. https://github.com/olcf/olcf-test-harness-examplesGoogle Scholar
- 2023. TOP500 List June 2023. https://www.top500.org/lists/top500/2023/06/.Google Scholar
- Ulrike Yang et al. 2017. AMG. https://github.com/LLNL/AMG. https://github.com/LLNL/AMGGoogle Scholar
- Matt Ezell. 2023. Frontier node health checking and state management. Cray User Group Proceedings (2023).Google Scholar
- Advanced Micro Devices Inc. 2023. rocHPL. https://github.com/ROCmSoftwarePlatform/rocHPL. Retrieved August 4, 2023 from https://github.com/ROCmSoftwarePlatform/rocHPLGoogle Scholar
- Advanced Micro Devices Inc. 2023. rocPRIM. https://github.com/ROCmSoftwarePlatform/rocPRIM. Retrieved August 4, 2023 from https://github.com/ROCmSoftwarePlatform/rocPRIMGoogle Scholar
- Veronica G. Vergara Larrea, Michael J. Brim, Arnold Tharrington, Reuben Budiardja, and Wayne Joubert. 2020. Towards Acceptance Testing at the Exascale Frontier. Cray User Group Proceedings (2020). https://cug.org/proceedings/cug2020_proceedings/includes/files/spec111s1.pdfGoogle Scholar
- Arnold N. Tharrington. 2015. NCCS Regression Test Harness, Version 00. https://www.osti.gov//servlets/purl/1232564Google Scholar
- A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton. 2022. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comp. Phys. Comm. 271 (2022), 108171. https://doi.org/10.1016/j.cpc.2021.108171Google ScholarCross Ref
- Risto Vaarandi, Bernhards Blumbergs, and Emin Caliskan. 2015. Simple Event Correlator - Best Practices for Creating Scalable Configurations. Proceedings of the 2015 IEEE CogSIMA Conference (2015), 96–100.Google ScholarCross Ref
- Veronica G. Melesse Veragara, Paul Peltz, Nick Hagerty, Christopher Zimmer, Reuben Budiardja, Dan Dietz, Thomas Papatheodore, Christopher Coffman, and Benton Sparks. 2023. Balancing Workloads in More Ways than One. Cray User Group Proceedings (2023).Google Scholar
Index Terms
- Experiences Detecting Defective Hardware in Exascale Supercomputers
Recommendations
Rethinking Hardware-Software Codesign for Exascale Systems
The rapid and disruptive changes anticipated in hardware design over this next decade necessitate a more agile development process, such as the hardware-software co-design processes developed for rapid product development in the embedded space. This ...
Enabling Workflow-Aware Scheduling on HPC Systems
HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed ComputingScientific workflows are increasingly common in the workloads of current High Performance Computing (HPC) systems. However, HPC schedulers do not incorporate workflow-specific mechanisms beyond the capacity to declare dependencies between their jobs. ...
Middleware support for many-task computing
Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many-task computing denotes high-performance computations comprising multiple distinct activities, coupled via file ...
Comments