skip to main content
10.1145/3624062.3624134acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Experiences Detecting Defective Hardware in Exascale Supercomputers

Published:12 November 2023Publication History

ABSTRACT

In May 2022, the newest supercomputer to top the TOP 500 list was Frontier at Oak Ridge National Laboratory, demonstrating the capability of computing more than 1.1 quintillion (1018) floating-point calculations every second. Driving this ground-breaking rate of computing is Frontier’s more than 37,000 graphics processing units (GPUs) and 9,408 central processing units (CPUs). In total, Frontier contains more than 60 million parts. At this scale, the smallest margin of error may generate hundreds of hardware errors across the system. These errors are capable of directly hindering world-class science performed on Frontier if not found. In this work, we describe and evaluate two strategies for finding hardware-level faults in Frontier’s 9,408 compute nodes. There are two strategies developed: the first uses the Slurm scheduler to scavenge available compute time to run the node screen, the second builds upon the lessons learned in the first strategy and enforces a weekly screen of each node. Using June 2023 as a case study, we find that the first scheduling strategy consumed more than ten times the resources as the second scheduling strategy, but successfully detected five hardware defects in Frontier. We summarize the lessons learned while developing and running a node screen on the world’s first exascale supercomputer.

References

  1. 2022. TOP500 List June 2022. https://www.top500.org/lists/top500/2022/06/.Google ScholarGoogle Scholar
  2. [2] 2023. https://www.ornl.gov/news/frontier-supercomputer-debuts-worlds-fastest-breaking-exascale-barrier.Google ScholarGoogle Scholar
  3. 2023. Frontier User Guide. https://docs.olcf.ornl.gov/systems/frontier_user_guide.html.Google ScholarGoogle Scholar
  4. 2023. Grafana Open-Source. https://grafana.com/oss/grafana/.Google ScholarGoogle Scholar
  5. 2023. Hardware/Hybrid Accelerated Cosmology Code. https://cpac.hep.anl.gov/projects/hacc/.Google ScholarGoogle Scholar
  6. 2023. High-Performance Linpack Benchmark. http://icl.cs.utk.edu/hpl/index.html.Google ScholarGoogle Scholar
  7. 2023. InfluxDB. https://www.influxdata.com/products/influxdb/.Google ScholarGoogle Scholar
  8. 2023. OLCF Test Harness Examples. https://github.com/olcf/olcf-test-harness-examples. https://github.com/olcf/olcf-test-harness-examplesGoogle ScholarGoogle Scholar
  9. 2023. TOP500 List June 2023. https://www.top500.org/lists/top500/2023/06/.Google ScholarGoogle Scholar
  10. Ulrike Yang et al. 2017. AMG. https://github.com/LLNL/AMG. https://github.com/LLNL/AMGGoogle ScholarGoogle Scholar
  11. Matt Ezell. 2023. Frontier node health checking and state management. Cray User Group Proceedings (2023).Google ScholarGoogle Scholar
  12. Advanced Micro Devices Inc. 2023. rocHPL. https://github.com/ROCmSoftwarePlatform/rocHPL. Retrieved August 4, 2023 from https://github.com/ROCmSoftwarePlatform/rocHPLGoogle ScholarGoogle Scholar
  13. Advanced Micro Devices Inc. 2023. rocPRIM. https://github.com/ROCmSoftwarePlatform/rocPRIM. Retrieved August 4, 2023 from https://github.com/ROCmSoftwarePlatform/rocPRIMGoogle ScholarGoogle Scholar
  14. Veronica G. Vergara Larrea, Michael J. Brim, Arnold Tharrington, Reuben Budiardja, and Wayne Joubert. 2020. Towards Acceptance Testing at the Exascale Frontier. Cray User Group Proceedings (2020). https://cug.org/proceedings/cug2020_proceedings/includes/files/spec111s1.pdfGoogle ScholarGoogle Scholar
  15. Arnold N. Tharrington. 2015. NCCS Regression Test Harness, Version 00. https://www.osti.gov//servlets/purl/1232564Google ScholarGoogle Scholar
  16. A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton. 2022. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comp. Phys. Comm. 271 (2022), 108171. https://doi.org/10.1016/j.cpc.2021.108171Google ScholarGoogle ScholarCross RefCross Ref
  17. Risto Vaarandi, Bernhards Blumbergs, and Emin Caliskan. 2015. Simple Event Correlator - Best Practices for Creating Scalable Configurations. Proceedings of the 2015 IEEE CogSIMA Conference (2015), 96–100.Google ScholarGoogle ScholarCross RefCross Ref
  18. Veronica G. Melesse Veragara, Paul Peltz, Nick Hagerty, Christopher Zimmer, Reuben Budiardja, Dan Dietz, Thomas Papatheodore, Christopher Coffman, and Benton Sparks. 2023. Balancing Workloads in More Ways than One. Cray User Group Proceedings (2023).Google ScholarGoogle Scholar

Index Terms

  1. Experiences Detecting Defective Hardware in Exascale Supercomputers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
          November 2023
          2180 pages
          ISBN:9798400707858
          DOI:10.1145/3624062

          Copyright © 2023 ACM

          Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 November 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited
        • Article Metrics

          • Downloads (Last 12 months)50
          • Downloads (Last 6 weeks)9

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format