ABSTRACT
Preparing for the deployment of large scientific and engineering codes on upcoming exascale systems with GPU-dense nodes is made challenging by the unprecedented diversity of device architectures and heterogeneous programming models. In this work, we evaluate the process of porting a massively parallel, fluid dynamics code written in CUDA to SYCL, HIP, and Kokkos with a range of backends, using a combination of automated tools and manual tuning. We use a proxy application along with a custom performance model to inform the results and identify additional optimization strategies. At scale performance of the programming model implementations are evaluated on pre-production GPU node architectures for Frontier and Aurora, as well as on current NVIDIA device-based systems Summit and Polaris. Real-world workloads representing 3D blood flow calculations in complex vasculature are assessed. Our analysis highlights critical trade-offs between code performance, portability, and development time.
- Germán Castaño, Youssef Faqir-Rhazoui, Carlos García, and Manuel Prieto-Matías. 2022. Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing. J. Parallel and Distrib. Comput. 165 (2022), 120–129.Google ScholarCross Ref
- Cheng Chang, Chih-Hao Liu, and Chao-An Lin. 2009. Boundary conditions for lattice Boltzmann simulations with complex geometry flows. Computers & Mathematics with Applications 58, 5 (2009), 940–949. https://doi.org/10.1016/j.camwa.2009.02.016 Mesoscopic Methods in Engineering and Science.Google ScholarDigital Library
- Steffen Christgau and Thomas Steinke. 2020. Porting a legacy cuda stencil code to oneapi. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 359–367.Google ScholarCross Ref
- Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2018. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering 17, 3 (2018), 247–262.Google ScholarCross Ref
- Amanda S Dufek, Rahulkumar Gayatri, Neil Mehta, Douglas Doerfler, Brandon Cook, Yasaman Ghadar, and Carleton DeTar. 2021. Case Study of Using Kokkos and SYCL as Performance-Portable Frameworks for Milc-Dslash Benchmark on NVIDIA, AMD and Intel GPUs. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 57–67.Google ScholarCross Ref
- Argonne Leadership Computing Facility. 2021. Polaris. https://www.alcf.anl.gov/polaris.Google Scholar
- Argonne Leadership Computing Facility. 2022. Aurora/Sunspot Interconnect. https://www.alcf.anl.gov/support-center/aurora/interconnect.Google Scholar
- Argonne Leadership Computing Facility. 2022. Aurora/Sunspot Node Level Overview. https://www.alcf.anl.gov/support-center/aurora/node-level-overview.Google Scholar
- Oak Ridge Leadership Computing Facility. 2023. Crusher Quick-Start Guide. https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html.Google Scholar
- William F Godoy, Pedro Valero-Lara, T Elise Dettling, Christian Trefftz, Ian Jorquera, Thomas Sheehy, Ross G Miller, Marc Gonzalez-Tallada, Jeffrey S Vetter, and Valentin Churavy. 2023. Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes. arXiv preprint arXiv:2303.06195 (2023).Google Scholar
- Muhammad Haseeb, Nan Ding, Jack Deslippe, and Muaaz Awan. 2021. Evaluating Performance and Portability of a core bioinformatics kernel on multiple vendor GPUs. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 68–78.Google ScholarCross Ref
- Gregory Herschlag, Seyong Lee, Jeffrey S Vetter, and Amanda Randles. 2018. GPU data access on complex geometries for D3Q19 lattice Boltzmann method. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 825–834.Google ScholarCross Ref
- Intel. 2017. Intel MPI Benchmarks Github. https://github.com/intel/mpi-benchmarks.Google Scholar
- Balint Joo, Thorsten Kurth, Michael A Clark, Jeongnim Kim, Christian Robert Trott, Dan Ibanez, Daniel Sunderland, and Jack Deslippe. 2019. Performance portability of a Wilson Dslash stencil operator mini-app using Kokkos and SYCL. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 14–25.Google ScholarCross Ref
- JaeHyuk Kwack, John Tramm, Colleen Bertoni, Yasaman Ghadar, Brian Homerding, Esteban Rangel, Christopher Knight, and Scott Parker. 2021. Evaluation of Performance Portability of Applications and Mini-Apps across AMD, Intel and NVIDIA GPUs. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 45–56.Google Scholar
- William Ladd, Christopher Jensen, Madhurima Vardhan, Jeff Ames, Jeff Hammond, Erik Draeger, and Amanda Randles. 2023. Optimizing Cloud Computing Resource Usage for Hemodynamic Simulation. In IEEE 37th International Symposium on Parallel and Distributed Processing. https://doi.org/10.1109/IPDPS54959.2023.00063Google ScholarCross Ref
- Geng Liu and John Gounley. 2022. MINIAPP. https://github.com/lucaso19891019/MINIAPP.Google Scholar
- Ji Qiang and Robert D Ryne. 2001. Parallel 3D Poisson solver for a charged beam in a conducting pipe. Computer physics communications 138, 1 (2001), 18–28.Google Scholar
- Amanda Peters Randles, Vivek Kale, Jeff Hammond, William Gropp, and Efthimios Kaxiras. 2013. Performance Analysis of the Lattice Boltzmann Model Beyond Navier-Stokes. In IEEE 27th International Symposium on Parallel and Distributed Processing. 1063–1074. https://doi.org/10.1109/IPDPS.2013.109Google ScholarDigital Library
- Sauro Succi. 2001. The lattice Boltzmann equation: for fluid dynamics and beyond. Oxford university press.Google Scholar
- Nhat Phuong Tran, Myungho Lee, and Dong Hoon Choi. 2016. Memory-Efficient Parallelization of 3D Lattice Boltzmann Flow Solver on a GPU. In Proceedings - 22nd IEEE International Conference on High Performance Computing, HiPC 2015. IEEE, 315–324. https://doi.org/10.1109/HiPC.2015.49Google ScholarDigital Library
- G. Wellein, T. Zeiser, G. Hager, and S. Donath. 2006. On the single processor performance of simple lattice Boltzmann kernels. Computers and Fluids 35, 8-9 (2006), 910–919. https://doi.org/10.1016/j.compfluid.2005.02.008Google ScholarCross Ref
- Jisheng Zhao, Colleen Bertoni, Jeffrey Young, Kevin Harms, Vivek Sarkar, and Brice Videau. 2022. HIPLZ: Enabling Performance Portability for Exascale Systems. In European Conference on Parallel Processing. Springer, 197–210.Google Scholar
Index Terms
- Performance Evaluation of Heterogeneous GPU Programming Frameworks for Hemodynamic Simulations
Recommendations
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Evaluation of directive-based performance portable programming models
We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators. To do this, we use examples of algorithms with varying ...
Performance portability study for massively parallel computational fluid dynamics application on scalable heterogeneous architectures
AbstractPatient-specific hemodynamic simulations have the potential to greatly improve both the diagnosis and treatment of a variety of vascular diseases. Portability will enable wider adoption of computational fluid dynamics (CFD) ...
Highlights- Port HARVEY to heterogeneous systems using the hybrid MPI+X programming model.
- ...
Comments