Memory Access Optimization of High-Order CFD Stencil Computations on GPU

Wang, Shengxiang; Li, Zhuoqian; Che, Yonggang

doi:10.1007/978-3-030-69244-5_4

Shengxiang Wang¹¹,
Zhuoqian Li¹¹ &
Yonggang Che¹¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12606))

Included in the following conference series:

International Conference on Parallel and Distributed Computing: Applications and Technologies

1094 Accesses
2 Citations

Abstract

Stencils computations are a class of computations commonly found in scientific and engineering applications. They have relatively lower arithmetic intensity. Therefore, their performance is greatly affected by memory access. This paper studies the issue of memory access optimization for the key stencil computations of a high-order CFD program on the NVidia GPU. Two methods are used to optimize the performance. First, we use registers to cache the data used by the stencil computations in the kernel. We use the CUDA warp shuffle functions to exchange data between neighboring grid points, and adjust the thread computation granularity to increase the data reuse. Second, we use the shared memory to buffer the grid data used by the stencil computations in the kernel, and utilize loop tiling to reduce redundant accesses to the global memory. Performance evaluation is done on an NVidia Tesla K80 GPU. The results show that compared to the original implementation that only uses the global memory, the optimized implementation that utilizes the registers achieves a maximum speedup of 2.59 and 2.79 relatively for 15M and 60M grids, and the optimized implementation that utilizes the shared memory achieves a maximum speedup of 3.51 and 3.36 relatively for 15M and 60M grids.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tabik, S., Peemen, M., Nicol, G., Corporaal, H.: Demystifying the 16 x 16 thread-block for stencils on the GPU. Concurr. Comput. Pract. Exp. 27(18), 5557–5573 (2015)
Article Google Scholar
Peng, G., Liang, Y., Zhang, Y., Shan, H.: Parallel stencil algorithm based on tessellating. J. Front. Comput. Sci. Technol. 13(2), 181–194 (2019)
Google Scholar
Yang, X., Liao, X., et al.: TH-1: China’s first petaflop super-computer. Front. Comput. Sci. China 4(4), 445–455 (2010)
Article Google Scholar
Xu, C., Deng, X., et al.: Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer. J. Comput. Phys. 278, 275–297 (2014)
Article Google Scholar
NVIDIA Corp.: CUDA C Programming Guide v11.0, July 2020
Google Scholar
Falch, T.L., Elster, A.C.: Register caching for stencil computations on GPUs. In: 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 479–486 (2014)
Google Scholar
Holewinski, J., Pouchet, L.-N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: ICS 2012, pp. 311–320 (2012)
Google Scholar
Svard, M., Carpenter, M.H., Nordstrom, J.: A stable high-order finite difference scheme for the compressible Navier-Stokes equations, far-field boundary conditions. J. Comput. Phys. 225(1), 1020–1038 (2007)
Article MathSciNet Google Scholar
Wang, S., Wang, W., Che, Y.: GPU acceleration of a high-order CFD program. In: 4th International Conference on High Performance Compilation, Computing and Communications, Guangzhou, China, pp. 123–128 (2020)
Google Scholar
Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS 1991, New York, USA, pp. 63–74 (1991)
Google Scholar
Pikle, N.K., Sathe, S.R., Vyavahare, A.Y.: High performance iterative elemental product strategy in assembly-free fem on GPU with improved occupancy. Computing 100(12), 1–25 (2018). https://doi.org/10.1007/s00607-018-0613-x
Article MathSciNet Google Scholar
NVIDIA Corp, Kepler Tuning Guide v11.0, July 2020
Google Scholar

Download references

Acknowledgments

This work was partially supported by the National Key R&D Program under Grant No. 2017YFB0202403, the National Natural Science Foundation of China under grant Nos. 61561146395 and 61772542.

Author information

Authors and Affiliations

Institute for Quantum Information and State Key Laboratory of High Performance Computing, College of Computer, National University of Defense Technology, Changsha, China
Shengxiang Wang, Zhuoqian Li & Yonggang Che

Authors

Shengxiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoqian Li
View author publications
You can also search for this author in PubMed Google Scholar
Yonggang Che
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yonggang Che .

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology, Shenzhen, China
Yong Zhang
Shenzhen Institutes of Advanced Technology, Shenzhen, China
Yicheng Xu
Griffith University, Gold Coast, QLD, Australia
Hui Tian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Li, Z., Che, Y. (2021). Memory Access Optimization of High-Order CFD Stencil Computations on GPU. In: Zhang, Y., Xu, Y., Tian, H. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2020. Lecture Notes in Computer Science(), vol 12606. Springer, Cham. https://doi.org/10.1007/978-3-030-69244-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-69244-5_4
Published: 21 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69243-8
Online ISBN: 978-3-030-69244-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics