Skip to main content

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

  • Conference paper
  • First Online:
Parallel and Distributed Computing, Applications and Technologies (PDCAT 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12606))

Abstract

Stencils computations are a class of computations commonly found in scientific and engineering applications. They have relatively lower arithmetic intensity. Therefore, their performance is greatly affected by memory access. This paper studies the issue of memory access optimization for the key stencil computations of a high-order CFD program on the NVidia GPU. Two methods are used to optimize the performance. First, we use registers to cache the data used by the stencil computations in the kernel. We use the CUDA warp shuffle functions to exchange data between neighboring grid points, and adjust the thread computation granularity to increase the data reuse. Second, we use the shared memory to buffer the grid data used by the stencil computations in the kernel, and utilize loop tiling to reduce redundant accesses to the global memory. Performance evaluation is done on an NVidia Tesla K80 GPU. The results show that compared to the original implementation that only uses the global memory, the optimized implementation that utilizes the registers achieves a maximum speedup of 2.59 and 2.79 relatively for 15M and 60M grids, and the optimized implementation that utilizes the shared memory achieves a maximum speedup of 3.51 and 3.36 relatively for 15M and 60M grids.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tabik, S., Peemen, M., Nicol, G., Corporaal, H.: Demystifying the 16 x 16 thread-block for stencils on the GPU. Concurr. Comput. Pract. Exp. 27(18), 5557–5573 (2015)

    Article  Google Scholar 

  2. Peng, G., Liang, Y., Zhang, Y., Shan, H.: Parallel stencil algorithm based on tessellating. J. Front. Comput. Sci. Technol. 13(2), 181–194 (2019)

    Google Scholar 

  3. Yang, X., Liao, X., et al.: TH-1: China’s first petaflop super-computer. Front. Comput. Sci. China 4(4), 445–455 (2010)

    Article  Google Scholar 

  4. Xu, C., Deng, X., et al.: Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer. J. Comput. Phys. 278, 275–297 (2014)

    Article  Google Scholar 

  5. NVIDIA Corp.: CUDA C Programming Guide v11.0, July 2020

    Google Scholar 

  6. Falch, T.L., Elster, A.C.: Register caching for stencil computations on GPUs. In: 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 479–486 (2014)

    Google Scholar 

  7. Holewinski, J., Pouchet, L.-N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: ICS 2012, pp. 311–320 (2012)

    Google Scholar 

  8. Svard, M., Carpenter, M.H., Nordstrom, J.: A stable high-order finite difference scheme for the compressible Navier-Stokes equations, far-field boundary conditions. J. Comput. Phys. 225(1), 1020–1038 (2007)

    Article  MathSciNet  Google Scholar 

  9. Wang, S., Wang, W., Che, Y.: GPU acceleration of a high-order CFD program. In: 4th International Conference on High Performance Compilation, Computing and Communications, Guangzhou, China, pp. 123–128 (2020)

    Google Scholar 

  10. Lam, M.D., Rothberg, E.E., Wolf, M.E.: The cache performance and optimizations of blocked algorithms. In: ASPLOS 1991, New York, USA, pp. 63–74 (1991)

    Google Scholar 

  11. Pikle, N.K., Sathe, S.R., Vyavahare, A.Y.: High performance iterative elemental product strategy in assembly-free fem on GPU with improved occupancy. Computing 100(12), 1–25 (2018). https://doi.org/10.1007/s00607-018-0613-x

    Article  MathSciNet  Google Scholar 

  12. NVIDIA Corp, Kepler Tuning Guide v11.0, July 2020

    Google Scholar 

Download references

Acknowledgments

This work was partially supported by the National Key R&D Program under Grant No. 2017YFB0202403, the National Natural Science Foundation of China under grant Nos. 61561146395 and 61772542.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yonggang Che .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, S., Li, Z., Che, Y. (2021). Memory Access Optimization of High-Order CFD Stencil Computations on GPU. In: Zhang, Y., Xu, Y., Tian, H. (eds) Parallel and Distributed Computing, Applications and Technologies. PDCAT 2020. Lecture Notes in Computer Science(), vol 12606. Springer, Cham. https://doi.org/10.1007/978-3-030-69244-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-69244-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-69243-8

  • Online ISBN: 978-3-030-69244-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics