LC-MEMENTO: A Memory Model for Accelerated Architectures

Ranganath, Kiran; Firoz, Jesun; Suetterlein, Joshua; Manzano, Joseph; Marquez, Andres; Raugas, Mark; Wong, Daniel

doi:10.1007/978-3-030-99372-6_5

Kiran Ranganath¹⁰,
Jesun Firoz¹¹,
Joshua Suetterlein¹¹,
Joseph Manzano¹¹,
Andres Marquez¹¹,
Mark Raugas¹¹ &
…
Daniel Wong¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13181))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

312 Accesses

Abstract

With the advent of heterogeneous architectures, in particular, with the ubiquity of multi-GPU systems, it is becoming increasingly important to manage device memory efficiently in order to reap the benefits of the additional core count. To date, such responsibility mainly falls on the programmer where device-to-host data communication (and vice versa), if not done properly, may incur costly memory transfer operations and synchronization. The problem may be compounded by additional requirement to maintain system-wide memory consistency that may involve expensive synchronization overhead. In this paper, we present Location Consistency Memory Model for Enhanced Transfer Operations (LC-MEMENTO). This framework considers incorporating runtime techniques for multi-GPU memory management to support relaxed synchronization semantics and memory transfer operations automatically. Specifically, we implement a relaxed form of a memory consistency model based on the Location Consistency (LC) in an Asynchronous Many-Task Runtime (ARTS) and demonstrate that, this memory model enables additional optimization opportunities for the three representative applications encompassing different computational patterns (scientific computation, graphs, data streaming, etc.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Their ancillary functionality can be set and reset across different program’s phases.
2.
A practical memory model is one that can be used by application developers to write non-chaotic codes since all of its non-determinism can be contained by special operators.
3.
A concept in computer science and mathematics in which operators can be applied multiple times without changing the results/state of the computation after the first application.

References

Abdolrashidi, A., et al.: WIREFRAME: supporting data-dependent parallelism through dependency graph execution in GPUs. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 600–611 (2017)
Google Scholar
Abdolrashidi, A., et al.: BlockMaestro: enabling programmer-transparent task-based execution in GPU systems. In: 2021 48th Annual IEEE/ACM International Symposium on Computer Architecture (ISCA). IEEE (2021)
Google Scholar
Adve, S.V., Gharachorloo, K.: Shared memory consistency models: a tutorial. Computer 29(12), 66–76 (1996)
Article Google Scholar
Ben-Nun, T., et al.: Groute: an asynchronous multi-GPU programming model for irregular computations. ACM SIGPLAN Notices 52(8), 235–248 (2017)
Article Google Scholar
Bershad, B.N., Zekauskas, M.J.: Midway: shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical report (1991)
Google Scholar
Chen, G., et al.: EffiSha: a software framework for enabling effficient preemptive scheduling of GPU. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 3–16 (2017)
Google Scholar
Droco, M., et al.: Global Memory and Threading (GMT). https://github.com/pnnl/gmt
Firoz, J.S., Zalewski, M., Kanewala, T., Lumsdaine, A.: Synchronization-avoiding graph algorithms. In: 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pp. 52–61. IEEE (2018)
Google Scholar
Modelado Foundation: Open Community Runtime. https://xstackwiki.modelado.org/Open_Community_Runtime
Gao, G.R., Sarkar, V.: Location consistency-a new memory model and cache consistency protocol. IEEE Trans. Comput. 49(8), 798–813 (2000)
Article Google Scholar
Hechtman, B.A., Sorin, D.J.: Exploring memory consistency for massively-threaded throughput-oriented processors. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, pp. 201–212 (2013)
Google Scholar
Jeon, M., et al.: Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: 2019 USENIX Annual Technical Conference (USENIX ATC 19), pp. 947–960 (2019)
Google Scholar
Landwehr, J., et al.: Designing scalable distributed memory models: a case study. In: Proceedings of the Computing Frontiers Conference, CF 2017, pp. 174–182. Association for Computing Machinery, New York (2017)
Google Scholar
Lenoski, D., et al.: The directory-based cache coherence protocol for the DASH multiprocessor. In: Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA 1990, pp. 148–159. ACM, New York (1990)
Google Scholar
Long, G., et al.: Location consistency model revisited: problem, solution and prospects. In: 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 91–98 (2008)
Google Scholar
Lustig, D., et al.: A formal analysis of the NVIDIA PTX memory consistency model. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, pp. 257–270. Association for Computing Machinery, New York (2019)
Google Scholar
Luszczek, P.R., et al.: The HPC Challenge (HPCC) benchmark suite. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 2006, p. 213-es. Association for Computing Machinery, New York (2006)
Google Scholar
Protiae, J., Milutinoviae, V.: Entry consistency versus lazy release consistency in DSM systems: analytical comparison and a new hybrid solution. In: Proceedings of the Sixth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems, 1997, pp. 78–83, October 1997
Google Scholar
Ranganath, K., et al.: Speeding up collective communications through inter-GPU re-routing. IEEE Comput. Archit. Lett. 18(2), 128–131 (2019)
Article Google Scholar
Ranganath, K., et al.: MAPA: multi-accelerator pattern allocation policy for multi-tenant GPU servers. In: SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM (2021)
Google Scholar
Ren, X., Lis, M.: Efficient sequential consistency in GPUs via relativistic cache coherence. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 625–636. IEEE (2017)
Google Scholar
Ren, X., Lustig, D., Bolotin, E., Jaleel, A., Villa, O., Nellans, D.: HMG: extending cache coherence protocols across modern hierarchical multi-GPU systems. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 582–595. IEEE (2020)
Google Scholar
Singh, A., Aga, S., Narayanasamy, S.: Efficiently enforcing strong memory ordering in GPUs. In: Proceedings of the 48th International Symposium on Microarchitecture, pp. 699–712 (2015)
Google Scholar
Rennich, S.: Streams and Concurrency. https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
Suetterlein, J., et al.: The Abstract Runtime System: ARTS. https://github.com/pnnl/ARTS
Tripathy, D., et al.: LocalityGuru: a PTX analyzer for extracting thread block-level locality in GPGPUs. In: Proceedings of the 15th IEEE/ACM International Conference on Networking, Architecture, and Storage (2021, To appear)
Google Scholar
Trott, C.R., Edwards, H.C.: Kokkos: the C++ performance portability programming model. Technical report, Sandia National Lab. (SNL-NM), Albuquerque, NM, United States (2017)
Google Scholar
Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, pp. 149–160. Association for Computing Machinery, New York (2012)
Google Scholar
Vergara, M., et al.: Scaling the summit: deploying the world’s fastest supercomputer. In: International Workshop on OpenPOWER for HPC (IWOPH 2019) (2019)
Google Scholar
Willcock, J.J., et al.: AM++: a generalized active message framework. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT 2010, pp. 401–410. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1854273.1854323
Xiao, W., et al.: Gandiva: introspective cluster scheduling for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 595–610 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

University of California Riverside, Riverside, CA, 92521, USA
Kiran Ranganath & Daniel Wong
Pacific Northwest National Laboratory, Richland, WA, 99352, USA
Jesun Firoz, Joshua Suetterlein, Joseph Manzano, Andres Marquez & Mark Raugas

Authors

Kiran Ranganath
View author publications
You can also search for this author in PubMed Google Scholar
Jesun Firoz
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Suetterlein
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Manzano
View author publications
You can also search for this author in PubMed Google Scholar
Andres Marquez
View author publications
You can also search for this author in PubMed Google Scholar
Mark Raugas
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kiran Ranganath or Joshua Suetterlein .

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, University of Delaware, Newark, DE, USA
Xiaoming Li
Department of Computer Science, University of Delaware, Newark, DE, USA
Sunita Chandrasekaran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ranganath, K. et al. (2022). LC-MEMENTO: A Memory Model for Accelerated Architectures. In: Li, X., Chandrasekaran, S. (eds) Languages and Compilers for Parallel Computing. LCPC 2021. Lecture Notes in Computer Science, vol 13181. Springer, Cham. https://doi.org/10.1007/978-3-030-99372-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-99372-6_5
Published: 24 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99371-9
Online ISBN: 978-3-030-99372-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics