research-article

Public Access

How to Manage High-Bandwidth Memory Automatically

Authors:
Rathish Das

Stony Brook University, Stony Brook, NY, USA

Stony Brook University, Stony Brook, NY, USA
View Profile

,
Kunal Agrawal

Washington University in St. Louis, St. Louis, MO, USA

Washington University in St. Louis, St. Louis, MO, USA
View Profile

,
Michael A. Bender

Stony Brook University, Stony Brook, NY, USA

Stony Brook University, Stony Brook, NY, USA
View Profile

,
Jonathan Berry

Sandia National Laboratories, Albuquerque, NM, USA

Sandia National Laboratories, Albuquerque, NM, USA
View Profile

,
Benjamin Moseley

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Cynthia A. Phillips

Sandia National Laboratories, Albuquerque, NM, USA

Sandia National Laboratories, Albuquerque, NM, USA
View Profile

SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and ArchitecturesJuly 2020Pages 187–199https://doi.org/10.1145/3350755.3400233

Published:09 July 2020Publication History

SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures

Pages 187–199

ABSTRACT

This paper develops an algorithmic foundation for automated management of the multilevel-memory systems common to new supercomputers. In particular, the High-Bandwidth Memory (HBM) of these systems has a similar latency to that of DRAM and a smaller capacity, but it has much larger bandwidth. Systems equipped with HBM do not fit in classic memory-hierarchy models due to HBM's atypical characteristics.

Unlike caches, which are generally managed automatically by the hardware, programmers of some current HBM-equipped supercomputers can choose to explicitly manage HBM themselves. This process is problem specific and resource intensive. Vendors offer this option because there is no consensus on how to automatically manage HBM to guarantee good performance, or whether this is even possible.

In this paper, we give theoretical support for automatic HBM management by developing simple algorithms that can automatically control HBM and deliver good performance on multicore systems. HBM management is starkly different from traditional caching both in terms of optimization objectives and algorithm development. Since DRAM and HBM have similar latencies, minimizing HBM misses (provably) turns out not to be the right memory-management objective. Instead, we directly focus on minimizing makespan. In addition, while cache-management algorithms must focus on what pages to keep in cache; HBM management requires answering two questions: (1) which pages to keep in HBM and (2) how to use the limited bandwidth from HBM to DRAM. It turns out that the natural approach of using LRU for the first question and FCFS (First-Come-First-Serve) for the second question is provably bad. Instead, we provide a priority based approach that is simple, efficiently implementable and $O(1)$-competitive for makespan when all multicore threads are independent.

References

High-performance on-package memory, January 2015. http://www.micron.com/products/hybrid-memory-cube/high-performance-on-package-memory.Google Scholar
A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir. A model for hierarchical memory. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pages 305--314, May 1987.Google ScholarDigital Library
A. Aggarwal, A. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, pages 3--28, March 1990.Google Scholar
A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116--1127, Sept. 1988.Google ScholarDigital Library
K. Agrawal, M. Bender, R. Das, W. Kuszmaul, E. Peserico, and M. Scquizzato. Green paging and parallel paging. In Proc. 32st ACM on Symposium on Parallelism in Algorithms and Architectures, 2020.Google ScholarDigital Library
M. Andrews, M. A. Bender, and L. Zhang. New algorithms for the disk scheduling problem. In Proc. 37th Annual Symposium on Foundations of Computer Science (FOCS), pages 580--589, 1996.Google ScholarCross Ref
M. Andrews, M. A. Bender, and L. Zhang. New algorithms for the disk scheduling problem. Algorithmica, 32(2):277--301, February 2002.Google ScholarDigital Library
L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 197--206, 2008.Google ScholarDigital Library
R. Barve and J. S. Vitter. External memory algorithms with dynamically changing memory allocations. Technical report, Duke University, 1998.Google Scholar
M. Bender, R. Chowdhury, R. Das, R. Johnson, W. Kuszmaul, A. Lincoln, Q. Liu, J. Lynch, and H. Xu. Closing the gap between cache-oblivious and cache-adaptiveanalysis. In Proc. 32st ACM on Symposium on Parallelism in Algorithms and Architectures, 2020.Google ScholarDigital Library
M. A. Bender, J. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. In Proc. 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Hyderabad, INDIA, May 2015.Google ScholarDigital Library
M. A. Bender, J. W. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. Journal of Parallel and Distributed Computing, 102:213--228, 2017.Google ScholarDigital Library
M. A. Bender, J. W. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. S. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. Journal of Parallel and Distributed Computing, 102:213--228, 2017.Google ScholarDigital Library
M. A. Bender, J. W. Berry, S. D. Hammond, B. Moore, B. Moseley, and C. A. Phillips. k-means clustering on two-level memory systems. In B. Jacob, editor, Proc. 2015 International Symposium on Memory Systems, (MEMSYS), pages 197--205, Washington DC, USA, October 2015.Google ScholarDigital Library
M. A. Bender, A. Conway, M. Farach-Colton, W. Jannen, Y. Jiao, R. Johnson, E. Knorr, S. McAllister, N. Mukherjee, P. Pandey, D. E. Porter, J. Yuan, and Y. Zhan. Small refinements to the dam can have big consequences for data-structure design. In Proc. 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 265--274, Phoenix, AZ, June 2019.Google ScholarDigital Library
M. A. Bender, E. D. Demaine, R. Ebrahimi, J. T. Fineman, R. Johnson, A. Lincoln, J. Lynch, and S. McCauley. Cache-adaptive analysis. In Proc. 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 135--144, July 2016.Google ScholarDigital Library
M. A. Bender, R. Ebrahimi, J. T. Fineman, G. Ghasemiesfeh, R. Johnson, and S. McCauley. Cache-adaptive algorithms. In Proc. 25th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 958--971, Portland, OR, USA, January 2014.Google ScholarCross Ref
G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 501--510. Society for Industrial and Applied Mathematics, 2008.Google ScholarDigital Library
A. Borodin, P. Raghavan, S. Irani, and B. Schieber. Competitive paging with locality of reference. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 249--259. Citeseer, 1991.Google ScholarDigital Library
N. Butcher, S. L. Olivier, J. Berry, S. D. Hammond, and P. M. Kogge. Optimizing for knl usage modes when data doesn't fit in mcdram. In Proceedings of the 47th International Conference on Parallel Processing, page 37. ACM, 2018.Google Scholar
C. Byun, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, et al. Benchmarking data analysis and machine learning applications on the intel knl many-core processor. arXiv preprint arXiv:1707.03515, 2017.Google Scholar
S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, et al. Scheduling threads for constructive cache sharing on cmps. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105--115. ACM, 2007.Google ScholarDigital Library
R. Cole and V. Ramachandran. Bounding cache miss costs of multithreaded computations under general schedulers. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 351--362. ACM, 2017.Google ScholarDigital Library
R. Das, S.-Y. Tsai, S. Duppala, J. Lynch, E. M. Arkin, R. Chowdhury, J. S. Mitchell, and S. Skiena. Data races and the discrete resource-time tradeoff problem with resource reuse over paths. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures, pages 359--368. ACM, 2019.Google ScholarDigital Library
A. S. de Loma. New results on fair multi-threaded paging. Electronic Journal of SADIO, 1(1):21--36, 1998.Google Scholar
S. De Loma et al. On-line multi-threaded paging. Algorithmica, 32(1):36--60, 2002.Google ScholarDigital Library
D. W. Doerfler. Trinity: Next-generation supercomputer for the asc program. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2014.Google Scholar
A. Fiat, R. M. Karp, M. Luby, L. A. McGeoch, D. D. Sleator, and N. E. Young. Competitive paging algorithms. Journal of Algorithms, 12(4):685--699, 1991.Google ScholarDigital Library
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Annual ACM Symposium on Foundations of Computer Science (FOCS), pages 285--297, 1999.Google ScholarCross Ref
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. ACM Transactions on Algorithms, 8(1):4, Jan. 2012.Google ScholarDigital Library
A. Hassidim. Cache replacement policies for multicore processors. In A. C. Yao, editor, Proc. Innovations in Computer Science (ICS), pages 501--509, 2010.Google Scholar
M. M. Javanmard, P. Ganapathi, R. Das, Z. Ahmad, S. Tschudi, and R. Chowdhury. Toward efficient architecture-independent algorithms for dynamic programs. In International Conference on High Performance Computing, pages 143?164. Springer, 2019.Google Scholar
A. K. Katti and V. Ramachandran. Competitive cache replacement strategies for shared cache environments. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pages 215--226. IEEE, 2012.Google ScholarDigital Library
http://www.hpcwire.com/2014/06/24/micron-intel-reveal-memory-slice-knights-landing/.Google Scholar
P. Kogge and J. Shalf. Exascale computing trends: Adjusting to the" new normal"'for computer architecture. Computing in Science & Engineering, 15(6):16--26, 2013.Google ScholarDigital Library
A. Li, W. Liu, M. R. Kristensen, B. Vinter, H. Wang, K. Hou, A. Marquez, and S. L. Song. Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 26, 2017.Google ScholarDigital Library
A. Lincoln, Q. C. Liu, J. Lynch, and H. Xu. Cache-adaptive exploration: Experimental results and scan-hiding for adaptivity. In Proc. 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 213--222, 2018.Google ScholarDigital Library
A. López-Ortiz and A. Salinger. Paging for multi-core shared caches. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 113--127. ACM, 2012.Google ScholarDigital Library
http://nnsa.energy.gov/mediaroom/pressreleases/trinity.Google Scholar
E. Peserico. Paging with dynamic memory capacity. CoRR, abs/1304.6007, 2013.Google Scholar
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In ACM SIGARCH Computer Architecture News, volume 28, pages 128--138. ACM, 2000.Google ScholarDigital Library
S. S. Seiden. Randomized online multi-threaded paging. Nordic Journal of Computing, 6(2):148--161, 1999.Google ScholarDigital Library
Semiconductor Engineering. What's next for high bandwidth memory? https://semiengineering.com/whats-next-for-high-bandwidth-memory/, December 2019.Google Scholar
D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Commun. ACM, 28(2):202--208, Feb. 1985.Google ScholarDigital Library
G. M. Slota and S. Rajamanickam. Experimental design of work chunking for graph algorithms on high bandwidth memory architectures. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 875--884. IEEE, 2018.Google ScholarCross Ref
A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu. Knights landing: Second-generation intel xeon phi product. Ieee micro, 36(2):34--46, 2016.Google ScholarDigital Library
J. Wells, B. Bland, J. Nichols, J. Hack, F. Foertter, G. Hagen, T. Maier, M. Ashfaq, B. Messer, and S. Parete-Koon. Announcing supercomputer summit. Technical report, ORNL (Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)), 2016.Google Scholar

Index Terms

How to Manage High-Bandwidth Memory Automatically
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Automatic HBM Management: Models and Algorithms
SPAA '22: Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Some past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity.

In this paper, we evaluate algorithms for managing High- Bandwidth Memory ...
Read More
Two-level main memory co-design

A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (...
Read More
Dynamic scratchpad memory management for code in portable systems with an MMU

In this work, we present a dynamic memory allocation technique for a novel, horizontally partitioned memory subsystem targeting contemporary embedded processors with a memory management unit (MMU). We propose to replace the on-chip instruction cache ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures
July 2020
601 pages
ISBN:9781450369350
DOI:10.1145/3350755
General Chair:
Christian Scheideler
Institut fuer Informatik Universitaet Paderborn Fuerstenallee 11
,
Program Chair:
Michael Spear
Copyright © 2020 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 July 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximation algorithms
high-bandwidth memory
multicore paging
online algorithms
paging
scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 944
  Total Downloads
- Downloads (Last 12 months)365
- Downloads (Last 6 weeks)49
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

How to Manage High-Bandwidth Memory Automatically

SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic HBM Management: Models and Algorithms

Two-level main memory co-design

Dynamic scratchpad memory management for code in portable systems with an MMU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

How to Manage High-Bandwidth Memory Automatically

SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic HBM Management: Models and Algorithms

Two-level main memory co-design

Dynamic scratchpad memory management for code in portable systems with an MMU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media