skip to main content
10.1145/3350755.3400233acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article
Public Access

How to Manage High-Bandwidth Memory Automatically

Published:09 July 2020Publication History

ABSTRACT

This paper develops an algorithmic foundation for automated management of the multilevel-memory systems common to new supercomputers. In particular, the High-Bandwidth Memory (HBM) of these systems has a similar latency to that of DRAM and a smaller capacity, but it has much larger bandwidth. Systems equipped with HBM do not fit in classic memory-hierarchy models due to HBM's atypical characteristics.

Unlike caches, which are generally managed automatically by the hardware, programmers of some current HBM-equipped supercomputers can choose to explicitly manage HBM themselves. This process is problem specific and resource intensive. Vendors offer this option because there is no consensus on how to automatically manage HBM to guarantee good performance, or whether this is even possible.

In this paper, we give theoretical support for automatic HBM management by developing simple algorithms that can automatically control HBM and deliver good performance on multicore systems. HBM management is starkly different from traditional caching both in terms of optimization objectives and algorithm development. Since DRAM and HBM have similar latencies, minimizing HBM misses (provably) turns out not to be the right memory-management objective. Instead, we directly focus on minimizing makespan. In addition, while cache-management algorithms must focus on what pages to keep in cache; HBM management requires answering two questions: (1) which pages to keep in HBM and (2) how to use the limited bandwidth from HBM to DRAM. It turns out that the natural approach of using LRU for the first question and FCFS (First-Come-First-Serve) for the second question is provably bad. Instead, we provide a priority based approach that is simple, efficiently implementable and $O(1)$-competitive for makespan when all multicore threads are independent.

References

  1. High-performance on-package memory, January 2015. http://www.micron.com/products/hybrid-memory-cube/high-performance-on-package-memory.Google ScholarGoogle Scholar
  2. A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir. A model for hierarchical memory. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pages 305--314, May 1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Aggarwal, A. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, pages 3--28, March 1990.Google ScholarGoogle Scholar
  4. A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116--1127, Sept. 1988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Agrawal, M. Bender, R. Das, W. Kuszmaul, E. Peserico, and M. Scquizzato. Green paging and parallel paging. In Proc. 32st ACM on Symposium on Parallelism in Algorithms and Architectures, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Andrews, M. A. Bender, and L. Zhang. New algorithms for the disk scheduling problem. In Proc. 37th Annual Symposium on Foundations of Computer Science (FOCS), pages 580--589, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Andrews, M. A. Bender, and L. Zhang. New algorithms for the disk scheduling problem. Algorithmica, 32(2):277--301, February 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 197--206, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Barve and J. S. Vitter. External memory algorithms with dynamically changing memory allocations. Technical report, Duke University, 1998.Google ScholarGoogle Scholar
  10. M. Bender, R. Chowdhury, R. Das, R. Johnson, W. Kuszmaul, A. Lincoln, Q. Liu, J. Lynch, and H. Xu. Closing the gap between cache-oblivious and cache-adaptiveanalysis. In Proc. 32st ACM on Symposium on Parallelism in Algorithms and Architectures, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. A. Bender, J. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. In Proc. 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Hyderabad, INDIA, May 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. A. Bender, J. W. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. Journal of Parallel and Distributed Computing, 102:213--228, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. A. Bender, J. W. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. S. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. Journal of Parallel and Distributed Computing, 102:213--228, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. A. Bender, J. W. Berry, S. D. Hammond, B. Moore, B. Moseley, and C. A. Phillips. k-means clustering on two-level memory systems. In B. Jacob, editor, Proc. 2015 International Symposium on Memory Systems, (MEMSYS), pages 197--205, Washington DC, USA, October 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. A. Bender, A. Conway, M. Farach-Colton, W. Jannen, Y. Jiao, R. Johnson, E. Knorr, S. McAllister, N. Mukherjee, P. Pandey, D. E. Porter, J. Yuan, and Y. Zhan. Small refinements to the dam can have big consequences for data-structure design. In Proc. 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 265--274, Phoenix, AZ, June 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. A. Bender, E. D. Demaine, R. Ebrahimi, J. T. Fineman, R. Johnson, A. Lincoln, J. Lynch, and S. McCauley. Cache-adaptive analysis. In Proc. 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 135--144, July 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. A. Bender, R. Ebrahimi, J. T. Fineman, G. Ghasemiesfeh, R. Johnson, and S. McCauley. Cache-adaptive algorithms. In Proc. 25th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 958--971, Portland, OR, USA, January 2014.Google ScholarGoogle ScholarCross RefCross Ref
  18. G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 501--510. Society for Industrial and Applied Mathematics, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Borodin, P. Raghavan, S. Irani, and B. Schieber. Competitive paging with locality of reference. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 249--259. Citeseer, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Butcher, S. L. Olivier, J. Berry, S. D. Hammond, and P. M. Kogge. Optimizing for knl usage modes when data doesn't fit in mcdram. In Proceedings of the 47th International Conference on Parallel Processing, page 37. ACM, 2018.Google ScholarGoogle Scholar
  21. C. Byun, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, et al. Benchmarking data analysis and machine learning applications on the intel knl many-core processor. arXiv preprint arXiv:1707.03515, 2017.Google ScholarGoogle Scholar
  22. S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, et al. Scheduling threads for constructive cache sharing on cmps. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105--115. ACM, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Cole and V. Ramachandran. Bounding cache miss costs of multithreaded computations under general schedulers. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 351--362. ACM, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Das, S.-Y. Tsai, S. Duppala, J. Lynch, E. M. Arkin, R. Chowdhury, J. S. Mitchell, and S. Skiena. Data races and the discrete resource-time tradeoff problem with resource reuse over paths. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures, pages 359--368. ACM, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. S. de Loma. New results on fair multi-threaded paging. Electronic Journal of SADIO, 1(1):21--36, 1998.Google ScholarGoogle Scholar
  26. S. De Loma et al. On-line multi-threaded paging. Algorithmica, 32(1):36--60, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. W. Doerfler. Trinity: Next-generation supercomputer for the asc program. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2014.Google ScholarGoogle Scholar
  28. A. Fiat, R. M. Karp, M. Luby, L. A. McGeoch, D. D. Sleator, and N. E. Young. Competitive paging algorithms. Journal of Algorithms, 12(4):685--699, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Annual ACM Symposium on Foundations of Computer Science (FOCS), pages 285--297, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. ACM Transactions on Algorithms, 8(1):4, Jan. 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Hassidim. Cache replacement policies for multicore processors. In A. C. Yao, editor, Proc. Innovations in Computer Science (ICS), pages 501--509, 2010.Google ScholarGoogle Scholar
  32. M. M. Javanmard, P. Ganapathi, R. Das, Z. Ahmad, S. Tschudi, and R. Chowdhury. Toward efficient architecture-independent algorithms for dynamic programs. In International Conference on High Performance Computing, pages 143?164. Springer, 2019.Google ScholarGoogle Scholar
  33. A. K. Katti and V. Ramachandran. Competitive cache replacement strategies for shared cache environments. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pages 215--226. IEEE, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. http://www.hpcwire.com/2014/06/24/micron-intel-reveal-memory-slice-knights-landing/.Google ScholarGoogle Scholar
  35. P. Kogge and J. Shalf. Exascale computing trends: Adjusting to the" new normal"'for computer architecture. Computing in Science & Engineering, 15(6):16--26, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Li, W. Liu, M. R. Kristensen, B. Vinter, H. Wang, K. Hou, A. Marquez, and S. L. Song. Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 26, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Lincoln, Q. C. Liu, J. Lynch, and H. Xu. Cache-adaptive exploration: Experimental results and scan-hiding for adaptivity. In Proc. 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 213--222, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. López-Ortiz and A. Salinger. Paging for multi-core shared caches. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 113--127. ACM, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. http://nnsa.energy.gov/mediaroom/pressreleases/trinity.Google ScholarGoogle Scholar
  40. E. Peserico. Paging with dynamic memory capacity. CoRR, abs/1304.6007, 2013.Google ScholarGoogle Scholar
  41. S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In ACM SIGARCH Computer Architecture News, volume 28, pages 128--138. ACM, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. S. Seiden. Randomized online multi-threaded paging. Nordic Journal of Computing, 6(2):148--161, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Semiconductor Engineering. What's next for high bandwidth memory? https://semiengineering.com/whats-next-for-high-bandwidth-memory/, December 2019.Google ScholarGoogle Scholar
  44. D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Commun. ACM, 28(2):202--208, Feb. 1985.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. M. Slota and S. Rajamanickam. Experimental design of work chunking for graph algorithms on high bandwidth memory architectures. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 875--884. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  46. A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu. Knights landing: Second-generation intel xeon phi product. Ieee micro, 36(2):34--46, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Wells, B. Bland, J. Nichols, J. Hack, F. Foertter, G. Hagen, T. Maier, M. Ashfaq, B. Messer, and S. Parete-Koon. Announcing supercomputer summit. Technical report, ORNL (Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)), 2016.Google ScholarGoogle Scholar

Index Terms

  1. How to Manage High-Bandwidth Memory Automatically

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures
              July 2020
              601 pages
              ISBN:9781450369350
              DOI:10.1145/3350755

              Copyright © 2020 ACM

              Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 9 July 2020

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate447of1,461submissions,31%

              Upcoming Conference

              SPAA '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader