ABSTRACT
This paper develops an algorithmic foundation for automated management of the multilevel-memory systems common to new supercomputers. In particular, the High-Bandwidth Memory (HBM) of these systems has a similar latency to that of DRAM and a smaller capacity, but it has much larger bandwidth. Systems equipped with HBM do not fit in classic memory-hierarchy models due to HBM's atypical characteristics.
Unlike caches, which are generally managed automatically by the hardware, programmers of some current HBM-equipped supercomputers can choose to explicitly manage HBM themselves. This process is problem specific and resource intensive. Vendors offer this option because there is no consensus on how to automatically manage HBM to guarantee good performance, or whether this is even possible.
In this paper, we give theoretical support for automatic HBM management by developing simple algorithms that can automatically control HBM and deliver good performance on multicore systems. HBM management is starkly different from traditional caching both in terms of optimization objectives and algorithm development. Since DRAM and HBM have similar latencies, minimizing HBM misses (provably) turns out not to be the right memory-management objective. Instead, we directly focus on minimizing makespan. In addition, while cache-management algorithms must focus on what pages to keep in cache; HBM management requires answering two questions: (1) which pages to keep in HBM and (2) how to use the limited bandwidth from HBM to DRAM. It turns out that the natural approach of using LRU for the first question and FCFS (First-Come-First-Serve) for the second question is provably bad. Instead, we provide a priority based approach that is simple, efficiently implementable and $O(1)$-competitive for makespan when all multicore threads are independent.
- High-performance on-package memory, January 2015. http://www.micron.com/products/hybrid-memory-cube/high-performance-on-package-memory.Google Scholar
- A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir. A model for hierarchical memory. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pages 305--314, May 1987.Google ScholarDigital Library
- A. Aggarwal, A. Chandra, and M. Snir. Communication complexity of PRAMs. Theoretical Computer Science, pages 3--28, March 1990.Google Scholar
- A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31(9):1116--1127, Sept. 1988.Google ScholarDigital Library
- K. Agrawal, M. Bender, R. Das, W. Kuszmaul, E. Peserico, and M. Scquizzato. Green paging and parallel paging. In Proc. 32st ACM on Symposium on Parallelism in Algorithms and Architectures, 2020.Google ScholarDigital Library
- M. Andrews, M. A. Bender, and L. Zhang. New algorithms for the disk scheduling problem. In Proc. 37th Annual Symposium on Foundations of Computer Science (FOCS), pages 580--589, 1996.Google ScholarCross Ref
- M. Andrews, M. A. Bender, and L. Zhang. New algorithms for the disk scheduling problem. Algorithmica, 32(2):277--301, February 2002.Google ScholarDigital Library
- L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. In Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 197--206, 2008.Google ScholarDigital Library
- R. Barve and J. S. Vitter. External memory algorithms with dynamically changing memory allocations. Technical report, Duke University, 1998.Google Scholar
- M. Bender, R. Chowdhury, R. Das, R. Johnson, W. Kuszmaul, A. Lincoln, Q. Liu, J. Lynch, and H. Xu. Closing the gap between cache-oblivious and cache-adaptiveanalysis. In Proc. 32st ACM on Symposium on Parallelism in Algorithms and Architectures, 2020.Google ScholarDigital Library
- M. A. Bender, J. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. In Proc. 29th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Hyderabad, INDIA, May 2015.Google ScholarDigital Library
- M. A. Bender, J. W. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. Journal of Parallel and Distributed Computing, 102:213--228, 2017.Google ScholarDigital Library
- M. A. Bender, J. W. Berry, S. D. Hammond, K. S. Hemmert, S. McCauley, B. Moore, B. Moseley, C. A. Phillips, D. S. Resnick, and A. Rodrigues. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation. Journal of Parallel and Distributed Computing, 102:213--228, 2017.Google ScholarDigital Library
- M. A. Bender, J. W. Berry, S. D. Hammond, B. Moore, B. Moseley, and C. A. Phillips. k-means clustering on two-level memory systems. In B. Jacob, editor, Proc. 2015 International Symposium on Memory Systems, (MEMSYS), pages 197--205, Washington DC, USA, October 2015.Google ScholarDigital Library
- M. A. Bender, A. Conway, M. Farach-Colton, W. Jannen, Y. Jiao, R. Johnson, E. Knorr, S. McAllister, N. Mukherjee, P. Pandey, D. E. Porter, J. Yuan, and Y. Zhan. Small refinements to the dam can have big consequences for data-structure design. In Proc. 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 265--274, Phoenix, AZ, June 2019.Google ScholarDigital Library
- M. A. Bender, E. D. Demaine, R. Ebrahimi, J. T. Fineman, R. Johnson, A. Lincoln, J. Lynch, and S. McCauley. Cache-adaptive analysis. In Proc. 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 135--144, July 2016.Google ScholarDigital Library
- M. A. Bender, R. Ebrahimi, J. T. Fineman, G. Ghasemiesfeh, R. Johnson, and S. McCauley. Cache-adaptive algorithms. In Proc. 25th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 958--971, Portland, OR, USA, January 2014.Google ScholarCross Ref
- G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 501--510. Society for Industrial and Applied Mathematics, 2008.Google ScholarDigital Library
- A. Borodin, P. Raghavan, S. Irani, and B. Schieber. Competitive paging with locality of reference. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 249--259. Citeseer, 1991.Google ScholarDigital Library
- N. Butcher, S. L. Olivier, J. Berry, S. D. Hammond, and P. M. Kogge. Optimizing for knl usage modes when data doesn't fit in mcdram. In Proceedings of the 47th International Conference on Parallel Processing, page 37. ACM, 2018.Google Scholar
- C. Byun, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, et al. Benchmarking data analysis and machine learning applications on the intel knl many-core processor. arXiv preprint arXiv:1707.03515, 2017.Google Scholar
- S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, et al. Scheduling threads for constructive cache sharing on cmps. In Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105--115. ACM, 2007.Google ScholarDigital Library
- R. Cole and V. Ramachandran. Bounding cache miss costs of multithreaded computations under general schedulers. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 351--362. ACM, 2017.Google ScholarDigital Library
- R. Das, S.-Y. Tsai, S. Duppala, J. Lynch, E. M. Arkin, R. Chowdhury, J. S. Mitchell, and S. Skiena. Data races and the discrete resource-time tradeoff problem with resource reuse over paths. In The 31st ACM on Symposium on Parallelism in Algorithms and Architectures, pages 359--368. ACM, 2019.Google ScholarDigital Library
- A. S. de Loma. New results on fair multi-threaded paging. Electronic Journal of SADIO, 1(1):21--36, 1998.Google Scholar
- S. De Loma et al. On-line multi-threaded paging. Algorithmica, 32(1):36--60, 2002.Google ScholarDigital Library
- D. W. Doerfler. Trinity: Next-generation supercomputer for the asc program. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2014.Google Scholar
- A. Fiat, R. M. Karp, M. Luby, L. A. McGeoch, D. D. Sleator, and N. E. Young. Competitive paging algorithms. Journal of Algorithms, 12(4):685--699, 1991.Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th Annual ACM Symposium on Foundations of Computer Science (FOCS), pages 285--297, 1999.Google ScholarCross Ref
- M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. ACM Transactions on Algorithms, 8(1):4, Jan. 2012.Google ScholarDigital Library
- A. Hassidim. Cache replacement policies for multicore processors. In A. C. Yao, editor, Proc. Innovations in Computer Science (ICS), pages 501--509, 2010.Google Scholar
- M. M. Javanmard, P. Ganapathi, R. Das, Z. Ahmad, S. Tschudi, and R. Chowdhury. Toward efficient architecture-independent algorithms for dynamic programs. In International Conference on High Performance Computing, pages 143?164. Springer, 2019.Google Scholar
- A. K. Katti and V. Ramachandran. Competitive cache replacement strategies for shared cache environments. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium, pages 215--226. IEEE, 2012.Google ScholarDigital Library
- http://www.hpcwire.com/2014/06/24/micron-intel-reveal-memory-slice-knights-landing/.Google Scholar
- P. Kogge and J. Shalf. Exascale computing trends: Adjusting to the" new normal"'for computer architecture. Computing in Science & Engineering, 15(6):16--26, 2013.Google ScholarDigital Library
- A. Li, W. Liu, M. R. Kristensen, B. Vinter, H. Wang, K. Hou, A. Marquez, and S. L. Song. Exploring and analyzing the real impact of modern on-package memory on hpc scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 26, 2017.Google ScholarDigital Library
- A. Lincoln, Q. C. Liu, J. Lynch, and H. Xu. Cache-adaptive exploration: Experimental results and scan-hiding for adaptivity. In Proc. 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 213--222, 2018.Google ScholarDigital Library
- A. López-Ortiz and A. Salinger. Paging for multi-core shared caches. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 113--127. ACM, 2012.Google ScholarDigital Library
- http://nnsa.energy.gov/mediaroom/pressreleases/trinity.Google Scholar
- E. Peserico. Paging with dynamic memory capacity. CoRR, abs/1304.6007, 2013.Google Scholar
- S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory access scheduling. In ACM SIGARCH Computer Architecture News, volume 28, pages 128--138. ACM, 2000.Google ScholarDigital Library
- S. S. Seiden. Randomized online multi-threaded paging. Nordic Journal of Computing, 6(2):148--161, 1999.Google ScholarDigital Library
- Semiconductor Engineering. What's next for high bandwidth memory? https://semiengineering.com/whats-next-for-high-bandwidth-memory/, December 2019.Google Scholar
- D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Commun. ACM, 28(2):202--208, Feb. 1985.Google ScholarDigital Library
- G. M. Slota and S. Rajamanickam. Experimental design of work chunking for graph algorithms on high bandwidth memory architectures. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 875--884. IEEE, 2018.Google ScholarCross Ref
- A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y.-C. Liu. Knights landing: Second-generation intel xeon phi product. Ieee micro, 36(2):34--46, 2016.Google ScholarDigital Library
- J. Wells, B. Bland, J. Nichols, J. Hack, F. Foertter, G. Hagen, T. Maier, M. Ashfaq, B. Messer, and S. Parete-Koon. Announcing supercomputer summit. Technical report, ORNL (Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)), 2016.Google Scholar
Index Terms
- How to Manage High-Bandwidth Memory Automatically
Recommendations
Automatic HBM Management: Models and Algorithms
SPAA '22: Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and ArchitecturesSome past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity.
In this paper, we evaluate algorithms for managing High- Bandwidth Memory ...
Two-level main memory co-design
A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (...
Dynamic scratchpad memory management for code in portable systems with an MMU
In this work, we present a dynamic memory allocation technique for a novel, horizontally partitioned memory subsystem targeting contemporary embedded processors with a memory management unit (MMU). We propose to replace the on-chip instruction cache ...
Comments