Abstract
An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.
This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.
Supplemental Material
Available for Download
Slide deck associated with this paper
- Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.Google Scholar
- Daniel E. Atkins, Kelvin K. Droegemeier, Stuart I. Feldman, Stuart I. Feldman, Michael L. Klein, David G. Messerschmitt, Paul Messina, Jeremiah P. Ostriker, and Margaret H. Wright. 2003. Revolutionizing Science and Engineering Through Cyberinfrastructure. Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. National Science Foundation.Google Scholar
- Barcelona Supercomputing Center. 2013. MareNostrum III System Architecture. Technical Report.Google Scholar
- Barcelona Supercomputing Center. 2014. Extrae User Guide Manual for Version 2.5.1. Barcelona Supercomputing Center.Google Scholar
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81. Google ScholarDigital Library
- Susmit Biswas, Bronis R. de Supinski, Martin Schulz, Diana Franklin, Timothy Sherwood, and Frederic T. Chong. 2011. Exploiting data similarity to reduce memory footprints. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). 152--163. Google ScholarDigital Library
- Mark Bull. 2013. PRACE-2IP: D7.4 unified european applications benchmark suite final. (2013).Google Scholar
- Chris Cantalupo, Karthik Raman, and Ruchira Sasanka. 2015. MCDRAM on 2nd Generation Intel Xeon Phi Processor (code-named Knights Landing): Analysis Methods and Tools. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC).Google Scholar
- Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1--12. Google ScholarDigital Library
- Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11. Google ScholarDigital Library
- Jack Dongarra, Michael Heroux, and Piotr Luszczek. 2016. The HPCG Benchmark. Retrieved from http://www.hpcg-benchmark.org/.Google Scholar
- Jack J. Dongarra and Michael A. Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Sandia Report SAND2013-4744. Sandia National Laboratories.Google Scholar
- Jack J. Dongarra, Piotr Luszczek, and Michael A. Heroux. 2014. HPCG: One year later. In Proceedings of the International Supercomputing Conference (ISC).Google Scholar
- Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 9 (2003), 803--820. Google ScholarCross Ref
- ETP4HPC. 2013. ETP4HPC Strategic Research Agenda Achieving HPC Leadership in Europe. (June 2013).Google Scholar
- Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0. Retrieved from http://www.hybridmemorycube.org/specification-v2-download-form/.Google Scholar
- Intel. 2016a. Intel VTune Amplifier 2016. Retrieved from https://software.intel.com/en-us/intel-vtune-amplifier-xe/.Google Scholar
- Intel. 2016b. The memkind library. Retrieved from http://memkind.github.io/memkind/.Google Scholar
- JEDEC Solid State Technology Association. 2013. High Bandwidth Memory (HBM) DRAM. http://www.jedec.org/standards-documents/docs/jesd235. (Oct. 2013).Google Scholar
- James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition (2nd ed.). Morgan Kaufmann. Google ScholarDigital Library
- Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. 2008. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report.Google Scholar
- Matthew J. Koop, Terry Jones, and Dhabaleswar K. Panda. 2007. Reducing connection memory requirements of MPI for infiniband clusters: A message coalescing approach. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 495--504. Google ScholarDigital Library
- Piotr Luszczek and Jack J. Dongarra. 2005. Introduction to the HPC Challenge Benchmark Suite. ICL Technical Report ICL-UT-05-01. University of Tennessee.Google Scholar
- Vladimir Marjanović, José Garcia, and Colin W. Glass. 2014. Performance modeling of the HPCG benchmark. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer International Publishing, 172--192.Google Scholar
- Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). 126--136. Google ScholarCross Ref
- Richard Murphy, Jonathan Berry, William McLendon, Bruce Hendrickson, Douglas Gregor, and Andrew Lumsdaine. 2006. DFS: A simple to write yet difficult to execute benchmark. In IEEE International Symposium on Workload Characterization (IISWC). 175--177. Google ScholarCross Ref
- Richard Murphy, Kyle Wheeler, Brian Barrett, and James Ang. 2010. Introducing the Graph 500. Cray User’s Group (CUG). (May 2010).Google Scholar
- NERSC. 2012. Large Scale Computing and Storage Requirements for High Energy Physics: Target 2017. Report of the NERSC Requirements Review. Lawrence Berkeley National Laboratory.Google Scholar
- NERSC. 2013. Large Scale Computing and Storage Requirements for Biological and Environmental Science: Target 2017. Report of the NERSC Requirements Review LBNL-6256E. Lawrence Berkeley National Laboratory.Google Scholar
- NERSC. 2014a. High Performance Computing and Storage Requirements for Basic Energy Sciences: Target 2017. Report of the HPC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.Google Scholar
- NERSC. 2014b. Large Scale Computing and Storage Requirements for Fusion Energy Sciences: Target 2017. Report of the NERSC Requirements Review LBNL-6631E. Lawrence Berkeley National Laboratory.Google Scholar
- NERSC. 2015a. High Performance Computing and Storage Requirements for Nuclear Physics: Target 2017. Report of the NERSC Requirements Review LBNL-6926E. Lawrence Berkeley National Laboratory.Google Scholar
- NERSC. 2015b. Large Scale Computing and Storage Requirements for Advanced Scientific Computing Research: Target 2017. Report of the NERSC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.Google Scholar
- Chris J. Newburn. 2015. Code for the future: Knights Landing and beyond. In Proceedings of the International Supercomputing Conference (ISC).Google Scholar
- Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 945--955. Google ScholarDigital Library
- Milan Pavlovic, Yoav Etsion, and Alex Ramirez. 2011. On the memory system requirements of future scientific applications: Four case-studies. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 159--170. Google ScholarDigital Library
- Milan Pavlovic, Milan Radulovic, Alex Ramirez, and Petar Radojkovic. 2015. Limpio - Lightweight MPI instrumentation. In Proceedings of the International Conference on Program Comprehension (ICPC). Retrieved from https://www.bsc.es/computer-sciences/computer-architecture/memory-systems/limpio, 303--306. Google ScholarDigital Library
- O. Perks, S. D. Hammond, S. J. Pennycook, and S. A. Jarvis. 2011. Should we worry about memory loss? SIGMETRICS Performance Evaluation Review 38, 4 (March 2011), 69--74. Google ScholarDigital Library
- Antoine Petitet, Clint Whaley, Jack Dongarra, Andy Cleary, and Piotr Luszczek. 2012. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Retrieved from http://www.netlib.org/benchmark/hpl/.Google Scholar
- PRACE. 2013. Unified European Applications Benchmark Suite. www.prace-ri.eu/ueabs/. (2013).Google Scholar
- PRACE. 2016. Prace Research Infrastructure. http://www.prace-ri.eu. (2016).Google Scholar
- Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R. de Supinski, Sally A. McKee, Petar Radojković, and Eduard Ayguadé. 2015. Another trip to the wall: How much will stacked DRAM benefit HPC? In Proceedings of the International Symposium on Memory Systems (MEMSYS). 31--36. Google ScholarDigital Library
- Jaewoong Sim, Alaa R. Alameldeen, Zeshan Chishti, Chris Wilkerson, and Hyesoon Kim. 2014. Transparent hardware management of stacked DRAM as part of memory. In Proc. of the International Symposium on Microarchitecture (MICRO). 13--24. Google ScholarDigital Library
- Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. International Symposium on Microarchitecture (MICRO). (Dec. 2011). Keynote.Google Scholar
- Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (March 2016), 34--46. Google ScholarDigital Library
- SPEC. 2015a. SPEC MPI2007. Retrieved from http://www.spec.org/mpi2007/.Google Scholar
- SPEC. 2015b. SPEC OMP2012. https://www.spec.org/omp2012/.Google Scholar
- Rick Stevens, Andy White, Pete Beckman, Ray Bair-ANL, Jim Hack, Jeff Nichols, Al GeistORNL, Horst Simon, Kathy Yelick, John Shalf-LBNL, Steve Ashby, Moe Khaleel-PNNL, Michel McCoy, Mark Seager, Brent Gorda-LLNL, John Morrison, Cheryl Wampler-LANL, James Peery, Sudip Dosanjh, Jim Ang-SNL, Jim Davenport, Tom Schlagel, BNL, Fred Johnson, and Paul Messina. 2010. A Decadal DOE Plan for Providing Exascale Applications and Technologies for DOE Mission Needs. Presentation at Advanced Simulation and Computing Principal Investigators Meeting.Google Scholar
- Erich Strohmaier, Jack Dongarra, Horst Simon, Martin Meuer, and Hans Meuer. 2015. TOP500 List. Retrieved from http://www.top500.org/.Google Scholar
- Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler. 1999. Architectural requirements and scalability of the NAS parallel benchmarks. In Proceedings of the of the ACM/IEEE Conference on Supercomputing (SC). Google ScholarDigital Library
- Steven Cameron Woo, Moriyoshi Ohara, and Evan Torrie. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the of the International Symposium on Computer Architecture (ISCA). 24--36. Google ScholarDigital Library
- Darko Zivanovic, Milan Radulovic, Germán Llort, David Zaragoza, Janko Strassburg, Paul M. Carpenter, Petar Radojković, and Eduard Ayguadé. 2016. Large-memory nodes for energy efficient high-performance computing. In Proceedings of the of the International Symposium on Memory Systems (MEMSYS). Google ScholarDigital Library
Index Terms
- Main Memory in HPC: Do We Need More or Could We Live with Less?
Recommendations
Performance Impact of a Slower Main Memory: A case study of STT-MRAM in HPC
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsIn high-performance computing (HPC), significant effort is invested in research and development of novel memory technologies. One of them is Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) --- byte-addressable, high-endurance non-volatile ...
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing SystemsThe non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Enabling a reliable STT-MRAM main memory simulation
MEMSYS '17: Proceedings of the International Symposium on Memory SystemsSTT-MRAM is a promising new memory technology with very desirable set of properties such as non-volatility, byte-addressability and high endurance. It has the potential to become the universal memory that could be incorporated to all levels of memory ...
Comments