skip to main content
research-article
Free Access

Main Memory in HPC: Do We Need More or Could We Live with Less?

Authors Info & Claims
Published:06 March 2017Publication History
Skip Abstract Section

Abstract

An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now.

This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.

Skip Supplemental Material Section

Supplemental Material

References

  1. Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  2. Daniel E. Atkins, Kelvin K. Droegemeier, Stuart I. Feldman, Stuart I. Feldman, Michael L. Klein, David G. Messerschmitt, Paul Messina, Jeremiah P. Ostriker, and Margaret H. Wright. 2003. Revolutionizing Science and Engineering Through Cyberinfrastructure. Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. National Science Foundation.Google ScholarGoogle Scholar
  3. Barcelona Supercomputing Center. 2013. MareNostrum III System Architecture. Technical Report.Google ScholarGoogle Scholar
  4. Barcelona Supercomputing Center. 2014. Extrae User Guide Manual for Version 2.5.1. Barcelona Supercomputing Center.Google ScholarGoogle Scholar
  5. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Susmit Biswas, Bronis R. de Supinski, Martin Schulz, Diana Franklin, Timothy Sherwood, and Frederic T. Chong. 2011. Exploiting data similarity to reduce memory footprints. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS). 152--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mark Bull. 2013. PRACE-2IP: D7.4 unified european applications benchmark suite final. (2013).Google ScholarGoogle Scholar
  8. Chris Cantalupo, Karthik Raman, and Ruchira Sasanka. 2015. MCDRAM on 2nd Generation Intel Xeon Phi Processor (code-named Knights Landing): Analysis Methods and Tools. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarGoogle Scholar
  9. Chiachen Chou, Aamer Jaleel, and Moinuddin K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jack Dongarra, Michael Heroux, and Piotr Luszczek. 2016. The HPCG Benchmark. Retrieved from http://www.hpcg-benchmark.org/.Google ScholarGoogle Scholar
  12. Jack J. Dongarra and Michael A. Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Sandia Report SAND2013-4744. Sandia National Laboratories.Google ScholarGoogle Scholar
  13. Jack J. Dongarra, Piotr Luszczek, and Michael A. Heroux. 2014. HPCG: One year later. In Proceedings of the International Supercomputing Conference (ISC).Google ScholarGoogle Scholar
  14. Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 9 (2003), 803--820. Google ScholarGoogle ScholarCross RefCross Ref
  15. ETP4HPC. 2013. ETP4HPC Strategic Research Agenda Achieving HPC Leadership in Europe. (June 2013).Google ScholarGoogle Scholar
  16. Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.0. Retrieved from http://www.hybridmemorycube.org/specification-v2-download-form/.Google ScholarGoogle Scholar
  17. Intel. 2016a. Intel VTune Amplifier 2016. Retrieved from https://software.intel.com/en-us/intel-vtune-amplifier-xe/.Google ScholarGoogle Scholar
  18. Intel. 2016b. The memkind library. Retrieved from http://memkind.github.io/memkind/.Google ScholarGoogle Scholar
  19. JEDEC Solid State Technology Association. 2013. High Bandwidth Memory (HBM) DRAM. http://www.jedec.org/standards-documents/docs/jesd235. (Oct. 2013).Google ScholarGoogle Scholar
  20. James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition (2nd ed.). Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. 2008. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. Technical Report.Google ScholarGoogle Scholar
  22. Matthew J. Koop, Terry Jones, and Dhabaleswar K. Panda. 2007. Reducing connection memory requirements of MPI for infiniband clusters: A message coalescing approach. In Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGRID). 495--504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Piotr Luszczek and Jack J. Dongarra. 2005. Introduction to the HPC Challenge Benchmark Suite. ICL Technical Report ICL-UT-05-01. University of Tennessee.Google ScholarGoogle Scholar
  24. Vladimir Marjanović, José Garcia, and Colin W. Glass. 2014. Performance modeling of the HPCG benchmark. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. Springer International Publishing, 172--192.Google ScholarGoogle Scholar
  25. Mitesh R. Meswani, Sergey Blagodurov, David Roberts, John Slice, Mike Ignatowski, and Gabriel H. Loh. 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). 126--136. Google ScholarGoogle ScholarCross RefCross Ref
  26. Richard Murphy, Jonathan Berry, William McLendon, Bruce Hendrickson, Douglas Gregor, and Andrew Lumsdaine. 2006. DFS: A simple to write yet difficult to execute benchmark. In IEEE International Symposium on Workload Characterization (IISWC). 175--177. Google ScholarGoogle ScholarCross RefCross Ref
  27. Richard Murphy, Kyle Wheeler, Brian Barrett, and James Ang. 2010. Introducing the Graph 500. Cray User’s Group (CUG). (May 2010).Google ScholarGoogle Scholar
  28. NERSC. 2012. Large Scale Computing and Storage Requirements for High Energy Physics: Target 2017. Report of the NERSC Requirements Review. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  29. NERSC. 2013. Large Scale Computing and Storage Requirements for Biological and Environmental Science: Target 2017. Report of the NERSC Requirements Review LBNL-6256E. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  30. NERSC. 2014a. High Performance Computing and Storage Requirements for Basic Energy Sciences: Target 2017. Report of the HPC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  31. NERSC. 2014b. Large Scale Computing and Storage Requirements for Fusion Energy Sciences: Target 2017. Report of the NERSC Requirements Review LBNL-6631E. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  32. NERSC. 2015a. High Performance Computing and Storage Requirements for Nuclear Physics: Target 2017. Report of the NERSC Requirements Review LBNL-6926E. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  33. NERSC. 2015b. Large Scale Computing and Storage Requirements for Advanced Scientific Computing Research: Target 2017. Report of the NERSC Requirements Review LBNL-6978E. Lawrence Berkeley National Laboratory.Google ScholarGoogle Scholar
  34. Chris J. Newburn. 2015. Code for the future: Knights Landing and beyond. In Proceedings of the International Supercomputing Conference (ISC).Google ScholarGoogle Scholar
  35. Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 945--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Milan Pavlovic, Yoav Etsion, and Alex Ramirez. 2011. On the memory system requirements of future scientific applications: Four case-studies. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 159--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Milan Pavlovic, Milan Radulovic, Alex Ramirez, and Petar Radojkovic. 2015. Limpio - Lightweight MPI instrumentation. In Proceedings of the International Conference on Program Comprehension (ICPC). Retrieved from https://www.bsc.es/computer-sciences/computer-architecture/memory-systems/limpio, 303--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. O. Perks, S. D. Hammond, S. J. Pennycook, and S. A. Jarvis. 2011. Should we worry about memory loss? SIGMETRICS Performance Evaluation Review 38, 4 (March 2011), 69--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Antoine Petitet, Clint Whaley, Jack Dongarra, Andy Cleary, and Piotr Luszczek. 2012. HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Retrieved from http://www.netlib.org/benchmark/hpl/.Google ScholarGoogle Scholar
  40. PRACE. 2013. Unified European Applications Benchmark Suite. www.prace-ri.eu/ueabs/. (2013).Google ScholarGoogle Scholar
  41. PRACE. 2016. Prace Research Infrastructure. http://www.prace-ri.eu. (2016).Google ScholarGoogle Scholar
  42. Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R. de Supinski, Sally A. McKee, Petar Radojković, and Eduard Ayguadé. 2015. Another trip to the wall: How much will stacked DRAM benefit HPC? In Proceedings of the International Symposium on Memory Systems (MEMSYS). 31--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jaewoong Sim, Alaa R. Alameldeen, Zeshan Chishti, Chris Wilkerson, and Hyesoon Kim. 2014. Transparent hardware management of stacked DRAM as part of memory. In Proc. of the International Symposium on Microarchitecture (MICRO). 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges. International Symposium on Microarchitecture (MICRO). (Dec. 2011). Keynote.Google ScholarGoogle Scholar
  45. Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (March 2016), 34--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. SPEC. 2015a. SPEC MPI2007. Retrieved from http://www.spec.org/mpi2007/.Google ScholarGoogle Scholar
  47. SPEC. 2015b. SPEC OMP2012. https://www.spec.org/omp2012/.Google ScholarGoogle Scholar
  48. Rick Stevens, Andy White, Pete Beckman, Ray Bair-ANL, Jim Hack, Jeff Nichols, Al GeistORNL, Horst Simon, Kathy Yelick, John Shalf-LBNL, Steve Ashby, Moe Khaleel-PNNL, Michel McCoy, Mark Seager, Brent Gorda-LLNL, John Morrison, Cheryl Wampler-LANL, James Peery, Sudip Dosanjh, Jim Ang-SNL, Jim Davenport, Tom Schlagel, BNL, Fred Johnson, and Paul Messina. 2010. A Decadal DOE Plan for Providing Exascale Applications and Technologies for DOE Mission Needs. Presentation at Advanced Simulation and Computing Principal Investigators Meeting.Google ScholarGoogle Scholar
  49. Erich Strohmaier, Jack Dongarra, Horst Simon, Martin Meuer, and Hans Meuer. 2015. TOP500 List. Retrieved from http://www.top500.org/.Google ScholarGoogle Scholar
  50. Frederick C. Wong, Richard P. Martin, Remzi H. Arpaci-Dusseau, and David E. Culler. 1999. Architectural requirements and scalability of the NAS parallel benchmarks. In Proceedings of the of the ACM/IEEE Conference on Supercomputing (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Steven Cameron Woo, Moriyoshi Ohara, and Evan Torrie. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the of the International Symposium on Computer Architecture (ISCA). 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Darko Zivanovic, Milan Radulovic, Germán Llort, David Zaragoza, Janko Strassburg, Paul M. Carpenter, Petar Radojković, and Eduard Ayguadé. 2016. Large-memory nodes for energy efficient high-performance computing. In Proceedings of the of the International Symposium on Memory Systems (MEMSYS). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Main Memory in HPC: Do We Need More or Could We Live with Less?

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Architecture and Code Optimization
          ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 1
          March 2017
          258 pages
          ISSN:1544-3566
          EISSN:1544-3973
          DOI:10.1145/3058793
          Issue’s Table of Contents

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 March 2017
          • Accepted: 1 December 2016
          • Revised: 1 November 2016
          • Received: 1 May 2016
          Published in taco Volume 14, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader