ABSTRACT
The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies workloads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.
We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on microarchitectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.
- J. Ahn et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In ISCA 2015. Google ScholarDigital Library
- J. Ahn et al. A scalable processing-in-memory accelerator for parallel graph processing. In ISCA 2015. Google ScholarDigital Library
- A. Anghel et al. An instrumentation approach for hardware-agnostic software characterization. IJPP (2016).Google Scholar
- E. Azarkhish et al. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In ARCS 2016. Google ScholarDigital Library
- A. Boroumand et al. CoNDA: Enabling efficient near-data accelerator communication by optimizing data movement. In ISCA 2019.Google Scholar
- A. Boroumand et al. Google workloads for consumer devices: Mitigating data movement bottlenecks. In ASPLOS 2018. Google ScholarDigital Library
- A. Boroumand et al. LazyPIM: An efficient cache coherence mechanism for processing-in-memory. CAL (2017).Google Scholar
- L. Breiman. Random forests. Machine learning (2001). Google ScholarDigital Library
- A. Calotoiu et al. Using automated performance modeling to find scalability bugs in complex codes. In SC 2013. Google ScholarDigital Library
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC 2009. Google ScholarDigital Library
- M. Gao et al. Practical near-data processing for in-memory analytics frameworks. In PACT 2015. Google ScholarDigital Library
- S. Ghose et al. The processing-in-memory paradigm: Mechanisms to enable adoption. In Beyond-CMOS Technologies for Next Generation Computer Design (2019).Google ScholarCross Ref
- Q. Guo et al. Microarchitectural design space exploration made fast. MicPro (2013). Google ScholarDigital Library
- K. Hsieh et al. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In ICCD 2016.Google ScholarCross Ref
- K. Hsieh et al. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In ISCA 2016. Google ScholarDigital Library
- IBM. IBM POWER9 CPU. URL: https://www.ibm.com/it-infrastructure/power/power9 (2018).Google Scholar
- E. Ipek et al. Efficiently exploring architectural design spaces via predictive modeling. In ASPLOS 2006. Google ScholarDigital Library
- P. J. Joseph et al. Construction and use of linear regression models for processor performance analysis. In HPCA 2006.Google ScholarCross Ref
- J. S. Kim et al. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics (2018).Google Scholar
- Y. Kim et al. Ramulator: A fast and extensible DRAM simulator. CAL (2016). Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO 2004. Google ScholarDigital Library
- D. Lee et al. Simultaneous multi-layer access: Improving 3D-stacked memory bandwidth at low cost. ACM TACO (2016). Google ScholarDigital Library
- D. U. Lee et al. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV. In ISSCC 2014.Google Scholar
- D. Li et al. Processor design space exploration via statistical sampling and semi-supervised ensemble learning. IEEE Access (2018).Google ScholarCross Ref
- G. Mariani et al. Predicting cloud performance for HPC applications: a user-oriented approach. In CCGrid 2017. Google ScholarDigital Library
- D. C. Montgomery. Design and analysis of experiments. (2017).Google Scholar
- O. Mutlu et al. Processing data where it makes sense: Enabling in-memory computation. MicPro (2019).Google Scholar
- M. Natrella. NIST/SEMATECH e-handbook of statistical methods. (2010).Google Scholar
- J. T. Pawlowski. Hybrid memory cube (HMC). In HCS 2011.Google ScholarCross Ref
- F. Pedregosa et al. Scikit-learn: Machine learning in Python. JMLR (2011). Google ScholarDigital Library
- L.-N. Pouchet. Polybench: The polyhedral benchmark suite. URL: http://www.cs.ucla.edu/pouchet/software/polybench (2012).Google Scholar
- SAFARI Research Group. Ramulator for processing-in-memory. https://github.com/CMU-SAFARI/ramulator-pim/.Google Scholar
- D. Sanchez and C. Kozyrakis. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In ISCA 2013. Google ScholarDigital Library
- G. Singh et al. A Review of near-memory computing architectures: Opportunities and challenges. In DSD 2018.Google ScholarCross Ref
- A. Wong et al. Parallel application signature for performance analysis and prediction. IEEE TPDS (2015).Google ScholarDigital Library
- G. Wu et al. GPGPU performance and power estimation using machine learning. In HPCA 2015.Google ScholarCross Ref
- X. Wu and F. Mueller. Scalaextrap: Trace-based communication extrapolation for SPMD programs. In PPoPP 2011. Google ScholarDigital Library
Recommendations
A durable and energy efficient main memory using phase change memory technology
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureUsing nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile ...
Mellow writes: extending lifetime in resistive memories through selective slow write backs
ISCA'16Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major ...
Per-bank refresh with adaptive early termination for high density DRAM
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information ProcessingDRAM, which is mainly used as main memory, requires a refresh operation to maintain the integrity of stored data. Since memory read and write operations to a bank are not allowed while the bank is being refreshed, a lot of memory accesses may be blocked ...
Comments