Abstract
On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the frst cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Bufer prefetcher, the highest performing prefetcher in our evaluation.
- J. Ahn et al., "A Scalable Processing-in-memory Accelerator for Parallel Graph Processing," in ISCA, 2015. Google ScholarDigital Library
- J. Ahn et al., "PIM-Enabled Instructions: A Low-overhead, Locality-aware Processing-in-Memory Architecture," in ISCA, 2015. Google ScholarDigital Library
- T. Alexander and G. Kedem, "Distributed Prefetch-Bufer/Cache Design for High Performance Memory Systems," in HPCA, 1996. Google ScholarDigital Library
- M. Annavaram, J. M. Patel, and E. S. Davidson, "Data Prefetching by Dependence Graph Precomputation," in ISCA, 2001. Google ScholarDigital Library
- M. Awasthi et al., "Handling the Problems and Opportunities Posed by Multiple On-chip Memory Controllers," in PACT, 2010. Google ScholarDigital Library
- J. Baer and T. Chen, "An Efective On-Chip Preloading Scheme to Reduce Data Access Penalty," in Supercomputing, 1991. Google ScholarDigital Library
- J. Carter et al., "Impulse: Building a Smarter Memory Controller," in HPCA, 1999. Google ScholarDigital Library
- M. J. Charney and A. P. Reeves, "Generalized Correlation-Based Hardware Prefetching," Cornell Univ., Tech. Rep. EE-CEG-95-1, 1995.Google Scholar
- J. D. Collins et al., "Dynamic Speculative Precomputation," in MICRO, 2001. Google ScholarDigital Library
- J. D. Collins et al., "Speculative Precomputation: Long-Range Prefetching of Delinquent Loads," in ISCA, 2001. Google ScholarDigital Library
- R. Cooksey, S. Jourdan, and D. Grunwald, "A Stateless, Content-Directed Data Prefetching Mechanism," in ASPLOS, 2002. Google ScholarDigital Library
- R. Das et al., "Application-Aware Prioritization Mechanisms for On-Chip Networks," in MICRO, 2009. Google ScholarDigital Library
- P. Dlugosch et al., "An Efcient and Scalable Semiconductor Architecture for Parallel Automata Processing," in TPDS, 2014.Google Scholar
- J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss," in ICS, 1997. Google ScholarDigital Library
- E. Ebrahimi et al., "Coordinated control of multiple prefetchers in multi-core systems," in MICRO, 2009. Google ScholarDigital Library
- E. Ebrahimi et al., "Fairness via Source Throttling: A Confgurable and High-Performance Fairness Substrate for Multi-Core Memory Systems," in ASPLOS, 2010. Google ScholarDigital Library
- E. Ebrahimi et al., "Parallel Application Memory Scheduling," in MICRO, 2011. Google ScholarDigital Library
- E. Ebrahimi, O. Mutlu, and Y. N. Patt, "Techniques for Bandwidth-Efcient Prefetching of Linked Data Structures in Hybrid Prefetching Systems," in HPCA, 2009.Google Scholar
- D. G. Elliott, W. M. Snelgrove, and M. Stumm, "Computational RAM: A Memory-SIMD hybrid and its application to DSP," in CICC, 1992.Google Scholar
- J. D. Gindele, "Bufer Block Prefetching Method," IBM Technical Disclosure Bulletin, 1977.Google Scholar
- M. Gokhale, B. Holmes, and K. Iobst, "Processing in memory: The Terasys massively parallel PIM array," IEEE Computer, 1995. Google ScholarDigital Library
- C. J. Hughes and S. Adve, "Memory-Side Prefetching for Linked Data Structures," in Journal of Parallel and Distributed Computing, 2001.Google Scholar
- "Intel Transactional Synchronization Extensions," http://software.intel.com/sites/default/fles/blog/393551/sf12-arcs004-100.pdf, 2012.Google Scholar
- "Intel-64 and IA-32 Architectures Optimization Reference Manual," http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, 2014.Google Scholar
- D. Joseph and D. Grunwald, "Prefetching using Markov Predictors," in ISCA, 1997. Google ScholarDigital Library
- N. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Bufers," in ISCA, 1990. Google ScholarDigital Library
- Y. Kim et al., "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers," in HPCA, 2010.Google Scholar
- P. M. Kogge, "EXECUBE-A New Architecture for Scaleable MPPs," in ICPP, 1994. Google ScholarDigital Library
- A.-C. Lai, C. Fide, and B. Falsaf, "Dead-Block Prediction and Dead-Block Correlating Prefetchers," in ISCA, 2001. Google ScholarDigital Library
- C. J. Lee et al., "Prefetch-Aware DRAM Controllers," in MICRO, 2008. Google ScholarDigital Library
- C. J. Lee et al., "DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems," HPS Technical Report, Tech. Rep., 2010.Google Scholar
- D. Lee et al., "Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost," ACM TACO, 2016. Google ScholarDigital Library
- S. Li et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO, 2009. Google ScholarDigital Library
- C.-K. Luk, "Tolerating Memory Latency through Software-Controlled Pre-execution in Simultaneous Multithreading Processors," in ISCA, 2001. Google ScholarDigital Library
- T. Moscibroda and O. Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," in USENIX Security, 2007. Google ScholarDigital Library
- "MT41J512M4 DDR3 SDRAM Datasheet Rev. K Micron Technology, Apr. 2010," http://download.micron.com/pdf/datasheets/dram/ddr3/2Gb_DDR3_SDRAM.pdf, 2010.Google Scholar
- N. Muralimanohar and R. Balasubramonian, "CACTI 6.0: A Tool to Model Large Caches," in HP Laboratories, Tech. Rep. HPL-2009-85, 2009.Google Scholar
- O. Mutlu et al., "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors," in HPCA, 2003. Google ScholarDigital Library
- O. Mutlu et al., "Address-value delta (AVD) prediction: Increasing the efectiveness of runahead execution by exploting regular memory allocation patterns," in MICRO, 2005.Google Scholar
- O. Mutlu, H. Kim, and Y. N. Patt, "Techniques for Eficient Processing in Runahead Execution Engines," in ISCA, 2005. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in MICRO, 2007. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," in ISCA, 2008. Google ScholarDigital Library
- K. J. Nesbit and J. E. Smith, "Data Cache Prefetching Using a Global History Bufer," in HPCA, 2004. Google ScholarDigital Library
- D. Patterson et al., "A Case for Intelligent RAM," in IEEE Micro, 1997. Google ScholarDigital Library
- J. T. Pawlowski, "Hybrid Memory Cube (HMC)," in Hot Chips, 2011.Google Scholar
- D. G. Perez, G. Mouchard, and O. Temam, "MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms," in MICRO, 2004. Google ScholarDigital Library
- M. K. Qureshi and G. H. Loh, "Fundamental Latency Trade-of in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design," in MICRO, 2012. Google ScholarDigital Library
- A. Roth, A. Moshovos, and G. S. Sohi, "Dependence based Prefetching for Linked Data Structures," in ASPLOS, 1998. Google ScholarDigital Library
- A. Roth and G. S. Sohi, "Efective Jump-Pointer Prefetching for Linked Data Structures," in ISCA, 1999. Google ScholarDigital Library
- V. Seshadri et al., "RowClone: Fast and Energy-Efcient In-DRAM Bulk Data Copy and Initialization," in MICRO, 2013. Google ScholarDigital Library
- V. Seshadri et al., "Fast Bulk Bitwise AND and OR in DRAM," IEEE CAL, 2015. Google ScholarDigital Library
- V. Seshadri et al., "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," in MICRO, 2015. Google ScholarDigital Library
- D. E. Shaw et al., "The NON-VON database machine: A brief overview," IEEE Database Eng. Bull., 1981.Google Scholar
- T. Sherwood et al., "Automatically Characterizing Large Scale Program Behavior," in ASPLOS, 2002. Google ScholarDigital Library
- Y. Solihin, J. Lee, and J. Torrellas, "Using a User-Level Memory Thread for Correlation Prefetching," in ISCA, 2002. Google ScholarDigital Library
- S. Somogyi et al., "Spatial Memory Streaming," in ISCA, 2006. Google ScholarDigital Library
- S. Srinath et al., "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efciency of Hardware Prefetchers," in HPCA, 2007. Google ScholarDigital Library
- S. T. Srinivasan et al., "Continual Flow Pipelines," in ASPLOS, 2004. Google ScholarDigital Library
- H. S. Stone, "A Logic-in-Memory Computer," IEEE TC, 1970. Google ScholarDigital Library
- K. Sundaramoorthy, Z. Purser, and E. Rotenberg, "Slipstream Processors: Improving both Performance and Fault Tolerance," in ASPLOS, 2000. Google ScholarDigital Library
- J. M. Tendler et al., "POWER4 System Microarchitecture," IBM Technical White Paper, Oct. 2001.Google Scholar
- S. Yehia, J.-F. Collard, and O. Temam, "Load Squared: Adding Logic Close to Memory to Reduce the Latency of Indirect Loads with High Miss Ratios," in MEDEA, 2004. Google ScholarDigital Library
- D. Zhang et al., "TOP-PIM: Throughput-oriented Programmable Processing in Memory," in HPDC, 2014. Google ScholarDigital Library
- C. Zilles and G. Sohi, "Execution-Based Prediction using Speculative Slices," in ISCA, 2001. Google ScholarDigital Library
Index Terms
- Accelerating dependent cache misses with an enhanced memory controller
Recommendations
Accelerating dependent cache misses with an enhanced memory controller
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureOn-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss ...
Reducing cache misses through programmable decoders
Level-one caches normally reside on a processor's critical path, which determines clock frequency. Therefore, fast access to level-one cache is important. Direct-mapped caches exhibit faster access time, but poor hit rates, compared with same sized set-...
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture
SC '02: Proceedings of the 2002 ACM/IEEE conference on SupercomputingCache misses for which data must be obtained from a remote cache (cache-to-cache transfer misses) account for an important fraction of the total miss rate. Unfortunately, cc-NUMA designs put the access to the directory information into the critical path ...
Comments