research-article

Accelerating dependent cache misses with an enhanced memory controller

Authors:
Milad Hashemi

The University of Texas

The University of Texas
View Profile

,
Khubaib

Apple

Apple
View Profile

,
Eiman Ebrahimi

NVIDIA

NVIDIA
View Profile

,
Onur Mutlu

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Yale N. Patt

The University of Texas

The University of Texas
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 44 Issue 3June 2016pp 444–455https://doi.org/10.1145/3007787.3001184

Published:18 June 2016Publication History

ACM SIGARCH Computer Architecture News

Abstract

On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the frst cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Bufer prefetcher, the highest performing prefetcher in our evaluation.

References

J. Ahn et al., "A Scalable Processing-in-memory Accelerator for Parallel Graph Processing," in ISCA, 2015. Google ScholarDigital Library
J. Ahn et al., "PIM-Enabled Instructions: A Low-overhead, Locality-aware Processing-in-Memory Architecture," in ISCA, 2015. Google ScholarDigital Library
T. Alexander and G. Kedem, "Distributed Prefetch-Bufer/Cache Design for High Performance Memory Systems," in HPCA, 1996. Google ScholarDigital Library
M. Annavaram, J. M. Patel, and E. S. Davidson, "Data Prefetching by Dependence Graph Precomputation," in ISCA, 2001. Google ScholarDigital Library
M. Awasthi et al., "Handling the Problems and Opportunities Posed by Multiple On-chip Memory Controllers," in PACT, 2010. Google ScholarDigital Library
J. Baer and T. Chen, "An Efective On-Chip Preloading Scheme to Reduce Data Access Penalty," in Supercomputing, 1991. Google ScholarDigital Library
J. Carter et al., "Impulse: Building a Smarter Memory Controller," in HPCA, 1999. Google ScholarDigital Library
M. J. Charney and A. P. Reeves, "Generalized Correlation-Based Hardware Prefetching," Cornell Univ., Tech. Rep. EE-CEG-95-1, 1995.Google Scholar
J. D. Collins et al., "Dynamic Speculative Precomputation," in MICRO, 2001. Google ScholarDigital Library
J. D. Collins et al., "Speculative Precomputation: Long-Range Prefetching of Delinquent Loads," in ISCA, 2001. Google ScholarDigital Library
R. Cooksey, S. Jourdan, and D. Grunwald, "A Stateless, Content-Directed Data Prefetching Mechanism," in ASPLOS, 2002. Google ScholarDigital Library
R. Das et al., "Application-Aware Prioritization Mechanisms for On-Chip Networks," in MICRO, 2009. Google ScholarDigital Library
P. Dlugosch et al., "An Efcient and Scalable Semiconductor Architecture for Parallel Automata Processing," in TPDS, 2014.Google Scholar
J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss," in ICS, 1997. Google ScholarDigital Library
E. Ebrahimi et al., "Coordinated control of multiple prefetchers in multi-core systems," in MICRO, 2009. Google ScholarDigital Library
E. Ebrahimi et al., "Fairness via Source Throttling: A Confgurable and High-Performance Fairness Substrate for Multi-Core Memory Systems," in ASPLOS, 2010. Google ScholarDigital Library
E. Ebrahimi et al., "Parallel Application Memory Scheduling," in MICRO, 2011. Google ScholarDigital Library
E. Ebrahimi, O. Mutlu, and Y. N. Patt, "Techniques for Bandwidth-Efcient Prefetching of Linked Data Structures in Hybrid Prefetching Systems," in HPCA, 2009.Google Scholar
D. G. Elliott, W. M. Snelgrove, and M. Stumm, "Computational RAM: A Memory-SIMD hybrid and its application to DSP," in CICC, 1992.Google Scholar
J. D. Gindele, "Bufer Block Prefetching Method," IBM Technical Disclosure Bulletin, 1977.Google Scholar
M. Gokhale, B. Holmes, and K. Iobst, "Processing in memory: The Terasys massively parallel PIM array," IEEE Computer, 1995. Google ScholarDigital Library
C. J. Hughes and S. Adve, "Memory-Side Prefetching for Linked Data Structures," in Journal of Parallel and Distributed Computing, 2001.Google Scholar
"Intel Transactional Synchronization Extensions," http://software.intel.com/sites/default/fles/blog/393551/sf12-arcs004-100.pdf, 2012.Google Scholar
"Intel-64 and IA-32 Architectures Optimization Reference Manual," http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, 2014.Google Scholar
D. Joseph and D. Grunwald, "Prefetching using Markov Predictors," in ISCA, 1997. Google ScholarDigital Library
N. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Bufers," in ISCA, 1990. Google ScholarDigital Library
Y. Kim et al., "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers," in HPCA, 2010.Google Scholar
P. M. Kogge, "EXECUBE-A New Architecture for Scaleable MPPs," in ICPP, 1994. Google ScholarDigital Library
A.-C. Lai, C. Fide, and B. Falsaf, "Dead-Block Prediction and Dead-Block Correlating Prefetchers," in ISCA, 2001. Google ScholarDigital Library
C. J. Lee et al., "Prefetch-Aware DRAM Controllers," in MICRO, 2008. Google ScholarDigital Library
C. J. Lee et al., "DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems," HPS Technical Report, Tech. Rep., 2010.Google Scholar
D. Lee et al., "Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost," ACM TACO, 2016. Google ScholarDigital Library
S. Li et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO, 2009. Google ScholarDigital Library
C.-K. Luk, "Tolerating Memory Latency through Software-Controlled Pre-execution in Simultaneous Multithreading Processors," in ISCA, 2001. Google ScholarDigital Library
T. Moscibroda and O. Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," in USENIX Security, 2007. Google ScholarDigital Library
"MT41J512M4 DDR3 SDRAM Datasheet Rev. K Micron Technology, Apr. 2010," http://download.micron.com/pdf/datasheets/dram/ddr3/2Gb_DDR3_SDRAM.pdf, 2010.Google Scholar
N. Muralimanohar and R. Balasubramonian, "CACTI 6.0: A Tool to Model Large Caches," in HP Laboratories, Tech. Rep. HPL-2009-85, 2009.Google Scholar
O. Mutlu et al., "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors," in HPCA, 2003. Google ScholarDigital Library
O. Mutlu et al., "Address-value delta (AVD) prediction: Increasing the efectiveness of runahead execution by exploting regular memory allocation patterns," in MICRO, 2005.Google Scholar
O. Mutlu, H. Kim, and Y. N. Patt, "Techniques for Eficient Processing in Runahead Execution Engines," in ISCA, 2005. Google ScholarDigital Library
O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in MICRO, 2007. Google ScholarDigital Library
O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," in ISCA, 2008. Google ScholarDigital Library
K. J. Nesbit and J. E. Smith, "Data Cache Prefetching Using a Global History Bufer," in HPCA, 2004. Google ScholarDigital Library
D. Patterson et al., "A Case for Intelligent RAM," in IEEE Micro, 1997. Google ScholarDigital Library
J. T. Pawlowski, "Hybrid Memory Cube (HMC)," in Hot Chips, 2011.Google Scholar
D. G. Perez, G. Mouchard, and O. Temam, "MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms," in MICRO, 2004. Google ScholarDigital Library
M. K. Qureshi and G. H. Loh, "Fundamental Latency Trade-of in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design," in MICRO, 2012. Google ScholarDigital Library
A. Roth, A. Moshovos, and G. S. Sohi, "Dependence based Prefetching for Linked Data Structures," in ASPLOS, 1998. Google ScholarDigital Library
A. Roth and G. S. Sohi, "Efective Jump-Pointer Prefetching for Linked Data Structures," in ISCA, 1999. Google ScholarDigital Library
V. Seshadri et al., "RowClone: Fast and Energy-Efcient In-DRAM Bulk Data Copy and Initialization," in MICRO, 2013. Google ScholarDigital Library
V. Seshadri et al., "Fast Bulk Bitwise AND and OR in DRAM," IEEE CAL, 2015. Google ScholarDigital Library
V. Seshadri et al., "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," in MICRO, 2015. Google ScholarDigital Library
D. E. Shaw et al., "The NON-VON database machine: A brief overview," IEEE Database Eng. Bull., 1981.Google Scholar
T. Sherwood et al., "Automatically Characterizing Large Scale Program Behavior," in ASPLOS, 2002. Google ScholarDigital Library
Y. Solihin, J. Lee, and J. Torrellas, "Using a User-Level Memory Thread for Correlation Prefetching," in ISCA, 2002. Google ScholarDigital Library
S. Somogyi et al., "Spatial Memory Streaming," in ISCA, 2006. Google ScholarDigital Library
S. Srinath et al., "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efciency of Hardware Prefetchers," in HPCA, 2007. Google ScholarDigital Library
S. T. Srinivasan et al., "Continual Flow Pipelines," in ASPLOS, 2004. Google ScholarDigital Library
H. S. Stone, "A Logic-in-Memory Computer," IEEE TC, 1970. Google ScholarDigital Library
K. Sundaramoorthy, Z. Purser, and E. Rotenberg, "Slipstream Processors: Improving both Performance and Fault Tolerance," in ASPLOS, 2000. Google ScholarDigital Library
J. M. Tendler et al., "POWER4 System Microarchitecture," IBM Technical White Paper, Oct. 2001.Google Scholar
S. Yehia, J.-F. Collard, and O. Temam, "Load Squared: Adding Logic Close to Memory to Reduce the Latency of Indirect Loads with High Miss Ratios," in MEDEA, 2004. Google ScholarDigital Library
D. Zhang et al., "TOP-PIM: Throughput-oriented Programmable Processing in Memory," in HPDC, 2014. Google ScholarDigital Library
C. Zilles and G. Sohi, "Execution-Based Prediction using Speculative Slices," in ISCA, 2001. Google ScholarDigital Library

Index Terms

Accelerating dependent cache misses with an enhanced memory controller
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Index terms have been assigned to the content through auto-classification.

Recommendations

Accelerating dependent cache misses with an enhanced memory controller
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture

On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss ...
Read More
Reducing cache misses through programmable decoders

Level-one caches normally reside on a processor's critical path, which determines clock frequency. Therefore, fast access to level-one cache is important. Direct-mapped caches exhibit faster access time, but poor hit rates, compared with same sized set-...
Read More
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture
SC '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing

Cache misses for which data must be obtained from a remote cache (cache-to-cache transfer misses) account for an important fraction of the total miss rate. Unfortunately, cc-NUMA designs put the access to the directory information into the critical path ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 44, Issue 3
ISCA'16
June 2016
730 pages
ISSN:0163-5964
DOI:10.1145/3007787
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
June 2016
756 pages
ISBN:9781467389471
General Chairs:
Sang Lyul Min
Seoul National University
,
Gabriel Loh
AMD Research
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2016
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 591
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accelerating dependent cache misses with an enhanced memory controller

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Accelerating dependent cache misses with an enhanced memory controller

Reducing cache misses through programmable decoders

Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture