skip to main content
research-article

Accelerating dependent cache misses with an enhanced memory controller

Published:18 June 2016Publication History
Skip Abstract Section

Abstract

On-chip contention increases memory access latency for multicore processors. We identify that this additional latency has a substantial efect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the frst cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Bufer prefetcher, the highest performing prefetcher in our evaluation.

References

  1. J. Ahn et al., "A Scalable Processing-in-memory Accelerator for Parallel Graph Processing," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Ahn et al., "PIM-Enabled Instructions: A Low-overhead, Locality-aware Processing-in-Memory Architecture," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Alexander and G. Kedem, "Distributed Prefetch-Bufer/Cache Design for High Performance Memory Systems," in HPCA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Annavaram, J. M. Patel, and E. S. Davidson, "Data Prefetching by Dependence Graph Precomputation," in ISCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Awasthi et al., "Handling the Problems and Opportunities Posed by Multiple On-chip Memory Controllers," in PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Baer and T. Chen, "An Efective On-Chip Preloading Scheme to Reduce Data Access Penalty," in Supercomputing, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Carter et al., "Impulse: Building a Smarter Memory Controller," in HPCA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. J. Charney and A. P. Reeves, "Generalized Correlation-Based Hardware Prefetching," Cornell Univ., Tech. Rep. EE-CEG-95-1, 1995.Google ScholarGoogle Scholar
  9. J. D. Collins et al., "Dynamic Speculative Precomputation," in MICRO, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. D. Collins et al., "Speculative Precomputation: Long-Range Prefetching of Delinquent Loads," in ISCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Cooksey, S. Jourdan, and D. Grunwald, "A Stateless, Content-Directed Data Prefetching Mechanism," in ASPLOS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Das et al., "Application-Aware Prioritization Mechanisms for On-Chip Networks," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Dlugosch et al., "An Efcient and Scalable Semiconductor Architecture for Parallel Automata Processing," in TPDS, 2014.Google ScholarGoogle Scholar
  14. J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss," in ICS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Ebrahimi et al., "Coordinated control of multiple prefetchers in multi-core systems," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Ebrahimi et al., "Fairness via Source Throttling: A Confgurable and High-Performance Fairness Substrate for Multi-Core Memory Systems," in ASPLOS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Ebrahimi et al., "Parallel Application Memory Scheduling," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Ebrahimi, O. Mutlu, and Y. N. Patt, "Techniques for Bandwidth-Efcient Prefetching of Linked Data Structures in Hybrid Prefetching Systems," in HPCA, 2009.Google ScholarGoogle Scholar
  19. D. G. Elliott, W. M. Snelgrove, and M. Stumm, "Computational RAM: A Memory-SIMD hybrid and its application to DSP," in CICC, 1992.Google ScholarGoogle Scholar
  20. J. D. Gindele, "Bufer Block Prefetching Method," IBM Technical Disclosure Bulletin, 1977.Google ScholarGoogle Scholar
  21. M. Gokhale, B. Holmes, and K. Iobst, "Processing in memory: The Terasys massively parallel PIM array," IEEE Computer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. J. Hughes and S. Adve, "Memory-Side Prefetching for Linked Data Structures," in Journal of Parallel and Distributed Computing, 2001.Google ScholarGoogle Scholar
  23. "Intel Transactional Synchronization Extensions," http://software.intel.com/sites/default/fles/blog/393551/sf12-arcs004-100.pdf, 2012.Google ScholarGoogle Scholar
  24. "Intel-64 and IA-32 Architectures Optimization Reference Manual," http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf, 2014.Google ScholarGoogle Scholar
  25. D. Joseph and D. Grunwald, "Prefetching using Markov Predictors," in ISCA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Bufers," in ISCA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Kim et al., "ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers," in HPCA, 2010.Google ScholarGoogle Scholar
  28. P. M. Kogge, "EXECUBE-A New Architecture for Scaleable MPPs," in ICPP, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A.-C. Lai, C. Fide, and B. Falsaf, "Dead-Block Prediction and Dead-Block Correlating Prefetchers," in ISCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. J. Lee et al., "Prefetch-Aware DRAM Controllers," in MICRO, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. J. Lee et al., "DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems," HPS Technical Report, Tech. Rep., 2010.Google ScholarGoogle Scholar
  32. D. Lee et al., "Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost," ACM TACO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Li et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C.-K. Luk, "Tolerating Memory Latency through Software-Controlled Pre-execution in Simultaneous Multithreading Processors," in ISCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Moscibroda and O. Mutlu, "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems," in USENIX Security, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. "MT41J512M4 DDR3 SDRAM Datasheet Rev. K Micron Technology, Apr. 2010," http://download.micron.com/pdf/datasheets/dram/ddr3/2Gb_DDR3_SDRAM.pdf, 2010.Google ScholarGoogle Scholar
  37. N. Muralimanohar and R. Balasubramonian, "CACTI 6.0: A Tool to Model Large Caches," in HP Laboratories, Tech. Rep. HPL-2009-85, 2009.Google ScholarGoogle Scholar
  38. O. Mutlu et al., "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors," in HPCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. O. Mutlu et al., "Address-value delta (AVD) prediction: Increasing the efectiveness of runahead execution by exploting regular memory allocation patterns," in MICRO, 2005.Google ScholarGoogle Scholar
  40. O. Mutlu, H. Kim, and Y. N. Patt, "Techniques for Eficient Processing in Runahead Execution Engines," in ISCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. O. Mutlu and T. Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," in ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. J. Nesbit and J. E. Smith, "Data Cache Prefetching Using a Global History Bufer," in HPCA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. Patterson et al., "A Case for Intelligent RAM," in IEEE Micro, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. T. Pawlowski, "Hybrid Memory Cube (HMC)," in Hot Chips, 2011.Google ScholarGoogle Scholar
  46. D. G. Perez, G. Mouchard, and O. Temam, "MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms," in MICRO, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. K. Qureshi and G. H. Loh, "Fundamental Latency Trade-of in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design," in MICRO, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. Roth, A. Moshovos, and G. S. Sohi, "Dependence based Prefetching for Linked Data Structures," in ASPLOS, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Roth and G. S. Sohi, "Efective Jump-Pointer Prefetching for Linked Data Structures," in ISCA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. V. Seshadri et al., "RowClone: Fast and Energy-Efcient In-DRAM Bulk Data Copy and Initialization," in MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. V. Seshadri et al., "Fast Bulk Bitwise AND and OR in DRAM," IEEE CAL, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. V. Seshadri et al., "Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses," in MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. D. E. Shaw et al., "The NON-VON database machine: A brief overview," IEEE Database Eng. Bull., 1981.Google ScholarGoogle Scholar
  54. T. Sherwood et al., "Automatically Characterizing Large Scale Program Behavior," in ASPLOS, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Y. Solihin, J. Lee, and J. Torrellas, "Using a User-Level Memory Thread for Correlation Prefetching," in ISCA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. S. Somogyi et al., "Spatial Memory Streaming," in ISCA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. S. Srinath et al., "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efciency of Hardware Prefetchers," in HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. T. Srinivasan et al., "Continual Flow Pipelines," in ASPLOS, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. H. S. Stone, "A Logic-in-Memory Computer," IEEE TC, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. K. Sundaramoorthy, Z. Purser, and E. Rotenberg, "Slipstream Processors: Improving both Performance and Fault Tolerance," in ASPLOS, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. J. M. Tendler et al., "POWER4 System Microarchitecture," IBM Technical White Paper, Oct. 2001.Google ScholarGoogle Scholar
  62. S. Yehia, J.-F. Collard, and O. Temam, "Load Squared: Adding Logic Close to Memory to Reduce the Latency of Indirect Loads with High Miss Ratios," in MEDEA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. D. Zhang et al., "TOP-PIM: Throughput-oriented Programmable Processing in Memory," in HPDC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. C. Zilles and G. Sohi, "Execution-Based Prediction using Speculative Slices," in ISCA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Accelerating dependent cache misses with an enhanced memory controller
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 44, Issue 3
        ISCA'16
        June 2016
        730 pages
        ISSN:0163-5964
        DOI:10.1145/3007787
        Issue’s Table of Contents
        • cover image ACM Conferences
          ISCA '16: Proceedings of the 43rd International Symposium on Computer Architecture
          June 2016
          756 pages
          ISBN:9781467389471

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2016

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader