ABSTRACT
This paper introduces the Irregular Stream Buffer (ISB), a prefetcher that targets irregular sequences of temporally correlated memory references. The key idea is to use an extra level of indirection to translate arbitrary pairs of correlated physical addresses into consecutive addresses in a new structural address space, which is visible only to the ISB. This structural address space allows the ISB to organize prefetching meta-data so that it is simultaneously temporally and spatially ordered, which produces technical benefits in terms of coverage, accuracy, and memory traffic overhead.
We evaluate the ISB using the Marss full system simulator and the irregular memory-intensive programs of SPEC CPU 2006 for both single-core and multi-core systems. For example, on a single core, the ISB exhibits an average speedup of 23.1% with 93.7% accuracy, compared to 9.9% speedup and 64.2% accuracy for an idealized prefetcher that over-approximates the STMS prefetcher, the previous best temporal stream prefetcher; this ISB prefetcher uses 32 KB of on-chip storage and sees 8.4% memory traffic overhead due to meta-data accesses. We also show that a hybrid prefetcher that combines a stride-prefetcher and an ISB with just 8 KB of on-chip storage exhibits 40.8% speedup and 66.2% accuracy.
- J.-L. Baer and T.-F. Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, 44(5):609--623, May 1995. Google ScholarDigital Library
- I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi. Predictor virtualization. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, ASPLOS XIII, pages 157--167. ACM, 2008. Google ScholarDigital Library
- D. Burger, T. R. Puzak, W.-F. Lin, and S. K. Reinhardt. Filtering superfluous prefetches using density vectors. In ICCD '01: Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors, pages 124--133, 2001. Google ScholarDigital Library
- J. B. Carter, W. C. Hsieh, L. Stoller, M. R. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. A. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a smarter memory controller. In HPCA, pages 70--79, 1999. Google ScholarDigital Library
- C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, HPCA '04, pages 276--288, 2004. Google ScholarDigital Library
- T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In PLDI, pages 191--202, 2001. Google ScholarDigital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN 1999 conference on Programming Language Design and Implementation, PLDI '99, pages 1--12, 1999. Google ScholarDigital Library
- Y. Chou. Low-cost epoch-based correlation prefetching for commercial applications. In MICRO, pages 301--313, 2007. Google ScholarDigital Library
- I.-H. Chung, C. Kim, H.-F. Wen, and G. Cong. Application data prefetching on the ibm blue gene/q supercomputer. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, 2012. Google ScholarDigital Library
- J. Collins, S. Sair, B. Calder, and D. M. Tullsen. Pointer cache assisted prefetching. In Proceedings of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 35, pages 62--73, 2002. Google ScholarCross Ref
- R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. SIGARCH Computer Architecture News, 30(5):279--290, October 2002. Google ScholarDigital Library
- P. Diaz and M. Cintra. Stream chaining: exploiting multiple levels of correlation in data prefetching. In ISCA, pages 81--92, 2009. Google ScholarDigital Library
- M. Dimitrov and H. Zhou. Combining local and global history for high performance data prefetching. In Journal of Instruction-Level Parallelism Data Prefetching Championship, volume 13, 2011.Google Scholar
- E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA, pages 7--17, 2009.Google ScholarCross Ref
- K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. Memory-system design considerations for dynamically-scheduled processors. In ISCA '97: Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 133--143, 1997. Google ScholarDigital Library
- G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Faster and more flexible program phase analysis. Journal of Instruction Level Parallelism, 7(4):1--28, 2005.Google Scholar
- Z. Hu, M. Martonosi, and S. Kaxiras. TCP: tag correlating prefetchers. In HPCA, pages 317--326, 2003. Google ScholarDigital Library
- X. Huang, S. M. Blackburn, K. S. McKinley, J. E. B. Moss, Z. Wang, and P. Cheng. The garbage collection advantage: improving program locality. In Proceedings of the 19th annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '04, pages 69--80, 2004. Google ScholarDigital Library
- I. Hur and C. Lin. Memory prefetching using adaptive stream detection. In Proceedings of the 39th International Symposium on Microarchitecture, pages 397--408, 2006. Google ScholarDigital Library
- Y. Ishii, M. Inaba, and K. Hiraki. Access map pattern matching for high performance data cache prefetch. In Journal of Instruction-Level Parallelism, volume 13, pages 1--24, 2011.Google Scholar
- A. Jaleel. Memory characterization of workloads using instrumentation-driven simulation -- a pin-based memory characterization of the SPEC CPU2000 and SPEC CPU2006 benchmark suites. Technical report, VSSAD Technical Report 2007, 2007.Google Scholar
- T. L. Johnson, M. C. Merten, and W.-M. W. Hwu. Run-time spatial locality detection and optimization. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pages 57--64, 1997. Google ScholarDigital Library
- D. Joseph and D. Grunwald. Prefetching using markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252--263, 1997. Google ScholarDigital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Computer Architecture News, 18(3a):364--373, May 1990. Google ScholarDigital Library
- S. Kumar and C. Wilkerson. Exploiting spatial locality in data caches using spatial footprints. SIGARCH Computer Architecture News, 26(3):357--368, April 1998. Google ScholarDigital Library
- C.-K. Luk and T. C. Mowry. Compiler-based prefetching for recursive data structures. SIGOPS Operating Systems Review, 30(5):222--233, September 1996. Google ScholarDigital Library
- K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith. Ac/dc: An adaptive data cache prefetcher. In IEEE PACT, pages 135--145, 2004. Google ScholarDigital Library
- K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. IEEE Micro, 25(1):90--97, 2005. Google ScholarDigital Library
- S. Palacharla and R. E. Kessler. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the International Symposium on Computer Architecture, pages 24--33, April 1994. Google ScholarDigital Library
- A. Patel, F. Afram, S. Chen, and K. Ghose. MARSSx86: A Full System Simulator for x86 CPUs. In Design Automation Conference 2011 (DAC'11), 2011. Google ScholarDigital Library
- E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using simpoint for accurate and efficient simulation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 318--319, 2003. Google ScholarDigital Library
- A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ASPLOS-VIII, pages 115--126, 1998. Google ScholarDigital Library
- A. Roth and G. S. Sohi. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA '99, pages 111--121, 1999. Google ScholarDigital Library
- S. Sair, T. Sherwood, and B. Calder. A decoupled predictor-directed stream prefetching architecture. IEEE Transactions on Computers, 52(3):260--276, March 2003. Google ScholarDigital Library
- P. Shivakumar and N. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. Technical report, Technical Report 2001/2, Compaq Computer Corporation, 2001.Google Scholar
- A. Smith. Sequential program prefetching in memory hierarchies. IEEE Transactions on Computers, 11(12):7--12, December 1978. Google ScholarDigital Library
- Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proceedings of the 29th Annual International Symposium on Computer Architecture, pages 171--182, 2002. Google ScholarDigital Library
- S. Somogyi, T. F. Wenisch, A. Ailamaki, and B. Falsafi. Spatio-temporal memory streaming. In ISCA, pages 69--80, 2009. Google ScholarDigital Library
- S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial memory streaming. In ISCA '06: Proceedings of the 33rd Annual International Symposium on Computer Architecture, pages 252--263, 2006. Google ScholarDigital Library
- Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems. Guided region prefetching: a cooperative hardware/software approach. SIGARCH Computer Architecture News, 31(2):388--398, May 2003. Google ScholarDigital Library
- T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. Temporal streams in commercial server applications. In IISWC, pages 99--108, 2008.Google ScholarCross Ref
- T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. Practical off-chip meta-data for temporal memory streaming. In HPCA, pages 79--90, 2009.Google ScholarCross Ref
- T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. Making address-correlated prefetching practical. IEEE Micro, 30(1):50--59, 2010. Google ScholarDigital Library
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. SIGARCH Computer Architecture News, 33(2):222--233, May 2005. Google ScholarDigital Library
Index Terms
- Linearizing irregular memory accesses for improved correlated prefetching
Recommendations
Stealth prefetching
Proceedings of the 2006 ASPLOS ConferencePrefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Stealth prefetching
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systemsPrefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Stealth prefetching
Proceedings of the 2006 ASPLOS ConferencePrefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Comments