ABSTRACT
Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant benefits. Most of the previous data locality optimization techniques have been proposed and evaluated in the context of single core architectures. While one can expect these optimizations to be useful for multicore machines as well, multicores present further opportunities due to shared on-chip caches most of them accommodate. In order to optimize data locality targeting multicore machines however, the first step is to understand data reuse characteristics of multithreaded applications and potential benefits shared caches can bring. Motivated by these observations, we make the following contributions in this paper. First, we give a definition for inter-core data reuse and quantify it on multicores using a set of ten multithreaded application programs. Second, we show that neither on-chip cache hierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploit available inter-core data reuse in multithreaded applications. Third, we demonstrate that exploiting all available intercore reuse could boost overall application performance by around 21.3% on average, indicating that there is significant scope for optimization. However, we also show that trying to optimize for inter-core reuse aggressively without considering the impact of doing so on intra-core reuse can actually perform worse than optimizing for intra-core reuse alone. Finally, we present a novel, compiler-based data locality optimization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maximize benefits that can be extracted from shared caches. Our experiments with this strategy reveal that it is very effective in optimizing data locality in multicores.
Supplemental Material
- AMD's Istanbul six-core Opteron processors. http://techreport.com/articles.x/17005.Google Scholar
- IBM Power7. http://en.wikipedia.org/wiki/POWER7.Google Scholar
- Intel core i7 processor. http://www.intel.com/products/processor/corei7/~inebreak specifications.htm.Google Scholar
- Intel Xeon processors. http://en.wikipedia.org/wiki/Xeon.Google Scholar
- Platform 2015: Intel processor and platform evolution for the next decade. http://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW_Trends_borkar_2015.pdf, 2005.Google Scholar
- G. Almasi et al. Calculating stack distances efficiently. SIGPLAN Not., 2003. Google ScholarDigital Library
- G. Ascia et al. Multi-objective mapping for mesh-based noc architectures. Proc. of CODES + ISSS, 2004. Google ScholarDigital Library
- V. Aslot et al. SPECOMP: A new benchmark suite for measuring parallel computer performance. OpenMP Shared Memory Parallel Programming, ISBN 978-3-540-42346-1, 2001.Google Scholar
- B. Bennett and V.J.Kruskal. LRU stack processing. IBM Journal of Research and Development, 1975. Google ScholarDigital Library
- M. J. Cade and A. Qasem. Balancing locality and parallelism on shared-cache multi-core systems. Proc. of HPCC, 2009. Google ScholarDigital Library
- D. Chandra et al. Predicting inter-thread cache contention on a chip multi-processor architecture. Proc. of HPCA, 2005. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. Proc. of ICS, 2007. Google ScholarDigital Library
- G. Chen et al. Application mapping for chip multiprocessors. Proc. of DAC, 2008. Google ScholarDigital Library
- S. Chen et al. Scheduling threads for constructive cache sharing on cmps. Proc. of SPAA, 2007. Google ScholarDigital Library
- Z. Chishti et al. Optimizing replication, communication, and capacity allocation in CMPs. Proc. of ISCA, 2005. Google ScholarDigital Library
- C. L. Chou and R. Marculescu. User-aware dynamic task allocation in networks-on-chip. Proc. of DATE, 2008. Google ScholarDigital Library
- M. Chu et al. Data access partitioning for fine-grain parallelism on multicore architectures. Proc. of Micro, 2007. Google ScholarDigital Library
- S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. Proc. of PLDI, 1995. Google ScholarDigital Library
- A. Fedorova. Operating system scheduling for chip multithreaded processors. PhD Thesis, Harvard University, 2006. Google ScholarDigital Library
- A. Fedorova et al. Cache-fair thread scheduling for multicore processors. Technical Report, Harvard University, 2006.Google Scholar
- P. P. Gelsinger. Intel architecture press briefing. http://download.intel.com/pressroom/archive/reference/Gelsinger_briefing_0308.pdf, 2008.Google Scholar
- P. Gepner et al. Second generation quad-core Intel Xeon processors bring 45 nm technology and a new level of performance to HPC applications. Proc. of ICCS, Part I, 2008. Google ScholarDigital Library
- A. Jaleel et al. Adaptive insertion policies for managing shared caches. Proc. of PACT, 2008. Google ScholarDigital Library
- Y. Jiang et al. Is reuse distance applicable to data locality analysis on chip multiprocessors? Proc. of CC, 2010. Google ScholarDigital Library
- L. Jin et al. A flexible data to L2 cache mapping approach for future multicore processors. Proc. of MSPC, 2006. Google ScholarDigital Library
- M. Kandemir. A compiler technique for improving whole-program locality. Proc. of POPL, 2001. Google ScholarDigital Library
- M. Kandemir et al. Optimizing shared cache behavior of chip multiprocessors. Proc. of MICRO, 2009. Google ScholarDigital Library
- M. Kandemir et al. Cache topology aware computation mapping for multicores. Proc. of PLDI, 2010. Google ScholarDigital Library
- S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. Proc. of PACT, 2004. Google ScholarDigital Library
- R. Knauerhase et al. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008. Google ScholarDigital Library
- M. Kulkarni et al. Accelerating multicore reuse distance analysis with sampling and parallelization. Proc. of PACT, 2010. Google ScholarDigital Library
- W. Li. Compiling for NUMA parallel machines. Doctoral Dissertation, Cornell University, 1993. Google ScholarDigital Library
- A. Lu et al. Data layout transformation for enhancing data locality on nuca chip multiprocessors. Proc. of PACT, 2009. Google ScholarDigital Library
- M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 2005. Google ScholarDigital Library
- R. Mattson et al. Evaluation techniques for storage hierarchies. IBM Systems Journal, 1970. Google ScholarDigital Library
- S. Muralidhara et al. Intra-application shared cache partitioning for multithreaded applications. Proc. of PPoPP, 2010. Google ScholarDigital Library
- F. Olken. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report, Lawrence Berkeley Laboratory, 1981.Google ScholarCross Ref
- P. Petoumenos et al. Modeling cache sharing on chip multiprocessor architectures. Proc. of IEEE Internationl Symposium on Workload Characterization, 2006.Google ScholarCross Ref
- M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. Proc. of Micro, 2006. Google ScholarDigital Library
- N. Rafique et al. Architectural support for operating system-driven CMP cache management. Proc. of PACT, 2006. Google ScholarDigital Library
- M. Ruggiero et al. Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip. Proc. of DATE, 2006. Google ScholarDigital Library
- S. Sarkar and D. M. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. Proc. of HiPEAC, 2008. Google ScholarDigital Library
- D. Schuff et al. Multicore-aware reuse distance analysis. Workshop on Performance Modeling, Evaluation, and Optimisation of Ubiquitous Computating and Networked Systems, 2010.Google ScholarCross Ref
- S. Srikantaiah et al. Adaptive set pinning: Managing shared caches in chip multiprocessors. Proc. of ASPLOS, 2008. Google ScholarDigital Library
- R. Sugumar and S. Abraham. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical Report, University of Michigan, 1993.Google Scholar
- G. E. Suh et al. Dynamic partitioning of shared cache memory. Journal of SuperComputing, 2004. Google ScholarDigital Library
- D. Tam et al. Managing shared L2 caches on multicore systems in software. Proc. of WIOSCA, 2007.Google Scholar
- R. Wilson et al. The suif compiler system: a parallelizing and optimizing research compiler. Technical Report, University of Stanford, 1994. Google ScholarDigital Library
- M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. Proc. of PLDI, 1991. Google ScholarDigital Library
- J. Wu et al. Parallel data reuse theory for openmp applications. Proc. of SNPD, 2009. Google ScholarDigital Library
- C. Zhang et al. A hierarchical model of data locality. Proc. of POPL, 2006. Google ScholarDigital Library
- E. Zhang et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? Proc. of PPOPP, 2010. Google ScholarDigital Library
Index Terms
- Studying inter-core data reuse in multicores
Recommendations
Studying inter-core data reuse in multicores
Performance evaluation reviewMost of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant ...
All-pairs computations on many-core graphics processors
Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory ...
Memory Row Reuse Distance and its Role in Optimizing Application Performance
Performance evaluation reviewContinuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally ...
Comments