research-article

Studying inter-core data reuse in multicores

Authors:
Yuanrui Zhang

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Mahmut Kandemir

Pennsylvania State University, University Park, PA, USA

Pennsylvania State University, University Park, PA, USA
View Profile

,
Taylan Yemliha

Syracuse University, Syracuse, NY, USA

Syracuse University, Syracuse, NY, USA
View Profile

SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systemsJune 2011Pages 25–36https://doi.org/10.1145/1993744.1993748

Published:07 June 2011Publication History

SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems

Pages 25–36

ABSTRACT

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant benefits. Most of the previous data locality optimization techniques have been proposed and evaluated in the context of single core architectures. While one can expect these optimizations to be useful for multicore machines as well, multicores present further opportunities due to shared on-chip caches most of them accommodate. In order to optimize data locality targeting multicore machines however, the first step is to understand data reuse characteristics of multithreaded applications and potential benefits shared caches can bring. Motivated by these observations, we make the following contributions in this paper. First, we give a definition for inter-core data reuse and quantify it on multicores using a set of ten multithreaded application programs. Second, we show that neither on-chip cache hierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploit available inter-core data reuse in multithreaded applications. Third, we demonstrate that exploiting all available intercore reuse could boost overall application performance by around 21.3% on average, indicating that there is significant scope for optimization. However, we also show that trying to optimize for inter-core reuse aggressively without considering the impact of doing so on intra-core reuse can actually perform worse than optimizing for intra-core reuse alone. Finally, we present a novel, compiler-based data locality optimization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maximize benefits that can be extracted from shared caches. Our experiments with this strategy reveal that it is very effective in optimizing data locality in multicores.

Supplemental Material

metrics_1b_1.mp4

mp4

132.6 MB

Download

References

AMD's Istanbul six-core Opteron processors. http://techreport.com/articles.x/17005.Google Scholar
IBM Power7. http://en.wikipedia.org/wiki/POWER7.Google Scholar
Intel core i7 processor. http://www.intel.com/products/processor/corei7/~inebreak specifications.htm.Google Scholar
Intel Xeon processors. http://en.wikipedia.org/wiki/Xeon.Google Scholar
Platform 2015: Intel processor and platform evolution for the next decade. http://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW_Trends_borkar_2015.pdf, 2005.Google Scholar
G. Almasi et al. Calculating stack distances efficiently. SIGPLAN Not., 2003. Google ScholarDigital Library
G. Ascia et al. Multi-objective mapping for mesh-based noc architectures. Proc. of CODES + ISSS, 2004. Google ScholarDigital Library
V. Aslot et al. SPECOMP: A new benchmark suite for measuring parallel computer performance. OpenMP Shared Memory Parallel Programming, ISBN 978-3-540-42346-1, 2001.Google Scholar
B. Bennett and V.J.Kruskal. LRU stack processing. IBM Journal of Research and Development, 1975. Google ScholarDigital Library
M. J. Cade and A. Qasem. Balancing locality and parallelism on shared-cache multi-core systems. Proc. of HPCC, 2009. Google ScholarDigital Library
D. Chandra et al. Predicting inter-thread cache contention on a chip multi-processor architecture. Proc. of HPCA, 2005. Google ScholarDigital Library
J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. Proc. of ICS, 2007. Google ScholarDigital Library
G. Chen et al. Application mapping for chip multiprocessors. Proc. of DAC, 2008. Google ScholarDigital Library
S. Chen et al. Scheduling threads for constructive cache sharing on cmps. Proc. of SPAA, 2007. Google ScholarDigital Library
Z. Chishti et al. Optimizing replication, communication, and capacity allocation in CMPs. Proc. of ISCA, 2005. Google ScholarDigital Library
C. L. Chou and R. Marculescu. User-aware dynamic task allocation in networks-on-chip. Proc. of DATE, 2008. Google ScholarDigital Library
M. Chu et al. Data access partitioning for fine-grain parallelism on multicore architectures. Proc. of Micro, 2007. Google ScholarDigital Library
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. Proc. of PLDI, 1995. Google ScholarDigital Library
A. Fedorova. Operating system scheduling for chip multithreaded processors. PhD Thesis, Harvard University, 2006. Google ScholarDigital Library
A. Fedorova et al. Cache-fair thread scheduling for multicore processors. Technical Report, Harvard University, 2006.Google Scholar
P. P. Gelsinger. Intel architecture press briefing. http://download.intel.com/pressroom/archive/reference/Gelsinger_briefing_0308.pdf, 2008.Google Scholar
P. Gepner et al. Second generation quad-core Intel Xeon processors bring 45 nm technology and a new level of performance to HPC applications. Proc. of ICCS, Part I, 2008. Google ScholarDigital Library
A. Jaleel et al. Adaptive insertion policies for managing shared caches. Proc. of PACT, 2008. Google ScholarDigital Library
Y. Jiang et al. Is reuse distance applicable to data locality analysis on chip multiprocessors? Proc. of CC, 2010. Google ScholarDigital Library
L. Jin et al. A flexible data to L2 cache mapping approach for future multicore processors. Proc. of MSPC, 2006. Google ScholarDigital Library
M. Kandemir. A compiler technique for improving whole-program locality. Proc. of POPL, 2001. Google ScholarDigital Library
M. Kandemir et al. Optimizing shared cache behavior of chip multiprocessors. Proc. of MICRO, 2009. Google ScholarDigital Library
M. Kandemir et al. Cache topology aware computation mapping for multicores. Proc. of PLDI, 2010. Google ScholarDigital Library
S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. Proc. of PACT, 2004. Google ScholarDigital Library
R. Knauerhase et al. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008. Google ScholarDigital Library
M. Kulkarni et al. Accelerating multicore reuse distance analysis with sampling and parallelization. Proc. of PACT, 2010. Google ScholarDigital Library
W. Li. Compiling for NUMA parallel machines. Doctoral Dissertation, Cornell University, 1993. Google ScholarDigital Library
A. Lu et al. Data layout transformation for enhancing data locality on nuca chip multiprocessors. Proc. of PACT, 2009. Google ScholarDigital Library
M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 2005. Google ScholarDigital Library
R. Mattson et al. Evaluation techniques for storage hierarchies. IBM Systems Journal, 1970. Google ScholarDigital Library
S. Muralidhara et al. Intra-application shared cache partitioning for multithreaded applications. Proc. of PPoPP, 2010. Google ScholarDigital Library
F. Olken. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report, Lawrence Berkeley Laboratory, 1981.Google ScholarCross Ref
P. Petoumenos et al. Modeling cache sharing on chip multiprocessor architectures. Proc. of IEEE Internationl Symposium on Workload Characterization, 2006.Google ScholarCross Ref
M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. Proc. of Micro, 2006. Google ScholarDigital Library
N. Rafique et al. Architectural support for operating system-driven CMP cache management. Proc. of PACT, 2006. Google ScholarDigital Library
M. Ruggiero et al. Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip. Proc. of DATE, 2006. Google ScholarDigital Library
S. Sarkar and D. M. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. Proc. of HiPEAC, 2008. Google ScholarDigital Library
D. Schuff et al. Multicore-aware reuse distance analysis. Workshop on Performance Modeling, Evaluation, and Optimisation of Ubiquitous Computating and Networked Systems, 2010.Google ScholarCross Ref
S. Srikantaiah et al. Adaptive set pinning: Managing shared caches in chip multiprocessors. Proc. of ASPLOS, 2008. Google ScholarDigital Library
R. Sugumar and S. Abraham. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical Report, University of Michigan, 1993.Google Scholar
G. E. Suh et al. Dynamic partitioning of shared cache memory. Journal of SuperComputing, 2004. Google ScholarDigital Library
D. Tam et al. Managing shared L2 caches on multicore systems in software. Proc. of WIOSCA, 2007.Google Scholar
R. Wilson et al. The suif compiler system: a parallelizing and optimizing research compiler. Technical Report, University of Stanford, 1994. Google ScholarDigital Library
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. Proc. of PLDI, 1991. Google ScholarDigital Library
J. Wu et al. Parallel data reuse theory for openmp applications. Proc. of SNPD, 2009. Google ScholarDigital Library
C. Zhang et al. A hierarchical model of data locality. Proc. of POPL, 2006. Google ScholarDigital Library
E. Zhang et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? Proc. of PPOPP, 2010. Google ScholarDigital Library

Index Terms

Studying inter-core data reuse in multicores
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Studying inter-core data reuse in multicores
Performance evaluation review

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant ...
Read More
All-pairs computations on many-core graphics processors

Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory ...
Read More
Memory Row Reuse Distance and its Role in Optimizing Application Performance
Performance evaluation review

Continuously increasing dataset sizes of large-scale applications overwhelm on-chip cache capacities and make the performance of last-level caches (LLC) increasingly important. That is, in addition to maximizing LLC hit rates, it is becoming equally ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
June 2011
376 pages
ISBN:9781450308144
DOI:10.1145/1993744
General Chair:
Arif Merchant
Google, USA
,
Program Chairs:
Kimberly Keeton
HP Labs, USA
,
Dan Rubenstein
Columbia University, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache hierarchy-aware
computation mapping and scheduling
data reuse
multicores
shared cache
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate459of2,691submissions,17%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 670
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Studying inter-core data reuse in multicores

SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Studying inter-core data reuse in multicores

All-pairs computations on many-core graphics processors

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Studying inter-core data reuse in multicores

SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Studying inter-core data reuse in multicores

All-pairs computations on many-core graphics processors

Memory Row Reuse Distance and its Role in Optimizing Application Performance

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media