research-article

ULCC: a user-level facility for optimizing shared cache performance on multicores

Authors:
Xiaoning Ding

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Kaibo Wang

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Xiaodong Zhang

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programmingFebruary 2011Pages 103–112https://doi.org/10.1145/1941553.1941568

Published:12 February 2011Publication History

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

Pages 103–112

ABSTRACT

Scientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory accesses, and consequently increases application execution times. Optimizing shared cache performance is critical to reduce significantly execution times of multi-threaded programs on multicores. However, there are two unique problems to be solved before implementing cache optimization techniques on multicores at the user level. First, available cache space for each running thread in a last level cache is difficult to predict due to access contention in the shared space, which makes cache conscious algorithms for single cores ineffective on multicores. Second, at the user level, programmers are not able to allocate cache space at will to running threads in the shared cache, thus data sets with strong locality may not be allocated with sufficient cache space, and cache pollution can easily happen. To address these two critical issues, we have designed ULCC (User Level Cache Control), a software runtime library that enables programmers to explicitly manage and optimize last level cache usage by allocating proper cache space for different data sets of different threads. We have implemented ULCC at the user level based on a page-coloring technique for last level cache usage management. By means of multiple case studies on an Intel multicore processor, we show that with ULCC, scientific applications can achieve significant performance improvements by fully exploiting the benefit of cache optimization algorithms and by partitioning the cache space accordingly to protect frequently reused data sets and to avoid cache pollution. Our experiments with various applications show that ULCC can significantly improve application performance by nearly 40%.

References

NAS parallel benchmarks in OpenMP. URL http://phase.hpcc.jp/Omni/benchmarks/NPB/index.html.Google Scholar
E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In SC'90, pages 2--11, 1990. Google ScholarDigital Library
L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, and R. C. Whaley. An updated set of basic linear algebra subprograms (blas). ACM Trans. Math. Softw., 28 (2): 135--151, 2002. Google ScholarDigital Library
S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In SPAA'07, pages 105--115, 2007. Google ScholarDigital Library
T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In PLDI'99, pages 1--12, 1999. Google ScholarDigital Library
HP Corp. Perfmon project. URL http://www.hpl.hp.com/research/linux/perfmon.Google Scholar
Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT'08, pages 220--229, 2008. Google ScholarDigital Library
D. Kang. Dynamic data layouts for cache-conscious factorization of DFT. In IPDPS '00, page 693, 2000. Google ScholarDigital Library
R. E. Kessler and M. D. Hill. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst., 10 (4), 1992. Google ScholarDigital Library
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT'04, pages 111--122, 2004. Google ScholarDigital Library
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 28 (3): 54--66, 2008. Google ScholarDigital Library
M. D. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In ASPLOS'91, pages 63--74, 1991. Google ScholarDigital Library
A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. In SODA'97, pages 370--379. Google ScholarDigital Library
R. Lee, X. Ding, F. Chen, Q. Lu, and X. Zhang. MCC-DB: minimizing cache conflicts in muli-core processors for databases. In VLDB'09.Google Scholar
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA '08, pages 367--378, Salt Lake City, UT, 2008.Google Scholar
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Enabling software multicore cache management with lightweight hardware support. In SC'09, 2009. Google ScholarDigital Library
C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA'04, pages 176--185, 2004. Google ScholarDigital Library
Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-OLP: Improving hardware cache performance through software-controlled object-level partitioning. In PACT '09, pages 246--257, 2009. Google ScholarDigital Library
S. K. Moore. Multicore is bad news for supercomputers. pages 213--226, 2008.Google Scholar
M. Penner and V. K. Prasanna. Cache-friendly implementations of transitive closure. In PACT'01, page 185, Barcelona, Spain, 2001. Google ScholarDigital Library
M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO'06, pages 423--432, 2006. Google ScholarDigital Library
A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In SIGMETRICS'02, pages 66--76. Google ScholarDigital Library
L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. In MICRO'08, pages 258--269, 2008. Google ScholarDigital Library
G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomputing, 28 (1), 2002. Google ScholarDigital Library
D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared l2 caches on multicore systems in software. In WIOSCA'07, 2007Google Scholar
D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In EuroSys'07, pages 47--58, 2007 Google ScholarDigital Library
TOP500.Org. URL http://www.top500.org/lists/2010/06.Google Scholar
A. Wakatani and M. Wolfe. A new approach to array redistribution: Strip mining redistribution. In PARLE'94, pages 323--335, 1994. Google ScholarDigital Library
R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In SC '98, 1998. Google ScholarDigital Library
M. Wolfe. Iteration space tiling for memory hierarchies. In PP '89, pages 357--361, Philadelphia, PA, 1989. Google ScholarDigital Library
M. Wolfe. More iteration space tiling. In SC'89, pages 655--664, 1989. Google ScholarDigital Library
L. Xiao, X. Zhang, and S. A. Kubricht. Improving memory performance of sorting algorithms. ACM J. Exp. Algorithmics, 5: 2000, 2000. Google ScholarDigital Library
K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson. An experimental comparison of cache-oblivious and cache-conscious programs. In SPAA'07, pages 93--104, 2007. Google ScholarDigital Library
X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloring-based multicore cache management. In EuroSys'09, pages 89--102, 2009. Google ScholarDigital Library
S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS'10, pages 129--142, 2010. Google ScholarDigital Library

Index Terms

ULCC: a user-level facility for optimizing shared cache performance on multicores
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

ULCC: a user-level facility for optimizing shared cache performance on multicores
PPoPP '11

Scientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory ...
Read More
SRM-buffer: an OS buffer management technique to prevent last level cache from thrashing in multicores
EuroSys '11: Proceedings of the sixth conference on Computer systems

Buffer caches in operating systems keep active file blocks in memory to reduce disk accesses. Related studies have been focused on how to minimize buffer misses and the caused performance degradation. However, the side effects and performance ...
Read More
LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies

The trend of increasing the number of cores to achieve higher performance has challenged efficient management of on-chip data. Moreover, many emerging applications process massive amounts of data with varying degrees of locality. Therefore, exploiting ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
February 2011
326 pages
ISBN:9781450301190
DOI:10.1145/1941553
General Chair:
Calin Cascaval
Qualcomm Research, USA
,
Program Chair:
Pen-Chung Yew
Academia Sinica, Taiwan and University of Minnesota at Twin Cities, USA
ACM SIGPLAN Notices Volume 46, Issue 8
PPoPP '11
August 2011
300 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2038037
Issue’s Table of Contents
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache
multicore
scientific computing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 63
  Total Citations
  View Citations
- 497
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ULCC: a user-level facility for optimizing shared cache performance on multicores

PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

ULCC: a user-level facility for optimizing shared cache performance on multicores

SRM-buffer: an OS buffer management technique to prevent last level cache from thrashing in multicores

LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies