skip to main content
10.1145/1941553.1941568acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

ULCC: a user-level facility for optimizing shared cache performance on multicores

Published:12 February 2011Publication History

ABSTRACT

Scientific applications face serious performance challenges on multicore processors, one of which is caused by access contention in last level shared caches from multiple running threads. The contention increases the number of long latency memory accesses, and consequently increases application execution times. Optimizing shared cache performance is critical to reduce significantly execution times of multi-threaded programs on multicores. However, there are two unique problems to be solved before implementing cache optimization techniques on multicores at the user level. First, available cache space for each running thread in a last level cache is difficult to predict due to access contention in the shared space, which makes cache conscious algorithms for single cores ineffective on multicores. Second, at the user level, programmers are not able to allocate cache space at will to running threads in the shared cache, thus data sets with strong locality may not be allocated with sufficient cache space, and cache pollution can easily happen. To address these two critical issues, we have designed ULCC (User Level Cache Control), a software runtime library that enables programmers to explicitly manage and optimize last level cache usage by allocating proper cache space for different data sets of different threads. We have implemented ULCC at the user level based on a page-coloring technique for last level cache usage management. By means of multiple case studies on an Intel multicore processor, we show that with ULCC, scientific applications can achieve significant performance improvements by fully exploiting the benefit of cache optimization algorithms and by partitioning the cache space accordingly to protect frequently reused data sets and to avoid cache pollution. Our experiments with various applications show that ULCC can significantly improve application performance by nearly 40%.

References

  1. NAS parallel benchmarks in OpenMP. URL http://phase.hpcc.jp/Omni/benchmarks/NPB/index.html.Google ScholarGoogle Scholar
  2. E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In SC'90, pages 2--11, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux, L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, and R. C. Whaley. An updated set of basic linear algebra subprograms (blas). ACM Trans. Math. Softw., 28 (2): 135--151, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In SPAA'07, pages 105--115, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious structure layout. In PLDI'99, pages 1--12, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. HP Corp. Perfmon project. URL http://www.hpl.hp.com/research/linux/perfmon.Google ScholarGoogle Scholar
  7. Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT'08, pages 220--229, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Kang. Dynamic data layouts for cache-conscious factorization of DFT. In IPDPS '00, page 693, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. E. Kessler and M. D. Hill. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst., 10 (4), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT'04, pages 111--122, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 28 (3): 54--66, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. D. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimizations of blocked algorithms. In ASPLOS'91, pages 63--74, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. In SODA'97, pages 370--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Lee, X. Ding, F. Chen, Q. Lu, and X. Zhang. MCC-DB: minimizing cache conflicts in muli-core processors for databases. In VLDB'09.Google ScholarGoogle Scholar
  15. J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA '08, pages 367--378, Salt Lake City, UT, 2008.Google ScholarGoogle Scholar
  16. J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Enabling software multicore cache management with lightweight hardware support. In SC'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA'04, pages 176--185, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Q. Lu, J. Lin, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Soft-OLP: Improving hardware cache performance through software-controlled object-level partitioning. In PACT '09, pages 246--257, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. K. Moore. Multicore is bad news for supercomputers. pages 213--226, 2008.Google ScholarGoogle Scholar
  20. M. Penner and V. K. Prasanna. Cache-friendly implementations of transitive closure. In PACT'01, page 185, Barcelona, Spain, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO'06, pages 423--432, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In SIGMETRICS'02, pages 66--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, software-only pollute buffer. In MICRO'08, pages 258--269, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomputing, 28 (1), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared l2 caches on multicore systems in software. In WIOSCA'07, 2007Google ScholarGoogle Scholar
  26. D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In EuroSys'07, pages 47--58, 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. TOP500.Org. URL http://www.top500.org/lists/2010/06.Google ScholarGoogle Scholar
  28. A. Wakatani and M. Wolfe. A new approach to array redistribution: Strip mining redistribution. In PARLE'94, pages 323--335, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In SC '98, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Wolfe. Iteration space tiling for memory hierarchies. In PP '89, pages 357--361, Philadelphia, PA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Wolfe. More iteration space tiling. In SC'89, pages 655--664, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Xiao, X. Zhang, and S. A. Kubricht. Improving memory performance of sorting algorithms. ACM J. Exp. Algorithmics, 5: 2000, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Yotov, T. Roeder, K. Pingali, J. Gunnels, and F. Gustavson. An experimental comparison of cache-oblivious and cache-conscious programs. In SPAA'07, pages 93--104, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Zhang, S. Dwarkadas, and K. Shen. Towards practical page coloring-based multicore cache management. In EuroSys'09, pages 89--102, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS'10, pages 129--142, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ULCC: a user-level facility for optimizing shared cache performance on multicores

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '11: Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
        February 2011
        326 pages
        ISBN:9781450301190
        DOI:10.1145/1941553
        • General Chair:
        • Calin Cascaval,
        • Program Chair:
        • Pen-Chung Yew
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 46, Issue 8
          PPoPP '11
          August 2011
          300 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2038037
          Issue’s Table of Contents

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 February 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader