skip to main content
10.1145/1993744.1993748acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

Studying inter-core data reuse in multicores

Published:07 June 2011Publication History

ABSTRACT

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant benefits. Most of the previous data locality optimization techniques have been proposed and evaluated in the context of single core architectures. While one can expect these optimizations to be useful for multicore machines as well, multicores present further opportunities due to shared on-chip caches most of them accommodate. In order to optimize data locality targeting multicore machines however, the first step is to understand data reuse characteristics of multithreaded applications and potential benefits shared caches can bring. Motivated by these observations, we make the following contributions in this paper. First, we give a definition for inter-core data reuse and quantify it on multicores using a set of ten multithreaded application programs. Second, we show that neither on-chip cache hierarchies of current multicore architectures nor state-of-the-art (single-core centric) code/data optimizations exploit available inter-core data reuse in multithreaded applications. Third, we demonstrate that exploiting all available intercore reuse could boost overall application performance by around 21.3% on average, indicating that there is significant scope for optimization. However, we also show that trying to optimize for inter-core reuse aggressively without considering the impact of doing so on intra-core reuse can actually perform worse than optimizing for intra-core reuse alone. Finally, we present a novel, compiler-based data locality optimization strategy for multicores that balances both inter-core and intra-core reuse optimizations carefully to maximize benefits that can be extracted from shared caches. Our experiments with this strategy reveal that it is very effective in optimizing data locality in multicores.

Skip Supplemental Material Section

Supplemental Material

metrics_1b_1.mp4

mp4

132.6 MB

References

  1. AMD's Istanbul six-core Opteron processors. http://techreport.com/articles.x/17005.Google ScholarGoogle Scholar
  2. IBM Power7. http://en.wikipedia.org/wiki/POWER7.Google ScholarGoogle Scholar
  3. Intel core i7 processor. http://www.intel.com/products/processor/corei7/~inebreak specifications.htm.Google ScholarGoogle Scholar
  4. Intel Xeon processors. http://en.wikipedia.org/wiki/Xeon.Google ScholarGoogle Scholar
  5. Platform 2015: Intel processor and platform evolution for the next decade. http://epic.hpi.uni-potsdam.de/pub/Home/TrendsAndConceptsII2010/HW_Trends_borkar_2015.pdf, 2005.Google ScholarGoogle Scholar
  6. G. Almasi et al. Calculating stack distances efficiently. SIGPLAN Not., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Ascia et al. Multi-objective mapping for mesh-based noc architectures. Proc. of CODES + ISSS, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. V. Aslot et al. SPECOMP: A new benchmark suite for measuring parallel computer performance. OpenMP Shared Memory Parallel Programming, ISBN 978-3-540-42346-1, 2001.Google ScholarGoogle Scholar
  9. B. Bennett and V.J.Kruskal. LRU stack processing. IBM Journal of Research and Development, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. J. Cade and A. Qasem. Balancing locality and parallelism on shared-cache multi-core systems. Proc. of HPCC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Chandra et al. Predicting inter-thread cache contention on a chip multi-processor architecture. Proc. of HPCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. Proc. of ICS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Chen et al. Application mapping for chip multiprocessors. Proc. of DAC, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Chen et al. Scheduling threads for constructive cache sharing on cmps. Proc. of SPAA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Chishti et al. Optimizing replication, communication, and capacity allocation in CMPs. Proc. of ISCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. L. Chou and R. Marculescu. User-aware dynamic task allocation in networks-on-chip. Proc. of DATE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Chu et al. Data access partitioning for fine-grain parallelism on multicore architectures. Proc. of Micro, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. Proc. of PLDI, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Fedorova. Operating system scheduling for chip multithreaded processors. PhD Thesis, Harvard University, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Fedorova et al. Cache-fair thread scheduling for multicore processors. Technical Report, Harvard University, 2006.Google ScholarGoogle Scholar
  21. P. P. Gelsinger. Intel architecture press briefing. http://download.intel.com/pressroom/archive/reference/Gelsinger_briefing_0308.pdf, 2008.Google ScholarGoogle Scholar
  22. P. Gepner et al. Second generation quad-core Intel Xeon processors bring 45 nm technology and a new level of performance to HPC applications. Proc. of ICCS, Part I, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Jaleel et al. Adaptive insertion policies for managing shared caches. Proc. of PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Jiang et al. Is reuse distance applicable to data locality analysis on chip multiprocessors? Proc. of CC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Jin et al. A flexible data to L2 cache mapping approach for future multicore processors. Proc. of MSPC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Kandemir. A compiler technique for improving whole-program locality. Proc. of POPL, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Kandemir et al. Optimizing shared cache behavior of chip multiprocessors. Proc. of MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Kandemir et al. Cache topology aware computation mapping for multicores. Proc. of PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. Proc. of PACT, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Knauerhase et al. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Kulkarni et al. Accelerating multicore reuse distance analysis with sampling and parallelization. Proc. of PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Li. Compiling for NUMA parallel machines. Doctoral Dissertation, Cornell University, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Lu et al. Data layout transformation for enhancing data locality on nuca chip multiprocessors. Proc. of PACT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Mattson et al. Evaluation techniques for storage hierarchies. IBM Systems Journal, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Muralidhara et al. Intra-application shared cache partitioning for multithreaded applications. Proc. of PPoPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Olken. Efficient methods for calculating the success function of fixed space replacement policies. Technical Report, Lawrence Berkeley Laboratory, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  38. P. Petoumenos et al. Modeling cache sharing on chip multiprocessor architectures. Proc. of IEEE Internationl Symposium on Workload Characterization, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  39. M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. Proc. of Micro, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. Rafique et al. Architectural support for operating system-driven CMP cache management. Proc. of PACT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Ruggiero et al. Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip. Proc. of DATE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Sarkar and D. M. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. Proc. of HiPEAC, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Schuff et al. Multicore-aware reuse distance analysis. Workshop on Performance Modeling, Evaluation, and Optimisation of Ubiquitous Computating and Networked Systems, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  44. S. Srikantaiah et al. Adaptive set pinning: Managing shared caches in chip multiprocessors. Proc. of ASPLOS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. Sugumar and S. Abraham. Multi-configuration simulation algorithms for the evaluation of computer architecture designs. Technical Report, University of Michigan, 1993.Google ScholarGoogle Scholar
  46. G. E. Suh et al. Dynamic partitioning of shared cache memory. Journal of SuperComputing, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. D. Tam et al. Managing shared L2 caches on multicore systems in software. Proc. of WIOSCA, 2007.Google ScholarGoogle Scholar
  48. R. Wilson et al. The suif compiler system: a parallelizing and optimizing research compiler. Technical Report, University of Stanford, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. Proc. of PLDI, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Wu et al. Parallel data reuse theory for openmp applications. Proc. of SNPD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. C. Zhang et al. A hierarchical model of data locality. Proc. of POPL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. E. Zhang et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? Proc. of PPOPP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Studying inter-core data reuse in multicores

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
      June 2011
      376 pages
      ISBN:9781450308144
      DOI:10.1145/1993744

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 June 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate459of2,691submissions,17%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader