skip to main content
10.1145/1346281.1346317acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Published:01 March 2008Publication History

ABSTRACT

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power.

This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.

Skip Supplemental Material Section

Supplemental Material

1346317.mp4

mp4

126.1 MB

References

  1. Advanced Micro Devices, Inc. White Paper: Multi-Core Processors -- The next evolution in computing. 2005.Google ScholarGoogle Scholar
  2. D. an Mey et al. The RWTH Aachen SMP-Cluster User's Guide Version 6.2, May 2007.Google ScholarGoogle Scholar
  3. D. Bailey et al. NAS parallel benchmarks. Technical report, NASA, 1994.Google ScholarGoogle Scholar
  4. G. E. P. Box and M. E. Muller. A note on the generation of random normal deviates. Annals of Mathematical Statistics, 1958.Google ScholarGoogle ScholarCross RefCross Ref
  5. T. Brecht and K. Guha. Using parallel program characteristics in dynamic processor allocation policies. Performance Evaluation,27/28(4), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Corbalan et al. Dynamic speedup calculation through self-analysis. Technical Report UPC-DAC-1999-43, UPC, 1999.Google ScholarGoogle Scholar
  7. J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. IEEE Trans. Parallel Distrib. Syst., 16(7):599--611, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Dagum. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, "www.openmp.org", Technical Report, October 1997.Google ScholarGoogle Scholar
  9. A. J. Dorta et al. The openmp source code repository. In Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Ennals. Efficient Software Transactional Memory. Technical Report IRC-TR-05-051, Intel Research Cambridge Tech Report, Jan 2005.Google ScholarGoogle Scholar
  11. M. Gillespie and C. Breshears(Intel Corp.). Achieving Threading Success. www.intel.com/cd/ids/developer/asmo-na/eng/212806.htm, 2005.Google ScholarGoogle Scholar
  12. J. Huh et al. Exploring design space of future CMPs. In PACT '01, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel. Developing multithreaded applications: A platform consistent approach. www.intel.com/cd/ids/developer/asmo-na/eng/53797.htm, 2003.Google ScholarGoogle Scholar
  14. Intel. ICC 9.1 for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.Google ScholarGoogle Scholar
  15. Intel. Threading methodology: Principles and practices. www.intel.com/cd/ids/developer/asmo-na/eng/219349.htm, 2003.Google ScholarGoogle Scholar
  16. Intel. Intel Itanium 2 Processor Reference Manual, 2004.Google ScholarGoogle Scholar
  17. R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40--47, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Kennedy et al. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multithreaded SPARC Processor. IEEE Micro, 25(2):21--29, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kumar, G. Agrawal, and G. Gao. Compiling several classes of communication patterns on a multithreaded architecture. In IPDPS'02, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In MICRO 36, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In ISCA 31, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Levinthal. Introduction to Performance Analysis on Intel CORE 2 Duo Processors. "http://assets.devx.com/goparallel/17775.pdf", 2006.Google ScholarGoogle Scholar
  24. M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. McCann et al. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. Trans. Comp. Sys., 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Narayanan et al. MineBench: A Benchmark Suite for Data Mining Workloads. In IISWC, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  27. T. D. Nguyen et al. Maximizing speedup through self-tuning of processor allocation. In Intn'l Parallel Processing Symposium, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Nieplocha et al. Evaluating the potential of multithreaded platforms for irregular scientific computations. In Computing Frontiers, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Nishitani, K. Negishi, H. Ohta, and E. Nunohiro. Implementation and Evaluation of OpenMP for Hitachi SR8000. In ISHPC 3, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nvidia. CUDA SDK Code Samples. http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html, 2007.Google ScholarGoogle Scholar
  31. R. Ramanathan. Intel multi-core processors: Making the move to quad-core and beyond. Technology@Intel Magazine, 4(1):2--4, Dec 2006.Google ScholarGoogle Scholar
  32. S. Saini et al. A Scalability Study of Columbia using the NAS Parallel Benchmarks. Journal of Comput. Methods in Sci. and Engr., 2006.Google ScholarGoogle ScholarCross RefCross Ref
  33. R. van der Pas et al. OMPlab on Sun Systems. Presentation at the International Workshop on OpenMP, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
      March 2008
      352 pages
      ISBN:9781595939586
      DOI:10.1145/1346281
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 36, Issue 1
        ASPLOS '08
        March 2008
        339 pages
        ISSN:0163-5964
        DOI:10.1145/1353534
        Issue’s Table of Contents
      • cover image ACM SIGOPS Operating Systems Review
        ACM SIGOPS Operating Systems Review  Volume 42, Issue 2
        ASPLOS '08
        March 2008
        339 pages
        ISSN:0163-5980
        DOI:10.1145/1353535
        Issue’s Table of Contents
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 43, Issue 3
        ASPLOS '08
        March 2008
        339 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1353536
        Issue’s Table of Contents

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 March 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ASPLOS XIII Paper Acceptance Rate31of127submissions,24%Overall Acceptance Rate535of2,713submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader