ABSTRACT
Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power.
This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.
Supplemental Material
Available for Download
Slides from the presentation
Supplemental material for Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
- Advanced Micro Devices, Inc. White Paper: Multi-Core Processors -- The next evolution in computing. 2005.Google Scholar
- D. an Mey et al. The RWTH Aachen SMP-Cluster User's Guide Version 6.2, May 2007.Google Scholar
- D. Bailey et al. NAS parallel benchmarks. Technical report, NASA, 1994.Google Scholar
- G. E. P. Box and M. E. Muller. A note on the generation of random normal deviates. Annals of Mathematical Statistics, 1958.Google ScholarCross Ref
- T. Brecht and K. Guha. Using parallel program characteristics in dynamic processor allocation policies. Performance Evaluation,27/28(4), 1996. Google ScholarDigital Library
- J. Corbalan et al. Dynamic speedup calculation through self-analysis. Technical Report UPC-DAC-1999-43, UPC, 1999.Google Scholar
- J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. IEEE Trans. Parallel Distrib. Syst., 16(7):599--611, 2005. Google ScholarDigital Library
- L. Dagum. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, "www.openmp.org", Technical Report, October 1997.Google Scholar
- A. J. Dorta et al. The openmp source code repository. In Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2005. Google ScholarDigital Library
- R. Ennals. Efficient Software Transactional Memory. Technical Report IRC-TR-05-051, Intel Research Cambridge Tech Report, Jan 2005.Google Scholar
- M. Gillespie and C. Breshears(Intel Corp.). Achieving Threading Success. www.intel.com/cd/ids/developer/asmo-na/eng/212806.htm, 2005.Google Scholar
- J. Huh et al. Exploring design space of future CMPs. In PACT '01, 2001. Google ScholarDigital Library
- Intel. Developing multithreaded applications: A platform consistent approach. www.intel.com/cd/ids/developer/asmo-na/eng/53797.htm, 2003.Google Scholar
- Intel. ICC 9.1 for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.Google Scholar
- Intel. Threading methodology: Principles and practices. www.intel.com/cd/ids/developer/asmo-na/eng/219349.htm, 2003.Google Scholar
- Intel. Intel Itanium 2 Processor Reference Manual, 2004.Google Scholar
- R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40--47, 2004. Google ScholarDigital Library
- K. Kennedy et al. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers, 2002. Google ScholarDigital Library
- P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multithreaded SPARC Processor. IEEE Micro, 25(2):21--29, 2005. Google ScholarDigital Library
- R. Kumar, G. Agrawal, and G. Gao. Compiling several classes of communication patterns on a multithreaded architecture. In IPDPS'02, 2002. Google ScholarDigital Library
- R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In MICRO 36, 2003. Google ScholarDigital Library
- R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In ISCA 31, 2004. Google ScholarDigital Library
- D. Levinthal. Introduction to Performance Analysis on Intel CORE 2 Duo Processors. "http://assets.devx.com/goparallel/17775.pdf", 2006.Google Scholar
- M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 1998. Google ScholarDigital Library
- C. McCann et al. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. Trans. Comp. Sys., 1993. Google ScholarDigital Library
- R. Narayanan et al. MineBench: A Benchmark Suite for Data Mining Workloads. In IISWC, 2006.Google ScholarCross Ref
- T. D. Nguyen et al. Maximizing speedup through self-tuning of processor allocation. In Intn'l Parallel Processing Symposium, 1996. Google ScholarDigital Library
- J. Nieplocha et al. Evaluating the potential of multithreaded platforms for irregular scientific computations. In Computing Frontiers, 2007. Google ScholarDigital Library
- Y. Nishitani, K. Negishi, H. Ohta, and E. Nunohiro. Implementation and Evaluation of OpenMP for Hitachi SR8000. In ISHPC 3, 2000. Google ScholarDigital Library
- Nvidia. CUDA SDK Code Samples. http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html, 2007.Google Scholar
- R. Ramanathan. Intel multi-core processors: Making the move to quad-core and beyond. Technology@Intel Magazine, 4(1):2--4, Dec 2006.Google Scholar
- S. Saini et al. A Scalability Study of Columbia using the NAS Parallel Benchmarks. Journal of Comput. Methods in Sci. and Engr., 2006.Google ScholarCross Ref
- R. van der Pas et al. OMPlab on Sun Systems. Presentation at the International Workshop on OpenMP, 2005.Google Scholar
Index Terms
- Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
Recommendations
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
ASPLOS '08Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of ...
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
ASPLOS '08Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of ...
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
ASPLOS '08Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of ...
Comments