research-article

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Authors:
M. Aater Suleman

The University of Texas at Austin, Austin, TX

The University of Texas at Austin, Austin, TX
View Profile

,
Moinuddin K. Qureshi

T. J. Watson Research Center, Yorktown Hieghts, NY

T. J. Watson Research Center, Yorktown Hieghts, NY
View Profile

,
Yale N. Patt

The University of Texas at Austin, Austin, TX

The University of Texas at Austin, Austin, TX
View Profile

ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systemsMarch 2008Pages 277–286https://doi.org/10.1145/1346281.1346317

Published:01 March 2008Publication History

ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems

Pages 277–286

ABSTRACT

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power.

This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.

Supplemental Material

1346317.mp4

mp4

126.1 MB

Download

Available for Download

other

Slides from the presentation

zip

p277-suleman-slides.zip (17.2 MB)

Supplemental material for Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

mp3

1346317.mp3 (9.1 MB)

References

Advanced Micro Devices, Inc. White Paper: Multi-Core Processors -- The next evolution in computing. 2005.Google Scholar
D. an Mey et al. The RWTH Aachen SMP-Cluster User's Guide Version 6.2, May 2007.Google Scholar
D. Bailey et al. NAS parallel benchmarks. Technical report, NASA, 1994.Google Scholar
G. E. P. Box and M. E. Muller. A note on the generation of random normal deviates. Annals of Mathematical Statistics, 1958.Google ScholarCross Ref
T. Brecht and K. Guha. Using parallel program characteristics in dynamic processor allocation policies. Performance Evaluation,27/28(4), 1996. Google ScholarDigital Library
J. Corbalan et al. Dynamic speedup calculation through self-analysis. Technical Report UPC-DAC-1999-43, UPC, 1999.Google Scholar
J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. IEEE Trans. Parallel Distrib. Syst., 16(7):599--611, 2005. Google ScholarDigital Library
L. Dagum. OpenMP: A Proposed Industry Standard API for Shared Memory Programming, "www.openmp.org", Technical Report, October 1997.Google Scholar
A. J. Dorta et al. The openmp source code repository. In Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2005. Google ScholarDigital Library
R. Ennals. Efficient Software Transactional Memory. Technical Report IRC-TR-05-051, Intel Research Cambridge Tech Report, Jan 2005.Google Scholar
M. Gillespie and C. Breshears(Intel Corp.). Achieving Threading Success. www.intel.com/cd/ids/developer/asmo-na/eng/212806.htm, 2005.Google Scholar
J. Huh et al. Exploring design space of future CMPs. In PACT '01, 2001. Google ScholarDigital Library
Intel. Developing multithreaded applications: A platform consistent approach. www.intel.com/cd/ids/developer/asmo-na/eng/53797.htm, 2003.Google Scholar
Intel. ICC 9.1 for Linux. http://www.intel.com/cd/software/products/asmo-na/eng/compilers/284264.htm.Google Scholar
Intel. Threading methodology: Principles and practices. www.intel.com/cd/ids/developer/asmo-na/eng/219349.htm, 2003.Google Scholar
Intel. Intel Itanium 2 Processor Reference Manual, 2004.Google Scholar
R. Kalla, B. Sinharoy, and J. M. Tendler. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro, 24(2):40--47, 2004. Google ScholarDigital Library
K. Kennedy et al. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers, 2002. Google ScholarDigital Library
P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-Way Multithreaded SPARC Processor. IEEE Micro, 25(2):21--29, 2005. Google ScholarDigital Library
R. Kumar, G. Agrawal, and G. Gao. Compiling several classes of communication patterns on a multithreaded architecture. In IPDPS'02, 2002. Google ScholarDigital Library
R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In MICRO 36, 2003. Google ScholarDigital Library
R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In ISCA 31, 2004. Google ScholarDigital Library
D. Levinthal. Introduction to Performance Analysis on Intel CORE 2 Duo Processors. "http://assets.devx.com/goparallel/17775.pdf", 2006.Google Scholar
M. Matsumoto and T. Nishimura. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul., 1998. Google ScholarDigital Library
C. McCann et al. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. Trans. Comp. Sys., 1993. Google ScholarDigital Library
R. Narayanan et al. MineBench: A Benchmark Suite for Data Mining Workloads. In IISWC, 2006.Google ScholarCross Ref
T. D. Nguyen et al. Maximizing speedup through self-tuning of processor allocation. In Intn'l Parallel Processing Symposium, 1996. Google ScholarDigital Library
J. Nieplocha et al. Evaluating the potential of multithreaded platforms for irregular scientific computations. In Computing Frontiers, 2007. Google ScholarDigital Library
Y. Nishitani, K. Negishi, H. Ohta, and E. Nunohiro. Implementation and Evaluation of OpenMP for Hitachi SR8000. In ISHPC 3, 2000. Google ScholarDigital Library
Nvidia. CUDA SDK Code Samples. http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html, 2007.Google Scholar
R. Ramanathan. Intel multi-core processors: Making the move to quad-core and beyond. Technology@Intel Magazine, 4(1):2--4, Dec 2006.Google Scholar
S. Saini et al. A Scalability Study of Columbia using the NAS Parallel Benchmarks. Journal of Comput. Methods in Sci. and Engr., 2006.Google ScholarCross Ref
R. van der Pas et al. OMPlab on Sun Systems. Presentation at the International Workshop on OpenMP, 2005.Google Scholar

Index Terms

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
1. Computer systems organization
  1. Architectures

Recommendations

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
ASPLOS '08

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of ...
Read More
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
ASPLOS '08

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of ...
Read More
Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs
ASPLOS '08

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
March 2008
352 pages
ISBN:9781595939586
DOI:10.1145/1346281
General Chair:
Susan Eggers
University of Washington, USA
,
Program Chair:
James Larus
Microsoft Research, USA
ACM SIGARCH Computer Architecture News Volume 36, Issue 1
ASPLOS '08
March 2008
339 pages
ISSN:0163-5964
DOI:10.1145/1353534
Issue’s Table of Contents
ACM SIGOPS Operating Systems Review Volume 42, Issue 2
ASPLOS '08
March 2008
339 pages
ISSN:0163-5980
DOI:10.1145/1353535
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 43, Issue 3
ASPLOS '08
March 2008
339 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1353536
Issue’s Table of Contents
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CMP
bandwidth
multi-threaded
synchronization
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS XIII Paper Acceptance Rate31of127submissions,24%Overall Acceptance Rate535of2,713submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 164
  Total Citations
  View Citations
- 1,801
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs