research-article

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Authors:
Chi-Keung Luk

Intel Corporation, Hudson, MA

Intel Corporation, Hudson, MA
View Profile

,
Sunpyo Hong

Georgia Institute of Technology, Atlanta, GA

Georgia Institute of Technology, Atlanta, GA
View Profile

,
Hyesoon Kim

Georgia Institute of Technology, Atlanta, GA

Georgia Institute of Technology, Atlanta, GA
View Profile

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on MicroarchitectureDecember 2009Pages 45–55https://doi.org/10.1145/1669112.1669121

Published:12 December 2009Publication History

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 45–55

ABSTRACT

Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.

References

AMD. AMD Stream SDK User Guide v 1.2.1-beta, Oct 2008.Google Scholar
Arnold, M., Fink, S., Grove, D., Hind, M., and Sweeney, P. Adaptive Optimization in the Jalapeno JVM. In Proceedins of OOPSLA'00 (October 2000). Google ScholarDigital Library
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Transactions on Graphics 23, 3 (2004), 777--786. Google ScholarDigital Library
Chen, C., Chame, J., Nelson, Y. L., Diniz, P., Hall, M., and Lucas, R. Compiler-Assisted Performance Tuning. In Proceedings of SciDAC 2007, Journal of Physics: Conference Series (June 2007).Google ScholarCross Ref
Eichenberger, A. E., O'Brien, K., O'Brien, K., Wu, P., Chen, T., Oden, P. H., Prener, D. A., Shepherd, J. C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., and Gschwind, M. Optimizing Compiler for a CELL Processor. In Proceedings of the 2005 International Conference on PACT. Google ScholarDigital Library
Extech. Extech Power Analyzer 380801. http://www.extech.com.Google Scholar
Fursin, G. G., O'Boyle, M. F. P., and Knijnenburg, P. M. W. Evaluating Iterative Compilation. In Proceedings of the 2002 Workshop on Languages and Compilers for Parallel Computing. Google ScholarDigital Library
Ghiasi, S., Keller, T., and Rawson, F. Scheduling for Heterogeneous Processors in Server Systems. In Proceedings of the 2nd Conference on Computing Frontiers (May 2005), pp. 199--210. Google ScholarDigital Library
Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., and Chen, B. Future-Proof Data Parallel Algorithms and Software On Intel Multi-Core Architecture. Intel Technology Journal 11, 4, 333--348.Google Scholar
Hill, M., and Marty, M. R. Amdahl's Law in the Multicore Era. IEEE Computer (July 2008), 33--38. Google ScholarDigital Library
Intel. Intel Math Kernel Library Reference Manual, Sept 2007.Google Scholar
Jimenez, V. J., Vilanova, L., Gelado, I., Gil, M., Fursin, G., and Navarro, N. Predictive Runtime Code Scheduling for Heterogeneous Architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers (2009), pp. 19--33. Google ScholarDigital Library
Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., and Tullsen, D. Single-ISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction. In Proceedings of the MICRO'03 (December 2003), pp. 81--92. Google ScholarDigital Library
Kumar, R., Tullsen, D., Jouppi, N., and Ranganathan, P. Heterogeneous Chip Multiprocessors. IEEE Computer (November 2005), 32--38. Google ScholarDigital Library
Liao, S.-W., Du, Z., Wu, G., and Lueh, G.-Y. Data and Computation Transformations for Brook Streaming Applications on Multiprocessors. In Proceedings of the 4th Conference on CGO (March 2006), pp. 196--207. Google ScholarDigital Library
Linderman, M. D., Collins, J. D., Wang, H., and Meng, T. H. Merge: A Programming Model for Heterogeneous Multi-core Systems. In Proceedings of the 2008 ASPLOS (March 2008). Google ScholarDigital Library
Luk, C.-K., Muth, R., Patil, H., Cohn, R., and Lowney, P. G. Ispike: A Post-link Optimizer for the Intel Itanium Architecture. In Proceedings of 2004 CGO (2004), pp. 15--26. Google ScholarDigital Library
Munshi, A. OpenCL Parallel Computing on the GPU and CPU. In ACM SIGGRAPH 2008 (2008).Google Scholar
Nvidia. CUDA SDK. http://www.nvidia.com/object/cuda_get.html.Google Scholar
Nvidia. CUDA CUBLAS Reference Manual, June 2007.Google Scholar
Nvidia. CUDA Programming Guide v 1.0, June 2007.Google Scholar
O'Brien, K., O'Brien, K., Sura, Z., Chen, T., and Zhang, T. Supporting OpenMP on Cell. International Journal on Parallel Programming 36 (2008), 289--311. Google ScholarDigital Library
Pan, Z., and Eigenmann, R. PEAL---A Fast and Effective Performance Tuning System via Compiler Optimization Orchestration. ACM Transactions. on Programming Languages and Systems 30, 3 (May 2008). Google ScholarDigital Library
Peakstream. Peakstream Stream Platform API C++ Programming Guide v 1.0, May 2007.Google Scholar
Pettis, K., and Hansen, R. Profile Guided Code Positioning. In Proceedings of the ACM SIGPLAN 90 Conference on PLDI (June 1990), pp. 16--27. Google ScholarDigital Library
Pham, D., Asano, S., Bolliger, M., Day, M. M., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., and Yazawa, K. The Design and Implementation of a First-Generation CELL Processor. In IEEE International Solid-State Circuits Conference (May 2005), pp. 49--52.Google ScholarCross Ref
Pouchet, L.-N., Bastoul, C., Cohen, A., and Cavazos, J. Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time. In Proceedings of the ACM SIGPLAN 08 Conference on PLDI (June 2008). Google ScholarDigital Library
Puschel, M., Moura, J., Johnson, J., Pauda, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaption 93, 2 (2005), 232--275.Google Scholar
Rapidmind. Rapidmind. http://www.rapidmind.net.Google Scholar
Reinders, J. Intel Threading Building Blocks. O'Reilly, July 2007. Google ScholarDigital Library
Ren, M., Park, J., Houston, M., Aiken, A., and Dally, W. J. A Tuning Framework for Software-Managed Memory Hierarchies. In Proceedings of the 2008 International Conference on PACT. Google ScholarDigital Library
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. Larrabee: A Many-Core x86 Architecture for Visual Computing. In Proceedings of ACM SIGGRAPH 2008 (2008). Google ScholarDigital Library
Stratton, J. A., Stone, S. S., and m W. Hwu, W. MCUDA: An Efficient Implementation of CUDA Kernels from Multi-Core CPUs. In Proceedings of the 2008 Workshop on Languages and Compilers for Parallel Computing. Google ScholarDigital Library
Tarditi, D., Puri, S., and Oglesby, J. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of the 2006 ASPLOS (October 2006). Google ScholarDigital Library
Vuduc, R., Demmel, J., and Yelick, K. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series (June 2005).Google Scholar
Wang, P., Collins, J. D., Chinya, G., Jiang, H., Tian, X., Girkar, M., Yang, N., Lueh, G.-Y., and Wang, H. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In Proceedings of the ACM SIGPLAN 07 Conference on PLDI (June 2007), pp. 156--166. Google ScholarDigital Library
Wang, Z., and O'Boyle, M. Mapping Parallelism to Multi-cores: A Machine Learning Based Approach. In Proceedings of 2009 ACM PPoPP (2009), pp. 75--84. Google ScholarDigital Library
Whaley, R. C., Petitet, A., and Dongarra, J. J. Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing 27, 1--2 (2001), 3--35.Google Scholar

Index Terms

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Read More
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Special issue on Community Analysis and Information Recommendation

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of ...
Read More
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
December 2009
601 pages
ISBN:9781605587981
DOI:10.1145/1669112
General Chairs:
David Albonesi
Cornell
,
Margaret Martonosi
Princeton
,
Program Chairs:
David August
Princeton/Parakinetics
,
José Martínez
Cornell
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
adaptive
dynamic compilation
heterogeneous
mapping
multicore
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 376
  Total Citations
  View Citations
- 1,276
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

A performance study of general-purpose applications on graphics processors using CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

A performance study of general-purpose applications on graphics processors using CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media