ABSTRACT
Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.
- AMD. AMD Stream SDK User Guide v 1.2.1-beta, Oct 2008.Google Scholar
- Arnold, M., Fink, S., Grove, D., Hind, M., and Sweeney, P. Adaptive Optimization in the Jalapeno JVM. In Proceedins of OOPSLA'00 (October 2000). Google ScholarDigital Library
- Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Transactions on Graphics 23, 3 (2004), 777--786. Google ScholarDigital Library
- Chen, C., Chame, J., Nelson, Y. L., Diniz, P., Hall, M., and Lucas, R. Compiler-Assisted Performance Tuning. In Proceedings of SciDAC 2007, Journal of Physics: Conference Series (June 2007).Google ScholarCross Ref
- Eichenberger, A. E., O'Brien, K., O'Brien, K., Wu, P., Chen, T., Oden, P. H., Prener, D. A., Shepherd, J. C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., and Gschwind, M. Optimizing Compiler for a CELL Processor. In Proceedings of the 2005 International Conference on PACT. Google ScholarDigital Library
- Extech. Extech Power Analyzer 380801. http://www.extech.com.Google Scholar
- Fursin, G. G., O'Boyle, M. F. P., and Knijnenburg, P. M. W. Evaluating Iterative Compilation. In Proceedings of the 2002 Workshop on Languages and Compilers for Parallel Computing. Google ScholarDigital Library
- Ghiasi, S., Keller, T., and Rawson, F. Scheduling for Heterogeneous Processors in Server Systems. In Proceedings of the 2nd Conference on Computing Frontiers (May 2005), pp. 199--210. Google ScholarDigital Library
- Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., and Chen, B. Future-Proof Data Parallel Algorithms and Software On Intel Multi-Core Architecture. Intel Technology Journal 11, 4, 333--348.Google Scholar
- Hill, M., and Marty, M. R. Amdahl's Law in the Multicore Era. IEEE Computer (July 2008), 33--38. Google ScholarDigital Library
- Intel. Intel Math Kernel Library Reference Manual, Sept 2007.Google Scholar
- Jimenez, V. J., Vilanova, L., Gelado, I., Gil, M., Fursin, G., and Navarro, N. Predictive Runtime Code Scheduling for Heterogeneous Architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers (2009), pp. 19--33. Google ScholarDigital Library
- Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., and Tullsen, D. Single-ISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction. In Proceedings of the MICRO'03 (December 2003), pp. 81--92. Google ScholarDigital Library
- Kumar, R., Tullsen, D., Jouppi, N., and Ranganathan, P. Heterogeneous Chip Multiprocessors. IEEE Computer (November 2005), 32--38. Google ScholarDigital Library
- Liao, S.-W., Du, Z., Wu, G., and Lueh, G.-Y. Data and Computation Transformations for Brook Streaming Applications on Multiprocessors. In Proceedings of the 4th Conference on CGO (March 2006), pp. 196--207. Google ScholarDigital Library
- Linderman, M. D., Collins, J. D., Wang, H., and Meng, T. H. Merge: A Programming Model for Heterogeneous Multi-core Systems. In Proceedings of the 2008 ASPLOS (March 2008). Google ScholarDigital Library
- Luk, C.-K., Muth, R., Patil, H., Cohn, R., and Lowney, P. G. Ispike: A Post-link Optimizer for the Intel Itanium Architecture. In Proceedings of 2004 CGO (2004), pp. 15--26. Google ScholarDigital Library
- Munshi, A. OpenCL Parallel Computing on the GPU and CPU. In ACM SIGGRAPH 2008 (2008).Google Scholar
- Nvidia. CUDA SDK. http://www.nvidia.com/object/cuda_get.html.Google Scholar
- Nvidia. CUDA CUBLAS Reference Manual, June 2007.Google Scholar
- Nvidia. CUDA Programming Guide v 1.0, June 2007.Google Scholar
- O'Brien, K., O'Brien, K., Sura, Z., Chen, T., and Zhang, T. Supporting OpenMP on Cell. International Journal on Parallel Programming 36 (2008), 289--311. Google ScholarDigital Library
- Pan, Z., and Eigenmann, R. PEAL---A Fast and Effective Performance Tuning System via Compiler Optimization Orchestration. ACM Transactions. on Programming Languages and Systems 30, 3 (May 2008). Google ScholarDigital Library
- Peakstream. Peakstream Stream Platform API C++ Programming Guide v 1.0, May 2007.Google Scholar
- Pettis, K., and Hansen, R. Profile Guided Code Positioning. In Proceedings of the ACM SIGPLAN 90 Conference on PLDI (June 1990), pp. 16--27. Google ScholarDigital Library
- Pham, D., Asano, S., Bolliger, M., Day, M. M., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., and Yazawa, K. The Design and Implementation of a First-Generation CELL Processor. In IEEE International Solid-State Circuits Conference (May 2005), pp. 49--52.Google ScholarCross Ref
- Pouchet, L.-N., Bastoul, C., Cohen, A., and Cavazos, J. Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time. In Proceedings of the ACM SIGPLAN 08 Conference on PLDI (June 2008). Google ScholarDigital Library
- Puschel, M., Moura, J., Johnson, J., Pauda, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaption 93, 2 (2005), 232--275.Google Scholar
- Rapidmind. Rapidmind. http://www.rapidmind.net.Google Scholar
- Reinders, J. Intel Threading Building Blocks. O'Reilly, July 2007. Google ScholarDigital Library
- Ren, M., Park, J., Houston, M., Aiken, A., and Dally, W. J. A Tuning Framework for Software-Managed Memory Hierarchies. In Proceedings of the 2008 International Conference on PACT. Google ScholarDigital Library
- Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. Larrabee: A Many-Core x86 Architecture for Visual Computing. In Proceedings of ACM SIGGRAPH 2008 (2008). Google ScholarDigital Library
- Stratton, J. A., Stone, S. S., and m W. Hwu, W. MCUDA: An Efficient Implementation of CUDA Kernels from Multi-Core CPUs. In Proceedings of the 2008 Workshop on Languages and Compilers for Parallel Computing. Google ScholarDigital Library
- Tarditi, D., Puri, S., and Oglesby, J. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of the 2006 ASPLOS (October 2006). Google ScholarDigital Library
- Vuduc, R., Demmel, J., and Yelick, K. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series (June 2005).Google Scholar
- Wang, P., Collins, J. D., Chinya, G., Jiang, H., Tian, X., Girkar, M., Yang, N., Lueh, G.-Y., and Wang, H. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In Proceedings of the ACM SIGPLAN 07 Conference on PLDI (June 2007), pp. 156--166. Google ScholarDigital Library
- Wang, Z., and O'Boyle, M. Mapping Parallelism to Multi-cores: A Machine Learning Based Approach. In Proceedings of 2009 ACM PPoPP (2009), pp. 75--84. Google ScholarDigital Library
- Whaley, R. C., Petitet, A., and Dongarra, J. J. Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing 27, 1--2 (2001), 3--35.Google Scholar
Index Terms
- Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Recommendations
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster ComputingIn this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Optimizing linpack benchmark on GPU-accelerated petascale supercomputer
Special issue on Community Analysis and Information RecommendationIn this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Comments