skip to main content
10.1145/1669112.1669121acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Published:12 December 2009Publication History

ABSTRACT

Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration.

References

  1. AMD. AMD Stream SDK User Guide v 1.2.1-beta, Oct 2008.Google ScholarGoogle Scholar
  2. Arnold, M., Fink, S., Grove, D., Hind, M., and Sweeney, P. Adaptive Optimization in the Jalapeno JVM. In Proceedins of OOPSLA'00 (October 2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Transactions on Graphics 23, 3 (2004), 777--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, C., Chame, J., Nelson, Y. L., Diniz, P., Hall, M., and Lucas, R. Compiler-Assisted Performance Tuning. In Proceedings of SciDAC 2007, Journal of Physics: Conference Series (June 2007).Google ScholarGoogle ScholarCross RefCross Ref
  5. Eichenberger, A. E., O'Brien, K., O'Brien, K., Wu, P., Chen, T., Oden, P. H., Prener, D. A., Shepherd, J. C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., and Gschwind, M. Optimizing Compiler for a CELL Processor. In Proceedings of the 2005 International Conference on PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Extech. Extech Power Analyzer 380801. http://www.extech.com.Google ScholarGoogle Scholar
  7. Fursin, G. G., O'Boyle, M. F. P., and Knijnenburg, P. M. W. Evaluating Iterative Compilation. In Proceedings of the 2002 Workshop on Languages and Compilers for Parallel Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ghiasi, S., Keller, T., and Rawson, F. Scheduling for Heterogeneous Processors in Server Systems. In Proceedings of the 2nd Conference on Computing Frontiers (May 2005), pp. 199--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., and Chen, B. Future-Proof Data Parallel Algorithms and Software On Intel Multi-Core Architecture. Intel Technology Journal 11, 4, 333--348.Google ScholarGoogle Scholar
  10. Hill, M., and Marty, M. R. Amdahl's Law in the Multicore Era. IEEE Computer (July 2008), 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Intel. Intel Math Kernel Library Reference Manual, Sept 2007.Google ScholarGoogle Scholar
  12. Jimenez, V. J., Vilanova, L., Gelado, I., Gil, M., Fursin, G., and Navarro, N. Predictive Runtime Code Scheduling for Heterogeneous Architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers (2009), pp. 19--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kumar, R., Farkas, K. I., Jouppi, N. P., Ranganathan, P., and Tullsen, D. Single-ISA Heterogeneous Multicore Architectures: The Potential for Processor Power Reduction. In Proceedings of the MICRO'03 (December 2003), pp. 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kumar, R., Tullsen, D., Jouppi, N., and Ranganathan, P. Heterogeneous Chip Multiprocessors. IEEE Computer (November 2005), 32--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Liao, S.-W., Du, Z., Wu, G., and Lueh, G.-Y. Data and Computation Transformations for Brook Streaming Applications on Multiprocessors. In Proceedings of the 4th Conference on CGO (March 2006), pp. 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Linderman, M. D., Collins, J. D., Wang, H., and Meng, T. H. Merge: A Programming Model for Heterogeneous Multi-core Systems. In Proceedings of the 2008 ASPLOS (March 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Luk, C.-K., Muth, R., Patil, H., Cohn, R., and Lowney, P. G. Ispike: A Post-link Optimizer for the Intel Itanium Architecture. In Proceedings of 2004 CGO (2004), pp. 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Munshi, A. OpenCL Parallel Computing on the GPU and CPU. In ACM SIGGRAPH 2008 (2008).Google ScholarGoogle Scholar
  19. Nvidia. CUDA SDK. http://www.nvidia.com/object/cuda_get.html.Google ScholarGoogle Scholar
  20. Nvidia. CUDA CUBLAS Reference Manual, June 2007.Google ScholarGoogle Scholar
  21. Nvidia. CUDA Programming Guide v 1.0, June 2007.Google ScholarGoogle Scholar
  22. O'Brien, K., O'Brien, K., Sura, Z., Chen, T., and Zhang, T. Supporting OpenMP on Cell. International Journal on Parallel Programming 36 (2008), 289--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Pan, Z., and Eigenmann, R. PEAL---A Fast and Effective Performance Tuning System via Compiler Optimization Orchestration. ACM Transactions. on Programming Languages and Systems 30, 3 (May 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Peakstream. Peakstream Stream Platform API C++ Programming Guide v 1.0, May 2007.Google ScholarGoogle Scholar
  25. Pettis, K., and Hansen, R. Profile Guided Code Positioning. In Proceedings of the ACM SIGPLAN 90 Conference on PLDI (June 1990), pp. 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Pham, D., Asano, S., Bolliger, M., Day, M. M., Hofstee, H. P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiak, D., Suzuoki, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., and Yazawa, K. The Design and Implementation of a First-Generation CELL Processor. In IEEE International Solid-State Circuits Conference (May 2005), pp. 49--52.Google ScholarGoogle ScholarCross RefCross Ref
  27. Pouchet, L.-N., Bastoul, C., Cohen, A., and Cavazos, J. Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time. In Proceedings of the ACM SIGPLAN 08 Conference on PLDI (June 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Puschel, M., Moura, J., Johnson, J., Pauda, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. SPIRAL: Code Generation for DSP Transforms. Proceedings of the IEEE, special issue on Program Generation, Optimization, and Adaption 93, 2 (2005), 232--275.Google ScholarGoogle Scholar
  29. Rapidmind. Rapidmind. http://www.rapidmind.net.Google ScholarGoogle Scholar
  30. Reinders, J. Intel Threading Building Blocks. O'Reilly, July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ren, M., Park, J., Houston, M., Aiken, A., and Dally, W. J. A Tuning Framework for Software-Managed Memory Hierarchies. In Proceedings of the 2008 International Conference on PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. Larrabee: A Many-Core x86 Architecture for Visual Computing. In Proceedings of ACM SIGGRAPH 2008 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Stratton, J. A., Stone, S. S., and m W. Hwu, W. MCUDA: An Efficient Implementation of CUDA Kernels from Multi-Core CPUs. In Proceedings of the 2008 Workshop on Languages and Compilers for Parallel Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tarditi, D., Puri, S., and Oglesby, J. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of the 2006 ASPLOS (October 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vuduc, R., Demmel, J., and Yelick, K. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series (June 2005).Google ScholarGoogle Scholar
  36. Wang, P., Collins, J. D., Chinya, G., Jiang, H., Tian, X., Girkar, M., Yang, N., Lueh, G.-Y., and Wang, H. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System. In Proceedings of the ACM SIGPLAN 07 Conference on PLDI (June 2007), pp. 156--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Wang, Z., and O'Boyle, M. Mapping Parallelism to Multi-cores: A Machine Learning Based Approach. In Proceedings of 2009 ACM PPoPP (2009), pp. 75--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Whaley, R. C., Petitet, A., and Dongarra, J. J. Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing 27, 1--2 (2001), 3--35.Google ScholarGoogle Scholar

Index Terms

  1. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
          December 2009
          601 pages
          ISBN:9781605587981
          DOI:10.1145/1669112

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 December 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate484of2,242submissions,22%

          Upcoming Conference

          MICRO '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader