skip to main content
10.1145/2967938.2967946acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Best Paper

Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

Authors Info & Claims
Published:11 September 2016Publication History

ABSTRACT

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5x higher performance than NUMA-aware hierarchical work-stealing, and even 5.6x compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.

References

  1. K. E. Batcher. Sorting networks and their applications. In Proceedings of the April 30-May 2, 1968, Spring joint Computer Conference, AFIPS '68 (Spring), pages 307--314, New York, NY, USA, 1968. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. Nelson, and C. O ner. Extending OpenMP for NUMA machines. In Supercomputing. ACM/IEEE, Nov. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, Sept. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P.-A. Wacrenier, and R. Namyst. Structuring the execution of openmp applications for multicore architectures. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1--10. IEEE.Google ScholarGoogle Scholar
  6. F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. International Journal of Parallel Programming, 38(5):418--439, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  7. F. Broquedis, T. Gautier, and V. Danjean. LIBKOMP, an efficient OpenMP runtime system for both fork-join and data ow paradigms. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, IWOMP, pages 102--115, Berlin, Heidelberg, 2012. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Budimlić, M. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Taşirlar. Concurrent collections. Scientific Programming, 18:203--217, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: The New Adventures of Old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Charles, C. Grotho, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Q. Chen, M. Guo, and H. Guan. LAWS: Locality-aware Work-stealing for Multi-socket Multi-core Architectures. In Proceedings of the 28th ACM International Conference on Supercomputing, ICS '14, pages 3--12, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Corbet. NUMA in a hurry, Nov. 2012.Google ScholarGoogle Scholar
  13. M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 381--394, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach. Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages. ACM Transactions on Architecture and Code Optimization, 11(3):30:1--30:25, Aug. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. J. Drongowski. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. Advanced Micro Devices, November 2007.Google ScholarGoogle Scholar
  16. T. Gautier, X. Besseron, and L. Pigeon. KAAPI: A thread scheduling runtime system for data ow computations on cluster of multi-processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, pages 15--23. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Goglin and N. Furmento. Enabling high-performance memory migration for multithreaded applications on linux. In IEEE International Symposium on Parallel Distributed Processing (IPDPS '09), pages 1--9, May 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Intel Corporation. Intel Math Kernel Library. https://software.intel.com/en-us/intel-mkl, accessed 01/2015.Google ScholarGoogle Scholar
  19. Intel Corporation. Threading Building Blocks. https://www.threadingbuildingblocks.org/, accessed 09/2015.Google ScholarGoogle Scholar
  20. S. Kaestle, R. Achermann, T. Roscoe, and T. Harris. Shoal: Smart allocation and replication of memory for parallel programs. In Proceedings of the 2015 Usenix Annual Technical Conference, USENIX ATC '15, pages 263--276, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Kleen. A NUMA API for Linux, Apr. 2005.Google ScholarGoogle Scholar
  22. M. Kong, A. Pop, L.-N. Pouchet, R. Govindarajan, A. Cohen, and P. Sadayappan. Compiler/runtime framework for dynamic data ow parallelization of tiled programs. ACM Transactions on Architecture and Code Optimization, 11(4):61:1--61:30, Jan. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. M. Lê, A. Pop, A. Cohen, and F. Z. Nardelli. Correct and efficient work-stealing for weak memory models. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, New York, NY, USA, February 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Lepers, V. Quéma, and A. Fedorova. Thread and memory placement on NUMA systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, July 8-10, Santa Clara, CA, USA, pages 277--289, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Lö PDE Solver on a cc-NUMA System. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS '05, pages 387--392, New York, NY, USA, 2005. ACM.Google ScholarGoogle Scholar
  26. Z. Majo and T. R. Gross. A library for portable and composable data locality optimizations for numa systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 227--238, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. S. Nikolopoulos, E. Artiaga, E. Ayguadé, and J. Labarta. Exploiting memory affinity in openmp through schedule reuse. SIGARCH Computer Architecture News, 29(5):49--55, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.0, July 2013.Google ScholarGoogle Scholar
  29. http://www.openstream.info.Google ScholarGoogle Scholar
  30. J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. Hierarchical task-based programming with StarSs. International Journal on High Performance Computing Architecture, 23(3):284--299, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Pop and A. Cohen. OpenStream: Expressiveness and data- ow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization, 9(4):53:1--53:25, Jan. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Pousa Ribeiro and J.-F. Méhaut. Minas: Memory Affinity Management Framework. Research Report RR-7051, 2009.Google ScholarGoogle Scholar
  33. P. Pratikakis, H. Vandierendonck, S. Lyberis, and D. S. Nikolopoulos. A programming model for deterministic task parallelism. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC '11, pages 7--12, New York, NY, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. P. Ribeiro, J.-F. Mehaut, A. Carissimi, M. Castro, and L. G. Fernandes. Memory affinity for hierarchical shared memory multiprocessors. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD, pages 59--66, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. YarKhan, J. Kurzak, and J. Dongarra. QUARK Users' Guide - QUeueing And Runtime for Kernels, 2011. http://ash2.icl.utk.edu/sites/ash2.icl.utk.edu/les/publications/2011/icl-utk-454--2011.pdf, accessed 10/2014.Google ScholarGoogle Scholar

Index Terms

  1. Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
        September 2016
        474 pages
        ISBN:9781450341219
        DOI:10.1145/2967938

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 September 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%

        Upcoming Conference

        PACT '24
        International Conference on Parallel Architectures and Compilation Techniques
        October 14 - 16, 2024
        Southern California , CA , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader