research-article

Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

Authors:
Andi Drebes

The University of Manchester, Manchester, United Kingdom

The University of Manchester, Manchester, United Kingdom
View Profile

,
Antoniu Pop

The University of Manchester, Manchester, United Kingdom

The University of Manchester, Manchester, United Kingdom
View Profile

,
Karine Heydemann

Sorbonne Universités, UPMC Univ Paris 06, Paris, France

Sorbonne Universités, UPMC Univ Paris 06, Paris, France
View Profile

,
Albert Cohen

INRIA and École Normale Supérieure, Paris, France

INRIA and École Normale Supérieure, Paris, France
View Profile

,
Nathalie Drach

Sorbonne Universités, UPMC Univ Paris 06, Paris, France

Sorbonne Universités, UPMC Univ Paris 06, Paris, France
View Profile

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationSeptember 2016Pages 125–137https://doi.org/10.1145/2967938.2967946

Published:11 September 2016Publication History

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 125–137

ABSTRACT

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5x higher performance than NUMA-aware hierarchical work-stealing, and even 5.6x compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.

References

K. E. Batcher. Sorting networks and their applications. In Proceedings of the April 30-May 2, 1968, Spring joint Computer Conference, AFIPS '68 (Spring), pages 307--314, New York, NY, USA, 1968. ACM. Google ScholarDigital Library
J. Bircsak, P. Craig, R. Crowell, Z. Cvetanovic, J. Harris, C. Nelson, and C. O ner. Extending OpenMP for NUMA machines. In Supercomputing. ACM/IEEE, Nov. 2000. Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, Sept. 1999. Google ScholarDigital Library
F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P.-A. Wacrenier, and R. Namyst. Structuring the execution of openmp applications for multicore architectures. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1--10. IEEE.Google Scholar
F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. International Journal of Parallel Programming, 38(5):418--439, 2010.Google ScholarCross Ref
F. Broquedis, T. Gautier, and V. Danjean. LIBKOMP, an efficient OpenMP runtime system for both fork-join and data ow paradigms. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, IWOMP, pages 102--115, Berlin, Heidelberg, 2012. Springer-Verlag. Google ScholarDigital Library
Z. Budimlić, M. Burke, V. Cavé, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Taşirlar. Concurrent collections. Scientific Programming, 18:203--217, 2010. Google ScholarDigital Library
V. Cavé, J. Zhao, J. Shirako, and V. Sarkar. Habanero-Java: The New Adventures of Old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ '11, pages 51--61, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
P. Charles, C. Grotho, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA '05, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
Q. Chen, M. Guo, and H. Guan. LAWS: Locality-aware Work-stealing for Multi-socket Multi-core Architectures. In Proceedings of the 28th ACM International Conference on Supercomputing, ICS '14, pages 3--12, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
J. Corbet. NUMA in a hurry, Nov. 2012.Google Scholar
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 381--394, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
A. Drebes, A. Pop, K. Heydemann, A. Cohen, and N. Drach. Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages. ACM Transactions on Architecture and Code Optimization, 11(3):30:1--30:25, Aug. 2014. Google ScholarDigital Library
P. J. Drongowski. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. Advanced Micro Devices, November 2007.Google Scholar
T. Gautier, X. Besseron, and L. Pigeon. KAAPI: A thread scheduling runtime system for data ow computations on cluster of multi-processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, pages 15--23. ACM, 2007. Google ScholarDigital Library
B. Goglin and N. Furmento. Enabling high-performance memory migration for multithreaded applications on linux. In IEEE International Symposium on Parallel Distributed Processing (IPDPS '09), pages 1--9, May 2009. Google ScholarDigital Library
Intel Corporation. Intel Math Kernel Library. https://software.intel.com/en-us/intel-mkl, accessed 01/2015.Google Scholar
Intel Corporation. Threading Building Blocks. https://www.threadingbuildingblocks.org/, accessed 09/2015.Google Scholar
S. Kaestle, R. Achermann, T. Roscoe, and T. Harris. Shoal: Smart allocation and replication of memory for parallel programs. In Proceedings of the 2015 Usenix Annual Technical Conference, USENIX ATC '15, pages 263--276, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarDigital Library
A. Kleen. A NUMA API for Linux, Apr. 2005.Google Scholar
M. Kong, A. Pop, L.-N. Pouchet, R. Govindarajan, A. Cohen, and P. Sadayappan. Compiler/runtime framework for dynamic data ow parallelization of tiled programs. ACM Transactions on Architecture and Code Optimization, 11(4):61:1--61:30, Jan. 2015. Google ScholarDigital Library
N. M. Lê, A. Pop, A. Cohen, and F. Z. Nardelli. Correct and efficient work-stealing for weak memory models. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, New York, NY, USA, February 2013. ACM. Google ScholarDigital Library
B. Lepers, V. Quéma, and A. Fedorova. Thread and memory placement on NUMA systems: Asymmetry matters. In 2015 USENIX Annual Technical Conference, USENIX ATC '15, July 8-10, Santa Clara, CA, USA, pages 277--289, 2015. Google ScholarDigital Library
H. Lö PDE Solver on a cc-NUMA System. In Proceedings of the 19th Annual International Conference on Supercomputing, ICS '05, pages 387--392, New York, NY, USA, 2005. ACM.Google Scholar
Z. Majo and T. R. Gross. A library for portable and composable data locality optimizations for numa systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 227--238, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
D. S. Nikolopoulos, E. Artiaga, E. Ayguadé, and J. Labarta. Exploiting memory affinity in openmp through schedule reuse. SIGARCH Computer Architecture News, 29(5):49--55, Dec. 2001. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP Application Program Interface Version 4.0, July 2013.Google Scholar
http://www.openstream.info.Google Scholar
J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. Hierarchical task-based programming with StarSs. International Journal on High Performance Computing Architecture, 23(3):284--299, 2009. Google ScholarDigital Library
A. Pop and A. Cohen. OpenStream: Expressiveness and data- ow compilation of OpenMP streaming programs. ACM Transactions on Architecture and Code Optimization, 9(4):53:1--53:25, Jan. 2013. Google ScholarDigital Library
C. Pousa Ribeiro and J.-F. Méhaut. Minas: Memory Affinity Management Framework. Research Report RR-7051, 2009.Google Scholar
P. Pratikakis, H. Vandierendonck, S. Lyberis, and D. S. Nikolopoulos. A programming model for deterministic task parallelism. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC '11, pages 7--12, New York, NY, USA, 2011. Google ScholarDigital Library
C. P. Ribeiro, J.-F. Mehaut, A. Carissimi, M. Castro, and L. G. Fernandes. Memory affinity for hierarchical shared memory multiprocessors. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD, pages 59--66, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
A. YarKhan, J. Kurzak, and J. Dongarra. QUARK Users' Guide - QUeueing And Runtime for Kernels, 2011. http://ash2.icl.utk.edu/sites/ash2.icl.utk.edu/les/publications/2011/icl-utk-454--2011.pdf, accessed 10/2014.Google Scholar

Index Terms

Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
ISMM '11: Proceedings of the international symposium on Memory management

Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, ...
Read More
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
ISMM '11

Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, ...
Read More
Application task and data placement in embedded many-core NUMA architectures
ODES '13: Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems

The advent of many-core chips in the embedded world imposes new challenges to programmers. One of the most important challenges to achieve optimal performance is that the variance in memory access time depending on the issuing core and the location of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
data-flow programming
memory allocation
numa
scheduling
task-parallel programming
Qualifiers
- research-article
Conference

Acceptance Rates
PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%
More
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 718
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.