Task parallel assembly language for uncompromising parallelism

Authors:
Mike Rainey

Carnegie Mellon University, USA

Carnegie Mellon University, USA
View Profile

,
Ryan R. Newton

Facebook, USA

Facebook, USA
View Profile

,
Kyle Hale

Illinois Institute of Technology, USA

Illinois Institute of Technology, USA
View Profile

,
Nikos Hardavellas

Northwestern University, USA

Northwestern University, USA
View Profile

,
Simone Campanoni

Northwestern University, USA

Northwestern University, USA
View Profile

,
Peter Dinda

Northwestern University, USA

Northwestern University, USA
View Profile

,
Umut A. Acar

Carnegie Mellon University, USA

Carnegie Mellon University, USA
View Profile

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and ImplementationJune 2021Pages 1064–1079https://doi.org/10.1145/3453483.3460969

Published:18 June 2021Publication History

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Pages 1064–1079

ABSTRACT

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation.

In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

References

[n.d.]. Architecture and Policy for Adaptive Optimization in Virtual Machines. https://researcher.watson.ibm.com/researcher/files/us-groved/RC23429.pdf Accessed: 2021-04-1.Google Scholar
[n.d.]. The Linux source code file core.c. https://github.com/torvalds/linux/blob/master/kernel/events/core.c Accessed: 2021-04-1.Google Scholar
Umut A. Acar, Vitaly Aksenov, Arthur Charguéraud, and Mike Rainey. 2019. Provably and Practically Efficient Granularity Control. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP ’19). 214–228. isbn:978-1-4503-6225-2Google ScholarDigital Library
Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2002. The data locality of work stealing. Theory of Computing Systems (TOCS), 35, 3 (2002), 321–347.Google ScholarCross Ref
Umut A. Acar, Arthur Charguéraud, Adrien Guatto, Mike Rainey, and Filip Sieczkowski. 2018. Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). 769–782. isbn:978-1-4503-5698-5Google ScholarDigital Library
Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’13).Google ScholarDigital Library
Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2016. Oracle-guided scheduling for controlling granularity in implicitly parallel languages. Journal of Functional Programming (JFP), 26 (2016), e23.Google ScholarCross Ref
Shivali Agarwal, Rajkishore Barik, Dan Bonachea, Vivek Sarkar, R. K. Shyamasundar, and Katherine A. Yelick. 2007. Deadlock-free scheduling of X10 computations with bounded resources. In SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, California, USA, June 9-11, 2007. 229–240.Google Scholar
Matthew Arnold and David Grove. 2005. Collecting and Exploiting High-Accuracy Call Graph Profiles in Virtual Machines. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, USA. 51–62. isbn:076952298X https://doi.org/10.1109/CGO.2005.9 Google ScholarDigital Library
Jatin Arora, Sam Westrick, and Umut A. Acar. 2021. Provably Space Efficient Parallel Functional Programming. In Proceedings of the 48th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google Scholar
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread Scheduling for Multiprogrammed Multiprocessors. Theory of Computing Systems, 34, 2 (2001), 115–144.Google ScholarCross Ref
Guy E. Blelloch. 1996. Programming Parallel Algorithms. Commun. ACM, 39, 3 (1996), 85–97.Google ScholarDigital Library
Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM, 46 (1999), March, 281–321.Google ScholarDigital Library
Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a Portable Nested Data-Parallel Language. J. Parallel Distrib. Comput., 21, 1 (1994), 4–14.Google ScholarDigital Library
Robert D. Blumofe and Charles E. Leiserson. 1998. Space-Efficient Scheduling of Multithreaded Computations. SIAM J. Comput., 27, 1 (1998), 202–229.Google ScholarDigital Library
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM, 46 (1999), Sept., 720–748.Google ScholarDigital Library
Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A type and effect system for deterministic parallel Java. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications (OOPSLA ’09). 97–116.Google ScholarDigital Library
Richard P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM, 21, 2 (1974), 201–206.Google ScholarDigital Library
F. Warren Burton and M. Ronan Sleep. 1981. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA ’81). ACM Press, 187–194.Google ScholarDigital Library
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA ’05). ACM, 519–538.Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC ’09). IEEE Computer Society, USA. 44–54. isbn:9781424451562 https://doi.org/10.1109/IISWC.2009.5306797 Google ScholarDigital Library
Intel Corporation. 2014. Intel C++ Compiler Code Samples. https://software.intel.com/en-us/code-samples/intel-compiler/intel-compiler-features/IntelCilkPlusGoogle Scholar
Daniele Cono D’Elia and Camil Demetrescu. 2018. On-stack replacement, distilled. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 166–180.Google ScholarDigital Library
Derek L. Eager, John Zahorjan, and Edward D. Lazowska. 1989. Speedup versus efficiency in parallel systems. IEEE Transactions on Computing, 38, 3 (1989), 408–423.Google ScholarDigital Library
Karl-Filip Faxén. 2009. Wool-A Work Stealing Library. SIGARCH Comput. Archit. News, 36, 5 (2009), June, 93–100. issn:0163-5964 https://doi.org/10.1145/1556444.1556457 Google ScholarDigital Library
Marc Feeley. 1993. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture (FPCA ’93). 179–187.Google ScholarDigital Library
Stephen J Fink and Feng Qian. 2003. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In International Symposium on Code Generation and Optimization, 2003. CGO 2003.. 241–252.Google ScholarCross Ref
Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20, 5-6 (2011), 1–40.Google ScholarDigital Library
Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and Other Cilk++ Hyperobjects. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures (SPAA ’09). Association for Computing Machinery, New York, NY, USA. 79–90. isbn:9781605586069 https://doi.org/10.1145/1583991.1584017 Google ScholarDigital Library
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212–223.Google Scholar
Souradip Ghosh, Michael Cuevas, Simone Campanoni, and Peter Dinda. 2020. Compiler-based timing for extremely fine-grain preemptive parallelism. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 736–750.Google ScholarCross Ref
Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut A. Acar, and Matthew Fluet. 2018. Hierarchical memory management for mutable state. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018. 81–93.Google ScholarDigital Library
Kyle Hale and Peter Dinda. 2018. An Evaluation of Asynchronous Software Events on Modern Hardware. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 355–368.Google Scholar
Kyle C. Hale and Peter A. Dinda. 2015. A Case for Transforming Parallel Runtimes Into Operating System Kernels. In Proceedings of the 24^th ACM Symposium on High-performance Parallel and Distributed Computing (HPDC ’15).Google Scholar
Kyle C. Hale and Peter A. Dinda. 2016. Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. In Proceedings of the 12^th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’16). 161–175.Google Scholar
Kyle C. Hale, Conor Hetland, and Peter A. Dinda. 2016. Automatic Hybridization of Runtime Systems. In Proceedings of the 25^th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16). 137–140.Google Scholar
Robert H. Halstead, Jr.. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP ’84). ACM, 9–17.Google ScholarDigital Library
Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-based load balancing. Proceedings of the 2009 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 44, 4 (2009), February, 55–64.Google ScholarDigital Library
Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation. 32–43.Google ScholarDigital Library
Shams Mahmood Imam and Vivek Sarkar. 2014. Habanero-Java library: a Java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ ’14. 75–86.Google ScholarCross Ref
Intel. 2011. Intel Threading Building Blocks. https://www.threadingbuildingblocks.org/Google Scholar
Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande (JAVA ’00). 36–43.Google ScholarDigital Library
I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-Fly Pipeline Parallelism. TOPC, 2, 3 (2015), 17:1–17:42.Google Scholar
Charles Leiserson and Aske Plaat. 1998. Programming parallel applications in Cilk. SINEWS: SIAM News, 31, 4 (1998), 6–7.Google Scholar
Peng Li, Simon Marlow, Simon L. Peyton Jones, and Andrew P. Tolmach. 2007. Lightweight concurrency primitives for GHC. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2007, Freiburg, Germany, September 30, 2007. 107–118.Google Scholar
Vasileios Liaskovitis, Shimin Chen, Phillip B. Gibbons, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Michael Kozuch, Todd C. Mowry, and Chris Wilkerson. 2006. Parallel Depth First vs. Work Stealing Schedulers on CMP Architectures. In Proceedings of the Eighteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’06). Association for Computing Machinery, New York, NY, USA. 330. isbn:1595934529 https://doi.org/10.1145/1148109.1148167 Google ScholarDigital Library
Yi Lin, Kunshan Wang, Stephen M. Blackburn, Antony L. Hosking, and Michael Norrish. 2015. Stop and Go: Understanding Yieldpoint Behavior. In Proceedings of the 2015 International Symposium on Memory Management (ISMM ’15). Association for Computing Machinery, New York, NY, USA. 70–80. isbn:9781450335898 https://doi.org/10.1145/2754169.2754187 Google ScholarDigital Library
Simon Marlow and Simon L. Peyton Jones. 2011. Multicore garbage collection with local heaps. In Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA, June 04 - 05, 2011, Hans-Juergen Boehm and David F. Bacon (Eds.). ACM, 21–32.Google Scholar
David Mosberger, Peter Druschel, and Larry L Peterson. 1996. Implementing atomic sequences on uniprocessors using rollforward. Software: Practice and Experience, 26, 1 (1996), 1–23.Google ScholarDigital Library
Philip J Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference. 710.Google Scholar
Girija J. Narlikar and Guy E. Blelloch. 1999. Space-Efficient Scheduling of Nested Parallelism. ACM Transactions on Programming Languages and Systems, 21 (1999).Google ScholarDigital Library
MEJ Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46, 5 (2005), Sep, 323–351. issn:1366-5812 https://doi.org/10.1080/00107510500052444 Google ScholarCross Ref
OpenMP Architecture Review Board. [n.d.]. OpenMP Application Program Interface. http://www.openmp.org/Google Scholar
Simon L. Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. 2008. Harnessing the Multicores: Nested Data Parallelism in Haskell. In FSTTCS. 383–414.Google Scholar
Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016). ACM, New York, NY, USA. 392–406.Google ScholarDigital Library
Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’17). Association for Computing Machinery, New York, NY, USA. 249–265. isbn:9781450344937 https://doi.org/10.1145/3018743.3018758 Google ScholarDigital Library
K. C. Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. 2014. MultiMLton: A multicore-aware runtime for standard ML. Journal of Functional Programming, FirstView (2014), 6, 1–62.Google Scholar
Daniel Spoonhower. 2009. Scheduling Deterministic Parallel Programs. Ph.D. Dissertation. Carnegie Mellon University. https://www.cs.cmu.edu/~rwh/theses/spoonhower.pdfGoogle ScholarDigital Library
Torbjørn Johnsen Tessem. 2013. Improving parallel sparse matrix-vector multiplication. Master’s thesis. The University of Bergen.Google Scholar
Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. 2014. Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism. TOPLAS, 36, 3 (2014), Article 10, Sept., 51 pages.Google ScholarDigital Library
Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. 2020. Disentanglement in Nested-Parallel Programs. In Proceedings of the 47th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google Scholar
Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2015. Computer Performance Microscopy with Shim. SIGARCH Comput. Archit. News, 43, 3S (2015), June, 170–184. issn:0163-5964 https://doi.org/10.1145/2872887.2750401 Google ScholarDigital Library
Christopher S Zakian, Timothy AK Zakian, Abhishek Kulkarni, Buddhika Chamith, and Ryan R Newton. 2015. Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++. In International Workshop on Languages and Compilers for Parallel Computing. 73–90.Google Scholar

Index Terms

Task parallel assembly language for uncompromising parallelism
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Heartbeat scheduling: provable efficiency for nested parallelism
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not ...
Read More
Heartbeat scheduling: provable efficiency for nested parallelism
PLDI '18

A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not ...
Read More
An object-oriented parallel programming language for distributed-memory parallel computing platforms

In object-oriented programming (OOP) languages, the ability to encapsulate software concerns of the dominant decomposition in objects is the key to reaching high modularity and loss of complexity in large scale designs. However, distributed-memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation
June 2021
1341 pages
ISBN:9781450383912
DOI:10.1145/3453483
General Chair:
Stephen N. Freund
Williams College, USA
,
Program Chair:
Eran Yahav
Technion, Israel
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Evaluated & Functional / v1.1
Author Tags
granularity control
parallel programming languages
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate406of2,067submissions,20%
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 552
  Total Downloads
- Downloads (Last 12 months)138
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.