ABSTRACT
Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation.
In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.
- [n.d.]. Architecture and Policy for Adaptive Optimization in Virtual Machines. https://researcher.watson.ibm.com/researcher/files/us-groved/RC23429.pdf Accessed: 2021-04-1.Google Scholar
- [n.d.]. The Linux source code file core.c. https://github.com/torvalds/linux/blob/master/kernel/events/core.c Accessed: 2021-04-1.Google Scholar
- Umut A. Acar, Vitaly Aksenov, Arthur Charguéraud, and Mike Rainey. 2019. Provably and Practically Efficient Granularity Control. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP ’19). 214–228. isbn:978-1-4503-6225-2Google ScholarDigital Library
- Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2002. The data locality of work stealing. Theory of Computing Systems (TOCS), 35, 3 (2002), 321–347.Google ScholarCross Ref
- Umut A. Acar, Arthur Charguéraud, Adrien Guatto, Mike Rainey, and Filip Sieczkowski. 2018. Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). 769–782. isbn:978-1-4503-5698-5Google ScholarDigital Library
- Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’13).Google ScholarDigital Library
- Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2016. Oracle-guided scheduling for controlling granularity in implicitly parallel languages. Journal of Functional Programming (JFP), 26 (2016), e23.Google ScholarCross Ref
- Shivali Agarwal, Rajkishore Barik, Dan Bonachea, Vivek Sarkar, R. K. Shyamasundar, and Katherine A. Yelick. 2007. Deadlock-free scheduling of X10 computations with bounded resources. In SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, California, USA, June 9-11, 2007. 229–240.Google Scholar
- Matthew Arnold and David Grove. 2005. Collecting and Exploiting High-Accuracy Call Graph Profiles in Virtual Machines. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, USA. 51–62. isbn:076952298X https://doi.org/10.1109/CGO.2005.9 Google ScholarDigital Library
- Jatin Arora, Sam Westrick, and Umut A. Acar. 2021. Provably Space Efficient Parallel Functional Programming. In Proceedings of the 48th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google Scholar
- Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread Scheduling for Multiprogrammed Multiprocessors. Theory of Computing Systems, 34, 2 (2001), 115–144.Google ScholarCross Ref
- Guy E. Blelloch. 1996. Programming Parallel Algorithms. Commun. ACM, 39, 3 (1996), 85–97.Google ScholarDigital Library
- Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM, 46 (1999), March, 281–321.Google ScholarDigital Library
- Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a Portable Nested Data-Parallel Language. J. Parallel Distrib. Comput., 21, 1 (1994), 4–14.Google ScholarDigital Library
- Robert D. Blumofe and Charles E. Leiserson. 1998. Space-Efficient Scheduling of Multithreaded Computations. SIAM J. Comput., 27, 1 (1998), 202–229.Google ScholarDigital Library
- Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM, 46 (1999), Sept., 720–748.Google ScholarDigital Library
- Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A type and effect system for deterministic parallel Java. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications (OOPSLA ’09). 97–116.Google ScholarDigital Library
- Richard P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM, 21, 2 (1974), 201–206.Google ScholarDigital Library
- F. Warren Burton and M. Ronan Sleep. 1981. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA ’81). ACM Press, 187–194.Google ScholarDigital Library
- Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA ’05). ACM, 519–538.Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC ’09). IEEE Computer Society, USA. 44–54. isbn:9781424451562 https://doi.org/10.1109/IISWC.2009.5306797 Google ScholarDigital Library
- Intel Corporation. 2014. Intel C++ Compiler Code Samples. https://software.intel.com/en-us/code-samples/intel-compiler/intel-compiler-features/IntelCilkPlusGoogle Scholar
- Daniele Cono D’Elia and Camil Demetrescu. 2018. On-stack replacement, distilled. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 166–180.Google ScholarDigital Library
- Derek L. Eager, John Zahorjan, and Edward D. Lazowska. 1989. Speedup versus efficiency in parallel systems. IEEE Transactions on Computing, 38, 3 (1989), 408–423.Google ScholarDigital Library
- Karl-Filip Faxén. 2009. Wool-A Work Stealing Library. SIGARCH Comput. Archit. News, 36, 5 (2009), June, 93–100. issn:0163-5964 https://doi.org/10.1145/1556444.1556457 Google ScholarDigital Library
- Marc Feeley. 1993. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture (FPCA ’93). 179–187.Google ScholarDigital Library
- Stephen J Fink and Feng Qian. 2003. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In International Symposium on Code Generation and Optimization, 2003. CGO 2003.. 241–252.Google ScholarCross Ref
- Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20, 5-6 (2011), 1–40.Google ScholarDigital Library
- Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and Other Cilk++ Hyperobjects. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures (SPAA ’09). Association for Computing Machinery, New York, NY, USA. 79–90. isbn:9781605586069 https://doi.org/10.1145/1583991.1584017 Google ScholarDigital Library
- Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212–223.Google Scholar
- Souradip Ghosh, Michael Cuevas, Simone Campanoni, and Peter Dinda. 2020. Compiler-based timing for extremely fine-grain preemptive parallelism. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 736–750.Google ScholarCross Ref
- Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut A. Acar, and Matthew Fluet. 2018. Hierarchical memory management for mutable state. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018. 81–93.Google ScholarDigital Library
- Kyle Hale and Peter Dinda. 2018. An Evaluation of Asynchronous Software Events on Modern Hardware. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 355–368.Google Scholar
- Kyle C. Hale and Peter A. Dinda. 2015. A Case for Transforming Parallel Runtimes Into Operating System Kernels. In Proceedings of the 24^th ACM Symposium on High-performance Parallel and Distributed Computing (HPDC ’15).Google Scholar
- Kyle C. Hale and Peter A. Dinda. 2016. Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. In Proceedings of the 12^th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’16). 161–175.Google Scholar
- Kyle C. Hale, Conor Hetland, and Peter A. Dinda. 2016. Automatic Hybridization of Runtime Systems. In Proceedings of the 25^th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16). 137–140.Google Scholar
- Robert H. Halstead, Jr.. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP ’84). ACM, 9–17.Google ScholarDigital Library
- Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-based load balancing. Proceedings of the 2009 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 44, 4 (2009), February, 55–64.Google ScholarDigital Library
- Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation. 32–43.Google ScholarDigital Library
- Shams Mahmood Imam and Vivek Sarkar. 2014. Habanero-Java library: a Java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ ’14. 75–86.Google ScholarCross Ref
- Intel. 2011. Intel Threading Building Blocks. https://www.threadingbuildingblocks.org/Google Scholar
- Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande (JAVA ’00). 36–43.Google ScholarDigital Library
- I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-Fly Pipeline Parallelism. TOPC, 2, 3 (2015), 17:1–17:42.Google Scholar
- Charles Leiserson and Aske Plaat. 1998. Programming parallel applications in Cilk. SINEWS: SIAM News, 31, 4 (1998), 6–7.Google Scholar
- Peng Li, Simon Marlow, Simon L. Peyton Jones, and Andrew P. Tolmach. 2007. Lightweight concurrency primitives for GHC. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2007, Freiburg, Germany, September 30, 2007. 107–118.Google Scholar
- Vasileios Liaskovitis, Shimin Chen, Phillip B. Gibbons, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Michael Kozuch, Todd C. Mowry, and Chris Wilkerson. 2006. Parallel Depth First vs. Work Stealing Schedulers on CMP Architectures. In Proceedings of the Eighteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’06). Association for Computing Machinery, New York, NY, USA. 330. isbn:1595934529 https://doi.org/10.1145/1148109.1148167 Google ScholarDigital Library
- Yi Lin, Kunshan Wang, Stephen M. Blackburn, Antony L. Hosking, and Michael Norrish. 2015. Stop and Go: Understanding Yieldpoint Behavior. In Proceedings of the 2015 International Symposium on Memory Management (ISMM ’15). Association for Computing Machinery, New York, NY, USA. 70–80. isbn:9781450335898 https://doi.org/10.1145/2754169.2754187 Google ScholarDigital Library
- Simon Marlow and Simon L. Peyton Jones. 2011. Multicore garbage collection with local heaps. In Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA, June 04 - 05, 2011, Hans-Juergen Boehm and David F. Bacon (Eds.). ACM, 21–32.Google Scholar
- David Mosberger, Peter Druschel, and Larry L Peterson. 1996. Implementing atomic sequences on uniprocessors using rollforward. Software: Practice and Experience, 26, 1 (1996), 1–23.Google ScholarDigital Library
- Philip J Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference. 710.Google Scholar
- Girija J. Narlikar and Guy E. Blelloch. 1999. Space-Efficient Scheduling of Nested Parallelism. ACM Transactions on Programming Languages and Systems, 21 (1999).Google ScholarDigital Library
- MEJ Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46, 5 (2005), Sep, 323–351. issn:1366-5812 https://doi.org/10.1080/00107510500052444 Google ScholarCross Ref
- OpenMP Architecture Review Board. [n.d.]. OpenMP Application Program Interface. http://www.openmp.org/Google Scholar
- Simon L. Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. 2008. Harnessing the Multicores: Nested Data Parallelism in Haskell. In FSTTCS. 383–414.Google Scholar
- Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016). ACM, New York, NY, USA. 392–406.Google ScholarDigital Library
- Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’17). Association for Computing Machinery, New York, NY, USA. 249–265. isbn:9781450344937 https://doi.org/10.1145/3018743.3018758 Google ScholarDigital Library
- K. C. Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. 2014. MultiMLton: A multicore-aware runtime for standard ML. Journal of Functional Programming, FirstView (2014), 6, 1–62.Google Scholar
- Daniel Spoonhower. 2009. Scheduling Deterministic Parallel Programs. Ph.D. Dissertation. Carnegie Mellon University. https://www.cs.cmu.edu/~rwh/theses/spoonhower.pdfGoogle ScholarDigital Library
- Torbjørn Johnsen Tessem. 2013. Improving parallel sparse matrix-vector multiplication. Master’s thesis. The University of Bergen.Google Scholar
- Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. 2014. Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism. TOPLAS, 36, 3 (2014), Article 10, Sept., 51 pages.Google ScholarDigital Library
- Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. 2020. Disentanglement in Nested-Parallel Programs. In Proceedings of the 47th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google Scholar
- Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2015. Computer Performance Microscopy with Shim. SIGARCH Comput. Archit. News, 43, 3S (2015), June, 170–184. issn:0163-5964 https://doi.org/10.1145/2872887.2750401 Google ScholarDigital Library
- Christopher S Zakian, Timothy AK Zakian, Abhishek Kulkarni, Buddhika Chamith, and Ryan R Newton. 2015. Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++. In International Workshop on Languages and Compilers for Parallel Computing. 73–90.Google Scholar
Index Terms
- Task parallel assembly language for uncompromising parallelism
Recommendations
Heartbeat scheduling: provable efficiency for nested parallelism
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationA classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not ...
Heartbeat scheduling: provable efficiency for nested parallelism
PLDI '18A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not ...
An object-oriented parallel programming language for distributed-memory parallel computing platforms
In object-oriented programming (OOP) languages, the ability to encapsulate software concerns of the dominant decomposition in objects is the key to reaching high modularity and loss of complexity in large scale designs. However, distributed-memory ...
Comments