skip to main content
10.1145/3453483.3460969acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Open Access
Artifacts Evaluated & Functional / v1.1

Task parallel assembly language for uncompromising parallelism

Published:18 June 2021Publication History

ABSTRACT

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation.

In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task Parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

References

  1. [n.d.]. Architecture and Policy for Adaptive Optimization in Virtual Machines. https://researcher.watson.ibm.com/researcher/files/us-groved/RC23429.pdf Accessed: 2021-04-1.Google ScholarGoogle Scholar
  2. [n.d.]. The Linux source code file core.c. https://github.com/torvalds/linux/blob/master/kernel/events/core.c Accessed: 2021-04-1.Google ScholarGoogle Scholar
  3. Umut A. Acar, Vitaly Aksenov, Arthur Charguéraud, and Mike Rainey. 2019. Provably and Practically Efficient Granularity Control. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP ’19). 214–228. isbn:978-1-4503-6225-2Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2002. The data locality of work stealing. Theory of Computing Systems (TOCS), 35, 3 (2002), 321–347.Google ScholarGoogle ScholarCross RefCross Ref
  5. Umut A. Acar, Arthur Charguéraud, Adrien Guatto, Mike Rainey, and Filip Sieczkowski. 2018. Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). 769–782. isbn:978-1-4503-5698-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2016. Oracle-guided scheduling for controlling granularity in implicitly parallel languages. Journal of Functional Programming (JFP), 26 (2016), e23.Google ScholarGoogle ScholarCross RefCross Ref
  8. Shivali Agarwal, Rajkishore Barik, Dan Bonachea, Vivek Sarkar, R. K. Shyamasundar, and Katherine A. Yelick. 2007. Deadlock-free scheduling of X10 computations with bounded resources. In SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, California, USA, June 9-11, 2007. 229–240.Google ScholarGoogle Scholar
  9. Matthew Arnold and David Grove. 2005. Collecting and Exploiting High-Accuracy Call Graph Profiles in Virtual Machines. In Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05). IEEE Computer Society, USA. 51–62. isbn:076952298X https://doi.org/10.1109/CGO.2005.9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jatin Arora, Sam Westrick, and Umut A. Acar. 2021. Provably Space Efficient Parallel Functional Programming. In Proceedings of the 48th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google ScholarGoogle Scholar
  11. Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread Scheduling for Multiprogrammed Multiprocessors. Theory of Computing Systems, 34, 2 (2001), 115–144.Google ScholarGoogle ScholarCross RefCross Ref
  12. Guy E. Blelloch. 1996. Programming Parallel Algorithms. Commun. ACM, 39, 3 (1996), 85–97.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM, 46 (1999), March, 281–321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a Portable Nested Data-Parallel Language. J. Parallel Distrib. Comput., 21, 1 (1994), 4–14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Robert D. Blumofe and Charles E. Leiserson. 1998. Space-Efficient Scheduling of Multithreaded Computations. SIAM J. Comput., 27, 1 (1998), 202–229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM, 46 (1999), Sept., 720–748.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A type and effect system for deterministic parallel Java. In Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications (OOPSLA ’09). 97–116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Richard P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM, 21, 2 (1974), 201–206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Warren Burton and M. Ronan Sleep. 1981. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA ’81). ACM Press, 187–194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA ’05). ACM, 519–538.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC ’09). IEEE Computer Society, USA. 44–54. isbn:9781424451562 https://doi.org/10.1109/IISWC.2009.5306797 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Intel Corporation. 2014. Intel C++ Compiler Code Samples. https://software.intel.com/en-us/code-samples/intel-compiler/intel-compiler-features/IntelCilkPlusGoogle ScholarGoogle Scholar
  23. Daniele Cono D’Elia and Camil Demetrescu. 2018. On-stack replacement, distilled. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 166–180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Derek L. Eager, John Zahorjan, and Edward D. Lazowska. 1989. Speedup versus efficiency in parallel systems. IEEE Transactions on Computing, 38, 3 (1989), 408–423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Karl-Filip Faxén. 2009. Wool-A Work Stealing Library. SIGARCH Comput. Archit. News, 36, 5 (2009), June, 93–100. issn:0163-5964 https://doi.org/10.1145/1556444.1556457 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Marc Feeley. 1993. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture (FPCA ’93). 179–187.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Stephen J Fink and Feng Qian. 2003. Design, implementation and evaluation of adaptive recompilation with on-stack replacement. In International Symposium on Code Generation and Optimization, 2003. CGO 2003.. 241–252.Google ScholarGoogle ScholarCross RefCross Ref
  28. Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming, 20, 5-6 (2011), 1–40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and Other Cilk++ Hyperobjects. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures (SPAA ’09). Association for Computing Machinery, New York, NY, USA. 79–90. isbn:9781605586069 https://doi.org/10.1145/1583991.1584017 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212–223.Google ScholarGoogle Scholar
  31. Souradip Ghosh, Michael Cuevas, Simone Campanoni, and Peter Dinda. 2020. Compiler-based timing for extremely fine-grain preemptive parallelism. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 736–750.Google ScholarGoogle ScholarCross RefCross Ref
  32. Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut A. Acar, and Matthew Fluet. 2018. Hierarchical memory management for mutable state. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018, Vienna, Austria, February 24-28, 2018. 81–93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kyle Hale and Peter Dinda. 2018. An Evaluation of Asynchronous Software Events on Modern Hardware. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 355–368.Google ScholarGoogle Scholar
  34. Kyle C. Hale and Peter A. Dinda. 2015. A Case for Transforming Parallel Runtimes Into Operating System Kernels. In Proceedings of the 24^th ACM Symposium on High-performance Parallel and Distributed Computing (HPDC ’15).Google ScholarGoogle Scholar
  35. Kyle C. Hale and Peter A. Dinda. 2016. Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support. In Proceedings of the 12^th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’16). 161–175.Google ScholarGoogle Scholar
  36. Kyle C. Hale, Conor Hetland, and Peter A. Dinda. 2016. Automatic Hybridization of Runtime Systems. In Proceedings of the 25^th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’16). 137–140.Google ScholarGoogle Scholar
  37. Robert H. Halstead, Jr.. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP ’84). ACM, 9–17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-based load balancing. Proceedings of the 2009 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 44, 4 (2009), February, 55–64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Urs Hölzle, Craig Chambers, and David Ungar. 1992. Debugging optimized code with dynamic deoptimization. In Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation. 32–43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shams Mahmood Imam and Vivek Sarkar. 2014. Habanero-Java library: a Java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ ’14. 75–86.Google ScholarGoogle ScholarCross RefCross Ref
  41. Intel. 2011. Intel Threading Building Blocks. https://www.threadingbuildingblocks.org/Google ScholarGoogle Scholar
  42. Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande (JAVA ’00). 36–43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-Fly Pipeline Parallelism. TOPC, 2, 3 (2015), 17:1–17:42.Google ScholarGoogle Scholar
  44. Charles Leiserson and Aske Plaat. 1998. Programming parallel applications in Cilk. SINEWS: SIAM News, 31, 4 (1998), 6–7.Google ScholarGoogle Scholar
  45. Peng Li, Simon Marlow, Simon L. Peyton Jones, and Andrew P. Tolmach. 2007. Lightweight concurrency primitives for GHC. In Proceedings of the ACM SIGPLAN Workshop on Haskell, Haskell 2007, Freiburg, Germany, September 30, 2007. 107–118.Google ScholarGoogle Scholar
  46. Vasileios Liaskovitis, Shimin Chen, Phillip B. Gibbons, Anastassia Ailamaki, Guy E. Blelloch, Babak Falsafi, Limor Fix, Nikos Hardavellas, Michael Kozuch, Todd C. Mowry, and Chris Wilkerson. 2006. Parallel Depth First vs. Work Stealing Schedulers on CMP Architectures. In Proceedings of the Eighteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA ’06). Association for Computing Machinery, New York, NY, USA. 330. isbn:1595934529 https://doi.org/10.1145/1148109.1148167 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yi Lin, Kunshan Wang, Stephen M. Blackburn, Antony L. Hosking, and Michael Norrish. 2015. Stop and Go: Understanding Yieldpoint Behavior. In Proceedings of the 2015 International Symposium on Memory Management (ISMM ’15). Association for Computing Machinery, New York, NY, USA. 70–80. isbn:9781450335898 https://doi.org/10.1145/2754169.2754187 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Simon Marlow and Simon L. Peyton Jones. 2011. Multicore garbage collection with local heaps. In Proceedings of the 10th International Symposium on Memory Management, ISMM 2011, San Jose, CA, USA, June 04 - 05, 2011, Hans-Juergen Boehm and David F. Bacon (Eds.). ACM, 21–32.Google ScholarGoogle Scholar
  49. David Mosberger, Peter Druschel, and Larry L Peterson. 1996. Implementing atomic sequences on uniprocessors using rollforward. Software: Practice and Experience, 26, 1 (1996), 1–23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Philip J Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A portable interface to hardware performance counters. In Proceedings of the department of defense HPCMP users group conference. 710.Google ScholarGoogle Scholar
  51. Girija J. Narlikar and Guy E. Blelloch. 1999. Space-Efficient Scheduling of Nested Parallelism. ACM Transactions on Programming Languages and Systems, 21 (1999).Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. MEJ Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46, 5 (2005), Sep, 323–351. issn:1366-5812 https://doi.org/10.1080/00107510500052444 Google ScholarGoogle ScholarCross RefCross Ref
  53. OpenMP Architecture Review Board. [n.d.]. OpenMP Application Program Interface. http://www.openmp.org/Google ScholarGoogle Scholar
  54. Simon L. Peyton Jones, Roman Leshchinskiy, Gabriele Keller, and Manuel M. T. Chakravarty. 2008. Harnessing the Multicores: Nested Data Parallelism in Haskell. In FSTTCS. 383–414.Google ScholarGoogle Scholar
  55. Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016). ACM, New York, NY, USA. 392–406.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’17). Association for Computing Machinery, New York, NY, USA. 249–265. isbn:9781450344937 https://doi.org/10.1145/3018743.3018758 Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. K. C. Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. 2014. MultiMLton: A multicore-aware runtime for standard ML. Journal of Functional Programming, FirstView (2014), 6, 1–62.Google ScholarGoogle Scholar
  58. Daniel Spoonhower. 2009. Scheduling Deterministic Parallel Programs. Ph.D. Dissertation. Carnegie Mellon University. https://www.cs.cmu.edu/~rwh/theses/spoonhower.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  59. Torbjørn Johnsen Tessem. 2013. Improving parallel sparse matrix-vector multiplication. Master’s thesis. The University of Bergen.Google ScholarGoogle Scholar
  60. Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. 2014. Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism. TOPLAS, 36, 3 (2014), Article 10, Sept., 51 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. 2020. Disentanglement in Nested-Parallel Programs. In Proceedings of the 47th Annual ACM Symposium on Principles of Programming Languages (POPL)".Google ScholarGoogle Scholar
  62. Xi Yang, Stephen M. Blackburn, and Kathryn S. McKinley. 2015. Computer Performance Microscopy with Shim. SIGARCH Comput. Archit. News, 43, 3S (2015), June, 170–184. issn:0163-5964 https://doi.org/10.1145/2872887.2750401 Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Christopher S Zakian, Timothy AK Zakian, Abhishek Kulkarni, Buddhika Chamith, and Ryan R Newton. 2015. Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++. In International Workshop on Languages and Compilers for Parallel Computing. 73–90.Google ScholarGoogle Scholar

Index Terms

  1. Task parallel assembly language for uncompromising parallelism

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation
      June 2021
      1341 pages
      ISBN:9781450383912
      DOI:10.1145/3453483

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 June 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate406of2,067submissions,20%

      Upcoming Conference

      PLDI '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader