ABSTRACT
Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose that a processor p is executing a parallel region A after having acquired the lock L protecting A. If another processor p′ tries to acquire L, then instead of blocking and waiting for p to complete A, processor p′ joins p to help it complete A. Additional processors not blocked on L may also help to execute A.
The HELPER runtime system can execute fork-join computations augmented with helper locks and parallel regions. HELPER supports the unbounded nesting of parallel regions. We provide theoretical completion-time and space-usage bounds for a design of HELPER based on work stealing. Specifically, let V be the number of parallel regions in a computation, let T1 be its work, and let T∞ be its "aggregate span" --- the sum of the spans (critical-path lengths) of all its parallel regions. We prove that HELPER completes the computation in expected time O(T1/PP + T∞+ PV on P processors. This bound indicates that programs with a small number of highly parallel critical sections can attain linear speedup. For the space bound, we prove that HELPER completes a program using only O(PS1 stack space, where S1 is the sum, over all regions, of the stack space used by each region in a serial execution. Finally, we describe a prototype of HELPER implemented by modifying the Cilk multithreaded runtime system. We used this prototype to implement a concurrent hash table with a resize operation protected by a helper lock.
- E. Allen, D. Chase, J. Hallett, V. Luchango, J.-W. Maessen, S. Ryu, G. L. S. Jr., and S. Tobin-Hochstadt. The Fortress language specification, version 1.0. Technical report, Sun Microsystems, Inc., March 2008. URL http://research.sun.com/projects/plrg/Publications/fortress.1.0.pdf.Google Scholar
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 119--129, Puerto Vallarta, Mexico, 1998. Google ScholarDigital Library
- G. Barnes. A method for implementing lock-free shared-data structures. In SPAA '93: Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 261--270, New York, NY, USA, 1993. ACM. ISBN 0-89791-599-2. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999. Google ScholarDigital Library
- K. Ebcioglu, V. Saraswat, and V. Sarkar. X10: an experimental language for high productivity programming of scalable systems. In Proceedings of the Second Workshop on Productivity and Performance in High-End Computing (PPHEC-05), Feb. 2005. Held in conjunction with the Eleventh Symposium on High Performance Computer Architecture.Google Scholar
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 212--223, 1998. Google ScholarDigital Library
- M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch hashing. In DISC '08: Proceedings of the 22nd International Symposium on Distributed Computing, pages 350--364, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
- A. Israeli and L. Rappoport. Disjoint-access-parallel implementations of strong shared memory primitives. In PODC '94: Proceedings of the Thirteenth Annual ACM Symposium on Principles of Distributed Computing, pages 151--160, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
- C. E. Leiserson. The Cilk++ concurrency platform. In DAC '09: Proceedings of the 46th Annual Design Automation Conference, pages 522--527, New York, NY, 2009. ACM. Google ScholarDigital Library
- S.-C. Lim, J. Ahn, and M. H. Kim. A concurrent Blink-tree algorithm using a cooperative locking protocol. In Lecture Notes in Computer Science, volume 2712, pages 253--260. Springer Berlin / Heidelberg, 2003. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.Google Scholar
- J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly, 2007. Google ScholarDigital Library
- J. Turek, D. Shasha, and S. Prakash. Locking without blocking: making lock based concurrent data structure algorithms nonblocking. In PODS '92: Proceedings of the Eleventh ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 212--222, New York, NY, USA, 1992. ACM. Google ScholarDigital Library
Index Terms
- Helper locks for fork-join parallel programming
Recommendations
Helper locks for fork-join parallel programming
PPoPP '10Helper locks allow programs with large parallel critical sections, called parallel regions, to execute more efficiently by enlisting processors that might otherwise be waiting on the helper lock to aid in the execution of the parallel region. Suppose ...
Brief announcement: a lower bound for depth-restricted work stealing
SPAA '09: Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architecturesWork stealing is a common technique used in the runtime schedulers of parallel languages such as Cilk and parallel libraries such as Intel Threading Building Blocks (TBB). Depth-restricted work stealing is a restriction of Cilk-like work stealing in ...
Brief announcement: serial-parallel reciprocity in dynamic multithreaded languages
SPAA '10: Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architecturesIn dynamically multithreaded platforms that employ work stealing, there appears to be a fundamental tradeoff between providing provably good time and space bounds and supporting SP-reciprocity, the property of allowing arbitrary calling between parallel ...
Comments