Article

Parallel depth first vs. work stealing schedulers on CMP architectures

Authors:
Vasileios Liaskovitis

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Shimin Chen

Intel Research Pittsburgh

Intel Research Pittsburgh
View Profile

,
Phillip B. Gibbons

Intel Research Pittsburgh

Intel Research Pittsburgh
View Profile

,
Anastassia Ailamaki

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Guy E. Blelloch

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Babak Falsafi

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Limor Fix

Intel Research Pittsburgh

Intel Research Pittsburgh
View Profile

,
Nikos Hardavellas

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Michael Kozuch

Intel Research Pittsburgh

Intel Research Pittsburgh
View Profile

,
Todd C. Mowry

Carnegie Mellon University and Intel Research Pittsburgh

Carnegie Mellon University and Intel Research Pittsburgh
View Profile

,
Chris Wilkerson

Intel Microprocessor Research Lab

Intel Microprocessor Research Lab
View Profile

SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architecturesJuly 2006Pages 330https://doi.org/10.1145/1148109.1148167

Published:30 July 2006Publication History

SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures

Pages 330

ABSTRACT

In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach.Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets.CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm.Summary of findings. We studied a variety of benchmark programs to show the following findings.For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 13- 41% reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches.For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above.Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF.We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs.

References

G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In Proc. ACM SPAA, 2004. Google ScholarDigital Library
G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. JACM, 46(2), 1999. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5), 1999. Google ScholarDigital Library
V. Liaskovitis, S. Chen, P. B. Gibbons, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, M. Kozuch, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. Intel Research Pittsburgh tech. rep., June 2006.Google Scholar

Index Terms

Parallel depth first vs. work stealing schedulers on CMP architectures
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multithreading
        Scheduling

Recommendations

Proximity-aware directory-based coherence for multi-core processor architectures
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes ...
Read More
Unified vs. split TLBs and caches in shared-memory MP systems
IPPS '95: Proceedings of the 9th International Symposium on Parallel Processing

Data references in shared-memory multiprocessors (SMMPs) are targeted to private and shared data. Thus, conflicts between private and shared data occur in unified translation-lookaside buffer (TLBs) and caches. Separate private and shared data TLBs and ...
Read More
Temporal-based multilevel correlating inclusive cache replacement

Inclusive caches have been widely used in Chip Multiprocessors (CMPs) to simplify cache coherence. However, they have poor performance compared with noninclusive caches not only because of the limited capacity of the entire cache hierarchy but also due ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
July 2006
344 pages
ISBN:1595934529
DOI:10.1145/1148109
General Chair:
Phillip B. Gibbons
Intel Research, USA
,
Program Chair:
Uzi Vishkin
UMIACS, University of Maryland, USA
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 July 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
caches
chip multiprocessors
scheduling
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate447of1,461submissions,31%
Upcoming Conference
SPAA '24

Sponsor:

sigact

sigact

36th ACM Symposium on Parallelism in Algorithms and Architectures

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 640
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parallel depth first vs. work stealing schedulers on CMP architectures

SPAA '06: Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures

ABSTRACT

References

Cited By

Index Terms

Recommendations

Proximity-aware directory-based coherence for multi-core processor architectures

Unified vs. split TLBs and caches in shared-memory MP systems

Temporal-based multilevel correlating inclusive cache replacement