research-article

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems

Authors:
Matthew Agostini

University of Toronto, Canada

University of Toronto, Canada
View Profile

,
Francis O'Brien

University of Toronto, Canada

University of Toronto, Canada
View Profile

,
Tarek Abdelrahman

University of Toronto, Canada

University of Toronto, Canada
View Profile

ICPP '20: Proceedings of the 49th International Conference on Parallel ProcessingAugust 2020Article No.: 50Pages 1–12https://doi.org/10.1145/3404397.3404433

Published:17 August 2020Publication History

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

Pages 1–12

ABSTRACT

We propose, implement and evaluate a work stealing based scheduler, called HWS, for graph processing on heterogeneous CPU-FPGA systems that tightly couple the CPU and the FPGA to share system memory. HWS addresses unique concerns that arise with work stealing in the context of our target system. Our evaluation is conducted on the Intel Heterogeneous Architecture Research Platform (HARPv2), using three key processing kernels and seven real-world graphs. We show that HWS effectively balances workloads. Further, the use of HWS results in better graph processing performance compared to static scheduling and a representative of existing adaptive partitioning techniques, called HAP. Improvements vary by graph processing application, input graph and number of threads, and can be up to 100% over static scheduling, and up to 17% over HAP. We also compare to an oracle chunk self-scheduler, in which the best chunk size is known a priori for each number of threads and each input graph. HWS performs no worse than 1-3% in most cases. Finally, our graph processing throughput scales well with increasing threads. These results collectively demonstrate the effectiveness of work stealing for graph processing on our heterogeneous target platform.

References

U. A. Acar, A. Chargueraud, and M. Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proc. of Symp. on Principles and Practice of Parallel Programming. 219–228.Google Scholar
M. E. Belviranli, L. N. Bhuyan, and R. Gupta. 2013. A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures. ACM Trans. Archit. Code Optim. 9, 4 (2013).Google ScholarDigital Library
R. D. Blumofe and Ch. E. Leiserson. 1999. J. ACM 46, 5 (Sept. 1999), 720–748.Google ScholarDigital Library
E. Chacko and S. Ranganathan. 2011. Graphs in Bioinformatics. In Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications, A. Y. Zomaya and M. Elloumi (Eds.). O’Reily, Chapter 10.Google Scholar
G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. 2008. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In Proc. of Int’l Conf. on Parallel Processing. 536–545.Google Scholar
G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph Processing Framework on FPGA: A Case Study of Breadth-First Search. In Proc. of Symp. on Field-Programmable Gate Arrays. 105–110.Google Scholar
J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. 2009. Scalable Work Stealing. In Proc. of Conf. on High Performance Computing Networking, Storage and Analysis. 53:1–53:11.Google Scholar
N. Engelhardt and H. K. So. 2016. GraVF: A vertex-centric distributed graph processing framework on FPGAs. In Proc. of Int’l Conf. on Field Programmable Logic and Applications (FPL). 1–4.Google ScholarCross Ref
M. Frigo, C.E. Leiserson, and K.H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of Conf. on Programming Language Design and Implementation. 212–223.Google Scholar
M. D. Galanis, A. Milidonis, G. Theodoridis, D. Soudris, and C. E. Goutis. 2005. A partitioning methodology for accelerating applications in hybrid reconfigurable platforms. In Proc. of Design, Automation and Test in Europe. 247–252 Vol. 3.Google ScholarDigital Library
Graph 500. 2019. Graph500 Benchmarks. http://www.graph500.orgGoogle Scholar
P. Gupta. 2015. Xeon+FPGA Platform for the Data Center. http://www.ece.cmu.edu/~calcm/carl/doku.php?id=pk_gupta_intel_xeon_fpga_platform_for_the_data_centerGoogle Scholar
D. Hendler and N. Shavit. 2002. Non-Blocking Steal-Half Work Queues. In Proc. of Symp. on Principles of Distributed Computing. 280–289.Google Scholar
C. Hong, A. Sukumaran-Rajam, J. Kim, and P. Sadayappan. 2017. MultiGraph: Efficient Graph Processing on GPUs. In Proc. of Parallel Architectures and Compilation Techniques.Google Scholar
Intel Corp.2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdfGoogle Scholar
Intel Corp.2019. The Open Programmable Acceleration Engine (OPAE). https://01.org/opaeGoogle Scholar
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. 2013. Mizan: A System for Dynamic Load Balancing in Large-Scale Graph Processing. In Proc. of the European Conference on Computer Systems. 169–182.Google Scholar
H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media?. In Proc. of int’l Conf. on World Wide Web. 591–600.Google Scholar
J. Leskovec and A. Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.Google Scholar
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (2012), 716–727.Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proc. of Int’l Conf. on Management of Data. 135–146.Google Scholar
D. Merrill, M. Garland, and A. Grimshaw. 2012. Scalable GPU Graph Traversal. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.Google Scholar
R. Nakashima, H. Yoritaka, M. Yasugi, T. Hiraishi, and S. Umatani. 2019. Extending a Work-Stealing Framework with Priorities and Weights. In Proc. of Workshop on Irregular Applications: Architectures and Algorithms. 9–16.Google Scholar
A. Navarro, F. Corbera, A. Rodriguez, A. Vilches, and R. Asenjo. 2019. Heterogeneous Parallel_for Template for CPU—GPU Chips. J. Parallel Programming 47, 2 (April 2019), 213–233.Google Scholar
D. Nguyen, A. Lenharth, and K. Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In Proc. of Symp. on Operating Systems Principles. 456–471.Google Scholar
F. O’Brien. 2020. A Streamig Accelerator for Graph Analytics on Tightly-Coupled CPU-FPGA Systems. Master’s thesis. University of Toronto.Google Scholar
S. Perarnau and M. Sato. 2014. Victim Selection and Distributed Work Stealing Performance: A Case Study. In Proc. of Parallel and Distributed Processing Symp.659–668.Google Scholar
N. Ramanathan, J. Wickerson, F. Winterstein, and G. Constantinides. 2016. A Case for Work-Stealing on FPGAs with OpenCL Atomics. In Proc. of Int’l Symp. on Field-Programmable Gate Arrays. 48–53.Google Scholar
A. Rodriguez, A. Navarro, R. Asenjo, F. Corbera, R. Gran Tejero, D. Suarez Gracia, and J. Nunez-Yanez. 2019. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. Journal of Supercomputing (06 2019).Google Scholar
A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-Centric Graph Processing Using Streaming Partitions. In Proc. of Symp. on Operating Systems Principles. 472–488.Google Scholar
X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q. Hua. 2018. Graph Processing on GPUs: A Survey. ACM Comput. Surv. 50, 6 (2018).Google Scholar
S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. 2016. Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters. In Proc. of Int’l Conf. on Parallel Processing. 77–86.Google ScholarCross Ref
P. Stutz, A. Bernstein, and W. Cohen. 2010. Signal/Collect: Graph Algorithms for the (Semantic) Web. In Proc. of Int’l Semantic Web Conference on The Semantic Web - Volume Part I. 764–780.Google Scholar
J. L. Tripp, A. A. Hanson, M. Gokhale, and H. Mortveit. 2005. Partitioning Hardware and Software for Reconfigurable Supercomputing Applications: A Case Study. In Proc. of Conference on Supercomputing. 27–27.Google Scholar
Q. D. Truong, Q. B. Truong, and T. Dkaki. 2016. Graph Methods for Social Network Analysis. In Nature of Computation and Communication, P. C. Vinh and L. Barolli (Eds.). 276–286.Google Scholar
A. Vilches, R. Asenjo, A. G. Navarro, F. Corbera, R. Gran Tejero, and M. Garzarán. 2015. Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. In Proc. of the Int’l Conf. on Computational Science, Vol. 51. 140–149.Google Scholar
Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.Google ScholarDigital Library
Y. Wang, J. C. Hoe, and E. Nurvitadhi. 2019. Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 136–144.Google Scholar
B. Wile. 2014. CAPI is Core to POWER. http://www-03.ibm.com/linux/blogs/capi/Google Scholar
Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596(2019). arxiv:1901.00596http://arxiv.org/abs/1901.00596Google Scholar
Xilinx Inc.2014. Zynq-7000: all programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.htmlGoogle Scholar
Y. Xuejun, C. Haibo, C. Yungui, C. Fujie, and C. Lijie. 1990. Processor self-scheduling for parallel loops in preemptive environments. Future Generation Computer Systems 6, 1 (1990), 97–103.Google ScholarDigital Library
J. H. C. Yeung, C. C. Tsang, K. H. Tsoi, B. S. H. Kwan, C. C. C. Cheung, A. P. C. Chan, and P. H. W. Leong. 2008. Map-reduce as a Programming Model for Custom Computing Machines. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 149–159.Google ScholarDigital Library
S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (Oct 2019).Google ScholarCross Ref
S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In Proc. of Int’l Symp. on Computer Architecture and High Performance Computing (SBAC-PAD). 137–144.Google ScholarCross Ref

Recommendations

Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality
CF'17: Proceedings of the Computing Frontiers Conference

This paper uses betweenness centrality as a case study to research efficient work stealing in a heterogeneous system environment. Betweenness centrality is an important algorithm in graph processing. It presents multiple-level parallelism and is an ...
Read More
A survey on dynamic graph processing on GPUs: concepts, terminologies and systems
Abstract
Graphs that are used to model real-world entities with vertices and relationships among entities with edges, have proven to be a powerful tool for describing real-world problems in applications. In most real-world scenarios, entities and their ...
Read More
Almost deterministic work stealing
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

With task parallel models, programmers can easily parallelize divide-and-conquer algorithms by using nested fork-join structures. Work stealing, which is a popular scheduling strategy for task parallel programs, can efficiently perform dynamic load ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
August 2020
844 pages
ISBN:9781450388160
DOI:10.1145/3404397

Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 August 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CPU-FPGA systems
Heterogeneous processing
graph processing
load balancing
work stealing
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate91of313submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 266
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality

A survey on dynamic graph processing on GPUs: concepts, terminologies and systems

Almost deterministic work stealing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality

A survey on dynamic graph processing on GPUs: concepts, terminologies and systems

Almost deterministic work stealing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media