ABSTRACT
We propose, implement and evaluate a work stealing based scheduler, called HWS, for graph processing on heterogeneous CPU-FPGA systems that tightly couple the CPU and the FPGA to share system memory. HWS addresses unique concerns that arise with work stealing in the context of our target system. Our evaluation is conducted on the Intel Heterogeneous Architecture Research Platform (HARPv2), using three key processing kernels and seven real-world graphs. We show that HWS effectively balances workloads. Further, the use of HWS results in better graph processing performance compared to static scheduling and a representative of existing adaptive partitioning techniques, called HAP. Improvements vary by graph processing application, input graph and number of threads, and can be up to 100% over static scheduling, and up to 17% over HAP. We also compare to an oracle chunk self-scheduler, in which the best chunk size is known a priori for each number of threads and each input graph. HWS performs no worse than 1-3% in most cases. Finally, our graph processing throughput scales well with increasing threads. These results collectively demonstrate the effectiveness of work stealing for graph processing on our heterogeneous target platform.
- U. A. Acar, A. Chargueraud, and M. Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proc. of Symp. on Principles and Practice of Parallel Programming. 219–228.Google Scholar
- M. E. Belviranli, L. N. Bhuyan, and R. Gupta. 2013. A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures. ACM Trans. Archit. Code Optim. 9, 4 (2013).Google ScholarDigital Library
- R. D. Blumofe and Ch. E. Leiserson. 1999. J. ACM 46, 5 (Sept. 1999), 720–748.Google ScholarDigital Library
- E. Chacko and S. Ranganathan. 2011. Graphs in Bioinformatics. In Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications, A. Y. Zomaya and M. Elloumi (Eds.). O’Reily, Chapter 10.Google Scholar
- G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. 2008. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In Proc. of Int’l Conf. on Parallel Processing. 536–545.Google Scholar
- G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph Processing Framework on FPGA: A Case Study of Breadth-First Search. In Proc. of Symp. on Field-Programmable Gate Arrays. 105–110.Google Scholar
- J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. 2009. Scalable Work Stealing. In Proc. of Conf. on High Performance Computing Networking, Storage and Analysis. 53:1–53:11.Google Scholar
- N. Engelhardt and H. K. So. 2016. GraVF: A vertex-centric distributed graph processing framework on FPGAs. In Proc. of Int’l Conf. on Field Programmable Logic and Applications (FPL). 1–4.Google ScholarCross Ref
- M. Frigo, C.E. Leiserson, and K.H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of Conf. on Programming Language Design and Implementation. 212–223.Google Scholar
- M. D. Galanis, A. Milidonis, G. Theodoridis, D. Soudris, and C. E. Goutis. 2005. A partitioning methodology for accelerating applications in hybrid reconfigurable platforms. In Proc. of Design, Automation and Test in Europe. 247–252 Vol. 3.Google ScholarDigital Library
- Graph 500. 2019. Graph500 Benchmarks. http://www.graph500.orgGoogle Scholar
- P. Gupta. 2015. Xeon+FPGA Platform for the Data Center. http://www.ece.cmu.edu/~calcm/carl/doku.php?id=pk_gupta_intel_xeon_fpga_platform_for_the_data_centerGoogle Scholar
- D. Hendler and N. Shavit. 2002. Non-Blocking Steal-Half Work Queues. In Proc. of Symp. on Principles of Distributed Computing. 280–289.Google Scholar
- C. Hong, A. Sukumaran-Rajam, J. Kim, and P. Sadayappan. 2017. MultiGraph: Efficient Graph Processing on GPUs. In Proc. of Parallel Architectures and Compilation Techniques.Google Scholar
- Intel Corp.2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdfGoogle Scholar
- Intel Corp.2019. The Open Programmable Acceleration Engine (OPAE). https://01.org/opaeGoogle Scholar
- Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. 2013. Mizan: A System for Dynamic Load Balancing in Large-Scale Graph Processing. In Proc. of the European Conference on Computer Systems. 169–182.Google Scholar
- H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media?. In Proc. of int’l Conf. on World Wide Web. 591–600.Google Scholar
- J. Leskovec and A. Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.Google Scholar
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (2012), 716–727.Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proc. of Int’l Conf. on Management of Data. 135–146.Google Scholar
- D. Merrill, M. Garland, and A. Grimshaw. 2012. Scalable GPU Graph Traversal. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.Google Scholar
- R. Nakashima, H. Yoritaka, M. Yasugi, T. Hiraishi, and S. Umatani. 2019. Extending a Work-Stealing Framework with Priorities and Weights. In Proc. of Workshop on Irregular Applications: Architectures and Algorithms. 9–16.Google Scholar
- A. Navarro, F. Corbera, A. Rodriguez, A. Vilches, and R. Asenjo. 2019. Heterogeneous Parallel_for Template for CPU—GPU Chips. J. Parallel Programming 47, 2 (April 2019), 213–233.Google Scholar
- D. Nguyen, A. Lenharth, and K. Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In Proc. of Symp. on Operating Systems Principles. 456–471.Google Scholar
- F. O’Brien. 2020. A Streamig Accelerator for Graph Analytics on Tightly-Coupled CPU-FPGA Systems. Master’s thesis. University of Toronto.Google Scholar
- S. Perarnau and M. Sato. 2014. Victim Selection and Distributed Work Stealing Performance: A Case Study. In Proc. of Parallel and Distributed Processing Symp.659–668.Google Scholar
- N. Ramanathan, J. Wickerson, F. Winterstein, and G. Constantinides. 2016. A Case for Work-Stealing on FPGAs with OpenCL Atomics. In Proc. of Int’l Symp. on Field-Programmable Gate Arrays. 48–53.Google Scholar
- A. Rodriguez, A. Navarro, R. Asenjo, F. Corbera, R. Gran Tejero, D. Suarez Gracia, and J. Nunez-Yanez. 2019. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. Journal of Supercomputing (06 2019).Google Scholar
- A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-Centric Graph Processing Using Streaming Partitions. In Proc. of Symp. on Operating Systems Principles. 472–488.Google Scholar
- X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q. Hua. 2018. Graph Processing on GPUs: A Survey. ACM Comput. Surv. 50, 6 (2018).Google Scholar
- S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. 2016. Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters. In Proc. of Int’l Conf. on Parallel Processing. 77–86.Google ScholarCross Ref
- P. Stutz, A. Bernstein, and W. Cohen. 2010. Signal/Collect: Graph Algorithms for the (Semantic) Web. In Proc. of Int’l Semantic Web Conference on The Semantic Web - Volume Part I. 764–780.Google Scholar
- J. L. Tripp, A. A. Hanson, M. Gokhale, and H. Mortveit. 2005. Partitioning Hardware and Software for Reconfigurable Supercomputing Applications: A Case Study. In Proc. of Conference on Supercomputing. 27–27.Google Scholar
- Q. D. Truong, Q. B. Truong, and T. Dkaki. 2016. Graph Methods for Social Network Analysis. In Nature of Computation and Communication, P. C. Vinh and L. Barolli (Eds.). 276–286.Google Scholar
- A. Vilches, R. Asenjo, A. G. Navarro, F. Corbera, R. Gran Tejero, and M. Garzarán. 2015. Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. In Proc. of the Int’l Conf. on Computational Science, Vol. 51. 140–149.Google Scholar
- Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.Google ScholarDigital Library
- Y. Wang, J. C. Hoe, and E. Nurvitadhi. 2019. Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 136–144.Google Scholar
- B. Wile. 2014. CAPI is Core to POWER. http://www-03.ibm.com/linux/blogs/capi/Google Scholar
- Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596(2019). arxiv:1901.00596http://arxiv.org/abs/1901.00596Google Scholar
- Xilinx Inc.2014. Zynq-7000: all programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.htmlGoogle Scholar
- Y. Xuejun, C. Haibo, C. Yungui, C. Fujie, and C. Lijie. 1990. Processor self-scheduling for parallel loops in preemptive environments. Future Generation Computer Systems 6, 1 (1990), 97–103.Google ScholarDigital Library
- J. H. C. Yeung, C. C. Tsang, K. H. Tsoi, B. S. H. Kwan, C. C. C. Cheung, A. P. C. Chan, and P. H. W. Leong. 2008. Map-reduce as a Programming Model for Custom Computing Machines. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 149–159.Google ScholarDigital Library
- S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (Oct 2019).Google ScholarCross Ref
- S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In Proc. of Int’l Symp. on Computer Architecture and High Performance Computing (SBAC-PAD). 137–144.Google ScholarCross Ref
Recommendations
Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality
CF'17: Proceedings of the Computing Frontiers ConferenceThis paper uses betweenness centrality as a case study to research efficient work stealing in a heterogeneous system environment. Betweenness centrality is an important algorithm in graph processing. It presents multiple-level parallelism and is an ...
A survey on dynamic graph processing on GPUs: concepts, terminologies and systems
AbstractGraphs that are used to model real-world entities with vertices and relationships among entities with edges, have proven to be a powerful tool for describing real-world problems in applications. In most real-world scenarios, entities and their ...
Almost deterministic work stealing
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisWith task parallel models, programmers can easily parallelize divide-and-conquer algorithms by using nested fork-join structures. Work stealing, which is a popular scheduling strategy for task parallel programs, can efficiently perform dynamic load ...
Comments