skip to main content
10.1145/3404397.3404433acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems

Published:17 August 2020Publication History

ABSTRACT

We propose, implement and evaluate a work stealing based scheduler, called HWS, for graph processing on heterogeneous CPU-FPGA systems that tightly couple the CPU and the FPGA to share system memory. HWS addresses unique concerns that arise with work stealing in the context of our target system. Our evaluation is conducted on the Intel Heterogeneous Architecture Research Platform (HARPv2), using three key processing kernels and seven real-world graphs. We show that HWS effectively balances workloads. Further, the use of HWS results in better graph processing performance compared to static scheduling and a representative of existing adaptive partitioning techniques, called HAP. Improvements vary by graph processing application, input graph and number of threads, and can be up to 100% over static scheduling, and up to 17% over HAP. We also compare to an oracle chunk self-scheduler, in which the best chunk size is known a priori for each number of threads and each input graph. HWS performs no worse than 1-3% in most cases. Finally, our graph processing throughput scales well with increasing threads. These results collectively demonstrate the effectiveness of work stealing for graph processing on our heterogeneous target platform.

References

  1. U. A. Acar, A. Chargueraud, and M. Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proc. of Symp. on Principles and Practice of Parallel Programming. 219–228.Google ScholarGoogle Scholar
  2. M. E. Belviranli, L. N. Bhuyan, and R. Gupta. 2013. A Dynamic Self-Scheduling Scheme for Heterogeneous Multiprocessor Architectures. ACM Trans. Archit. Code Optim. 9, 4 (2013).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. D. Blumofe and Ch. E. Leiserson. 1999. J. ACM 46, 5 (Sept. 1999), 720–748.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Chacko and S. Ranganathan. 2011. Graphs in Bioinformatics. In Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications, A. Y. Zomaya and M. Elloumi (Eds.). O’Reily, Chapter 10.Google ScholarGoogle Scholar
  5. G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. 2008. Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing. In Proc. of Int’l Conf. on Parallel Processing. 536–545.Google ScholarGoogle Scholar
  6. G. Dai, Y. Chi, Y. Wang, and H. Yang. 2016. FPGP: Graph Processing Framework on FPGA: A Case Study of Breadth-First Search. In Proc. of Symp. on Field-Programmable Gate Arrays. 105–110.Google ScholarGoogle Scholar
  7. J. Dinan, D. B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha. 2009. Scalable Work Stealing. In Proc. of Conf. on High Performance Computing Networking, Storage and Analysis. 53:1–53:11.Google ScholarGoogle Scholar
  8. N. Engelhardt and H. K. So. 2016. GraVF: A vertex-centric distributed graph processing framework on FPGAs. In Proc. of Int’l Conf. on Field Programmable Logic and Applications (FPL). 1–4.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. Frigo, C.E. Leiserson, and K.H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In Proc. of Conf. on Programming Language Design and Implementation. 212–223.Google ScholarGoogle Scholar
  10. M. D. Galanis, A. Milidonis, G. Theodoridis, D. Soudris, and C. E. Goutis. 2005. A partitioning methodology for accelerating applications in hybrid reconfigurable platforms. In Proc. of Design, Automation and Test in Europe. 247–252 Vol. 3.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Graph 500. 2019. Graph500 Benchmarks. http://www.graph500.orgGoogle ScholarGoogle Scholar
  12. P. Gupta. 2015. Xeon+FPGA Platform for the Data Center. http://www.ece.cmu.edu/~calcm/carl/doku.php?id=pk_gupta_intel_xeon_fpga_platform_for_the_data_centerGoogle ScholarGoogle Scholar
  13. D. Hendler and N. Shavit. 2002. Non-Blocking Steal-Half Work Queues. In Proc. of Symp. on Principles of Distributed Computing. 280–289.Google ScholarGoogle Scholar
  14. C. Hong, A. Sukumaran-Rajam, J. Kim, and P. Sadayappan. 2017. MultiGraph: Efficient Graph Processing on GPUs. In Proc. of Parallel Architectures and Compilation Techniques.Google ScholarGoogle Scholar
  15. Intel Corp.2019. Intel Acceleration Stack for Intel Xeon CPU with FPGAs Core Cache Interface (CCI-P) Reference Manual. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl-ias-ccip.pdfGoogle ScholarGoogle Scholar
  16. Intel Corp.2019. The Open Programmable Acceleration Engine (OPAE). https://01.org/opaeGoogle ScholarGoogle Scholar
  17. Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. 2013. Mizan: A System for Dynamic Load Balancing in Large-Scale Graph Processing. In Proc. of the European Conference on Computer Systems. 169–182.Google ScholarGoogle Scholar
  18. H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media?. In Proc. of int’l Conf. on World Wide Web. 591–600.Google ScholarGoogle Scholar
  19. J. Leskovec and A. Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.Google ScholarGoogle Scholar
  20. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proc. VLDB Endow. 5, 8 (2012), 716–727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Malewicz, M. H. Austern, A. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proc. of Int’l Conf. on Management of Data. 135–146.Google ScholarGoogle Scholar
  22. D. Merrill, M. Garland, and A. Grimshaw. 2012. Scalable GPU Graph Traversal. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.Google ScholarGoogle Scholar
  23. R. Nakashima, H. Yoritaka, M. Yasugi, T. Hiraishi, and S. Umatani. 2019. Extending a Work-Stealing Framework with Priorities and Weights. In Proc. of Workshop on Irregular Applications: Architectures and Algorithms. 9–16.Google ScholarGoogle Scholar
  24. A. Navarro, F. Corbera, A. Rodriguez, A. Vilches, and R. Asenjo. 2019. Heterogeneous Parallel_for Template for CPU—GPU Chips. J. Parallel Programming 47, 2 (April 2019), 213–233.Google ScholarGoogle Scholar
  25. D. Nguyen, A. Lenharth, and K. Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In Proc. of Symp. on Operating Systems Principles. 456–471.Google ScholarGoogle Scholar
  26. F. O’Brien. 2020. A Streamig Accelerator for Graph Analytics on Tightly-Coupled CPU-FPGA Systems. Master’s thesis. University of Toronto.Google ScholarGoogle Scholar
  27. S. Perarnau and M. Sato. 2014. Victim Selection and Distributed Work Stealing Performance: A Case Study. In Proc. of Parallel and Distributed Processing Symp.659–668.Google ScholarGoogle Scholar
  28. N. Ramanathan, J. Wickerson, F. Winterstein, and G. Constantinides. 2016. A Case for Work-Stealing on FPGAs with OpenCL Atomics. In Proc. of Int’l Symp. on Field-Programmable Gate Arrays. 48–53.Google ScholarGoogle Scholar
  29. A. Rodriguez, A. Navarro, R. Asenjo, F. Corbera, R. Gran Tejero, D. Suarez Gracia, and J. Nunez-Yanez. 2019. Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform. Journal of Supercomputing (06 2019).Google ScholarGoogle Scholar
  30. A. Roy, I. Mihailovic, and W. Zwaenepoel. 2013. X-Stream: Edge-Centric Graph Processing Using Streaming Partitions. In Proc. of Symp. on Operating Systems Principles. 472–488.Google ScholarGoogle Scholar
  31. X. Shi, Z. Zheng, Y. Zhou, H. Jin, L. He, B. Liu, and Q. Hua. 2018. Graph Processing on GPUs: A Survey. ACM Comput. Surv. 50, 6 (2018).Google ScholarGoogle Scholar
  32. S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. 2016. Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters. In Proc. of Int’l Conf. on Parallel Processing. 77–86.Google ScholarGoogle ScholarCross RefCross Ref
  33. P. Stutz, A. Bernstein, and W. Cohen. 2010. Signal/Collect: Graph Algorithms for the (Semantic) Web. In Proc. of Int’l Semantic Web Conference on The Semantic Web - Volume Part I. 764–780.Google ScholarGoogle Scholar
  34. J. L. Tripp, A. A. Hanson, M. Gokhale, and H. Mortveit. 2005. Partitioning Hardware and Software for Reconfigurable Supercomputing Applications: A Case Study. In Proc. of Conference on Supercomputing. 27–27.Google ScholarGoogle Scholar
  35. Q. D. Truong, Q. B. Truong, and T. Dkaki. 2016. Graph Methods for Social Network Analysis. In Nature of Computation and Communication, P. C. Vinh and L. Barolli (Eds.). 276–286.Google ScholarGoogle Scholar
  36. A. Vilches, R. Asenjo, A. G. Navarro, F. Corbera, R. Gran Tejero, and M. Garzarán. 2015. Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. In Proc. of the Int’l Conf. on Computational Science, Vol. 51. 140–149.Google ScholarGoogle Scholar
  37. Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proc. of Symp. on Principles and Practice of Parallel Programming. 117–128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Wang, J. C. Hoe, and E. Nurvitadhi. 2019. Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 136–144.Google ScholarGoogle Scholar
  39. B. Wile. 2014. CAPI is Core to POWER. http://www-03.ibm.com/linux/blogs/capi/Google ScholarGoogle Scholar
  40. Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. 2019. A Comprehensive Survey on Graph Neural Networks. CoRR abs/1901.00596(2019). arxiv:1901.00596http://arxiv.org/abs/1901.00596Google ScholarGoogle Scholar
  41. Xilinx Inc.2014. Zynq-7000: all programmable SoC. http://www.xilinx.com/products/silicon-devices/soc/zynq-7000.htmlGoogle ScholarGoogle Scholar
  42. Y. Xuejun, C. Haibo, C. Yungui, C. Fujie, and C. Lijie. 1990. Processor self-scheduling for parallel loops in preemptive environments. Future Generation Computer Systems 6, 1 (1990), 97–103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. H. C. Yeung, C. C. Tsang, K. H. Tsoi, B. S. H. Kwan, C. C. C. Cheung, A. P. C. Chan, and P. H. W. Leong. 2008. Map-reduce as a Programming Model for Custom Computing Machines. In Proc. of Symp. on Field-Programmable Custom Computing Machines. 149–159.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA. IEEE Transactions on Parallel and Distributed Systems 30, 10 (Oct 2019).Google ScholarGoogle ScholarCross RefCross Ref
  45. S. Zhou and V. K. Prasanna. 2017. Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform. In Proc. of Int’l Symp. on Computer Architecture and High Performance Computing (SBAC-PAD). 137–144.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
    August 2020
    844 pages
    ISBN:9781450388160
    DOI:10.1145/3404397

    Copyright © 2020 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 17 August 2020

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate91of313submissions,29%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format