ABSTRACT
Graph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph computations, one major challenge for graph processing systems is performance. The algorithms, softwares, and hardwares that have been tailored for mainstream parallel applications are generally not effective for massive, sparse graphs from the real-world problems, due to their complex and irregular structures. To address the performance issues in large-scale graph analytics, we leverage the exceptional random access performance of the emerging Hybrid Memory Cube (HMC) combined with the flexibility and efficiency of modern FPGAs. In particular, we develop a collaborative software/hardware technique to perform a level-synchronized Breadth First Search (BFS) on a FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that exploits the FPGA-HMC platform»s capability to improve data locality and memory access efficiency. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by designing a memory request merging unit to take advantage of the increased data locality resulting from graph clustering. We evaluate the performance of our BFS implementation using the AC-510 development kit from Micron and achieve $2.8 \times$ average performance improvement compared to the latest FPGA-HMC based graph processing system over a set of benchmarks from a wide range of applications.
- Gary D. Bader and Christopher W. Hogue. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4, 1 (13 Jan 2003), 2.Google Scholar
- Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing Breadth-First Search. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. 1--10. Google ScholarDigital Library
- Marcelo Blatt, Shai Wiseman, and Eytan Domany. 1996. Superparamagnetic Clustering of Data. Phys. Rev. Lett. 76 (Apr 1996), 3251--3254. Issue 18.Google ScholarCross Ref
- Ulrik Brandes, Marco Gaertler, and Dorothea Wagner. 2003. Experiments on Graph Clustering Algorithms. Springer Berlin Heidelberg, Berlin, Heidelberg, 568--579.Google Scholar
- Sylvain Brohée and Jacques van Helden. 2006. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7, 1 (06 Nov 2006), 488.Google Scholar
- Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '16). ACM, New York, NY, USA, 105--110. Google ScholarDigital Library
- Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages. Google ScholarDigital Library
- Pedro Felzenszwalb and Ramin Zabih. 2011. Dynamic Programming and Graph Algorithms in Computer Vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 4 (April 2011), 721--740. Google ScholarDigital Library
- M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. 2010. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In 2010 Proceedings IEEE INFOCOM. 1--9. Google ScholarDigital Library
- Taher Haveliwala. 2003. Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Transactions on Knowledge and Data Engineering 15, 4 (July 2003), 784--796. Google ScholarDigital Library
- Andrew D. King, Natasa Pržulj, and Igor Jurisica. 2004. Protein complex prediction via cost-based clustering. Bioinformatics 20, 17 (2004), 3013--3020. Google ScholarDigital Library
- Mehmet Koyutürk, Ananth Grama, and Wojciech Szpankowski. 2004. An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics 20, suppl-1 (2004), i200--i207. Google ScholarDigital Library
- Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale Graph Computation on Just a PC. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). USENIX, Hollywood, CA, 31--46. https://www.usenix.org/conference/osdi12/technical-sessions/ presentation/kyrola Google ScholarDigital Library
- Guoqing Lei, Rongchun Li, Song Guo, and Fei Xia. 2015. TorusBFS: A Novel Message-passing Parallel Breadth-First Search Architecture on FPGAs. Engineering Science and Technology, an International Journal 5, 5 (10 2015), 313--318.Google Scholar
- Kyle Locke. 2011. Parameterizable Content-Addressable Memory. https://www.xilinx.com/support/documentation/application_notes/xapp1151_ Param_CAM.pdf. (2011).Google Scholar
- Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU Graph Traversal. SIGPLAN Not. 47, 8 (Feb. 2012), 117--128. Google ScholarDigital Library
- J Thomas Pawlowski. 2011. Hybrid memory cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HCS). 1--24.Google ScholarCross Ref
- Picocomputing. {n. d.}. Hybrid Memory Cube (HMC) and Controller IP. http: //picocomputing.com/hybrid-memory-cube-hmc-controller-ip/. ({n. d.}).Google Scholar
- Picocomputing. {n. d.}. UltraScale-based SuperProcessor with Hybrid Memory Cube. http://picocomputing.com/ac-510-superprocessor-module. ({n. d.}).Google Scholar
- Paul Rosenfeld. 2014. Performance exploration of the hybrid memory cube. Ph.D. Dissertation. Department of Electrical Engineering at University of Maryland.Google Scholar
- Yaman Umuroglu, Donn Morrison, and Magnus Jahre. 2015. Hybrid breadthfirst search on a single-chip FPGA-CPU heterogeneous platform. In 2015 25th International Conference on Field Programmable Logic and Applications (FPL). 1--8.Google ScholarCross Ref
- Stijn van Dongen. 2000. Graph clustering by flow simulation. Ph.D. Dissertation. University of Utrecht.Google Scholar
- Yangzihao Wang, Andrew A. Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: A High-Performance Graph Processing Library on the GPU. CoRR abs/1501.05387 (2015). arXiv:1501.05387 http://arxiv. org/abs/1501.05387 Google ScholarDigital Library
- Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First Search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '17). ACM, New York, NY, USA, 207--216. Google ScholarDigital Library
Index Terms
- Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform
Recommendations
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysGraph traversal is a core primitive for graph analytics and a basis for many higher-level graph analysis methods. However, irregularities in the structure of scale-free graphs (e.g., social network) limit our ability to analyze these important and ...
Processing Grid-format Real-world Graphs on DRAM-based FPGA Accelerators with Application-specific Caching Mechanisms
Graph processing is one of the important research topics in the big-data era. To build a general framework for graph processing by using a DRAM-based FPGA board with deep memory hierarchy, one of the reasonable methods is to partition a given big graph ...
Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysIn recent years, graph processing attracts lots of attention due to its broad applicability in solving real-world problems. With the flexibility and programmability, FPGA platforms provide the opportunity of processing the graph data with high ...
Comments