ABSTRACT
FPGAs are promising platforms for accelerating irregular applications due to their ability to implement highly specialized hardware designs for each kernel. However, the design and implementation of FPGA-accelerated kernels can take several months using hardware design languages. High Level Synthesis (HLS) tools provide fast, high quality results for regular applications, but lack the support to effectively accelerate more irregular, complex workloads. This work analyzes the challenges and benefits of using a commercial state-of-the-art HLS tool and its available optimizations to accelerate graph sampling. We evaluate the resulting designs and their effectiveness when deployed in a state-of-the-art heterogeneous framework that implements the Influence Maximization with Martingales (IMM) algorithm, a complex graph analytics algorithm. We discuss future opportunities for improvement in hardware, HLS tools, and hardware/software co-design methodology to better support complex irregular applications such as IMM.
- 2021. Vitis Unified Software Platform Documentation: Application Acceleration Development. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_2/ug1393-vitis-application-acceleration.pdfGoogle Scholar
- Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porter-field, and Burton Smith. 1990. The Tera computer system. In Proceedings of the 4th International Conference on Supercomputing. 1--6.Google ScholarDigital Library
- Osama G. Attia, Tyler Johnson, Kevin Townsend, Philip Jones, and Joseph Zambreno. 2014. CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search. In IPDPS '14. 228--235. https://doi.org/10.1109/IPDPSW.2014.30Google ScholarDigital Library
- Reet Barik, Marco Minutoli, Mahantesh Halappanavar, Nathan R Tallent, and Ananth Kalyanaraman. 2020. Vertex reordering for real-world graphs and applications: An empirical evaluation. In 2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 240--251.Google ScholarCross Ref
- Heiko Bauke. 2021. Tina's Random Number Generator Library. https://www.numbercrunch.de/trng/trng.pdfGoogle Scholar
- Maciej Besta, Dimitri Stanojevic, Johannes De Fine Licht, Tal Ben-Nun, and Torsten Hoefler. 2019. Graph Processing on FPGAs: Taxonomy, Survey, Challenges. arXiv:1903.06697 [cs.DC]Google Scholar
- Brahim Betkaoui, Yu Wang, David B. Thomas, and Wayne Luk. 2012. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration. In ASAP '12. 8--15. https://doi.org/10.1109/ASAP.2012.30Google ScholarDigital Library
- Christian Borgs, Michael Brautbar, Jennifer Chayes, and Brendan Lucier. 2014. Maximizing Social Influence in Nearly Optimal Time. In Proc. of SODA '14. SIAM, 946--957. https://doi.org/Portland, OregonGoogle ScholarCross Ref
- Tony M. Brewer. 2010. Instruction Set Innovations for the Convey HC-1 Computer. IEEE Micro 30, 2 (2010), 70--79. https://doi.org/10.1109/MM.2010.36Google ScholarDigital Library
- Enrico Calore and Sebastiano Fabio Schifano. 2021. Performance assessment of FPGAs as HPC accelerators using the FPGA Empirical Roofline. In Proc. of FPL '21. 83--90. https://doi.org/10.1109/FPL53798.2021.00022Google ScholarCross Ref
- Pedro M. Domingos and Matthew Richardson. 2001. Mining the network value of customers. In Proc. of KDD '01. ACM, 57--66.Google Scholar
- Timothy Dysart, Peter Kogge, Martin Deneroff, Eric Bovell, Preston Briggs, Jay Brockman, Kenneth Jacobsen, Yujen Juan, Shannon Kuntz, Richard Lethin, Janice McMahon, Chandra Pawar, Martin Perrigo, Sarah Rucker, John Ruttenberg, Max Ruttenberg, and Steve Stein. 2016. Highly Scalable Near Memory Processing with Migrating Threads on the Emu System Architecture. In 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3). 2--9. https://doi.org/10.1109/IA3.2016.007Google ScholarCross Ref
- Iman Firmansyah, Du Changdao, Norihisa Fujita, Yoshiki Yamaguchi, and Taisuke Boku. 2019. FPGA-Based Implementation of Memory-Intensive Application Using OpenCL (HEART 2019). ACM, New York, NY, USA, Article 16, 4 pages.Google Scholar
- Gökhan Göktürk and Kamer Kaya. 2020. Boosting Parallel Influence-Maximization Kernels for Undirected Networks with Fusing and Vectorization. CoRR abs/2008.03095 (2020). arXiv:2008.03095 https://arxiv.org/abs/2008.03095Google Scholar
- Mark Harris. 2013. CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops. https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/.Google Scholar
- Mohamed W. Hassan, Ahmed E. Helal, Peter M. Athanas, Wu-Chun Feng, and Yasser Y. Hanafy. 2018. Exploring FPGA-specific Optimizations for Irregular OpenCL Applications. In ReConFig '18. 1--8. https://doi.org/10.1109/RECONFIG.2018.8641699Google ScholarCross Ref
- Sitao Huang, Mohamed El-Hadedy, Cong Hao, Qin Li, Vikram S. Mailthody, Ketan Date, Jinjun Xiong, Deming Chen, Rakesh Nagi, and Wen-mei Hwu. 2018. Triangle Counting and Truss Decomposition using FPGA. In HPEC '18. 1--7.Google ScholarCross Ref
- Vinod Kathail. 2020. Xilinx Vitis Unified Software Platform. In Proc. of FPGA '20, Stephen Neuendorffer and Lesley Shannon (Eds.). ACM, 173--174. https://doi.org/10.1145/3373087.3375887Google ScholarDigital Library
- David Kempe, Jon Kleinberg, and Éva Tardos. 2003. Maximizing the Spread of Influence through a Social Network. In Proc. of KDD '03. ACM, New York, NY, USA, 137--146. https://doi.org/10.1145/956750.956769Google ScholarDigital Library
- Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. 2020. GPOP: A Scalable Cache- and Memory-Efficient Framework for Graph Processing over Parts. TOPC '20 7, 1, Article 7 (March 2020), 24 pages.Google ScholarDigital Library
- Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne Van-Briesen, and Natalie Glance. 2007. Cost-effective outbreak detection in networks. In KDD. ACM, 420--429.Google Scholar
- Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.Google Scholar
- Cheng Liu, Xinyu Chen, Bingsheng He, Xiaofei Liao, Ying Wang, and Lei Zhang. 2019. OBFS: OpenCL Based BFS Optimizations on Software Programmable FPGAs. In ICFPT '19. 315--318. https://doi.org/10.1109/ICFPT47387.2019.00056Google ScholarCross Ref
- Marco Minutoli, Vito Giovanni Castellana, Nicola Saporetti, Stefano Devecchi, Marco Lattuada, Pietro Fezzardi, Antonino Tumeo, and Fabrizio Ferrandi. 2022. Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics. IEEE Trans. Comput. 71, 3 (2022), 520--533. https://doi.org/10.1109/TC.2021.3057860Google ScholarDigital Library
- Marco Minutoli, Maurizio Drocco, Mahantesh Halappanavar, Antonino Tumeo, and Ananth Kalyanaraman. 2020. CuRipples: Influence Maximization on Multi-GPU Systems. In Proc. of ICS '20. ACM. https://doi.org/10.1145/3392717.3392750Google ScholarDigital Library
- Marco Minutoli, Mahantesh Halappanavar, Ananth Kalyanaraman, Arun Sathanur, Ryan Mcclure, and Jason McDermott. 2019. Fast and Scalable Implementations of Influence Maximization Algorithms. In CLUSTER '19. 1--12. https://doi.org/10.1109/CLUSTER.2019.8890991Google ScholarCross Ref
- Marco Minutoli, Prathyush Sambaturu, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyananaraman, and Anil Vullikanti. 2020. PREEMPT: Scalable Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15. https://doi.org/10.1109/SC41405.2020.00059Google ScholarCross Ref
- Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In Proc. of FPGA 16. 111--117.Google Scholar
- Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018. Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing. In Proc. of ASPLOS '18. ACM. https://doi.org/10.1145/3173162.3173180Google ScholarDigital Library
- K. Sridharan, T. K. Priya, and P. Rajesh Kumar. 2009. Hardware architecture for finding shortest paths. In TENCON '09. 1--5. https://doi.org/10.1109/TENCON.2009.5396155Google ScholarCross Ref
- Chunyou Su, Hao Liang, Wei Zhang, Kun Zhao, Baole Ai, Wenting Shen, and Zeke Wang. 2021. Graph Sampling with Fast Random Walker on HBM-enabled FPGA Accelerators. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 211--218. https://doi.org/10.1109/FPL53798.2021.00042Google ScholarCross Ref
- Mingxing Tan, Gai Liu, Ritchie Zhao, Steve Dai, and Zhiru Zhang. 2015. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests. In ICCAD '15. 78--85.Google ScholarDigital Library
- Youze Tang, Yanchen Shi, and Xiaokui Xiao. 2015. Influence Maximization in Near-Linear Time: A Martingale Approach. In Proc. 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1539--1554.Google ScholarDigital Library
- Matti Tommiska and Jorma Skyttä. 2001. Dijkstra's Shortest Path Routing Algorithm in Reconfigurable Hardware. In Proc. of FPL '01. Springer-Verlag, Berlin, Heidelberg, 653--657.Google Scholar
- Antonino Tumeo and John Feo. 2015. Irregular applications: From architectures to algorithms [guest editors' introduction]. Computer 48, 8 (2015), 14--16.Google ScholarDigital Library
- Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2015. Optimizing memory performance for FPGA implementation of pagerank. In ReConFig '15. 1--6. https://doi.org/10.1109/ReConFig.2015.7393332Google ScholarCross Ref
- Shijie Zhou, Charalampos Chelmis, and Viktor K. Prasanna. 2016. High-Throughput and Energy-Efficient Graph Processing on FPGA. In FCCM '16. 103--110. https://doi.org/10.1109/FCCM.2016.35Google ScholarCross Ref
Index Terms
- High-Level Synthesis of Irregular Applications: A Case Study on Influence Maximization
Recommendations
From software to accelerators with LegUp high-level synthesis
CASES '13: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded SystemsEmbedded system designers can achieve energy and performance benefits by using dedicated hardware accelerators. However, implementing custom hardware accelerators for an application can be difficult and time intensive. LegUp is an open-source high-level ...
Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysCurrent pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural and data ...
High-level synthesis with LegUp: a crash course for users and researchers
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysHigh-level synthesis (HLS) has been gaining traction recently as a design methodology for FPGAs, with the promise of raising the productivity of FPGA hardware designers, and ultimately, opening the door to the use of FPGAs as computing devices ...
Comments