ABSTRACT
Graph analytics delivers deep knowledge by processing large volumes of highly connected data. In real-world graphs, the degree distribution tends to follow the power law -- a small portion of nodes own a large number of neighbors. The high irregularity of degree distribution acts as a major barrier to their efficient processing on GPU architectures, which are primarily designed for accelerating computations on regular data with SIMD executions. Existing solutions to the inefficiency of GPU-based graph analytics either modify the graph programming abstraction or rely on changes to the low-level thread execution models. The former requires more programming efforts for designing and maintaining graph analytics; while the latter couples with the underlying architectures, making it difficult to adapt as architectures quickly evolve. Unlike prior efforts, this work proposes to address the above fundamental problem at its origin -- the irregular graph data itself. It raises a critical question in irregular graph processing: Is it possible to transform irregular graphs into more regular ones such that the graphs can be processed more efficiently on GPU-like architectures, yet still producing the same results? Inspired by the question, this work introduces Tigr -- a graph transformation framework that can effectively reduce the irregularity of real-world graphs with correctness guarantees for a wide range of graph analytics. To make the transformations practical, Tigr features a lightweight virtual transformation scheme, which can substantially reduce the costs of graph transformations, while preserving the benefits of reduced irregularity. Evaluation on Tigr-based GPU graph processing shows significant and consistent speedup over the state-of-the-art GPU graph processing frameworks for a spectrum of irregular graphs.
- Ching Avery. 2011. Giraph: Large-scale graph processing infrastructure on Hadoop. Proceedings of the Hadoop Summit. Santa Clara Vol. 11 (2011).Google Scholar
- Scott Beamer, Krste Asanović, and David Patterson. 2012. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 12. Google ScholarDigital Library
- Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors Proceedings of the conference on high performance computing networking, storage and analysis. ACM, 18. Google ScholarDigital Library
- Maciej Besta, Michał Podstawski, Linus Groner, Edgar Solomonik, and Torsten Hoefler. 2017. To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 93--104. Google ScholarDigital Library
- Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In Proceedings of the 20th international conference on World wide web. ACM, 587--596. Google ScholarDigital Library
- Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology Vol. 25, 2 (2001), 163--177.Google ScholarCross Ref
- Ed Bullmore and Olaf Sporns. 2009. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience Vol. 10, 3 (2009), 186--198.Google ScholarCross Ref
- Shuai Che, Jeremy W Sheaffer, and Kevin Skadron. 2011. Dymaxion: Optimizing memory access patterns for heterogeneous systems Proceedings of 2011 international conference for high performance computing, networking, storage and analysis. ACM, 13. Google ScholarDigital Library
- Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). ACM, New York, NY, USA, 1:1--1:15. Google ScholarDigital Library
- Andreas Crauser, Kurt Mehlhorn, Ulrich Meyer, and Peter Sanders. 1998. A parallelization of Dijkstra's shortest path algorithm. Mathematical Foundations of Computer Science 1998 (1998), 722--731. Google ScholarDigital Library
- Andrew Davidson, Sean Baxter, Michael Garland, and John D Owens. 2014. Work-efficient parallel GPU methods for single-source shortest paths Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 349--359. Google ScholarDigital Library
- Pedro Domingos and Matt Richardson. 2001. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 57--66. Google ScholarDigital Library
- E. Elsen and V. Vaidyanathan. 2013. A vertex-centric CUDA/CGoogle Scholar
- API for large graph analytics on GPUs using the gather-apply-scatter abstraction. https://github.com/RoyalCaliber/vertexAPI2. (2013).Google Scholar
- Adam Fidel, Nancy M Amato, and Lawrence Rauchwerger. 2012. The STAPL parallel graph library. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 46--60.Google Scholar
- Abdullah Gharaibeh, Lauro Beltr ao Costa, Elizeu Santos-Neto, and Matei Ripeanu. 2012. A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM, 345--354. Google ScholarDigital Library
- Abdullah Gharaibeh, Tahsin Reza, Elizeu Santos-Neto, Lauro Beltrao Costa, Scott Sallinen, and Matei Ripeanu. 2013. Efficient large-scale graph processing on hybrid CPU and GPU systems. arXiv preprint arXiv:1312.3018 (2013).Google Scholar
- Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI, Vol. Vol. 12. 2. Google ScholarDigital Library
- Douglas Gregor and Andrew Lumsdaine. 2005. The parallel BGL: A generic library for distributed graph computations. Parallel Object-Oriented Scientific Computing (POOSC) Vol. 2 (2005), 1--18.Google Scholar
- John Greiner. 1994. A comparison of parallel algorithms for connected components Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures. ACM, 16--25. Google ScholarDigital Library
- Tianyi David Han and Tarek S Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units. ACM, 3. Google ScholarDigital Library
- Wei Han, Daniel Mawhirter, Bo Wu, and Matthew Buland. 2017. Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU Parallel Architectures and Compilation Techniques (PACT), 2017 26th International Conference on. IEEE, 233--245.Google Scholar
- Pawan Harish and PJ Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA International Conference on High-Performance Computing. Springer, 197--208. Google ScholarDigital Library
- Sungpack Hong, Sang Kyun Kim, Tayo Oguntebi, and Kunle Olukotun. 2011 a. Accelerating CUDA graph algorithms at maximum warp ACM SIGPLAN Notices, Vol. Vol. 46. ACM, 267--276. Google ScholarDigital Library
- Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011 b. Efficient parallel graph exploration on multi-core CPU and GPU Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 78--88. Google ScholarDigital Library
- Jayadharini Jaiganesh and Martin Burtscher. 2018. ECL-CC v1.0. http://cs.txstate.edu/ burtscher/research/ECL-CC/. (2018).Google Scholar
- Yuntao Jia, Victor Lu, Jared Hoberock, Michael Garland, and John C Hart. 2011. Edge v. node parallelism for graph centrality metrics. GPU Computing Gems Vol. 2 (2011), 15--30.Google Scholar
- Laxmikant V Kale and Abhinav Bhatele. 2016. Parallel science and engineering applications: The Charm+approach. CRC Press. Google ScholarDigital Library
- George Karypis and Vipin Kumar. 1998 a. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing Vol. 20, 1 (1998), 359--392. Google ScholarDigital Library
- George Karypis and Vipin Kumar. 1998 b. Multilevelk-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed computing Vol. 48, 1 (1998), 96--129. Google ScholarDigital Library
- Farzad Khorasani, Rajiv Gupta, and Laxmi N Bhuyan. 2015. Scalable SIMD-efficient graph processing on GPUs Parallel Architecture and Compilation (PACT), 2015 International Conference on. IEEE, 39--50. Google ScholarDigital Library
- Farzad Khorasani, Bryan Rowe, Rajiv Gupta, and Laxmi N Bhuyan. 2016. Eliminating Intra-warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement. In Parallel and Distributed Processing Symposium, 2016 IEEE International. IEEE, 524--533.Google ScholarCross Ref
- Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N Bhuyan. 2014. CuSha: vertex-centric graph processing on GPUs Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, 239--252. Google ScholarDigital Library
- Aapo Kyrola, Guy E Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. USENIX. Google ScholarDigital Library
- HyoukJoong Lee, Kevin J Brown, Arvind K Sujeeth, Tiark Rompf, and Kunle Olukotun. 2014. Locality-aware mapping of nested parallel patterns on gpus Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 63--74. Google ScholarDigital Library
- Jure Leskovec and Andrej Krevl. 2015. SNAP Datasets:Stanford Large Network Dataset Collection. (2015).Google Scholar
- Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 339--350. Google ScholarDigital Library
- Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. 2012. Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment Vol. 5, 8 (2012), 716--727. Google ScholarDigital Library
- Yucheng Low, Joseph E Gonzalez, Aapo Kyrola, Danny Bickson, Carlos E Guestrin, and Joseph Hellerstein. 2010. GraphLab: A new framework for parallel machine learning. CoRR Vol. abs/1006.4990 (2010). http://arxiv.org/abs/1006.4990 Google ScholarDigital Library
- Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. 2007. Challenges in parallel graph processing. Parallel Processing Letters Vol. 17, 01 (2007), 5--20.Google ScholarCross Ref
- Lijuan Luo, Martin Wong, and Wen-mei Hwu. 2010. An effective GPU implementation of breadth-first search Proceedings of the 47th design automation conference. ACM, 52--55. Google ScholarDigital Library
- Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 135--146. Google ScholarDigital Library
- Adam McLaughlin and David A Bader. 2014. Scalable and high performance betweenness centrality on the GPU Proceedings of the International Conference for High performance computing, networking, storage and analysis. IEEE Press, 572--583. Google ScholarDigital Library
- Mario Mendez-Lojo, Martin Burtscher, and Keshav Pingali. 2012. A GPU implementation of inclusion-based points-to analysis. ACM SIGPLAN Notices Vol. 47, 8 (2012), 107--116. Google ScholarDigital Library
- Duane Merrill, Michael Garland, and Andrew Grimshaw. 2012. Scalable GPU graph traversal. In ACM SIGPLAN Notices, Vol. Vol. 47. ACM, 117--128. Google ScholarDigital Library
- Ulrich Meyer and Peter Sanders. 1998. Δ-stepping: A parallel single source shortest path algorithm European Symposium on Algorithms. Elsevier, 393--404. Google ScholarDigital Library
- Alan Mislove, Massimiliano Marcon, Krishna P Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks Proceedings of the 7th ACM SIGCOMM conference on Internet measurement. ACM, 29--42. Google ScholarDigital Library
- Rupesh Nasre, Martin Burtscher, and Keshav Pingali. 2013. Atomic-free irregular computations on GPUs. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. ACM, 96--107. Google ScholarDigital Library
- Rupesh Nasre, Martin Burtscher, and Keshav Pingali. 2013. Morph algorithms on GPUs. In ACM SIGPLAN Notices, Vol. Vol. 48. ACM, 147--156. Google ScholarDigital Library
- Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 456--471. Google ScholarDigital Library
- Hector Ortega-Arranz, Yuri Torres, Diego R Llanos, and Arturo Gonzalez-Escribano. 2013. A New GPU-based Approach to the Shortest Path Problem High Performance Computing and Simulation (HPCS), 2013 International Conference on. IEEE, 505--511.Google Scholar
- Sreepathi Pai and Keshav Pingali. 2016. A compiler for throughput optimization of graph algorithms on GPUs Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. ACM, 1--19. Google ScholarDigital Library
- Franccois Pellegrini and Jean Roman. 1996. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In High-Performance Computing and Networking. Springer, 493--498. Google ScholarDigital Library
- Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, et almbox.. 2011. The Tao of parallelism in algorithms. In ACM Sigplan Notices, Vol. Vol. 46. ACM, 12--25. Google ScholarDigital Library
- Dimitrios Prountzos and Keshav Pingali. 2013. Betweenness centrality: algorithms and implementations Acm Sigplan Notices, Vol. Vol. 48. ACM, 35--46. Google ScholarDigital Library
- Junqiao Qiu, Zhijia Zhao, and Bin Ren. 2016. MicroSpec: Speculation-centric fine-grained parallelization for FSM computations Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 221--233. Google ScholarDigital Library
- Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with Interactive Graph Analytics and Visualization Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. http://networkrepository.com Google ScholarDigital Library
- Gorka Sadowski and Philip Rathle. 2014. Fraud detection: Discovering connections with graph databases. White Paper-Neo Technology-Graphs are Everywhere (2014).Google Scholar
- Ahmet Erdem Sariyüce, Kamer Kaya, Erik Saule, and Ümit V Catalyürek. 2013. Betweenness centrality on GPUs and heterogeneous architectures Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. ACM, 76--85. Google ScholarDigital Library
- John Sartori and Rakesh Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia Vol. 15, 2 (2013), 279--290. Google ScholarDigital Library
- Dipanjan Sengupta, Shuaiwen Leon Song, Kapil Agarwal, and Karsten Schwan. 2015. GraphReduce: processing large-scale graphs on accelerator-based systems Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 28. Google ScholarDigital Library
- Julian Shun and Guy E Blelloch. 2013. Ligra: a lightweight graph processing framework for shared memory ACM Sigplan Notices, Vol. Vol. 48. ACM, 135--146. Google ScholarDigital Library
- Jeremy G Siek, Lie-Quan Lee, and Andrew Lumsdaine. 2001. The Boost Graph Library: User Guide and Reference Manual, Portable Documents. (2001).Google Scholar
- Jyothish Soman, Kothapalli Kishore, and PJ Narayanan. 2010. A fast GPU algorithm for graph connectivity. In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, 1--8.Google Scholar
- Stanley Tzeng, Anjul Patney, and John D Owens. 2010. Task management for irregular-parallel workloads on the GPU Proceedings of the Conference on High Performance Graphics. Eurographics Association, 29--37. Google ScholarDigital Library
- Leslie G Valiant. 1990. A bridging model for parallel computation. Commun. ACM Vol. 33, 8 (1990), 103--111. Google ScholarDigital Library
- Stephan M Wagner and Nikrouz Neshat. 2010. Assessing the vulnerability of supply chains using graph theory. International Journal of Production Economics Vol. 126, 1 (2010), 121--129.Google ScholarCross Ref
- Kai Wang, Aftab Hussain, Zhiqiang Zuo, Guoqing Xu, and Ardalan Amiri Sani. 2017. Graspan: A single-machine disk-based graph system for interprocedural static analyses of large-scale systems code. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 389--404. Google ScholarDigital Library
- Kai Wang, Guoqing (Harry) Xu, Zhendong Su, and Yu David Liu. 2015. GraphQ: Graph Query Processing with Abstraction Refinement-Scalable and Programmable Analytics over Very Large Graphs on a Single PC. USENIX Annual Technical Conference. 387--401. Google ScholarDigital Library
- Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A high-performance graph processing library on the GPU Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 11. Google ScholarDigital Library
- Brandon West, Adam Fidel, Nancy M Amato, Lawrence Rauchwerger, et almbox.. 2015. A hybrid approach to processing big data graphs on memory-restricted systems Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE, 799--808. Google ScholarDigital Library
- Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In ACM SIGPLAN Notices, Vol. Vol. 48. ACM, 57--68. Google ScholarDigital Library
- Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on gpus: Where are the bottlenecks? Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 140--149.Google ScholarCross Ref
- Yi Yang and Huiyang Zhou. 2014. CUDA-NP: realizing nested thread-level parallelism in GPGPU applications ACM SIGPLAN Notices, Vol. Vol. 49. ACM, 93--106. Google ScholarDigital Library
- Eddy Z Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing ACM SIGARCH Computer Architecture News, Vol. Vol. 39. ACM, 369--380. Google ScholarDigital Library
- Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. 2016. Exploring the Hidden Dimension in Graph Processing. OSDI. 285--300. Google ScholarDigital Library
- Zhijia Zhao and Xipeng Shen. 2015. On-the-Fly Principled Speculation for FSM Parallelization Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, Istanbul, Turkey, March 14--18, 2015. 619--630. Google ScholarDigital Library
- Zhijia Zhao, Bo Wu, and Xipeng Shen. 2014. Challenging the "Embarrassingly Sequential": Parallelizing Finite State Machine-Based Computations through Principled Speculation ASPLOS '14: Proceedings of 19th International Conference on Architecture Support for Programming Languages and Operating Systems. ACM Press. Google ScholarDigital Library
- Jianlong Zhong and Bingsheng He. 2014. Medusa: Simplified graph processing on GPUs. IEEE Transactions on Parallel and Distributed Systems Vol. 25, 6 (2014), 1543--1552. Google ScholarDigital Library
Index Terms
- Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing
Recommendations
Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing
ASPLOS '18Graph analytics delivers deep knowledge by processing large volumes of highly connected data. In real-world graphs, the degree distribution tends to follow the power law -- a small portion of nodes own a large number of neighbors. The high irregularity ...
Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors
Heterogeneous microprocessors integrate a CPU and GPU on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data “in place.” This permits exploiting a finer granularity of parallelism on the integrated GPUs, and enables ...
Accelerated bulk memory operations on heterogeneous multi-core systems
A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the past few years, the general-purpose computing on GPU (GPGPU). Recently, revolutionary measures have been taken along this ...
Comments