ABSTRACT
We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted array. In particular, point and range queries are significantly faster than in a GPU LSM (the GPU LSM does not implement successor queries). Furthermore, B-Tree insertions are also faster than LSM and sorted array insertions unless insertions come in batches of more than roughly 100k. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds the DRAM bandwidth of the GPU. We demonstrate that the key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high performance.
- Alok Aggarwal and Jeffrey Scott Vitter. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM 31, 9 (Sept. 1988), 1116--1127. Google ScholarDigital Library
- Saman Ashkiani, Martin Farach-Colton, and John D. Owens. 2018. A Dynamic Hash Table for the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018).Google Scholar
- Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. 2018. GPU LSM: A Dynamic Dictionary Data Structure for the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018). 430--440.Google Scholar
- Ricardo J. Barrientos, José I. Gómez, Christian Tenllado, Manuel Prieto Matias, and Mauricio Marin. 2012. Range Query Processing in a Multi-GPU Environment. In IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA-12). 419--426. Google ScholarDigital Library
- R. Bayer and E. McCreight. 1970. Organization and Maintenance of Large Ordered Indices. In Proceedings of the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET '70). 107--141. Google ScholarDigital Library
- Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious Streaming B-trees. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '07). 81--92. Google ScholarDigital Library
- Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. Don't Thrash: How to Cache Your Hash on Flash. PVLDB 5, 11 (2012), 1627--1637. Google ScholarDigital Library
- Kristina Chodorow. 2013. MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. O'Reilly Media, Inc. Google ScholarDigital Library
- Douglas Comer. 1979. Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (June 1979), 121--137. Google ScholarDigital Library
- A. ElTantawy and T. M. Aamodt. 2018. Warp Scheduling for Fine-Grained Synchronization. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 375--388.Google Scholar
- Jordan Fix, Andrew Wilkes, and Kevin Skadron. 2011. Accelerating Braided B+ Tree Searches on a GPU with CUDA. In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis, Implementation, and Performance (A4MMC 2011).Google Scholar
- Afton Geil, Martin Farach-Colton, and John D. Owens. 2018. Quotient Filters: Approximate Membership Queries on the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018). 451--462.Google Scholar
- Goetz Graefe. 2010. A Survey of B-tree Locking Techniques. ACM Trans. Database Syst. 35, 3, Article 16 (July 2010), 26 pages. Google ScholarDigital Library
- Oded Green and David A. Bader. 2016. cuSTINGER: Supporting dynamic graph algorithms for GPUs. In InProceedings of 2016 IEEE High Performance Extreme Computing Conference (HPEC 2016). 1--6.Google ScholarCross Ref
- Yulong Huang, Benyue Su, and Jianqing Xi. 2014. CUBPT: Lock-free bulk insertions to B+ tree on GPU architecture. Computer Modelling & New Technologies 18, 10 (2014), 224--231.Google Scholar
- Oracle Inc. 2011. Oracle. http://www.oracle.com/.Google Scholar
- Ibrahim Jaluta, Seppo Sippu, and Eljas Soisalon-Soininen. 2005. Concurrency Control and Recovery for Balanced B-link Trees. The VLDB Journal 14, 2 (April 2005), 257--277. Google ScholarDigital Library
- Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR (April 2018). arXiv:1804.06826 http://arxiv.org/abs/1804.06826Google Scholar
- Krzysztof Kaczmarski. 2012. B+-Tree Optimized for GPGPU. In On the Move to Meaningful Internet Systems: OTM 2012, Robert Meersman, Hervé Panetto, Tharam Dillon, Stefanie Rinderle-Ma, Peter Dadam, Xiaofang Zhou, Siani Pearson, Alois Ferscha, Sonia Bergamaschi, and Isabel F. Cruz (Eds.). Lecture Notes in Computer Science, Vol. 7566. Springer Berlin Heidelberg, 843--854.Google Scholar
- Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10). 339--350. Google ScholarDigital Library
- Jinwoong Kim, Sul-Gi Kim, and Beomseok Nam. 2013. Parallel multidimensional range query processing with R-trees on GPU. J. Parallel and Distrib. Comput. 73, 8 (Aug. 2013), 1195--1207. Google ScholarDigital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review 44, 2 (April 2010), 35--40. Google ScholarDigital Library
- Vladimir Lanin and Dennis Shasha. 1986. A Symmetric Concurrent B-tree Algorithm. In Proceedings of the 1986 ACM Fall Joint Computer Conference (ACM '86). 380--389. http://dl.acm.org/citation.cfm?id=324493.324589 Google ScholarDigital Library
- Philip L. Lehman and S. Bing Yao. 1981. Efficient Locking for Concurrent Operations on B-Trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981), 650--670. Google ScholarDigital Library
- Francesco Lettich, Claudio Silvestri, Salvatore Orlando, and Christian S. Jensen. 2014. GPU-Based Computing of Repeated Range Queries over Moving Objects. In 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2014). 640--647. Google ScholarDigital Library
- Yinan Li, Bingsheng He, Qiong Luo, and Ke Yi. 2009. Tree Indexing on Flash Disks. In IEEE 25th International Conference on Data Engineering (ICDE '09). IEEE, 1303--1306. Google ScholarDigital Library
- Wei Liao, Zhimin Yuan, Jiasheng Wang, and Zhiming Zhang. 2014. Accelerating Continuous Range Queries Processing In Location Based Networks On GPUs. In Management Innovation and Information Technology. 581--589.Google Scholar
- Robert Love. 2010. Linux kernel development. Pearson Education. Google ScholarDigital Library
- Lijuan Luo, Martin D. F. Wong, and Lance Leong. 2012. Parallel implementation of R-trees on the GPU. In 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC 2012). 353--358.Google ScholarCross Ref
- Duane Merrill. 2015. CUDA UnBound (CUB) Library. https://nvlabs.github.io/cub/.Google Scholar
- MySQL 5.7 Reference Manual. {n. d.}. Chapter 15 The InnoDB Storage Engine. http://dev.mysql.com/doc/refman/5.7/en/innodb-storage-engine.html.Google Scholar
- Ohad Rodeh. 2008. B-trees, Shadowing, and Clones. Trans. Storage 3, 4, Article 2 (Feb. 2008), 27pages. Google ScholarDigital Library
- Yehoshua Sagiv. 1986. Concurrent Operations on B*-trees with Overtaking. J. Comput. Syst. Sci. 33, 2 (Oct. 1986), 275--296. Google ScholarDigital Library
- Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A Hybrid B+-tree As Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1523--1538. Google ScholarDigital Library
- Jyothish Soman, Kishore Kothapalli, and P. J. Narayanan. 2012. Discrete Range Searching Primitive for the GPU and Its Applications. J. Exp. Algorithmics 17, Article 4.5 (Oct. 2012), 4.5:4.1--4.5:4.17 pages. Google ScholarDigital Library
- Jeff A. Stuart and John D. Owens. 2011. Efficient Synchronization Primitives for GPUs. CoRR abs/1110.4623 (Oct. 2011). arXiv:cs.OS/1110.4623v1Google Scholar
- Yunlong Xu, Lan Gao, Rui Wang, Zhongzhi Luan, Weiguo Wu, and Depei Qian. 2016. Lock-based Synchronization for GPU Architectures. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). 205--213. Google ScholarDigital Library
- Yunlong Xu, Rui Wang, Nilanjan Goswami, Tao Li, Lan Gao, and Depei Qian. 2014. Software Transactional Memory for GPU Architectures. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). Article 1, 1:1--1:10 pages. Google ScholarDigital Library
- Zhaofeng Yan, Yuzhe Lin, Lu Peng, and Weihua Zhang. 2019. Harmonia: A High Throughput B+tree for GPUs. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '19). Google ScholarDigital Library
- Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, Pedro Sander, and Jiaoying Shi. 2007. In-memory Grid Files on Graphics Processors. In Proceedings of the 3rd International Workshop on Data Management on New Hardware (DaMoN '07). Article 5, 7 pages. Google ScholarDigital Library
- Simin You, Jianting Zhang, and Le Gruenwald. 2013. Parallel Spatial Query Processing on GPUs Using R-trees. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial '13). 23--31. Google ScholarDigital Library
- Yang Zhan, Alexander Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. 2018. The Full Path to Full-Path Indexing. In 16th USENIX Conference on File and Storage Technologies (FAST 18). USENIX Association, 123--138. Google ScholarDigital Library
Index Terms
- Engineering a high-performance GPU B-Tree
Recommendations
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing FrontiersThe HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
High-performance code generation for stencil computations on GPU architectures
ICS '12: Proceedings of the 26th ACM international conference on SupercomputingStencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these ...
Comments