Engineering a high-performance GPU B-Tree

Authors:
Muhammad A. Awad

University of California

University of California
View Profile

,
Saman Ashkiani

University of California

University of California
View Profile

,
Rob Johnson

VMWare Research

VMWare Research
View Profile

,
Martín Farach-Colton

Rutgers University

Rutgers University
View Profile

,
John D. Owens

University of California

University of California
View Profile

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel ProgrammingFebruary 2019Pages 145–157https://doi.org/10.1145/3293883.3295706

Published:16 February 2019Publication History

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Pages 145–157

ABSTRACT

We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted array. In particular, point and range queries are significantly faster than in a GPU LSM (the GPU LSM does not implement successor queries). Furthermore, B-Tree insertions are also faster than LSM and sorted array insertions unless insertions come in batches of more than roughly 100k. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds the DRAM bandwidth of the GPU. We demonstrate that the key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high performance.

References

Alok Aggarwal and Jeffrey Scott Vitter. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM 31, 9 (Sept. 1988), 1116--1127. Google ScholarDigital Library
Saman Ashkiani, Martin Farach-Colton, and John D. Owens. 2018. A Dynamic Hash Table for the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018).Google Scholar
Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. 2018. GPU LSM: A Dynamic Dictionary Data Structure for the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018). 430--440.Google Scholar
Ricardo J. Barrientos, José I. Gómez, Christian Tenllado, Manuel Prieto Matias, and Mauricio Marin. 2012. Range Query Processing in a Multi-GPU Environment. In IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA-12). 419--426. Google ScholarDigital Library
R. Bayer and E. McCreight. 1970. Organization and Maintenance of Large Ordered Indices. In Proceedings of the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET '70). 107--141. Google ScholarDigital Library
Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious Streaming B-trees. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '07). 81--92. Google ScholarDigital Library
Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. Don't Thrash: How to Cache Your Hash on Flash. PVLDB 5, 11 (2012), 1627--1637. Google ScholarDigital Library
Kristina Chodorow. 2013. MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. O'Reilly Media, Inc. Google ScholarDigital Library
Douglas Comer. 1979. Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (June 1979), 121--137. Google ScholarDigital Library
A. ElTantawy and T. M. Aamodt. 2018. Warp Scheduling for Fine-Grained Synchronization. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 375--388.Google Scholar
Jordan Fix, Andrew Wilkes, and Kevin Skadron. 2011. Accelerating Braided B+ Tree Searches on a GPU with CUDA. In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis, Implementation, and Performance (A4MMC 2011).Google Scholar
Afton Geil, Martin Farach-Colton, and John D. Owens. 2018. Quotient Filters: Approximate Membership Queries on the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018). 451--462.Google Scholar
Goetz Graefe. 2010. A Survey of B-tree Locking Techniques. ACM Trans. Database Syst. 35, 3, Article 16 (July 2010), 26 pages. Google ScholarDigital Library
Oded Green and David A. Bader. 2016. cuSTINGER: Supporting dynamic graph algorithms for GPUs. In InProceedings of 2016 IEEE High Performance Extreme Computing Conference (HPEC 2016). 1--6.Google ScholarCross Ref
Yulong Huang, Benyue Su, and Jianqing Xi. 2014. CUBPT: Lock-free bulk insertions to B+ tree on GPU architecture. Computer Modelling & New Technologies 18, 10 (2014), 224--231.Google Scholar
Oracle Inc. 2011. Oracle. http://www.oracle.com/.Google Scholar
Ibrahim Jaluta, Seppo Sippu, and Eljas Soisalon-Soininen. 2005. Concurrency Control and Recovery for Balanced B-link Trees. The VLDB Journal 14, 2 (April 2005), 257--277. Google ScholarDigital Library
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR (April 2018). arXiv:1804.06826 http://arxiv.org/abs/1804.06826Google Scholar
Krzysztof Kaczmarski. 2012. B+-Tree Optimized for GPGPU. In On the Move to Meaningful Internet Systems: OTM 2012, Robert Meersman, Hervé Panetto, Tharam Dillon, Stefanie Rinderle-Ma, Peter Dadam, Xiaofang Zhou, Siani Pearson, Alois Ferscha, Sonia Bergamaschi, and Isabel F. Cruz (Eds.). Lecture Notes in Computer Science, Vol. 7566. Springer Berlin Heidelberg, 843--854.Google Scholar
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10). 339--350. Google ScholarDigital Library
Jinwoong Kim, Sul-Gi Kim, and Beomseok Nam. 2013. Parallel multidimensional range query processing with R-trees on GPU. J. Parallel and Distrib. Comput. 73, 8 (Aug. 2013), 1195--1207. Google ScholarDigital Library
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review 44, 2 (April 2010), 35--40. Google ScholarDigital Library
Vladimir Lanin and Dennis Shasha. 1986. A Symmetric Concurrent B-tree Algorithm. In Proceedings of the 1986 ACM Fall Joint Computer Conference (ACM '86). 380--389. http://dl.acm.org/citation.cfm?id=324493.324589 Google ScholarDigital Library
Philip L. Lehman and S. Bing Yao. 1981. Efficient Locking for Concurrent Operations on B-Trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981), 650--670. Google ScholarDigital Library
Francesco Lettich, Claudio Silvestri, Salvatore Orlando, and Christian S. Jensen. 2014. GPU-Based Computing of Repeated Range Queries over Moving Objects. In 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2014). 640--647. Google ScholarDigital Library
Yinan Li, Bingsheng He, Qiong Luo, and Ke Yi. 2009. Tree Indexing on Flash Disks. In IEEE 25th International Conference on Data Engineering (ICDE '09). IEEE, 1303--1306. Google ScholarDigital Library
Wei Liao, Zhimin Yuan, Jiasheng Wang, and Zhiming Zhang. 2014. Accelerating Continuous Range Queries Processing In Location Based Networks On GPUs. In Management Innovation and Information Technology. 581--589.Google Scholar
Robert Love. 2010. Linux kernel development. Pearson Education. Google ScholarDigital Library
Lijuan Luo, Martin D. F. Wong, and Lance Leong. 2012. Parallel implementation of R-trees on the GPU. In 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC 2012). 353--358.Google ScholarCross Ref
Duane Merrill. 2015. CUDA UnBound (CUB) Library. https://nvlabs.github.io/cub/.Google Scholar
MySQL 5.7 Reference Manual. {n. d.}. Chapter 15 The InnoDB Storage Engine. http://dev.mysql.com/doc/refman/5.7/en/innodb-storage-engine.html.Google Scholar
Ohad Rodeh. 2008. B-trees, Shadowing, and Clones. Trans. Storage 3, 4, Article 2 (Feb. 2008), 27pages. Google ScholarDigital Library
Yehoshua Sagiv. 1986. Concurrent Operations on B*-trees with Overtaking. J. Comput. Syst. Sci. 33, 2 (Oct. 1986), 275--296. Google ScholarDigital Library
Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A Hybrid B+-tree As Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1523--1538. Google ScholarDigital Library
Jyothish Soman, Kishore Kothapalli, and P. J. Narayanan. 2012. Discrete Range Searching Primitive for the GPU and Its Applications. J. Exp. Algorithmics 17, Article 4.5 (Oct. 2012), 4.5:4.1--4.5:4.17 pages. Google ScholarDigital Library
Jeff A. Stuart and John D. Owens. 2011. Efficient Synchronization Primitives for GPUs. CoRR abs/1110.4623 (Oct. 2011). arXiv:cs.OS/1110.4623v1Google Scholar
Yunlong Xu, Lan Gao, Rui Wang, Zhongzhi Luan, Weiguo Wu, and Depei Qian. 2016. Lock-based Synchronization for GPU Architectures. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). 205--213. Google ScholarDigital Library
Yunlong Xu, Rui Wang, Nilanjan Goswami, Tao Li, Lan Gao, and Depei Qian. 2014. Software Transactional Memory for GPU Architectures. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). Article 1, 1:1--1:10 pages. Google ScholarDigital Library
Zhaofeng Yan, Yuzhe Lin, Lu Peng, and Weihua Zhang. 2019. Harmonia: A High Throughput B+tree for GPUs. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '19). Google ScholarDigital Library
Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, Pedro Sander, and Jiaoying Shi. 2007. In-memory Grid Files on Graphics Processors. In Proceedings of the 3rd International Workshop on Data Management on New Hardware (DaMoN '07). Article 5, 7 pages. Google ScholarDigital Library
Simin You, Jianting Zhang, and Le Gruenwald. 2013. Parallel Spatial Query Processing on GPUs Using R-trees. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial '13). 23--31. Google ScholarDigital Library
Yang Zhan, Alexander Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. 2018. The Full Path to Full-Path Indexing. In 16th USENIX Conference on File and Storage Technologies (FAST 18). USENIX Association, 123--138. Google ScholarDigital Library

Index Terms

Engineering a high-performance GPU B-Tree
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
High-performance code generation for stencil computations on GPU architectures
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
February 2019
472 pages
ISBN:9781450362252
DOI:10.1145/3293883
General Chair:
Jeff Hollingsworth
University of Maryland
,
Program Chair:
Idit Keidar
Technion, Israel
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 February 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
B-tree
GPU
data structures
dynamic
mutable
Qualifiers
- research-article
Conference

Acceptance Rates
PPoPP '19 Paper Acceptance Rate29of152submissions,19%Overall Acceptance Rate230of1,014submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 2,197
  Total Downloads
- Downloads (Last 12 months)528
- Downloads (Last 6 weeks)64
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Engineering a high-performance GPU B-Tree

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerated high-performance computing through efficient multi-process GPU resource sharing

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

High-performance code generation for stencil computations on GPU architectures