skip to main content
10.1145/3293883.3295706acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open Access

Engineering a high-performance GPU B-Tree

Published:16 February 2019Publication History

ABSTRACT

We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted array. In particular, point and range queries are significantly faster than in a GPU LSM (the GPU LSM does not implement successor queries). Furthermore, B-Tree insertions are also faster than LSM and sorted array insertions unless insertions come in batches of more than roughly 100k. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds the DRAM bandwidth of the GPU. We demonstrate that the key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high performance.

References

  1. Alok Aggarwal and Jeffrey Scott Vitter. 1988. The Input/Output Complexity of Sorting and Related Problems. Commun. ACM 31, 9 (Sept. 1988), 1116--1127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Saman Ashkiani, Martin Farach-Colton, and John D. Owens. 2018. A Dynamic Hash Table for the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018).Google ScholarGoogle Scholar
  3. Saman Ashkiani, Shengren Li, Martin Farach-Colton, Nina Amenta, and John D. Owens. 2018. GPU LSM: A Dynamic Dictionary Data Structure for the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018). 430--440.Google ScholarGoogle Scholar
  4. Ricardo J. Barrientos, José I. Gómez, Christian Tenllado, Manuel Prieto Matias, and Mauricio Marin. 2012. Range Query Processing in a Multi-GPU Environment. In IEEE 10th International Symposium on Parallel and Distributed Processing with Applications (ISPA-12). 419--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Bayer and E. McCreight. 1970. Organization and Maintenance of Large Ordered Indices. In Proceedings of the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET '70). 107--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious Streaming B-trees. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '07). 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. 2012. Don't Thrash: How to Cache Your Hash on Flash. PVLDB 5, 11 (2012), 1627--1637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kristina Chodorow. 2013. MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. O'Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Douglas Comer. 1979. Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (June 1979), 121--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. ElTantawy and T. M. Aamodt. 2018. Warp Scheduling for Fine-Grained Synchronization. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 375--388.Google ScholarGoogle Scholar
  11. Jordan Fix, Andrew Wilkes, and Kevin Skadron. 2011. Accelerating Braided B+ Tree Searches on a GPU with CUDA. In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis, Implementation, and Performance (A4MMC 2011).Google ScholarGoogle Scholar
  12. Afton Geil, Martin Farach-Colton, and John D. Owens. 2018. Quotient Filters: Approximate Membership Queries on the GPU. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018). 451--462.Google ScholarGoogle Scholar
  13. Goetz Graefe. 2010. A Survey of B-tree Locking Techniques. ACM Trans. Database Syst. 35, 3, Article 16 (July 2010), 26 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Oded Green and David A. Bader. 2016. cuSTINGER: Supporting dynamic graph algorithms for GPUs. In InProceedings of 2016 IEEE High Performance Extreme Computing Conference (HPEC 2016). 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  15. Yulong Huang, Benyue Su, and Jianqing Xi. 2014. CUBPT: Lock-free bulk insertions to B+ tree on GPU architecture. Computer Modelling & New Technologies 18, 10 (2014), 224--231.Google ScholarGoogle Scholar
  16. Oracle Inc. 2011. Oracle. http://www.oracle.com/.Google ScholarGoogle Scholar
  17. Ibrahim Jaluta, Seppo Sippu, and Eljas Soisalon-Soininen. 2005. Concurrency Control and Recovery for Balanced B-link Trees. The VLDB Journal 14, 2 (April 2005), 257--277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR (April 2018). arXiv:1804.06826 http://arxiv.org/abs/1804.06826Google ScholarGoogle Scholar
  19. Krzysztof Kaczmarski. 2012. B+-Tree Optimized for GPGPU. In On the Move to Meaningful Internet Systems: OTM 2012, Robert Meersman, Hervé Panetto, Tharam Dillon, Stefanie Rinderle-Ma, Peter Dadam, Xiaofang Zhou, Siani Pearson, Alois Ferscha, Sonia Bergamaschi, and Isabel F. Cruz (Eds.). Lecture Notes in Computer Science, Vol. 7566. Springer Berlin Heidelberg, 843--854.Google ScholarGoogle Scholar
  20. Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10). 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jinwoong Kim, Sul-Gi Kim, and Beomseok Nam. 2013. Parallel multidimensional range query processing with R-trees on GPU. J. Parallel and Distrib. Comput. 73, 8 (Aug. 2013), 1195--1207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Avinash Lakshman and Prashant Malik. 2010. Cassandra: A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review 44, 2 (April 2010), 35--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Vladimir Lanin and Dennis Shasha. 1986. A Symmetric Concurrent B-tree Algorithm. In Proceedings of the 1986 ACM Fall Joint Computer Conference (ACM '86). 380--389. http://dl.acm.org/citation.cfm?id=324493.324589 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Philip L. Lehman and S. Bing Yao. 1981. Efficient Locking for Concurrent Operations on B-Trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981), 650--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Francesco Lettich, Claudio Silvestri, Salvatore Orlando, and Christian S. Jensen. 2014. GPU-Based Computing of Repeated Range Queries over Moving Objects. In 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2014). 640--647. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yinan Li, Bingsheng He, Qiong Luo, and Ke Yi. 2009. Tree Indexing on Flash Disks. In IEEE 25th International Conference on Data Engineering (ICDE '09). IEEE, 1303--1306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wei Liao, Zhimin Yuan, Jiasheng Wang, and Zhiming Zhang. 2014. Accelerating Continuous Range Queries Processing In Location Based Networks On GPUs. In Management Innovation and Information Technology. 581--589.Google ScholarGoogle Scholar
  28. Robert Love. 2010. Linux kernel development. Pearson Education. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lijuan Luo, Martin D. F. Wong, and Lance Leong. 2012. Parallel implementation of R-trees on the GPU. In 2012 17th Asia and South Pacific Design Automation Conference (ASP-DAC 2012). 353--358.Google ScholarGoogle ScholarCross RefCross Ref
  30. Duane Merrill. 2015. CUDA UnBound (CUB) Library. https://nvlabs.github.io/cub/.Google ScholarGoogle Scholar
  31. MySQL 5.7 Reference Manual. {n. d.}. Chapter 15 The InnoDB Storage Engine. http://dev.mysql.com/doc/refman/5.7/en/innodb-storage-engine.html.Google ScholarGoogle Scholar
  32. Ohad Rodeh. 2008. B-trees, Shadowing, and Clones. Trans. Storage 3, 4, Article 2 (Feb. 2008), 27pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yehoshua Sagiv. 1986. Concurrent Operations on B*-trees with Overtaking. J. Comput. Syst. Sci. 33, 2 (Oct. 1986), 275--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Amirhesam Shahvarani and Hans-Arno Jacobsen. 2016. A Hybrid B+-tree As Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1523--1538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jyothish Soman, Kishore Kothapalli, and P. J. Narayanan. 2012. Discrete Range Searching Primitive for the GPU and Its Applications. J. Exp. Algorithmics 17, Article 4.5 (Oct. 2012), 4.5:4.1--4.5:4.17 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jeff A. Stuart and John D. Owens. 2011. Efficient Synchronization Primitives for GPUs. CoRR abs/1110.4623 (Oct. 2011). arXiv:cs.OS/1110.4623v1Google ScholarGoogle Scholar
  37. Yunlong Xu, Lan Gao, Rui Wang, Zhongzhi Luan, Weiguo Wu, and Depei Qian. 2016. Lock-based Synchronization for GPU Architectures. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). 205--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yunlong Xu, Rui Wang, Nilanjan Goswami, Tao Li, Lan Gao, and Depei Qian. 2014. Software Transactional Memory for GPU Architectures. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '14). Article 1, 1:1--1:10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zhaofeng Yan, Yuzhe Lin, Lu Peng, and Weihua Zhang. 2019. Harmonia: A High Throughput B+tree for GPUs. In Proceedings of the ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '19). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ke Yang, Bingsheng He, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, Pedro Sander, and Jiaoying Shi. 2007. In-memory Grid Files on Graphics Processors. In Proceedings of the 3rd International Workshop on Data Management on New Hardware (DaMoN '07). Article 5, 7 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Simin You, Jianting Zhang, and Le Gruenwald. 2013. Parallel Spatial Query Processing on GPUs Using R-trees. In Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial '13). 23--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yang Zhan, Alexander Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. 2018. The Full Path to Full-Path Indexing. In 16th USENIX Conference on File and Storage Technologies (FAST 18). USENIX Association, 123--138. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Engineering a high-performance GPU B-Tree

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
        February 2019
        472 pages
        ISBN:9781450362252
        DOI:10.1145/3293883

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 February 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        PPoPP '19 Paper Acceptance Rate29of152submissions,19%Overall Acceptance Rate230of1,014submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader