Scaling SQL to the Supercomputer for Interactive Analysis of Simulation Data

Glaser, Jens; Aramburú, Felipe; Malpica, William; Hernández, Benjamín; Baker, Matthew; Aramburú, Rodrigo

doi:10.1007/978-3-030-96498-6_19

Jens Glaser¹²,
Felipe Aramburú¹³,
William Malpica¹³,
Benjamín Hernández¹²,
Matthew Baker¹² &
…
Rodrigo Aramburú¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1512))

Included in the following conference series:

Smoky Mountains Computational Sciences and Engineering Conference

970 Accesses

Abstract

AI and simulation workloads consume and generate large amounts of data that need to be searched, transformed and merged with other data. With the goal of treating data as a first-class citizen inside a traditionally compute-centric HPC environment, we explore how the use of accelerators and high-speed interconnects can speed up tasks which otherwise constitute bottlenecks in computational discovery workflows. BlazingSQL is SQL engine that runs natively on NVIDIA GPUs and supports internode communication for fast analytics on terabyte-scale tabular data sets. We show how a fast interconnect improves query performance if leveraged through the Unified Communication X (UCX) middleware. We envision that future computing platforms will integrate accelerated database query capabilities for immediate and interactive analysis of large simulation data.

This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

dask-sql. https://github.com/dask-contrib/dask-sql (2021). Accessed 5 Nov 2021
Bakkum, P., Skadron, K.: Accelerating SQL database operations on a GPU with CUDA. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pp. 94–103. Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/1735688.1735706
BlazingSQL: high performance SQL engine on RAPIDS AI. https://blazingsql.com/ (2021). Accessed 08 Oct 2021
Breß, S., Saake, G.: Why it is time for a HyPE: a hybrid query processing engine for efficient GPU coprocessing in DBMS. Proc. VLDB Endow. 6(12), 1398–1403 (2013). https://doi.org/10.14778/2536274.2536325
Article Google Scholar
Bre, S., Beier, F., Rauhe, H., Sattler, K.U., Schallehn, E., Saake, G.: Efficient co-processor utilization in database query processing. Inf. Syst. 38(8), 1084–1096 (2013). https://www.sciencedirect.com/science/article/pii/S0306437913000732
Chapman, B., et al.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, pp. 1–3 (2010)
Google Scholar
Chrysogelos, P., Sioulas, P., Ailamaki, A.: Hardware-conscious query processing in GPU-accelerated analytical engines. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. No. CONF (2019)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. Commun. ACM 35(6), 85–98 (1992). https://doi.org/10.1145/129888.129894
Fang, R., et al.: GPUQP: query co-processing using graphics processors. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. SIGMOD 2007, pp. 1061–1063. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1247480.1247606
Fang, W., He, B., Luo, Q.: Database compression on graphics processors. Proc. VLDB Endow. 3(1–2), 670–680 (2010). https://doi.org/10.14778/1920841.1920927
Glaser, J., et al.: High-throughput virtual laboratory for drug discovery using massive datasets. Int. J. High Perform. Comput. Appl. 35, 452–468 (2021). https://doi.org/10.1177/10943420211001565
Govindaraju, N.K., Lloyd, B., Wang, W., Lin, M., Manocha, D.: Fast computation of database operations using graphics processors. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. SIGMOD 2004, pp. 215–226. Association for Computing Machinery, New York, NY, USA (2004). https://doi.org/10.1145/1007568.1007594
He, B., Relational query coprocessing on graphics processors. ACM Trans. Database Syst. 34(4) (2009). https://doi.org/10.1145/1620585.1620588
Hernández, B., et al.: Performance evaluation of Python based data analytics frameworks in summit: early experiences. In: Nichols, J., Verastegui, B., Maccabe, A.B., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds.) SMC 2020. CCIS, vol. 1315, pp. 366–380. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63393-6_24
Chapter Google Scholar
Huebl, A.: OpenPMD release 1.4.0 with support for data processing through dask. https://github.com/openPMD/openPMD-api/releases/tag/0.14.0 (2021)
Lee, S., Park, S.: Performance analysis of big data ETL process over CPU-GPU heterogeneous architectures. In: 2021 IEEE 37th International Conference on Data Engineering Workshops (ICDEW), pp. 42–47 (2021)
Google Scholar
Lu, X., et al.: High-performance design of hadoop RPC with RDMA over InfiniBand. In: 2013 42nd International Conference on Parallel Processing, pp. 641–650 (2013)
Google Scholar
NVIDIA: Open GPU data science-RAPIDS. https://rapids.ai (2021). Accessed 26 May 2021
Olsen, S., Romoser, B., Zong, Z.: SQLPhi: a SQL-based database engine for intel Xeon Phi coprocessors. In: Proceedings of the 2014 International Conference on Big Data Science and Computing. BigDataScience 2014. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2640087.2644172
OmniSciDB: OmniSciDB: open source SQL-based, relational, columnar database engine. https://github.com/omnisci/omniscidb (2021). Accessed 26 May 2021
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
PGStrom: PG-Strom: a GPU extension module of PostgreSQL. https://github.com/heterodb/pg-strom (2021). Accessed 26 May 2021
Poeschel, F., et al.: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2. arXiv preprint arXiv:2107.06108 (2021)
Shamis, P., et al.: UCX: an open source framework for HPC network APIs and beyond. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 40–43. IEEE (2015)
Google Scholar
Shehab, E., Algergawy, A., Sarhan, A.: Accelerating relational database operations using both CPU and GPU co-processor. Comput. Electr. Eng. 57, 69–80 (2017). https://www.sciencedirect.com/science/article/pii/S0045790616310631
The pandas development team: pandas-dev/pandas: Pandas (2020). https://doi.org/10.5281/zenodo.3509134
UCX: UCX Client-Server. https://openucx.github.io/ucx/api/v1.10/html/ucp_client_server_8c-example.html (2021). Accessed 26 May 2021
Weininger, D.: SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28(1), 31–36 (1988)
Google Scholar
Woods, L., István, Z., Alonso, G.: Ibex: an intelligent storage engine with support for advanced SQL offloading. Proc. VLDB Endow. 7(11), 963–974 (2014). https://doi.org/10.14778/2732967.2732972

Download references

Acknowledgments

We are grateful to Oscar Hernandez (NVIDIA) for initial conceptualization of this research. We thank Arjun Shankar (ORNL) for support. This research used resources of the Oak Ridge Leadership Computing Facility (OLCF) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN, 37831, USA
Jens Glaser, Benjamín Hernández & Matthew Baker
Voltron Data, Inc., Mountain View, USA
Felipe Aramburú, William Malpica & Rodrigo Aramburú

Authors

Jens Glaser
View author publications
You can also search for this author in PubMed Google Scholar
Felipe Aramburú
View author publications
You can also search for this author in PubMed Google Scholar
William Malpica
View author publications
You can also search for this author in PubMed Google Scholar
Benjamín Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Baker
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Aramburú
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jens Glaser .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, USA
Jeffrey Nichols
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Arthur ‘Barney’ Maccabe
Oak Ridge National Laboratory, Oak Ridge, TN, USA
James Nutaro
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Swaroop Pophale
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Pravallika Devineni
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Theresa Ahearn
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Becky Verastegui

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Glaser, J., Aramburú, F., Malpica, W., Hernández, B., Baker, M., Aramburú, R. (2022). Scaling SQL to the Supercomputer for Interactive Analysis of Simulation Data. In: Nichols, J., et al. Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. SMC 2021. Communications in Computer and Information Science, vol 1512. Springer, Cham. https://doi.org/10.1007/978-3-030-96498-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-96498-6_19
Published: 10 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96497-9
Online ISBN: 978-3-030-96498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics