article

Hardware-oblivious parallelism for in-memory column-stores

Authors:
Max Heimel

Technische Universität Berlin

Technische Universität Berlin
View Profile

,
Michael Saecker

ParStream GmbH

ParStream GmbH
View Profile

,
Holger Pirk

CWI Amsterdam

CWI Amsterdam
View Profile

,
Stefan Manegold

CWI Amsterdam

CWI Amsterdam
View Profile

,
Volker Markl

Technische Universität Berlin

Technische Universität Berlin
View Profile

Proceedings of the VLDB Endowment Volume 6 Issue 9pp 709–720https://doi.org/10.14778/2536360.2536370

Published:01 July 2013Publication History

Proceedings of the VLDB Endowment

Abstract

The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and graphics cards. We propose an alternative design for a parallel database engine, based on a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This design reduces the development overhead for parallel database engines, while achieving competitive performance to hand-tuned systems.

We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Following this approach, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluated our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of MonetDB's query execution on the CPU and surpasses it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasibility of our approach.

References

Advanced Micro Devices. OpenCL Zone. http://developer.amd.com/resources/heterogeneous-computing/opencl-zone/, January 2013.Google Scholar
D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. In ACM SIGGRAPH Asia 2009 papers, SIGGRAPH Asia'09, pages 154:1-154:9, New York, NY, USA, 2009. ACM. Google Scholar
D. A. F. Alcantara. Efficient Hash Tables on the GPU. PhD thesis, University of California, Davis, 2011. Google Scholar
Altera Corporation. OpenCL for Altera FPGAs: Accelerating Performance and Design Productivity. http://www.altera.com/products/software/opencl/opencl-index.html, January 2013.Google Scholar
C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware. ETH Zurich, Systems Group, Tech. Rep, 2012.Google Scholar
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 119-130. ACM, 2010. Google Scholar
P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking The Memory Wall In MonetDB. Communications of the ACM, 51(12):77-85, December 2008. Google Scholar
S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67-77, 2011. Google Scholar
S. Breß, F. Beier, H. Rauhe, E. Schallehn, K.-U. Sattler, and G. Saake. Automatic selection of processing units for coprocessing in databases. In Advances in Databases and Information Systems, pages 57-70. Springer, 2012. Google Scholar
N. Cascarano, P. Rolando, F. Risso, and R. Sisto. infant: Nfa pattern matching on gpgpu devices. SIGCOMM Comput. Commun. Rev., 40(5):20-26, Oct. 2010. Google Scholar
M. M. Chakravarty, R. Leshchinskiy, S. P. Jones, G. Keller, and S. Marlow. Data parallel haskell: a status report. In Proceedings of the 2007 workshop on Declarative aspects of multicore programming, pages 10-18. ACM, 2007. Google Scholar
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008. Google Scholar
D. J. DeWitt. Direct - a multiprocessor organization for supporting relational data base management systems. In Proceedings of the 5th annual symposium on Computer architecture, ISCA'78, pages 182-189, New York, NY, USA, 1978. ACM. Google Scholar
I. García, S. Lefebvre, S. Hornus, and A. Lasram. Coherent parallel hashing. In Proceedings of the 2011 SIGGRAPH Asia Conference, SA'11, pages 161:1-161:8, New York, NY, USA, 2011. ACM. Google Scholar
B. Gold, A. Ailamaki, L. Huston, and B. Falsafi. Accelerating database operators using a network processor. In Proceedings of the 1st international workshop on Data management on new hardware, DaMoN'05, New York, NY, USA, 2005. ACM. Google Scholar
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, SIGMOD'06, pages 325-336, New York, NY, USA, 2006. ACM. Google Scholar
N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD'04, pages 215-226, New York, NY, USA, 2004. ACM. Google Scholar
B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC'07, pages 46:1-46:12, New York, NY, USA, 2007. ACM. Google Scholar
B. He, M. Lu, K. Yang, R. Fang, N. Govindaraju, Q. Luo, and P. Sander. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4):21, 2009. Google Scholar
B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 511-524. ACM, 2008. Google Scholar
M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. ADMS, 2012.Google Scholar
P. Helluy. A portable implementation of the radix sort algorithm in opencl.Google Scholar
S. Héman, N. Nes, M. Zukowski, and P. Boncz. Vectorized data processing on the cell broadband engine. In Proceedings of the 3rd international workshop on Data management on new hardware, page 4. ACM, 2007. Google Scholar
D. Horn. GPU Gems 2nd Edition, chapter Stream reduction operations for GPGPU applications. Addision Wesley, 2005.Google Scholar
M. Ivanova, M. Kersten, and F. Groffen. Just-in-time data distribution for analytical query processing. In Advances in Databases and Information Systems, pages 209-222. Springer, 2012. Google Scholar
C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: fast join implementation on modern multi-core cpus. Proceedings of the VLDB Endowment, 2(2):1378-1389, 2009. Google Scholar
S. Lee, M. M. Chakravarty, V. Grover, and G. Keller. Gpu kernels as data-parallel array computations in haskell. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods, 2009.Google Scholar
R. Mueller, J. Teubner, and G. Alonso. Data processing on fpgas. Proc. VLDB Endow., 2(1):910-921, Aug. 2009. Google Scholar
C. Nvidia. Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 83:129, 2007.Google Scholar
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society. Google Scholar
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society. Google Scholar
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD'10, pages 351-362, New York, NY, USA, 2010. ACM. Google Scholar
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, GH'07, pages 97-106, Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association. Google Scholar
D. Singh and S. P. Engineer. Higher level programming abstractions for fpgas using opencl. In Workshop on Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, 2011.Google Scholar
The Khronos Group Inc. OpenCL - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/, May 2011.Google Scholar
Transaction Processing Performance Council. TPC-H. http://www.tpc.org/tpch/default.asp, May 2011.Google Scholar
R. Wu, B. Zhang, M. Hsu, and Q. Chen. Gpu-accelerated predicate evaluation on column store. In Proceedings of the 11th international conference on Web-age information management, WAIM'10, pages 570-581, Berlin, Heidelberg, 2010. Springer-Verlag. Google Scholar

Index Terms

Hardware-oblivious parallelism for in-memory column-stores
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
Read More
A GPGPU compiler for memory optimization and parallelism management
PLDI '10

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Read More
Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03

Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 6, Issue 9
July 2013
180 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2013
Published in pvldb Volume 6, Issue 9
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 472
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hardware-oblivious parallelism for in-memory column-stores

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

A GPGPU compiler for memory optimization and parallelism management

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hardware-oblivious parallelism for in-memory column-stores

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

A GPGPU compiler for memory optimization and parallelism management

Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media