Abstract
The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and graphics cards. We propose an alternative design for a parallel database engine, based on a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This design reduces the development overhead for parallel database engines, while achieving competitive performance to hand-tuned systems.
We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Following this approach, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluated our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of MonetDB's query execution on the CPU and surpasses it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasibility of our approach.
- Advanced Micro Devices. OpenCL Zone. http://developer.amd.com/resources/heterogeneous-computing/opencl-zone/, January 2013.Google Scholar
- D. A. Alcantara, A. Sharf, F. Abbasinejad, S. Sengupta, M. Mitzenmacher, J. D. Owens, and N. Amenta. Real-time parallel hashing on the gpu. In ACM SIGGRAPH Asia 2009 papers, SIGGRAPH Asia'09, pages 154:1-154:9, New York, NY, USA, 2009. ACM. Google Scholar
- D. A. F. Alcantara. Efficient Hash Tables on the GPU. PhD thesis, University of California, Davis, 2011. Google Scholar
- Altera Corporation. OpenCL for Altera FPGAs: Accelerating Performance and Design Productivity. http://www.altera.com/products/software/opencl/opencl-index.html, January 2013.Google Scholar
- C. Balkesen, J. Teubner, G. Alonso, and M. T. Ozsu. Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware. ETH Zurich, Systems Group, Tech. Rep, 2012.Google Scholar
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 119-130. ACM, 2010. Google Scholar
- P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking The Memory Wall In MonetDB. Communications of the ACM, 51(12):77-85, December 2008. Google Scholar
- S. Borkar and A. A. Chien. The future of microprocessors. Commun. ACM, 54(5):67-77, 2011. Google Scholar
- S. Breß, F. Beier, H. Rauhe, E. Schallehn, K.-U. Sattler, and G. Saake. Automatic selection of processing units for coprocessing in databases. In Advances in Databases and Information Systems, pages 57-70. Springer, 2012. Google Scholar
- N. Cascarano, P. Rolando, F. Risso, and R. Sisto. infant: Nfa pattern matching on gpgpu devices. SIGCOMM Comput. Commun. Rev., 40(5):20-26, Oct. 2010. Google Scholar
- M. M. Chakravarty, R. Leshchinskiy, S. P. Jones, G. Keller, and S. Marlow. Data parallel haskell: a status report. In Proceedings of the 2007 workshop on Declarative aspects of multicore programming, pages 10-18. ACM, 2007. Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107-113, 2008. Google Scholar
- D. J. DeWitt. Direct - a multiprocessor organization for supporting relational data base management systems. In Proceedings of the 5th annual symposium on Computer architecture, ISCA'78, pages 182-189, New York, NY, USA, 1978. ACM. Google Scholar
- I. García, S. Lefebvre, S. Hornus, and A. Lasram. Coherent parallel hashing. In Proceedings of the 2011 SIGGRAPH Asia Conference, SA'11, pages 161:1-161:8, New York, NY, USA, 2011. ACM. Google Scholar
- B. Gold, A. Ailamaki, L. Huston, and B. Falsafi. Accelerating database operators using a network processor. In Proceedings of the 1st international workshop on Data management on new hardware, DaMoN'05, New York, NY, USA, 2005. ACM. Google Scholar
- N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, SIGMOD'06, pages 325-336, New York, NY, USA, 2006. ACM. Google Scholar
- N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, SIGMOD'04, pages 215-226, New York, NY, USA, 2004. ACM. Google Scholar
- B. He, N. K. Govindaraju, Q. Luo, and B. Smith. Efficient gather and scatter operations on graphics processors. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC'07, pages 46:1-46:12, New York, NY, USA, 2007. ACM. Google Scholar
- B. He, M. Lu, K. Yang, R. Fang, N. Govindaraju, Q. Luo, and P. Sander. Relational query coprocessing on graphics processors. ACM Transactions on Database Systems (TODS), 34(4):21, 2009. Google Scholar
- B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 511-524. ACM, 2008. Google Scholar
- M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. ADMS, 2012.Google Scholar
- P. Helluy. A portable implementation of the radix sort algorithm in opencl.Google Scholar
- S. Héman, N. Nes, M. Zukowski, and P. Boncz. Vectorized data processing on the cell broadband engine. In Proceedings of the 3rd international workshop on Data management on new hardware, page 4. ACM, 2007. Google Scholar
- D. Horn. GPU Gems 2nd Edition, chapter Stream reduction operations for GPGPU applications. Addision Wesley, 2005.Google Scholar
- M. Ivanova, M. Kersten, and F. Groffen. Just-in-time data distribution for analytical query processing. In Advances in Databases and Information Systems, pages 209-222. Springer, 2012. Google Scholar
- C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: fast join implementation on modern multi-core cpus. Proceedings of the VLDB Endowment, 2(2):1378-1389, 2009. Google Scholar
- S. Lee, M. M. Chakravarty, V. Grover, and G. Keller. Gpu kernels as data-parallel array computations in haskell. In Workshop on Exploiting Parallelism using GPUs and other Hardware-Assisted Methods, 2009.Google Scholar
- R. Mueller, J. Teubner, and G. Alonso. Data processing on fpgas. Proc. VLDB Endow., 2(1):910-921, Aug. 2009. Google Scholar
- C. Nvidia. Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA, 83:129, 2007.Google Scholar
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society. Google Scholar
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, IPDPS'09, pages 1-10, Washington, DC, USA, 2009. IEEE Computer Society. Google Scholar
- N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD'10, pages 351-362, New York, NY, USA, 2010. ACM. Google Scholar
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for gpu computing. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, GH'07, pages 97-106, Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association. Google Scholar
- D. Singh and S. P. Engineer. Higher level programming abstractions for fpgas using opencl. In Workshop on Design Methods and Tools for FPGA-Based Acceleration of Scientific Computing, 2011.Google Scholar
- The Khronos Group Inc. OpenCL - the open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/, May 2011.Google Scholar
- Transaction Processing Performance Council. TPC-H. http://www.tpc.org/tpch/default.asp, May 2011.Google Scholar
- R. Wu, B. Zhang, M. Hsu, and Q. Chen. Gpu-accelerated predicate evaluation on column store. In Proceedings of the 11th international conference on Web-age information management, WAIM'10, pages 570-581, Berlin, Heidelberg, 2010. Springer-Verlag. Google Scholar
Index Terms
- Hardware-oblivious parallelism for in-memory column-stores
Recommendations
Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
A GPGPU compiler for memory optimization and parallelism management
PLDI '10This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue ...
Comments