Abstract
This paper describes our new algorithm for sorting an array of structures by efficiently exploiting the SIMD instructions and cache memory of today's processors. Recently, multiway mergesort implemented with SIMD instructions has been used as a high-performance in-memory sorting algorithm for sorting integer values. For sorting an array of structures with SIMD instructions, a frequently used approach is to first pack the key and index for each record into an integer value, sort the key-index pairs using SIMD instructions, then rearrange the records based on the sorted key-index pairs. This approach can efficiently exploit SIMD instructions because it sorts the key-index pairs while packed into integer values; hence, it can use existing high-performance sorting implementations of the SIMD-based multiway mergesort for integers. However, this approach has frequent cache misses in the final rearranging phase due to its random and scattered memory accesses so that this phase limits both single-thread performance and scalability with multiple cores. Our approach is also based on multiway mergesort, but it can avoid costly random accesses for rearranging the records while still efficiently exploiting the SIMD instructions. Our results showed that our approach exhibited up to 2.1x better single-thread performance than the key-index approach implemented with SIMD instructions when sorting 512M 16-byte records on one core. Our approach also yielded better performance when we used multiple cores. Compared to an optimized radix sort, our vectorized multiway mergesort achieved better performance when the each record is large. Our vectorized multiway mergesort also yielded higher scalability with multiple cores than the radix sort.
- H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pp. 189--198, 2007. Google ScholarDigital Library
- J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. In Proceedings of VLDB Endow., 1 (2), pp. 1313--1324, 2007. Google ScholarDigital Library
- N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 351--362, 2010. Google ScholarDigital Library
- N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on CPUs, GPUs and Intel MIC architectures. Intel Technical report, 2010.Google Scholar
- H. Sundar, D. Malhotra, and G. Biros. HykSort: a new variant of hypercube quicksort on distributed memory architectures. In Proceedings of the 27th ACM International conference on supercomputing, pp. 293--302, 2013. Google ScholarDigital Library
- C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 841--850, 2012. Google ScholarDigital Library
- C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. D. Blas, and P. Dubey. Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs. In Proceedings of VLDB Endow., 2(2), pp. 1378--1389, 2009. Google ScholarDigital Library
- C. Balkesen, G. Alonso, and M. Ozsu. Multi-core, main-memory joins: Sort vs. hash revisited. In Proceedings of the VLDB Endow., 7(1), pp. 85--96, 2013. Google ScholarDigital Library
- O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 755--766, 2014. Google ScholarDigital Library
- K. E. Batcher. Sorting networks and their applications. In Proceedings of the AFIPS Spring Joint Computer Conference 32. AFIPS, pp. 307--314, 1968. Google ScholarDigital Library
- B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: high performance sorting on the cell processor. In Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 1286--1297, 2007. Google ScholarDigital Library
- N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: high performance graphics co-processor sorting for large database management. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 325--336, 2006. Google ScholarDigital Library
- J. Zhou and K. A. Ross. Implementing database operations using SIMD instructions. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 145--156, 2002. Google ScholarDigital Library
- H. Inoue, M. Ohara, and K. Taura. Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, In Proceedings of VLDB Endow., 8(3), 2014. Google ScholarDigital Library
- D. E. Knuth. The Art of Computer Programming. Vol. 3: Sorting and Searching. 1973.Google Scholar
- S. Lacey, R. Box. A Fast, Easy Sort. In Byte Magazine (April), pp. 315--320, 1991. Google ScholarDigital Library
- D. J. González, J.-L. Larriba-Pey, and J. J. Navarro. Communication conscious radix sort. In Proceedings of the 13th International Conference on Supercomputing, pp. 76--82, 1999. Google ScholarDigital Library
- R. Francis, I. Mathieson. A Benchmark Parallel Sort for Shared memory Multiprocessors. IEEE Transactions on Computers 37(12), pp. 1619--1626. 1988. Google ScholarDigital Library
- R. Sinha and J. Zobel. Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics 9, Article 1.5, 2004. Google ScholarDigital Library
- D. R. Musser. Introspective Sorting and Selection Algorithms. Software Practice and Experience 27(8), pp. 983--993, 1997. Google ScholarDigital Library
Index Terms
- SIMD- and cache-friendly algorithm for sorting an array of structures
Recommendations
Efficient implementation of sorting on multi-core SIMD CPU architecture
Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, ...
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architecturesMost contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to ...
Cache Efficient Radix Sort for String Sorting
In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and ...
Comments