research-article

SIMD- and cache-friendly algorithm for sorting an array of structures

Authors:
Hiroshi Inoue

IBM Research, Toyosu, Tokyo, Japan and University of Tokyo, Tokyo, Japan

IBM Research, Toyosu, Tokyo, Japan and University of Tokyo, Tokyo, Japan
View Profile

,
Kenjiro Taura

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 11pp 1274–1285https://doi.org/10.14778/2809974.2809988

Published:01 July 2015Publication History

Proceedings of the VLDB Endowment

Abstract

This paper describes our new algorithm for sorting an array of structures by efficiently exploiting the SIMD instructions and cache memory of today's processors. Recently, multiway mergesort implemented with SIMD instructions has been used as a high-performance in-memory sorting algorithm for sorting integer values. For sorting an array of structures with SIMD instructions, a frequently used approach is to first pack the key and index for each record into an integer value, sort the key-index pairs using SIMD instructions, then rearrange the records based on the sorted key-index pairs. This approach can efficiently exploit SIMD instructions because it sorts the key-index pairs while packed into integer values; hence, it can use existing high-performance sorting implementations of the SIMD-based multiway mergesort for integers. However, this approach has frequent cache misses in the final rearranging phase due to its random and scattered memory accesses so that this phase limits both single-thread performance and scalability with multiple cores. Our approach is also based on multiway mergesort, but it can avoid costly random accesses for rearranging the records while still efficiently exploiting the SIMD instructions. Our results showed that our approach exhibited up to 2.1x better single-thread performance than the key-index approach implemented with SIMD instructions when sorting 512M 16-byte records on one core. Our approach also yielded better performance when we used multiple cores. Compared to an optimized radix sort, our vectorized multiway mergesort achieved better performance when the each record is large. Our vectorized multiway mergesort also yielded higher scalability with multiple cores than the radix sort.

References

H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pp. 189--198, 2007. Google ScholarDigital Library
J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. In Proceedings of VLDB Endow., 1 (2), pp. 1313--1324, 2007. Google ScholarDigital Library
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 351--362, 2010. Google ScholarDigital Library
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on CPUs, GPUs and Intel MIC architectures. Intel Technical report, 2010.Google Scholar
H. Sundar, D. Malhotra, and G. Biros. HykSort: a new variant of hypercube quicksort on distributed memory architectures. In Proceedings of the 27th ACM International conference on supercomputing, pp. 293--302, 2013. Google ScholarDigital Library
C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 841--850, 2012. Google ScholarDigital Library
C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. D. Blas, and P. Dubey. Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs. In Proceedings of VLDB Endow., 2(2), pp. 1378--1389, 2009. Google ScholarDigital Library
C. Balkesen, G. Alonso, and M. Ozsu. Multi-core, main-memory joins: Sort vs. hash revisited. In Proceedings of the VLDB Endow., 7(1), pp. 85--96, 2013. Google ScholarDigital Library
O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 755--766, 2014. Google ScholarDigital Library
K. E. Batcher. Sorting networks and their applications. In Proceedings of the AFIPS Spring Joint Computer Conference 32. AFIPS, pp. 307--314, 1968. Google ScholarDigital Library
B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: high performance sorting on the cell processor. In Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 1286--1297, 2007. Google ScholarDigital Library
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: high performance graphics co-processor sorting for large database management. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 325--336, 2006. Google ScholarDigital Library
J. Zhou and K. A. Ross. Implementing database operations using SIMD instructions. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 145--156, 2002. Google ScholarDigital Library
H. Inoue, M. Ohara, and K. Taura. Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, In Proceedings of VLDB Endow., 8(3), 2014. Google ScholarDigital Library
D. E. Knuth. The Art of Computer Programming. Vol. 3: Sorting and Searching. 1973.Google Scholar
S. Lacey, R. Box. A Fast, Easy Sort. In Byte Magazine (April), pp. 315--320, 1991. Google ScholarDigital Library
D. J. González, J.-L. Larriba-Pey, and J. J. Navarro. Communication conscious radix sort. In Proceedings of the 13th International Conference on Supercomputing, pp. 76--82, 1999. Google ScholarDigital Library
R. Francis, I. Mathieson. A Benchmark Parallel Sort for Shared memory Multiprocessors. IEEE Transactions on Computers 37(12), pp. 1619--1626. 1988. Google ScholarDigital Library
R. Sinha and J. Zobel. Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics 9, Article 1.5, 2004. Google ScholarDigital Library
D. R. Musser. Introspective Sorting and Selection Algorithms. Software Practice and Experience 27(8), pp. 983--993, 1997. Google ScholarDigital Library

Index Terms

SIMD- and cache-friendly algorithm for sorting an array of structures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Efficient implementation of sorting on multi-core SIMD CPU architecture

Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, ...
Read More
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery - vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to ...
Read More
Cache Efficient Radix Sort for String Sorting

In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 8, Issue 11
July 2015
264 pages
ISSN:2150-8097
Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2015
Published in pvldb Volume 8, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 494
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SIMD- and cache-friendly algorithm for sorting an array of structures

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient implementation of sorting on multi-core SIMD CPU architecture

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Cache Efficient Radix Sort for String Sorting

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SIMD- and cache-friendly algorithm for sorting an array of structures

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient implementation of sorting on multi-core SIMD CPU architecture

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Cache Efficient Radix Sort for String Sorting

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media