skip to main content
research-article

SIMD- and cache-friendly algorithm for sorting an array of structures

Published:01 July 2015Publication History
Skip Abstract Section

Abstract

This paper describes our new algorithm for sorting an array of structures by efficiently exploiting the SIMD instructions and cache memory of today's processors. Recently, multiway mergesort implemented with SIMD instructions has been used as a high-performance in-memory sorting algorithm for sorting integer values. For sorting an array of structures with SIMD instructions, a frequently used approach is to first pack the key and index for each record into an integer value, sort the key-index pairs using SIMD instructions, then rearrange the records based on the sorted key-index pairs. This approach can efficiently exploit SIMD instructions because it sorts the key-index pairs while packed into integer values; hence, it can use existing high-performance sorting implementations of the SIMD-based multiway mergesort for integers. However, this approach has frequent cache misses in the final rearranging phase due to its random and scattered memory accesses so that this phase limits both single-thread performance and scalability with multiple cores. Our approach is also based on multiway mergesort, but it can avoid costly random accesses for rearranging the records while still efficiently exploiting the SIMD instructions. Our results showed that our approach exhibited up to 2.1x better single-thread performance than the key-index approach implemented with SIMD instructions when sorting 512M 16-byte records on one core. Our approach also yielded better performance when we used multiple cores. Compared to an optimized radix sort, our vectorized multiway mergesort achieved better performance when the each record is large. Our vectorized multiway mergesort also yielded higher scalability with multiple cores than the radix sort.

References

  1. H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pp. 189--198, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. In Proceedings of VLDB Endow., 1 (2), pp. 1313--1324, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 351--362, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. Fast sort on CPUs, GPUs and Intel MIC architectures. Intel Technical report, 2010.Google ScholarGoogle Scholar
  5. H. Sundar, D. Malhotra, and G. Biros. HykSort: a new variant of hypercube quicksort on distributed memory architectures. In Proceedings of the 27th ACM International conference on supercomputing, pp. 293--302, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Kim, J. Park, N. Satish, H. Lee, P. Dubey, and J. Chhugani. CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 841--850, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. D. Blas, and P. Dubey. Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs. In Proceedings of VLDB Endow., 2(2), pp. 1378--1389, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Balkesen, G. Alonso, and M. Ozsu. Multi-core, main-memory joins: Sort vs. hash revisited. In Proceedings of the VLDB Endow., 7(1), pp. 85--96, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 755--766, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. E. Batcher. Sorting networks and their applications. In Proceedings of the AFIPS Spring Joint Computer Conference 32. AFIPS, pp. 307--314, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Gedik, R. R. Bordawekar, and P. S. Yu. CellSort: high performance sorting on the cell processor. In Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 1286--1297, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTeraSort: high performance graphics co-processor sorting for large database management. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 325--336, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Zhou and K. A. Ross. Implementing database operations using SIMD instructions. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 145--156, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Inoue, M. Ohara, and K. Taura. Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, In Proceedings of VLDB Endow., 8(3), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. E. Knuth. The Art of Computer Programming. Vol. 3: Sorting and Searching. 1973.Google ScholarGoogle Scholar
  16. S. Lacey, R. Box. A Fast, Easy Sort. In Byte Magazine (April), pp. 315--320, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. J. González, J.-L. Larriba-Pey, and J. J. Navarro. Communication conscious radix sort. In Proceedings of the 13th International Conference on Supercomputing, pp. 76--82, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Francis, I. Mathieson. A Benchmark Parallel Sort for Shared memory Multiprocessors. IEEE Transactions on Computers 37(12), pp. 1619--1626. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Sinha and J. Zobel. Cache-conscious sorting of large sets of strings with dynamic tries. J. Exp. Algorithmics 9, Article 1.5, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. R. Musser. Introspective Sorting and Selection Algorithms. Software Practice and Experience 27(8), pp. 983--993, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SIMD- and cache-friendly algorithm for sorting an array of structures

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 8, Issue 11
        July 2015
        264 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 July 2015
        Published in pvldb Volume 8, Issue 11

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader