skip to main content
research-article

Efficient parallel lists intersection and index compression algorithms using graphics processing units

Published:01 May 2011Publication History
Skip Abstract Section

Abstract

Major web search engines answer thousands of queries per second requesting information about billions of web pages. The data sizes and query loads are growing at an exponential rate. To manage the heavy workload, we consider techniques for utilizing a Graphics Processing Unit (GPU). We investigate new approaches to improve two important operations of search engines -- lists intersection and index compression.

For lists intersection, we develop techniques for efficient implementation of the binary search algorithm for parallel computation. We inspect some representative real-world datasets and find that a sufficiently long inverted list has an overall linear rate of increase. Based on this observation, we propose Linear Regression and Hash Segmentation techniques for contracting the search range. For index compression, the traditional d-gap based compression schemata are not well-suited for parallel computation, so we propose a Linear Regression Compression schema which has an inherent parallel structure. We further discuss how to efficiently intersect the compressed lists on a GPU. Our experimental results show significant improvements in the query processing throughput on several datasets.

References

  1. V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151--166, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Combinatorial Pattern Matching, pages 400--408, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  3. R. Baeza-Yates and A. Salinger. Experimental analysis of a fast intersection algorithm for sorted sequences. In Proc. 12th International Conference on String Processing and Information, pages 13--24, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Barbay, A. López-Ortiz, and T. Lu. Faster adaptive set intersections for text searching. Experimental Algorithms: 5th International Workshop, pages 146--157, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Billeter, O. Olsson, and U. Assarsson. Effcient stream compaction on wide SIMD many-core architectures. In Proc. Conference on High Performance Graphics, pages 159--166, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. Data Compression Conference, pages 342--351, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. D. Demaine, A. López-Ortiz, and J. Ian Munro. Adaptive set intersections, unions, and differences. In Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 743--752, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. D. Demaine, A. López-Ortiz, and J. Ian Munro. Experiments on adaptive set intersections for text retrieval systems. Third International Workshop on Algorithm Engineering and Experimentation, pages 91--104, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Ding, J. He, H. Yan, and T. Suel. Using graphics processors for high performance IR query processing. In Proc. 18th International Conference on World Wide Web, pages 421--430, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. Estivill-Castro and D. Wood. A survey of adaptive sorting algorithms. ACM Comput. Surv., 24(4):441--476, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Grama, A. Gupta, and V. Kumar. Isoeffciency: Measuring the scalability of parallel algorithms and architectures. IEEE Parallel & Distributed Technology: Systems & Applications, 1(3):12{21, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Héman. Super-scalar database compression between RAM and CPU-cache. Master's thesis, Centrum voor Wiskunde en Informatica Amsterdam, 2005.Google ScholarGoogle Scholar
  14. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Perl, A. Itai, and H. Avni. Interpolation search -- a log log N search. Comm. ACM, 21(7):550--553, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. Pugh. Skip lists: a probabilistic alternative to balanced trees. Comm. ACM, 33(6):668--676, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Proc. 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 97--106, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Inform. Process. Manag., 39(1):117--131, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search engines indexes. In Proc. 2004 ACM Symposium on Applied Computing, pages 600--605, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Tatikonda, F. Junqueira, B. Barla Cambazoglu, and V. Plachouras. On effcient posting list intersection with multicore processors. In Proc. 32nd international ACM SIGIR conference on Research and Development in Information Retrieval, pages 738--739, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. Proc. VLDB Endowment, 2(1):838--849, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Wu, F. Zhang, N. Ao, G. Wang, X. Liu, and J. Liu. Effcient lists intersection by CPU--GPU cooperative computing. In 25th IEEE International Parallel and Distributed Processing Symposium, Workshops and PhD Forum (IPDPSW), pages 1--8, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  25. H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. 18th International Conference on World Wide Web, pages 401--410, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2):1--56, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM--CPU cache compression. In Proc. 22nd International Conference on Data Engineering (ICDE'06), page 59, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Büttcher, C. L. A. Clarke, and I. Soboroff. The TREC 2006 terabyte track. In Proc. 15th Text Retrieval Conference (TREC 2006), 2006.Google ScholarGoogle Scholar
  29. R. Fisher and F. Yates. Statistical Tables for Biological, Agricultural and Medical Research. Oliver and Boyd, 1963.Google ScholarGoogle Scholar
  30. NVIDIA Corporation. NVIDIA CUDA Programming Guide v3. 2010.Google ScholarGoogle Scholar
  31. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. M. Voorhees. Overview of trec 2004. In In NIST Special Publication 500-261: The Thirteenth Text Retrieval Conference Proceedings (TREC 2004), pages 1--12, 2004.Google ScholarGoogle Scholar
  33. E. M. Voorhees. Overview of TREC 2002. In Proc. 11th Text Retrieval Conference (TREC 2002), pages 1--16, 2003.Google ScholarGoogle Scholar

Index Terms

  1. Efficient parallel lists intersection and index compression algorithms using graphics processing units

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the VLDB Endowment
            Proceedings of the VLDB Endowment  Volume 4, Issue 8
            May 2011
            58 pages

            Publisher

            VLDB Endowment

            Publication History

            • Published: 1 May 2011
            Published in pvldb Volume 4, Issue 8

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader