Abstract
Major web search engines answer thousands of queries per second requesting information about billions of web pages. The data sizes and query loads are growing at an exponential rate. To manage the heavy workload, we consider techniques for utilizing a Graphics Processing Unit (GPU). We investigate new approaches to improve two important operations of search engines -- lists intersection and index compression.
For lists intersection, we develop techniques for efficient implementation of the binary search algorithm for parallel computation. We inspect some representative real-world datasets and find that a sufficiently long inverted list has an overall linear rate of increase. Based on this observation, we propose Linear Regression and Hash Segmentation techniques for contracting the search range. For index compression, the traditional d-gap based compression schemata are not well-suited for parallel computation, so we propose a Linear Regression Compression schema which has an inherent parallel structure. We further discuss how to efficiently intersect the compressed lists on a GPU. Our experimental results show significant improvements in the query processing throughput on several datasets.
- V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8(1):151--166, 2005. Google ScholarDigital Library
- R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Combinatorial Pattern Matching, pages 400--408, 2004.Google ScholarCross Ref
- R. Baeza-Yates and A. Salinger. Experimental analysis of a fast intersection algorithm for sorted sequences. In Proc. 12th International Conference on String Processing and Information, pages 13--24, 2005. Google ScholarDigital Library
- J. Barbay, A. López-Ortiz, and T. Lu. Faster adaptive set intersections for text searching. Experimental Algorithms: 5th International Workshop, pages 146--157, 2006. Google ScholarDigital Library
- M. Billeter, O. Olsson, and U. Assarsson. Effcient stream compaction on wide SIMD many-core architectures. In Proc. Conference on High Performance Graphics, pages 159--166, 2009. Google ScholarDigital Library
- D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. Data Compression Conference, pages 342--351, 2002. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990. Google ScholarDigital Library
- E. D. Demaine, A. López-Ortiz, and J. Ian Munro. Adaptive set intersections, unions, and differences. In Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 743--752, 2000. Google ScholarDigital Library
- E. D. Demaine, A. López-Ortiz, and J. Ian Munro. Experiments on adaptive set intersections for text retrieval systems. Third International Workshop on Algorithm Engineering and Experimentation, pages 91--104, 2001. Google ScholarDigital Library
- S. Ding, J. He, H. Yan, and T. Suel. Using graphics processors for high performance IR query processing. In Proc. 18th International Conference on World Wide Web, pages 421--430, 2009. Google ScholarDigital Library
- V. Estivill-Castro and D. Wood. A survey of adaptive sorting algorithms. ACM Comput. Surv., 24(4):441--476, 1992. Google ScholarDigital Library
- A. Grama, A. Gupta, and V. Kumar. Isoeffciency: Measuring the scalability of parallel algorithms and architectures. IEEE Parallel & Distributed Technology: Systems & Applications, 1(3):12{21, 1993. Google ScholarDigital Library
- S. Héman. Super-scalar database compression between RAM and CPU-cache. Master's thesis, Centrum voor Wiskunde en Informatica Amsterdam, 2005.Google Scholar
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- Y. Perl, A. Itai, and H. Avni. Interpolation search -- a log log N search. Comm. ACM, 21(7):550--553, 1978. Google ScholarDigital Library
- W. Pugh. Skip lists: a probabilistic alternative to balanced trees. Comm. ACM, 33(6):668--676, 1990. Google ScholarDigital Library
- F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 2002. Google ScholarDigital Library
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Proc. 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pages 97--106, 2007. Google ScholarDigital Library
- W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Inform. Process. Manag., 39(1):117--131, 2003. Google ScholarDigital Library
- F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search engines indexes. In Proc. 2004 ACM Symposium on Applied Computing, pages 600--605, 2004. Google ScholarDigital Library
- S. Tatikonda, F. Junqueira, B. Barla Cambazoglu, and V. Plachouras. On effcient posting list intersection with multicore processors. In Proc. 32nd international ACM SIGIR conference on Research and Development in Information Retrieval, pages 738--739, 2009. Google ScholarDigital Library
- D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. Proc. VLDB Endowment, 2(1):838--849, 2009. Google ScholarDigital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Google ScholarDigital Library
- D. Wu, F. Zhang, N. Ao, G. Wang, X. Liu, and J. Liu. Effcient lists intersection by CPU--GPU cooperative computing. In 25th IEEE International Parallel and Distributed Processing Symposium, Workshops and PhD Forum (IPDPSW), pages 1--8, 2010.Google ScholarCross Ref
- H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. 18th International Conference on World Wide Web, pages 401--410, 2009. Google ScholarDigital Library
- J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2):1--56, 2006. Google ScholarDigital Library
- M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar RAM--CPU cache compression. In Proc. 22nd International Conference on Data Engineering (ICDE'06), page 59, 2006. Google ScholarDigital Library
- S. Büttcher, C. L. A. Clarke, and I. Soboroff. The TREC 2006 terabyte track. In Proc. 15th Text Retrieval Conference (TREC 2006), 2006.Google Scholar
- R. Fisher and F. Yates. Statistical Tables for Biological, Agricultural and Medical Research. Oliver and Boyd, 1963.Google Scholar
- NVIDIA Corporation. NVIDIA CUDA Programming Guide v3. 2010.Google Scholar
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Computer Networks and ISDN Systems, 30(1--7):107--117, 1998. Google ScholarDigital Library
- E. M. Voorhees. Overview of trec 2004. In In NIST Special Publication 500-261: The Thirteenth Text Retrieval Conference Proceedings (TREC 2004), pages 1--12, 2004.Google Scholar
- E. M. Voorhees. Overview of TREC 2002. In Proc. 11th Text Retrieval Conference (TREC 2002), pages 1--16, 2003.Google Scholar
Index Terms
- Efficient parallel lists intersection and index compression algorithms using graphics processing units
Recommendations
Algorithmic performance studies on graphics processing units
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear ...
Efficient processing of XML path queries using the disk-based F&B Index
VLDB '05: Proceedings of the 31st international conference on Very large data basesWith the proliferation of XML data and applications on the Internet, efficient XML query processing techniques are in great demand. Answering queries using XML indexes is a natural approach. A number of XML indexes have been proposed in the literature: ...
Comments