Abstract
Conjunctive Boolean queries are a fundamental operation in web search engines. These queries can be reduced to the problem of intersecting ordered sets of integers, where each set represents the documents containing one of the query terms. But there is tension between the desire to store the lists effectively, in a compressed form, and the desire to carry out intersection operations efficiently, using non-sequential processing modes. In this paper we evaluate intersection algorithms on compressed sets, comparing them to the best non-sequential array-based intersection algorithms. By adding a simple, low-cost, auxiliary index, we show that compressed storage need not hinder efficient and high-speed intersection operations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)
Barbay, J., Kenyon, C.: Adaptive intersection and t-threshold problems. In: Eppstein, D. (ed.) SODA 2002, pp. 390–399 (January 2002)
Barbay, J., López-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: Àlvarez, C., Serna, M. (eds.) WEA 2006. LNCS, vol. 4007, pp. 146–157. Springer, Heidelberg (2006)
Bentley, J., Yao, A.C-C.: An almost optimal algorithm for unbounded searching. Information Processing Letters 5(3), 82–87 (1976)
Blandford, D.K., Blelloch, G.E.: Compact representations of ordered sets. In: Munro, J.I. (ed.) SODA 2004, pp. 11–19. ACM Press, New York (2004)
Clark, D.: Compact PAT trees. PhD thesis, University of Waterloo (1996)
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: SODA 2000, pp. 743–752 (2000)
Gupta, A., Hon, W.-K., Shah, R., Vitter, J.S.: Compressed dictionaries: Space measures, data sets, and experiments. In: Àlvarez, C., Serna, M. (eds.) WEA 2006. LNCS, vol. 4007, pp. 158–169. Springer, Heidelberg (2006)
Hwang, F.K., Lin, S.: A simple algorithm for merging two disjoint linearly ordered list. SIAM Journal on Computing 1, 31–39 (1973)
Jacobson, G.: Succinct static data structures. PhD thesis, Carnegie Mellon University (1988)
Moffat, A., Stuiver, L.: Binary interpolative coding for effective index compression. Information Retrieval 3(1), 25–47 (2000)
Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems 14(4), 349–379 (1996)
Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) STACS. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)
Pagh, R.: Low redundancy in static dictionaries with constant time query. SIAM Journal on Computing 31(2), 353–363 (2001), http://www.brics.dk/~pagh/papers/dict-jour.pdf
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In: Munro, J.I. (ed.) SODA 2002. Society for Industrial and Applied Mathematics, pp. 233–242 (January 2002)
Sanders, P., Transier, F.: Intersection in integer inverted indices. In: ALENEX 2007, pp. 71–83 (January 2007)
Spink, A., Wolfram, D., Jansen, B.J., Saracevic, T.: Searching the web: The public and their queries. Journal of the American Society for Information Science 52(3), 226–234 (2001)
Witten, I.H., Moffat, A., Bell, T.A.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2), 1–56 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Culpepper, J.S., Moffat, A. (2007). Compact Set Representation for Information Retrieval. In: Ziviani, N., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol 4726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75530-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-75530-2_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75529-6
Online ISBN: 978-3-540-75530-2
eBook Packages: Computer ScienceComputer Science (R0)