Skip to main content

Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11437))

Included in the following conference series:

Abstract

Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala et al., and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe the implementation challenges encountered, and explore the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.threadingbuildingblocks.org/.

  2. 2.

    https://software.intel.com/en-us/get-started-with-pstl.

  3. 3.

    https://github.com/pisa-engine/pisa.

  4. 4.

    https://github.com/pisa-engine/ecir19-bisection.

  5. 5.

    https://github.com/attardi/wikiextractor.

  6. 6.

    https://github.com/commoncrawl/news-crawl.

References

  1. Arguello, J., Diaz, F., Lin, J., Trotman, A.: SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR). In: Proceedings of SIGIR, pp. 1147–1148 (2015)

    Google Scholar 

  2. Blanco, R., Barreiro, Á.: Document identifier reassignment through dimensionality reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 375–387. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31865-1_27

    Chapter  Google Scholar 

  3. Blanco, R., Barreiro, Á.: Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In: Proceedings of SIGIR, pp. 587–588 (2005)

    Google Scholar 

  4. Blanco, R., Barreiro, Á.: TSP and cluster-based solutions to the reassignment of document identifiers. Inf. Retr. 9(4), 499–517 (2006)

    Article  Google Scholar 

  5. Blandford, D., Blelloch, G.: Index compression through document reordering. In: Proceedings DCC 2002, Data Compression Conference, pp. 342–352 (2002)

    Google Scholar 

  6. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)

    Article  MathSciNet  Google Scholar 

  7. Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: Proceedings of SIGKDD, pp. 219–228 (2009)

    Google Scholar 

  8. Crane, M., Culpepper, J.S., Lin, J., Mackenzie, J., Trotman, A.: A comparison of Document-at-a-Time and Score-at-a-Time query evaluation. In: Proceedings of WSDM, pp. 201–210 (2017)

    Google Scholar 

  9. Dean, J.: Challenges in building large-scale information retrieval systems: invited talk. In: Proceedings of WSDM, pp. 1–1 (2009)

    Google Scholar 

  10. Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., Shalita, A.: Compressing graphs and indexes with recursive graph bisection. In: Proceedings of SIGKDD, pp. 1535–1544 (2016)

    Google Scholar 

  11. Ding, S., Suel, T.: Faster top-\(k\) document retrieval using block-max indexes. In: Proceedings of SIGIR, pp. 993–1002 (2011)

    Google Scholar 

  12. Ding, S., Attenberg, J., Suel, T.: Scalable techniques for document identifier assignment in inverted indexes. In: Proceedings of the WWW, pp. 311–320 (2010)

    Google Scholar 

  13. Fredriksson, K., Kilpeläinen, P.: Practically efficient array initialization. Soft. Prac. Exp. 46(4), 435–467 (2016)

    Article  Google Scholar 

  14. Hasibi, F., Balog, K., Bratsberg, S.E.: On the reproducibility of the TAGME entity linking system. In: Ferro, N., Crestani, F., Moens, M.-F., Mothe, J., Silvestri, F., Di Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 436–449. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_32

    Chapter  Google Scholar 

  15. Hawking, D., Jones, T.: Reordering an index to speed query processing without loss of effectiveness. In: Proceedings of ADCS, pp. 17–24 (2012)

    Google Scholar 

  16. Kane, A., Tompa, F.W.: Split-lists and initial thresholds for WAND-based search. In: Proceedings of SIGIR, pp. 877–880 (2018)

    Google Scholar 

  17. Lemire, D., Kurz, N., Rupp, C.: Stream vbyte: faster byte-oriented integer compression. Inf. Proc. Lett. 130, 1–6 (2018)

    Article  MathSciNet  Google Scholar 

  18. Mallia, A., Ottaviano, G., Porciani, E., Tonellotto, N., Venturini, R.: Faster BlockMax WAND with variable-sized blocks. In: Proceedings of SIGIR, pp. 625–634 (2017)

    Google Scholar 

  19. Moffat, A., Stuiver, L.: Binary interpolative coding for effective index compression. Inf. Retr. 3(1), 25–47 (2000)

    Article  Google Scholar 

  20. Ottaviano, G., Venturini, R.: Partitioned Elias-Fano indexes. In: Proceedings of SIGIR, pp. 273–282 (2014)

    Google Scholar 

  21. Richardson, M., Prakash, A., Brill, E.: Beyond pagerank: machine learning for static ranking. In: Proceedings of WWW, pp. 707–715 (2006)

    Google Scholar 

  22. Shieh, W.-Y., Chen, T.-F., Shann, J.J.-J., Chung, C.-P.: Inverted file compression through document identifier reassignment. Inf. Proc. Man. 39(1), 117–131 (2003)

    Article  Google Scholar 

  23. Silvestri, F.: Sorting out the document identifier assignment problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71496-5_12

    Chapter  Google Scholar 

  24. Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proceedings of WWW, pp. 401–410 (2009)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Science Foundation (IIS-1718680), the Australian Research Council (DP170102231), and the Australian Government (RTP Scholarship).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joel Mackenzie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mackenzie, J., Mallia, A., Petri, M., Culpepper, J.S., Suel, T. (2019). Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds) Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science(), vol 11437. Springer, Cham. https://doi.org/10.1007/978-3-030-15712-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-15712-8_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-15711-1

  • Online ISBN: 978-3-030-15712-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics