skip to main content
10.1145/2983323.2983733acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Public Access

Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval

Authors Info & Claims
Published:24 October 2016Publication History

ABSTRACT

The previous two-phase method for searching versioned documents seeks a cost tradeoff by using non-positional information to rank document versions first. The second phase then re-ranks top document versions using positional information with fragment-based index compression. This paper proposes an alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. This paper compares several indexing and data traversal options with different time and space tradeoffs and describes evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.

References

  1. I. S. Altingovde, E. Demir, F. Can, and O. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans. Inf. Syst., 26(3):15:1--15:36, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. N. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of 15th Australasian Database Conference, pages 61--67, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. G. Anick and R. A. Flynn. Versioning a full-text information retrieval system. In SIGIR, pages 98--111, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Arampatzis and J. Kamps. A study of query length. In Prc. of ACM SIGIR, pages 811--812, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Arroyuelo, S. González, M. Marin, M. Oyarzún, and T. Suel. To index or not to index: Time-space trade-offs in search engines with positional ranking functions. In Proc. of 35th ACM SIGIR, pages 255--264, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Asadi and J. Lin. Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proc. of 36th ACM SIGIR, pages 997--1000, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Bai, J. Pedersen, and M. Yang. Web-scale semantic ranking. In Proceedings of the 2014 SIRIP, 2014.Google ScholarGoogle Scholar
  8. K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proc. of 30th ACM SIGIR, pages 519--526, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Z. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. J. Shekita. Indexing shared content in information retrieval systems. In EDBT, volume 3896, pages 313--330, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Büttcher, C. L. A. Clarke, and B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proc. of 29th ACM SIGIR, pages 621--622, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Claude, A. Fariña, M. A. Martinez-Prieto, and G. Navarro. Indexes for highly repetitive document collections. In Proc. of 20th ACM CIKM, pages 463--468, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Claude and J. I. Munro. Document listing on versioned documents. LNCS, 8214:72--83, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. S. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1:1--1:25, Dec. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Ding and A. C. König. Fast set intersection in memory. Proc. of VLDB, 4(4):255--266, Jan. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. DuBois and M. Amaldas. Building the Case for Moving Compliance, eDiscovery, and Archives to the Cloud., June 2011.Google ScholarGoogle Scholar
  16. EMC. Archive solutions for the enterprise with emc isilon scale-out nas. http://www.emc.com/collateral/white-papers/h11224-archive-solutions-enterprise-emc-isilon-wp.pdf, December, 2012.Google ScholarGoogle Scholar
  17. K. Eshghi and H. K. Tang. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs. Technical Report, TR 2005--30, 2005.Google ScholarGoogle Scholar
  18. H. Ferrada and G. Navarro. A Lempel-Ziv compressed structure for document listing. LNCS, 8214:116--128, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. S. Fraenkel, S. T. Klein, Y. Choueka, and E. Segal. Improved hierarchical bit-vector compression in document retrieval systems. In Proc. of 9th ACM SIGIR, pages 88--96. ACM, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Gagie, K. Karhu, G. Navarro, S. J. Puglisi, and J. Sirén. Document listing on repetitive collections. LNCS, 7922:107--119, 2013.Google ScholarGoogle Scholar
  21. J. He and T. Suel. Faster temporal range queries over versioned text. In Proc. of 34th ACM SIGIR, pages 565--574, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. He and T. Suel. Optimizing positional index structures for versioned document collections. In Proc. of ACM SIGIR, pages 245--254, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In Proc. of 18th ACM CIKM, pages 415--424, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. He, J. Zeng, and T. Suel. Improved index compression techniques for versioned document collections. In Proc. of 19th ACM CIKM, pages 1239--1248, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In ECIR, pages 76--87, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, Oct. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. S. Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage., 36(6):779--808, Nov. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. O. Kurland and E. Krikon. The opposite of smoothing: A language model approach to ranking query-specific document clusters. J. Artif. Int. Res., 41(2):367--395, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proceedings of the 12th International Conference on World Wide Web, WWW '03, pages 102--111. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proc. of 27th ACM SIGIR, pages 186--193, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T.-S. Moh and B. Chang. A running time improvement for the two thresholds two divisors algorithm. In Proc of 48th ACM Ann. Southeast Regional Conf., pages 69:1--69:6, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. W. Oard, J. R. Baron, B. Hedin, D. D. Lewis, and S. Tomlinson. Evaluation of information retrieval for e-discovery. Artif. Intell. Law, 18(4):347--386, Dec. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In In Proc. of the 25th European Conf. on IR Research, pages 207--218, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proc. of ACM SIGMOD, pages 76--85, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of 25th ACM SIGIR, pages 222--229, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. M. Svore, P. H. Kanani, and N. Khan. How good is a span of terms?: exploiting proximity to improve web retrieval. In Proc. of 33rd ACM SIGIR, pages 154--161, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In Proc. of 30th ACM SIGIR, pages 295--302, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka. Optimizing file replication over limited bandwidth networks using remote differential compression. Microsoft Research TR-2006-157, 2006.Google ScholarGoogle Scholar
  39. L. H. U, N. Mamoulis, K. Berberich, and S. Bedathur. Durable top-k search in document archives. In Proc. of ACM SIGMOD, pages 555--566, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. E. M. Voorhees. The cluster hypothesis revisited. In Proc. of 8th ACM SIGIR, pages 188--196, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. of 18th Inter. Conf. on World Wide Web, pages 401--410, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. WWW '08, pages 387--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Zhang and T. Suel. Efficient search in large textual collections with redundancy. WWW '07, pages 411--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Zhao and J. X. Huang. An enhanced context-sensitive proximity model for probabilistic information retrieval. SIGIR '14, pages 1131--1134. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Zukowski, S. Héman, N. Nes, P. A. Boncz, M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. In Proc. of IEEE ICDE, page 59, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
      October 2016
      2566 pages
      ISBN:9781450340731
      DOI:10.1145/2983323

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 October 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader