ABSTRACT
The previous two-phase method for searching versioned documents seeks a cost tradeoff by using non-positional information to rank document versions first. The second phase then re-ranks top document versions using positional information with fragment-based index compression. This paper proposes an alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. This paper compares several indexing and data traversal options with different time and space tradeoffs and describes evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.
- I. S. Altingovde, E. Demir, F. Can, and O. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans. Inf. Syst., 26(3):15:1--15:36, 2008. Google ScholarDigital Library
- V. N. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of 15th Australasian Database Conference, pages 61--67, 2004. Google ScholarDigital Library
- P. G. Anick and R. A. Flynn. Versioning a full-text information retrieval system. In SIGIR, pages 98--111, 1992. Google ScholarDigital Library
- A. Arampatzis and J. Kamps. A study of query length. In Prc. of ACM SIGIR, pages 811--812, 2008. Google ScholarDigital Library
- D. Arroyuelo, S. González, M. Marin, M. Oyarzún, and T. Suel. To index or not to index: Time-space trade-offs in search engines with positional ranking functions. In Proc. of 35th ACM SIGIR, pages 255--264, 2012. Google ScholarDigital Library
- N. Asadi and J. Lin. Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proc. of 36th ACM SIGIR, pages 997--1000, 2013. Google ScholarDigital Library
- J. Bai, J. Pedersen, and M. Yang. Web-scale semantic ranking. In Proceedings of the 2014 SIRIP, 2014.Google Scholar
- K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proc. of 30th ACM SIGIR, pages 519--526, 2007. Google ScholarDigital Library
- A. Z. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. J. Shekita. Indexing shared content in information retrieval systems. In EDBT, volume 3896, pages 313--330, 2006. Google ScholarDigital Library
- S. Büttcher, C. L. A. Clarke, and B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proc. of 29th ACM SIGIR, pages 621--622, 2006. Google ScholarDigital Library
- F. Claude, A. Fariña, M. A. Martinez-Prieto, and G. Navarro. Indexes for highly repetitive document collections. In Proc. of 20th ACM CIKM, pages 463--468, 2011. Google ScholarDigital Library
- F. Claude and J. I. Munro. Document listing on versioned documents. LNCS, 8214:72--83, 2013. Google ScholarDigital Library
- J. S. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1:1--1:25, Dec. 2010. Google ScholarDigital Library
- B. Ding and A. C. König. Fast set intersection in memory. Proc. of VLDB, 4(4):255--266, Jan. 2011. Google ScholarDigital Library
- L. DuBois and M. Amaldas. Building the Case for Moving Compliance, eDiscovery, and Archives to the Cloud., June 2011.Google Scholar
- EMC. Archive solutions for the enterprise with emc isilon scale-out nas. http://www.emc.com/collateral/white-papers/h11224-archive-solutions-enterprise-emc-isilon-wp.pdf, December, 2012.Google Scholar
- K. Eshghi and H. K. Tang. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs. Technical Report, TR 2005--30, 2005.Google Scholar
- H. Ferrada and G. Navarro. A Lempel-Ziv compressed structure for document listing. LNCS, 8214:116--128, 2013. Google ScholarDigital Library
- A. S. Fraenkel, S. T. Klein, Y. Choueka, and E. Segal. Improved hierarchical bit-vector compression in document retrieval systems. In Proc. of 9th ACM SIGIR, pages 88--96. ACM, 1986. Google ScholarDigital Library
- T. Gagie, K. Karhu, G. Navarro, S. J. Puglisi, and J. Sirén. Document listing on repetitive collections. LNCS, 7922:107--119, 2013.Google Scholar
- J. He and T. Suel. Faster temporal range queries over versioned text. In Proc. of 34th ACM SIGIR, pages 565--574, 2011. Google ScholarDigital Library
- J. He and T. Suel. Optimizing positional index structures for versioned document collections. In Proc. of ACM SIGIR, pages 245--254, 2012. Google ScholarDigital Library
- J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In Proc. of 18th ACM CIKM, pages 415--424, 2009. Google ScholarDigital Library
- J. He, J. Zeng, and T. Suel. Improved index compression techniques for versioned document collections. In Proc. of 19th ACM CIKM, pages 1239--1248, 2010. Google ScholarDigital Library
- M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In ECIR, pages 76--87, 2007. Google ScholarDigital Library
- K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, Oct. 2002. Google ScholarDigital Library
- K. S. Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage., 36(6):779--808, Nov. 2000. Google ScholarDigital Library
- O. Kurland and E. Krikon. The opposite of smoothing: A language model approach to ranking query-specific document clusters. J. Artif. Int. Res., 41(2):367--395, May 2011. Google ScholarDigital Library
- L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proceedings of the 12th International Conference on World Wide Web, WWW '03, pages 102--111. ACM, 2003. Google ScholarDigital Library
- X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proc. of 27th ACM SIGIR, pages 186--193, 2004. Google ScholarDigital Library
- T.-S. Moh and B. Chang. A running time improvement for the two thresholds two divisors algorithm. In Proc of 48th ACM Ann. Southeast Regional Conf., pages 69:1--69:6, 2010. Google ScholarDigital Library
- D. W. Oard, J. R. Baron, B. Hedin, D. D. Lewis, and S. Tomlinson. Evaluation of information retrieval for e-discovery. Artif. Intell. Law, 18(4):347--386, Dec. 2010. Google ScholarDigital Library
- Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In In Proc. of the 25th European Conf. on IR Research, pages 207--218, 2003. Google ScholarDigital Library
- S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proc. of ACM SIGMOD, pages 76--85, 2003. Google ScholarDigital Library
- F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of 25th ACM SIGIR, pages 222--229, 2002. Google ScholarDigital Library
- K. M. Svore, P. H. Kanani, and N. Khan. How good is a span of terms?: exploiting proximity to improve web retrieval. In Proc. of 33rd ACM SIGIR, pages 154--161, 2010. Google ScholarDigital Library
- T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In Proc. of 30th ACM SIGIR, pages 295--302, 2007. Google ScholarDigital Library
- D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka. Optimizing file replication over limited bandwidth networks using remote differential compression. Microsoft Research TR-2006-157, 2006.Google Scholar
- L. H. U, N. Mamoulis, K. Berberich, and S. Bedathur. Durable top-k search in document archives. In Proc. of ACM SIGMOD, pages 555--566, 2010. Google ScholarDigital Library
- E. M. Voorhees. The cluster hypothesis revisited. In Proc. of 8th ACM SIGIR, pages 188--196, 1985. Google ScholarDigital Library
- H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. of 18th Inter. Conf. on World Wide Web, pages 401--410, 2009. Google ScholarDigital Library
- J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. WWW '08, pages 387--396. Google ScholarDigital Library
- J. Zhang and T. Suel. Efficient search in large textual collections with redundancy. WWW '07, pages 411--420. Google ScholarDigital Library
- J. Zhao and J. X. Huang. An enhanced context-sensitive proximity model for probabilistic information retrieval. SIGIR '14, pages 1131--1134. ACM, 2014. Google ScholarDigital Library
- M. Zukowski, S. Héman, N. Nes, P. A. Boncz, M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. In Proc. of IEEE ICDE, page 59, 2006. Google ScholarDigital Library
Index Terms
- Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval
Recommendations
Compact full-text indexing of versioned document collections
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementWe study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page ...
An approach for document retrieval using cluster-based inverted indexing
Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-...
Web document indexing and retrieval
CICLing'03: Proceedings of the 4th international conference on Computational linguistics and intelligent text processingWeb Document Indexing is an important part of every Search Engine (SE). Indexing quality has an overwhelming effect on retrieval effectiveness. A document index is a set of terms which show the contents (topic) of the document and helps in ...
Comments