Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval

Authors:
Xin Jin

University of California, Santa Barbara, Santa Barbara, USA

University of California, Santa Barbara, Santa Barbara, USA
View Profile

,
Daniel Agun

University of California, Santa Barbara, Santa Barbara, USA

University of California, Santa Barbara, Santa Barbara, USA
View Profile

,
Tao Yang

University of California, Santa Barbara, Santa Barbara, USA

University of California, Santa Barbara, Santa Barbara, USA
View Profile

,
Qinghao Wu

University of California, Santa Barbara, Santa Barbara, USA

University of California, Santa Barbara, Santa Barbara, USA
View Profile

,
Yifan Shen

University of California, Santa Barbara, Santa Barbara, USA

University of California, Santa Barbara, Santa Barbara, USA
View Profile

,
Susen Zhao

University of California, Santa Barbara, Santa Barbara, USA

University of California, Santa Barbara, Santa Barbara, USA
View Profile

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementOctober 2016Pages 377–386https://doi.org/10.1145/2983323.2983733

Published:24 October 2016Publication History

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 377–386

ABSTRACT

The previous two-phase method for searching versioned documents seeks a cost tradeoff by using non-positional information to rank document versions first. The second phase then re-ranks top document versions using positional information with fragment-based index compression. This paper proposes an alternative approach that uses cluster-based retrieval to quickly narrow the search scope guided by version representatives at Phase 1 and develops a hybrid index structure with adaptive runtime data traversal to speed up Phase 2 search. The hybrid scheme exploits the advantages of forward index and inverted index based on the term characteristics to minimize the time in extracting positional and other feature information during runtime search. This paper compares several indexing and data traversal options with different time and space tradeoffs and describes evaluation results to demonstrate their effectiveness. The experiment results show that the proposed scheme can be up-to about 4x as fast as the previous work on solid state drives while retaining good relevance.

References

I. S. Altingovde, E. Demir, F. Can, and O. Ulusoy. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans. Inf. Syst., 26(3):15:1--15:36, 2008. Google ScholarDigital Library
V. N. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of 15th Australasian Database Conference, pages 61--67, 2004. Google ScholarDigital Library
P. G. Anick and R. A. Flynn. Versioning a full-text information retrieval system. In SIGIR, pages 98--111, 1992. Google ScholarDigital Library
A. Arampatzis and J. Kamps. A study of query length. In Prc. of ACM SIGIR, pages 811--812, 2008. Google ScholarDigital Library
D. Arroyuelo, S. González, M. Marin, M. Oyarzún, and T. Suel. To index or not to index: Time-space trade-offs in search engines with positional ranking functions. In Proc. of 35th ACM SIGIR, pages 255--264, 2012. Google ScholarDigital Library
N. Asadi and J. Lin. Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In Proc. of 36th ACM SIGIR, pages 997--1000, 2013. Google ScholarDigital Library
J. Bai, J. Pedersen, and M. Yang. Web-scale semantic ranking. In Proceedings of the 2014 SIRIP, 2014.Google Scholar
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proc. of 30th ACM SIGIR, pages 519--526, 2007. Google ScholarDigital Library
A. Z. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. J. Shekita. Indexing shared content in information retrieval systems. In EDBT, volume 3896, pages 313--330, 2006. Google ScholarDigital Library
S. Büttcher, C. L. A. Clarke, and B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. In Proc. of 29th ACM SIGIR, pages 621--622, 2006. Google ScholarDigital Library
F. Claude, A. Fariña, M. A. Martinez-Prieto, and G. Navarro. Indexes for highly repetitive document collections. In Proc. of 20th ACM CIKM, pages 463--468, 2011. Google ScholarDigital Library
F. Claude and J. I. Munro. Document listing on versioned documents. LNCS, 8214:72--83, 2013. Google ScholarDigital Library
J. S. Culpepper and A. Moffat. Efficient set intersection for inverted indexing. ACM Trans. Inf. Syst., 29(1):1:1--1:25, Dec. 2010. Google ScholarDigital Library
B. Ding and A. C. König. Fast set intersection in memory. Proc. of VLDB, 4(4):255--266, Jan. 2011. Google ScholarDigital Library
L. DuBois and M. Amaldas. Building the Case for Moving Compliance, eDiscovery, and Archives to the Cloud., June 2011.Google Scholar
EMC. Archive solutions for the enterprise with emc isilon scale-out nas. http://www.emc.com/collateral/white-papers/h11224-archive-solutions-enterprise-emc-isilon-wp.pdf, December, 2012.Google Scholar
K. Eshghi and H. K. Tang. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs. Technical Report, TR 2005--30, 2005.Google Scholar
H. Ferrada and G. Navarro. A Lempel-Ziv compressed structure for document listing. LNCS, 8214:116--128, 2013. Google ScholarDigital Library
A. S. Fraenkel, S. T. Klein, Y. Choueka, and E. Segal. Improved hierarchical bit-vector compression in document retrieval systems. In Proc. of 9th ACM SIGIR, pages 88--96. ACM, 1986. Google ScholarDigital Library
T. Gagie, K. Karhu, G. Navarro, S. J. Puglisi, and J. Sirén. Document listing on repetitive collections. LNCS, 7922:107--119, 2013.Google Scholar
J. He and T. Suel. Faster temporal range queries over versioned text. In Proc. of 34th ACM SIGIR, pages 565--574, 2011. Google ScholarDigital Library
J. He and T. Suel. Optimizing positional index structures for versioned document collections. In Proc. of ACM SIGIR, pages 245--254, 2012. Google ScholarDigital Library
J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In Proc. of 18th ACM CIKM, pages 415--424, 2009. Google ScholarDigital Library
J. He, J. Zeng, and T. Suel. Improved index compression techniques for versioned document collections. In Proc. of 19th ACM CIKM, pages 1239--1248, 2010. Google ScholarDigital Library
M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In ECIR, pages 76--87, 2007. Google ScholarDigital Library
K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, Oct. 2002. Google ScholarDigital Library
K. S. Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Inf. Process. Manage., 36(6):779--808, Nov. 2000. Google ScholarDigital Library
O. Kurland and E. Krikon. The opposite of smoothing: A language model approach to ranking query-specific document clusters. J. Artif. Int. Res., 41(2):367--395, May 2011. Google ScholarDigital Library
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal. Dynamic maintenance of web indexes using landmarks. In Proceedings of the 12th International Conference on World Wide Web, WWW '03, pages 102--111. ACM, 2003. Google ScholarDigital Library
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proc. of 27th ACM SIGIR, pages 186--193, 2004. Google ScholarDigital Library
T.-S. Moh and B. Chang. A running time improvement for the two thresholds two divisors algorithm. In Proc of 48th ACM Ann. Southeast Regional Conf., pages 69:1--69:6, 2010. Google ScholarDigital Library
D. W. Oard, J. R. Baron, B. Hedin, D. D. Lewis, and S. Tomlinson. Evaluation of information retrieval for e-discovery. Artif. Intell. Law, 18(4):347--386, Dec. 2010. Google ScholarDigital Library
Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval systems. In In Proc. of the 25th European Conf. on IR Research, pages 207--218, 2003. Google ScholarDigital Library
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proc. of ACM SIGMOD, pages 76--85, 2003. Google ScholarDigital Library
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of 25th ACM SIGIR, pages 222--229, 2002. Google ScholarDigital Library
K. M. Svore, P. H. Kanani, and N. Khan. How good is a span of terms?: exploiting proximity to improve web retrieval. In Proc. of 33rd ACM SIGIR, pages 154--161, 2010. Google ScholarDigital Library
T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In Proc. of 30th ACM SIGIR, pages 295--302, 2007. Google ScholarDigital Library
D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka. Optimizing file replication over limited bandwidth networks using remote differential compression. Microsoft Research TR-2006-157, 2006.Google Scholar
L. H. U, N. Mamoulis, K. Berberich, and S. Bedathur. Durable top-k search in document archives. In Proc. of ACM SIGMOD, pages 555--566, 2010. Google ScholarDigital Library
E. M. Voorhees. The cluster hypothesis revisited. In Proc. of 8th ACM SIGIR, pages 188--196, 1985. Google ScholarDigital Library
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In Proc. of 18th Inter. Conf. on World Wide Web, pages 401--410, 2009. Google ScholarDigital Library
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. WWW '08, pages 387--396. Google ScholarDigital Library
J. Zhang and T. Suel. Efficient search in large textual collections with redundancy. WWW '07, pages 411--420. Google ScholarDigital Library
J. Zhao and J. X. Huang. An enhanced context-sensitive proximity model for probabilistic information retrieval. SIGIR '14, pages 1131--1134. ACM, 2014. Google ScholarDigital Library
M. Zukowski, S. Héman, N. Nes, P. A. Boncz, M. Zukowski, S. Héman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. In Proc. of IEEE ICDE, page 59, 2006. Google ScholarDigital Library

Index Terms

Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Compact full-text indexing of versioned document collections
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

We study the problem of creating highly compressed full-text index structures for versioned document collections, that is, collections that contain multiple versions of each document. Important examples of such collections are Wikipedia or the web page ...
Read More
An approach for document retrieval using cluster-based inverted indexing

Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-...
Read More
Web document indexing and retrieval
CICLing'03: Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Web Document Indexing is an important part of every Search Engine (SE). Indexing quality has an overwhelming effect on retrieval effectiveness. A document index is a set of terms which show the contents (topic) of the document and helps in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
positional inverted index
query processing
search in document archives
versioned data
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 199
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Compact full-text indexing of versioned document collections

An approach for document retrieval using cluster-based inverted indexing

Web document indexing and retrieval