research-article

Supporting sub-document updates and queries in an inverted index

Authors:
Vuk Ercegovac

IBM, San Jose, CA, USA

IBM, San Jose, CA, USA
View Profile

,
Vanja Josifovski

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Ning Li

IBM, San Jose, CA, USA

IBM, San Jose, CA, USA
View Profile

,
Mauricio R. Mediano

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Eugene J. Shekita

IBM, San Jose, CA, USA

IBM, San Jose, CA, USA
View Profile

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementOctober 2008Pages 659–668https://doi.org/10.1145/1458082.1458171

Published:26 October 2008Publication History

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Pages 659–668

ABSTRACT

Inverted indexes have become the standard indexing method for supporting search queries in a variety of content-based applications. Examples of such applications include enterprise document management, e-mail, web search, and social networks. One shortcoming in current inverted index designs is that they support only document-level updates, forcing a full document to be reindexed even if just part of it changes. This paper describes a new inverted index design that enables applications to break a document into semantically meaningful sub-documents or "sections". Each section of a document can be updated separately, but search queries can still work seamlessly across sections. Our index design is motivated by applications where there is metadata associated with each document that tends to be smaller and more frequently updated than the document's content, but at the same time, it is desireable to search the metadata and content with the same index structure. A novel self-optimizing query execution algorithm is described to efficiently join the sections of a document in the inverted index. Experimental results on TREC and patent data are provided, showing that sections can dramatically improve overall system throughput on a mixed workload of updates and queries.

References

http://aws.amazon.com.Google Scholar
http://base.google.com.Google Scholar
http://lucene.apache.org.Google Scholar
http://trec.nist.gov/data/webmain.html.Google Scholar
http://www.documentum.com.Google Scholar
http://www.filenet.com.Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarDigital Library
R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. In Acta Informatica, vol 1, 1972.Google Scholar
O. Ben-Yitzhak, N. Golbandi, N. Har'El, R. Lempel, A. Neumann, S. Ofek-Koifman, D. Sheinwald, E. Shekita, B. Sznajder, and S. Yogev. Beyond basic faceted search. In WSDM, 2008. Google ScholarDigital Library
A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, 2003. Google ScholarDigital Library
E. Brown, J. Callan, and W. Croft. Fast incremental indexing for full-text information retrieval. In VLDB, 1994. Google ScholarDigital Library
S. Buttcher, C. Clarke, and B. Lushman. Hybrid index maintenance for growing text collections. In SIGIR, 2006. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google ScholarDigital Library
D. Cutting and J. Pedersen. Optimizations for dynamic inverted index maintenance. In SIGIR, 1990. Google ScholarDigital Library
M. Fontoura, V. Josifovski, E. Shekita, and B. Yang. Optimizing cursor movement in holistic twig joins. In CIKM, 2005. Google ScholarDigital Library
M. Fontoura, R. Lempel, R. Qi, and J. Zien. Inverted index support for numeric search. In Internet Mathematics, 3(2), 2006.Google Scholar
H. Garcia-Molina, J. Ullman, and J. Widom. Database System Implementation. Prentice Hall, 2000. Google ScholarDigital Library
G. Graefe. The five-minute rule twenty years later, and how flash memory changes the rules. In DAMON, 2007. Google ScholarDigital Library
J. Hamilton and T. Nayak. Microsoft sql server full-text search. IEEE Data Eng. Bull., 24(4), 2001.Google Scholar
H. Jiang, W. Wang, H. Lu, and J. Yu. Holistic twig joins on indexed xml documents. In VLDB, 2003. Google ScholarDigital Library
N. Lester, A. Moffat, and J. Zobel. Fast on-line index construction by geometric partitioning. In CIKM, 2005. Google ScholarDigital Library
N. Lester, J. Zobel, and H. Williams. In-place verse re-build verse re-merge: Index maintenance strategies for text retrieval systems. In 27'th Australasian Computer Science Conference, 2004. Google ScholarDigital Library
C. Mohan and F. Levine. Aries/im: An efficient and high concurrency index managment method using write-ahead logging. In SIGMOD, 1992. Google ScholarDigital Library
A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In SIGMOD, 1994. Google ScholarDigital Library
H. Turtle and J. Flood. Query evaluation: Strategies and optimizations. In Information Processing and Management, 1995. Google ScholarDigital Library
I. Witten, T. Bell, and A. Moffat. Managing Gigabytes: Compressing and Indexing Documents and Images. John Wiley & Sons, Inc., 1994. Google ScholarDigital Library
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. Google ScholarDigital Library

Index Terms

Supporting sub-document updates and queries in an inverted index
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Inverted index compression and query processing with optimized document ordering
WWW '09: Proceedings of the 18th international conference on World wide web

Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first ...
Read More
Inverted index maintenance strategy for flashSSDs

An inverted index is a core data structure of Information Retrieval systems, especially in search engines. Since the search environments have become more dynamic, many on-line index maintenance strategies have been proposed. Previous strategies were ...
Read More
Optimizing positional index structures for versioned document collections
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
October 2008
1562 pages
ISBN:9781595939913
DOI:10.1145/1458082
General Chair:
James G. Shanahan
Church and Duncan Group Inc, USA
,
Program Chairs:
Sihem Amer-Yahia
Yahoo! Research, USA
,
Ioana Manolescu
INRIA, France
,
Yi Zhang
University of California, Santa Cruz, USA
,
David A. Evans
JustSystems Evans Research, USA
,
Alek Kolcz
Microsoft Live Labs, USA
,
Key-Sun Choi
KAIST, Korea
,
Abdur Chowdury
Twitter, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
inverted index
section
update
zig-zag
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 478
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Supporting sub-document updates and queries in an inverted index

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Inverted index compression and query processing with optimized document ordering

Inverted index maintenance strategy for flashSSDs

Optimizing positional index structures for versioned document collections