skip to main content
10.1145/1099554.1099739acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Fast on-line index construction by geometric partitioning

Published:31 October 2005Publication History

ABSTRACT

Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line merge-based methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction -- that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction.

References

  1. V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 8 (1): 151--166, January 2005. Source code available from www.cs.mu.oz.au/~alistair/carry/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Biliris. An efficient database storage structure for large dynamic objects. In F. Golshani, editor, Proc. IEEE Int. Conf. on Data Engineering, pages 301--308, Tempe, Arizona, February 1992. IEEE Computer Society. ISBN 0-8186-2545-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Biliris. The performance of three database storage structures for managing large objects. In M. Stonebraker, editor, Proc. ACM-SIGMOD Int. Conf. on the Management of Data, pages 276--285, San Diego, California, June 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. Int. Conf. on Very Large Databases, pages 192--202, Santiago, Chile, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Büttcher and C. L. A. Clarke. Indexing time vs. query time trade-offs in dynamic information retrieval systems. In N. Fuhr, H.-J. Schek, and A. Chowdhury, editors, Proc. ACM CIKM Int. Conf. on Information and Knowledge Management, Bremen, Germany, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Carey, D. J. DeWitt, J. E. Richardson, and E. J. Shekita. Object and file management in the EXODUS extensible database system. In W. W. Chu, G. Gardarin, S. Ohsuga, and Y. Kambayashi, editors, Proc. Int. Conf. on Very Large Databases, pages 91--100, Kyoto, Japan, August 1986. Morgan Kaufmann. ISBN 0-934613-18-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. J. Carey, D. J. DeWitt, J. E. Richardson, and E. J. Shekita. Storage management for objects in EXODUS. In W. Kim and F. H. Lochovsky, editors, Object-Oriented Concepts, Databases, and Applications, pages 341--369. Addison-Wesley Longman, New York, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. L. A. Clarke and G. V. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Technical Report MT-95-01, Department of Computer Science, University of Waterloo, Waterloo, Canada, 1995. MultiText Project Technical Report.Google ScholarGoogle Scholar
  9. D. R. Cutting and J. O. Pedersen. Optimizations for dynamic inverted index maintenance. In J.-Luc Vidick, editor, Proc. ACM-SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 405--411, Brussels, Belgium, September 1990. ACM. ISBN 0-89791-408-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Hawking, N. Craswell, and P. Thistlewaite. Overview of TREC-7 very large collection track. In E. M. Voorhees and D. K. Harman, editors, The Eighth Tert REtrieval Conference (TREC-8), pages 91--104, Gaithersburg, MD, 1999. National Institute of Standards and Technology Special Publication 500-246.Google ScholarGoogle Scholar
  11. S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Jour. of the American Society for Information Science and Technology, 54 (8): 713--729, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. J. Lehman and B. G. Lindsay. The Starburst long field manager. In P. M. G. Apers and G. Wiederhold, editors, Proc. Int. Conf. on Very Large Databases, pages 375--383, Amsterdam, The Netherlands, August 1989. ISBN 1-55860-101-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Lester, J. Zobel, and H.E. Williams. In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. In V. Estivill-Castro, editor, Proc. Australasian Computer Science Conf., pages 15--22, January 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Moffat and T. A. H. Bell. In situ generation of compressed inverted files. Journal of the American Society of Information Science, 46 (7): 537--550, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In K. Järvelin, M. Beaulieu, R. Baeza-Yates, and S. H. Myaeng, editors, Proc. ACM-SIGIR Int. Conf. on Research and Development in Information Retrieval, pages 222--229, Tampere, Finland, August 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W.-Y. Shieh and C.-P. Chung. A statistics-based approach to incrementally update inverted files. In H. R. Arabnia, editor, Proc. Int. Conf. on Information and Knowledge Engineering, pages 38--43, Las Vegas, Nevada, June 2003. CSREA Press.Google ScholarGoogle Scholar
  17. K. Shoens, A. Tomasic, and H. García-Molina. Synthetic workload performance analysis of incremental updates. In W. B. Croft and C. J. van Rijsbergen, editors, Proc. International ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 329--338, Dublin, Ireland, July 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Spink, D. Wolfram, B. J. Jansen, and T. Saracevic. Searching the web: The public and their queries. Jour. of the American Society for Information Science, 52 (3): 226--234, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. In Proc. ACM-SIGMOD Int. Conf. on the Management of Data, pages 289--300, Minneapolis, Minnesota, May 1994. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H.E. Williams and J. Zobel. Searchable words on the web. International Journal of Digital Libraries, 5 (2): 99--105, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, California, second edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast on-line index construction by geometric partitioning

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
                  October 2005
                  854 pages
                  ISBN:1595931406
                  DOI:10.1145/1099554

                  Copyright © 2005 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 31 October 2005

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  CIKM '05 Paper Acceptance Rate77of425submissions,18%Overall Acceptance Rate1,861of8,427submissions,22%

                  Upcoming Conference

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader