skip to main content
10.1145/1529282.1529669acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Fast error-tolerant search on very large texts

Authors Info & Claims
Published:08 March 2009Publication History

ABSTRACT

We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents.

We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.

References

  1. H. Bast and I. Weber. Type less, find more: fast autocompletion search with a succinct index. In SIGIR '06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW '07, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Brill and R. C. Moore. An improved error model for noisy channel spelling correction. In ACL'00, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE '06, page 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. L. Chuan Xiao, Wei Wang and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW 2008, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. ICDE'00, page 489, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. C. Comeau and W. J. Wilbur. Non-word identification or spell checking without a dictionary. JASIST, 55:169--177, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Dalianis. Evaluating a spelling support in a search engine. In NLDB '02, pages 183--190, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Figueroa, E. Chávez, G. Navarro, and R. Paredes. On the least cost for proximity searching in metric spaces. In WEA, pages 279--290, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Kukich. Technique for automatically correcting words in text. ACM Comput. Surv., 24:377--439, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Mihov and K. U. Schulz. Fast approximate search in large dictionaries. Comput. Linguist., pages 451--477, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Muth and U. Manber. Approximate multiple string search. In CPM'96, pages 75--86, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Navarro and R. Baeza-yates. Searching in metric spaces. ACM Comput. Surv., pages 273--321, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. J. Pollock and A. Zamora. Automatic spelling correction in scientific and scholarly text. In Commun. ACM 27, 4 (Apr.), pages 358--368, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:522--532, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Sutinen and J. Tarhio. Filtration with q-samples in approximate matching. In CPM'96, pages 50--63, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical report, Department of Informatics, University of Zurich, 2007.Google ScholarGoogle Scholar
  18. K. Taghva, J. Borsack, and A. Condit. An expert system for automatically correcting ocr output. In SPIE, pages 270--278, 1994.Google ScholarGoogle Scholar
  19. K. Taghva, J. Borsack, and A. Condit. Results of applying probabilistic IR to OCR text. In Research and Development in Information Retrieval, pages 202--211, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Taghva, J. Borsack, and A. Condit. Effects of ocr errors on ranking and feedback using the vector space model. Inf. Process. Manage., 32:317--327, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Zobel and P. W. Dart. Finding approximate matches in large lexicons. Software - Practice and Experience, 25:331--345, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast error-tolerant search on very large texts

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing
          March 2009
          2347 pages
          ISBN:9781605581668
          DOI:10.1145/1529282

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 8 March 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader