research-article

Fast error-tolerant search on very large texts

Authors:
Marjan Celikik

Max Planck Institute for Computer Science, Saarbrücken, Germany

Max Planck Institute for Computer Science, Saarbrücken, Germany
View Profile

,
Holger Bast

Max Planck Institute for Computer Science, Saarbrücken, Germany

Max Planck Institute for Computer Science, Saarbrücken, Germany
View Profile

SAC '09: Proceedings of the 2009 ACM symposium on Applied ComputingMarch 2009Pages 1724–1731https://doi.org/10.1145/1529282.1529669

Published:08 March 2009Publication History

SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing

Pages 1724–1731

ABSTRACT

We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents.

We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.

References

H. Bast and I. Weber. Type less, find more: fast autocompletion search with a succinct index. In SIGIR '06, 2006. Google ScholarDigital Library
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW '07, pages 131--140, 2007. Google ScholarDigital Library
E. Brill and R. C. Moore. An improved error model for noisy channel spelling correction. In ACL'00, 2000. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE '06, page 5, 2006. Google ScholarDigital Library
X. L. Chuan Xiao, Wei Wang and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW 2008, 2008. Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. ICDE'00, page 489, 2000. Google ScholarDigital Library
D. C. Comeau and W. J. Wilbur. Non-word identification or spell checking without a dictionary. JASIST, 55:169--177, 2004. Google ScholarDigital Library
H. Dalianis. Evaluating a spelling support in a search engine. In NLDB '02, pages 183--190, 2002. Google ScholarDigital Library
K. Figueroa, E. Chávez, G. Navarro, and R. Paredes. On the least cost for proximity searching in metric spaces. In WEA, pages 279--290, 2006. Google ScholarDigital Library
K. Kukich. Technique for automatically correcting words in text. ACM Comput. Surv., 24:377--439, 1992. Google ScholarDigital Library
S. Mihov and K. U. Schulz. Fast approximate search in large dictionaries. Comput. Linguist., pages 451--477, 2004. Google ScholarDigital Library
R. Muth and U. Manber. Approximate multiple string search. In CPM'96, pages 75--86, 1996. Google ScholarDigital Library
G. Navarro and R. Baeza-yates. Searching in metric spaces. ACM Comput. Surv., pages 273--321, 2001. Google ScholarDigital Library
J. J. Pollock and A. Zamora. Automatic spelling correction in scientific and scholarly text. In Commun. ACM 27, 4 (Apr.), pages 358--368, 1984. Google ScholarDigital Library
E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:522--532, 1998. Google ScholarDigital Library
E. Sutinen and J. Tarhio. Filtration with q-samples in approximate matching. In CPM'96, pages 50--63, 1996. Google ScholarDigital Library
B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical report, Department of Informatics, University of Zurich, 2007.Google Scholar
K. Taghva, J. Borsack, and A. Condit. An expert system for automatically correcting ocr output. In SPIE, pages 270--278, 1994.Google Scholar
K. Taghva, J. Borsack, and A. Condit. Results of applying probabilistic IR to OCR text. In Research and Development in Information Retrieval, pages 202--211, 1994. Google ScholarDigital Library
K. Taghva, J. Borsack, and A. Condit. Effects of ocr errors on ranking and feedback using the vector space model. Inf. Process. Manage., 32:317--327, 1996. Google ScholarDigital Library
J. Zobel and P. W. Dart. Finding approximate matches in large lexicons. Software - Practice and Experience, 25:331--345, 1995. Google ScholarDigital Library

Index Terms

Fast error-tolerant search on very large texts
1. Human-centered computing
  1. Human computer interaction (HCI)
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Efficient two-sided error-tolerant search
KEYS '10: Proceedings of the 2nd International Workshop on Keyword Search on Structured Data

We consider fast two-sided error-tolerant search that is robust against errors both on the query side (type alogrithm, find documents with algorithm) as well as on the document side (type algorithm, find documents with alogrithm). We show how to realize ...
Read More
A novel implementation of the FITE-TRT translation method
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrieval

Cross-language Information Retrieval requires good methods for translating cross-lingual spelling variants which are not covered by the available dictionary resources. FITE-TRT is an established method employing frequency-based identification of ...
Read More
On-line Approximate String Matching in Natural Language

We consider approximate pattern matching in natural language text. We use the words of the text as the alphabet, instead of the characters as in traditional string matching approaches. Hence our pattern consists of a sequence of words. From the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing
March 2009
2347 pages
ISBN:9781605581668
DOI:10.1145/1529282
Conference Chairs:
Sung Y. Shin
South Dakota State University, United States
,
Sascha Ossowski
University Rey Juan Carlos, Spain
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 March 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximate string matching
error-tolerant search
spelling variants
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 333
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast error-tolerant search on very large texts

SAC '09: Proceedings of the 2009 ACM symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient two-sided error-tolerant search

A novel implementation of the FITE-TRT translation method

On-line Approximate String Matching in Natural Language