ABSTRACT
We consider the following spelling variants clustering problem: Given a list of distinct words, called lexicon, compute (possibly overlapping) clusters of words which are spelling variants of each other. This problem naturally arises in the context of error-tolerant full-text search of the following kind: For a given query, return not only documents matching the query words exactly but also those matching their spelling variants. This is the inverse of the well-known "Did you mean: … ?" web search engine feature, where the error tolerance is on the side of the query, and not on the side of the documents.
We combine various ideas from the large body of literature on approximate string searching and spelling correction techniques to a new algorithm for the spelling variants clustering problem that is both accurate and very efficient in time and space. Our largest lexicon, containing roughly 10 million words, can be processed in about 16 minutes on a standard PC using 10 MB of additional space. This beats the previously best scheme by a factor of two in running time and by a factor of more than ten in space usage. We have integrated our algorithms into the CompleteSearch engine in a way that achieves error-tolerant search without significant blowup in neither index size nor query processing time.
- H. Bast and I. Weber. Type less, find more: fast autocompletion search with a succinct index. In SIGIR '06, 2006. Google ScholarDigital Library
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW '07, pages 131--140, 2007. Google ScholarDigital Library
- E. Brill and R. C. Moore. An improved error model for noisy channel spelling correction. In ACL'00, 2000. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE '06, page 5, 2006. Google ScholarDigital Library
- X. L. Chuan Xiao, Wei Wang and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW 2008, 2008. Google ScholarDigital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. ICDE'00, page 489, 2000. Google ScholarDigital Library
- D. C. Comeau and W. J. Wilbur. Non-word identification or spell checking without a dictionary. JASIST, 55:169--177, 2004. Google ScholarDigital Library
- H. Dalianis. Evaluating a spelling support in a search engine. In NLDB '02, pages 183--190, 2002. Google ScholarDigital Library
- K. Figueroa, E. Chávez, G. Navarro, and R. Paredes. On the least cost for proximity searching in metric spaces. In WEA, pages 279--290, 2006. Google ScholarDigital Library
- K. Kukich. Technique for automatically correcting words in text. ACM Comput. Surv., 24:377--439, 1992. Google ScholarDigital Library
- S. Mihov and K. U. Schulz. Fast approximate search in large dictionaries. Comput. Linguist., pages 451--477, 2004. Google ScholarDigital Library
- R. Muth and U. Manber. Approximate multiple string search. In CPM'96, pages 75--86, 1996. Google ScholarDigital Library
- G. Navarro and R. Baeza-yates. Searching in metric spaces. ACM Comput. Surv., pages 273--321, 2001. Google ScholarDigital Library
- J. J. Pollock and A. Zamora. Automatic spelling correction in scientific and scholarly text. In Commun. ACM 27, 4 (Apr.), pages 358--368, 1984. Google ScholarDigital Library
- E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:522--532, 1998. Google ScholarDigital Library
- E. Sutinen and J. Tarhio. Filtration with q-samples in approximate matching. In CPM'96, pages 50--63, 1996. Google ScholarDigital Library
- B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical report, Department of Informatics, University of Zurich, 2007.Google Scholar
- K. Taghva, J. Borsack, and A. Condit. An expert system for automatically correcting ocr output. In SPIE, pages 270--278, 1994.Google Scholar
- K. Taghva, J. Borsack, and A. Condit. Results of applying probabilistic IR to OCR text. In Research and Development in Information Retrieval, pages 202--211, 1994. Google ScholarDigital Library
- K. Taghva, J. Borsack, and A. Condit. Effects of ocr errors on ranking and feedback using the vector space model. Inf. Process. Manage., 32:317--327, 1996. Google ScholarDigital Library
- J. Zobel and P. W. Dart. Finding approximate matches in large lexicons. Software - Practice and Experience, 25:331--345, 1995. Google ScholarDigital Library
Index Terms
- Fast error-tolerant search on very large texts
Recommendations
Efficient two-sided error-tolerant search
KEYS '10: Proceedings of the 2nd International Workshop on Keyword Search on Structured DataWe consider fast two-sided error-tolerant search that is robust against errors both on the query side (type alogrithm, find documents with algorithm) as well as on the document side (type algorithm, find documents with alogrithm). We show how to realize ...
A novel implementation of the FITE-TRT translation method
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrievalCross-language Information Retrieval requires good methods for translating cross-lingual spelling variants which are not covered by the available dictionary resources. FITE-TRT is an established method employing frequency-based identification of ...
On-line Approximate String Matching in Natural Language
We consider approximate pattern matching in natural language text. We use the words of the text as the alphabet, instead of the characters as in traditional string matching approaches. Hence our pattern consists of a sequence of words. From the ...
Comments