ABSTRACT
We study the problem of approximate membership extraction (AME), i.e., how to efficiently extract substrings in a text document that approximately match some strings in a given dictionary. This problem is important in a variety of applications such as named entity recognition and data cleaning. We solve this problem in two steps. In the first step, for each substring in the text, we filter away the strings in the dictionary that are very different from the substring. In the second step, each candidate string is verified to decide whether the substring should be extracted. We develop an incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text. Our experimental study of the proposed algorithms on real and synthetic datasets showed that our solutions significantly outperform existing methods in the literature.
- A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarDigital Library
- K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, 2008. Google ScholarDigital Library
- A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarDigital Library
- M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001. Google ScholarDigital Library
- C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarDigital Library
- C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB 2007. Google ScholarDigital Library
- G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001. Google ScholarDigital Library
- S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004. Google ScholarDigital Library
- A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35--43, 2001.Google Scholar
- E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327--340, 1995. Google ScholarDigital Library
- W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009. Google ScholarDigital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Google ScholarDigital Library
- A. C. Yao and F. F. Yao. Dictionary loop-up with small errors. In CPM, pages 387--394, 1995Google Scholar
Index Terms
- Efficient algorithms for approximate member extraction using signature-based inverted lists
Recommendations
Languages with mismatches and an application to approximate indexing
DLT'05: Proceedings of the 9th international conference on Developments in Language TheoryIn this paper we describe a factorial language, denoted by L(S,k,r), that contains all words that occur in a string S up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(...
Compressed Indexes for Approximate String Matching
We revisit the problem of indexing a string S[1..n] to support finding all substrings in S that match a given pattern P[1..m] with at most k errors. Previous solutions either require an index of size exponential in k or need Ω(mk) time for searching. ...
Efficient algorithms for the scaled indexing problem
A real scaled occurrence of a pattern in a text is a position of the text at which the pattern occurs in some real scale ≥ 1. The real scaled indexing problem is to preprocess a text so that all real scaled occurrences of a pattern in the text can be ...
Comments