skip to main content
10.1145/1645953.1645995acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient algorithms for approximate member extraction using signature-based inverted lists

Published:02 November 2009Publication History

ABSTRACT

We study the problem of approximate membership extraction (AME), i.e., how to efficiently extract substrings in a text document that approximately match some strings in a given dictionary. This problem is important in a variety of applications such as named entity recognition and data cleaning. We solve this problem in two steps. In the first step, for each substring in the text, we filter away the strings in the dictionary that are very different from the substring. In the second step, each candidate string is verified to decide whether the substring should be extracted. We develop an incremental algorithm using signature-based inverted lists to minimize the duplicate list-scan operations of overlapping windows in the text. Our experimental study of the proposed algorithms on real and synthetic datasets showed that our solutions significantly outperform existing methods in the literature.

References

  1. A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Li, B,Wang, X. Yang, VGRAM: Improving performance of approximate queries on string collections using variable length grams. In VLDB 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31--88, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35--43, 2001.Google ScholarGoogle Scholar
  12. E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327--340, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. C. Yao and F. F. Yao. Dictionary loop-up with small errors. In CPM, pages 387--394, 1995Google ScholarGoogle Scholar

Index Terms

  1. Efficient algorithms for approximate member extraction using signature-based inverted lists

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
            November 2009
            2162 pages
            ISBN:9781605585123
            DOI:10.1145/1645953

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 2 November 2009

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,861of8,427submissions,22%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader