skip to main content
10.1145/775047.775087acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Interactive deduplication using active learning

Published:23 July 2002Publication History

ABSTRACT

Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

References

  1. S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335--360, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting structured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barabara, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. In Proc. of SIGIR, pages 292--300, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chaudhuri, V. Narasayya, and S. Sarawagi. Efficient evaluation of queries with mining predicates. In Proc. of the 18th Int'l Conference on Data Engineering (ICDE), San Jose, USA, April 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994. Google ScholarGoogle ScholarCross RefCross Ref
  7. R. Collobert and S. Bengio. Svmtorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143--160, 2001. Software available from "http://www.idiap.ch/learning/SVMTorch.html". Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2--3):133--168, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Gravano, Panagiotis, and H. V. Jagadish. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hylton. Identifying and merging related bibliographic records. Master's thesis, MIT, 1996.Google ScholarGoogle Scholar
  13. V. S. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive resampling. In R. Ramakrishnan, S. Stolfo, R. Bayardo, and I. Parsa, editors, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-00), pages 91--98, N. Y., Aug. 20--23 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. C. Jacob. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.Google ScholarGoogle Scholar
  15. R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using MLC++: A machine learning library in C++. In Tools with Artificial Intelligence, pages 234--245. IEEE Computer Society Press, available from http://www.sgi.com/tech/mlc/, 1996. Google ScholarGoogle Scholar
  16. S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Liere and P. Tadepalli. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence, pages 591--596, Providence, US, 1997. AAAI Press, Menlo Park, US. Google ScholarGoogle Scholar
  18. A. McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore. Cora: Computer science research paper search engine, http://cora.whizbang.com/, 2000.Google ScholarGoogle Scholar
  19. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, pages 169--178, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In J. W. Shavlik, editor, Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 350--358, Madison, US, 1998. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Mitchell. Machine Learning. McGraw-Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.Google ScholarGoogle Scholar
  23. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. software available from http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Raman and J. M. Hellerstein. Potters wheel: An interactive data cleaning system. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Sarawagi, editor. IEEE Data Engineering special issue on Data Cleaning. http://www.research.microsoft, com/research/db/debull/A00dec/issue.htm, December 2000.Google ScholarGoogle Scholar
  27. G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th International Conf. on Machine Learning, pages 839--846. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learing Theory, pages 287--294, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Toney. Cleanup and deduplication of an international deduplication function. Information Technology and libraries, 11(1):19--28, 1992.Google ScholarGoogle Scholar
  30. S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, Nov. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. E. Winkler. Matching and record linkage. In B. G. C. et al, editor, Business Survey Methods, pages 355--384. New York: J. Wiley, 1995. available from http://www.census.gov/.Google ScholarGoogle Scholar
  32. W. E. Winkler. The state of record linkage and current research problems. RR99/04, http://www.census.gov/srd/papers/pdf/rr99-04.pdf, 1999.Google ScholarGoogle Scholar
  33. B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (KDD), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proc. 17th International Conf. on Machine Learning, pages 1191--1198. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Interactive deduplication using active learning

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
              July 2002
              719 pages
              ISBN:158113567X
              DOI:10.1145/775047

              Copyright © 2002 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 23 July 2002

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%

              Upcoming Conference

              KDD '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader