ABSTRACT
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number of instances needed to achieve high accuracy. We investigate various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.
- S. Argamon-Engelson and I. Dagan. Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research, 11:335--360, 1999.Google ScholarDigital Library
- V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting structured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa Barabara, USA, 2001. Google ScholarDigital Library
- C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. In Proc. of SIGIR, pages 292--300, 1994. Google ScholarDigital Library
- C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarDigital Library
- S. Chaudhuri, V. Narasayya, and S. Sarawagi. Efficient evaluation of queries with mining predicates. In Proc. of the 18th Int'l Conference on Data Engineering (ICDE), San Jose, USA, April 2002. Google ScholarDigital Library
- D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201--221, 1994. Google ScholarCross Ref
- R. Collobert and S. Bengio. Svmtorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143--160, 2001. Software available from "http://www.idiap.ch/learning/SVMTorch.html". Google ScholarDigital Library
- Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2--3):133--168, 1997. Google ScholarDigital Library
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001. Google ScholarDigital Library
- L. Gravano, Panagiotis, and H. V. Jagadish. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001. Google ScholarDigital Library
- M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
- J. Hylton. Identifying and merging related bibliographic records. Master's thesis, MIT, 1996.Google Scholar
- V. S. Iyengar, C. Apte, and T. Zhang. Active learning using adaptive resampling. In R. Ramakrishnan, S. Stolfo, R. Bayardo, and I. Parsa, editors, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-00), pages 91--98, N. Y., Aug. 20--23 2000. ACM Press. Google ScholarDigital Library
- W. C. Jacob. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.Google Scholar
- R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using MLC++: A machine learning library in C++. In Tools with Artificial Intelligence, pages 234--245. IEEE Computer Society Press, available from http://www.sgi.com/tech/mlc/, 1996. Google Scholar
- S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999. Google ScholarDigital Library
- R. Liere and P. Tadepalli. Active learning with committees for text categorization. In Proceedings of AAAI-97, 14th Conference of the American Association for Artificial Intelligence, pages 591--596, Providence, US, 1997. AAAI Press, Menlo Park, US. Google Scholar
- A. McCallum, K. Nigam, J. Reed, J. Rennie, and K. Seymore. Cora: Computer science research paper search engine, http://cora.whizbang.com/, 2000.Google Scholar
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, pages 169--178, 2000. Google ScholarDigital Library
- A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In J. W. Shavlik, editor, Proceedings of ICML-98, 15th International Conference on Machine Learning, pages 350--358, Madison, US, 1998. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarDigital Library
- T. Mitchell. Machine Learning. McGraw-Hill, 1997. Google ScholarDigital Library
- A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.Google Scholar
- G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google ScholarDigital Library
- J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993. software available from http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz. Google ScholarDigital Library
- V. Raman and J. M. Hellerstein. Potters wheel: An interactive data cleaning system. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), pages 307--316, Rome, Italy, 2001. Google ScholarDigital Library
- S. Sarawagi, editor. IEEE Data Engineering special issue on Data Cleaning. http://www.research.microsoft, com/research/db/debull/A00dec/issue.htm, December 2000.Google Scholar
- G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th International Conf. on Machine Learning, pages 839--846. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
- H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learing Theory, pages 287--294, 1992. Google ScholarDigital Library
- S. Toney. Cleanup and deduplication of an international deduplication function. Information Technology and libraries, 11(1):19--28, 1992.Google Scholar
- S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, Nov. 2001. Google ScholarDigital Library
- W. E. Winkler. Matching and record linkage. In B. G. C. et al, editor, Business Survey Methods, pages 355--384. New York: J. Wiley, 1995. available from http://www.census.gov/.Google Scholar
- W. E. Winkler. The state of record linkage and current research problems. RR99/04, http://www.census.gov/srd/papers/pdf/rr99-04.pdf, 1999.Google Scholar
- B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (KDD), 2001. Google ScholarDigital Library
- T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proc. 17th International Conf. on Machine Learning, pages 1191--1198. Morgan Kaufmann, San Francisco, CA, 2000. Google ScholarDigital Library
Index Terms
- Interactive deduplication using active learning
Recommendations
Cost‐effective multi‐instance multilabel active learning
AbstractMulti‐instance multi‐label (MIML) Active Learning (M2AL) aims to improve the learner while reducing the cost as much as possible by querying informative labels of complex bags composed of diverse instances. Existing M2AL solutions suffer high ...
Multiple-instance active learning
NIPS'07: Proceedings of the 20th International Conference on Neural Information Processing SystemsWe present a framework for active learning in the multiple-instance (MI) setting. In an MI learning problem, instances are naturally organized into bags and it is the bags, instead of individual instances, that are labeled for training. MI learners ...
Transfer active learning
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementActive learning traditionally assumes that labeled and unlabeled samples are subject to the same distributions and the goal of an active learner is to label the most informative unlabeled samples. In reality, situations may exist that we may not have ...
Comments