ABSTRACT
The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google ScholarDigital Library
- M. Bilenko and R. J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX, Feb. 2002.Google Scholar
- W. W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, Aug. 2000. Google ScholarDigital Library
- W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002. Google ScholarDigital Library
- D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.Google ScholarDigital Library
- R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.Google ScholarCross Ref
- I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183--1210, 1969.Google ScholarCross Ref
- Y. Freund and L. Mason. The alternating decision tree learning algorithm. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), Bled, Slovenia, 1999. Google ScholarDigital Library
- D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997. Google ScholarDigital Library
- M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-95), pages 127--138, San Jose, CA, May 1995. Google ScholarDigital Library
- T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 169--184. MIT Press, 1999. Google ScholarDigital Library
- T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), Bled, Slovenia, June 1999. Google ScholarDigital Library
- A. K. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000), pages 169--178, Boston, MA, Aug. 2000. Google ScholarDigital Library
- A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 267--270, Portland, OR, Aug. 1996.Google Scholar
- A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 23--29, Tuscon, AZ, May 1997.Google Scholar
- U. Y. Nahm and R. J. Mooney. Using information extraction to aid the discovery of prediction rules from texts. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, Boston, MA, Aug. 2000.Google Scholar
- S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443--453, 1970.Google ScholarCross Ref
- H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954--959, 1959.Google ScholarCross Ref
- J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. J. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 185--208. MIT Press, 1999.Google Scholar
- L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257--286, 1989.Google ScholarCross Ref
- E. S. Ristad and P. N. Yianilos. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5), 1998. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002. Google ScholarDigital Library
- S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, 2002. Google ScholarDigital Library
- V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. Google ScholarDigital Library
- W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, Wachington, DC, 1999.Google Scholar
- I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 1999. Google ScholarDigital Library
- B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of 18th International Conference on Machine Learning (ICML-2001), Williamstown, MA, 2001. Google ScholarDigital Library
Index Terms
- Adaptive duplicate detection using learnable string similarity measures
Recommendations
Learning similarity measures from data
AbstractDefining similarity measures is a requirement for some machine learning methods. One such method is case-based reasoning (CBR) where the similarity measure is used to retrieve the stored case or a set of cases most similar to the query case. ...
Some cosine similarity measures and distance measures between q‐rung orthopair fuzzy sets
AbstractIn this paper, we consider some cosine similarity measures and distance measures between q‐rung orthopair fuzzy sets (q‐ROFSs). First, we define a cosine similarity measure and a Euclidean distance measure of q‐ROFSs, their properties are also ...
Adaptive stereo similarity fusion using confidence measures
We propose similarity fusion strategy based on stereo confidences.We propose a consensus strategy to exploit spatial correlation between pixels.Our fusion increases the accuracy of global and local stereo algorithms.We out-perform other fusion ...
Comments