Article

Canonicalization of database records using adaptive similarity measures

Authors:
Aron Culotta

University of Massachusetts

University of Massachusetts
View Profile

,
Michael Wick

University of Massachusetts

University of Massachusetts
View Profile

,
Robert Hall

University of Massachusetts

University of Massachusetts
View Profile

,
Matthew Marzilli

University of Massachusetts

University of Massachusetts
View Profile

,
Andrew McCallum

University of Massachusetts

University of Massachusetts
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 201–209https://doi.org/10.1145/1281192.1281217

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 201–209

ABSTRACT

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization.

In this paper, we explore the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different styles of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. KDD versus Conference on Knowledge Discovery and Data Mining). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. These approaches can incorporate arbitrary textual evidence to select a canonical record. We evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.

References

M. Bilenko and R. J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI-02-296, University of Texas at Austin, 2002.Google Scholar
Y. Censor and S. Zenios. Parallel optimization: theory, algorithms, and applications. Oxford University Press, 1997. Google ScholarDigital Library
M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. Google ScholarDigital Library
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. J. Mach. Learn. Res., 7:551--585, 2006. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, 2006. Google ScholarDigital Library
V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Na uk SSR, 163(4):845--848, 1965.Google Scholar
D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3,(Ser. B)):503--528, 1989. Google ScholarDigital Library
G. Mann and D. Yarowsky. Multi-field information extraction and cross-document fusion. In ACL, 2005. Google ScholarDigital Library
A. McCallum, K. Bellare, and F. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In Conference on Uncertainty in AI, 2005.Google ScholarCross Ref
A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA, 2005.Google Scholar
B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with unknown objects. In IJCAI, 2005. Google ScholarDigital Library
E. S. Ristad and P. N. Yianilos. Learning string edit distance. Technical Report CS-TR-532-96, Princeton University, 1997.Google Scholar
S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, 2001. Google ScholarDigital Library
M. Wick, A. Culotta, and A. McCallum. Learning field compatibilities to extract database records from unstructured text. In EMNLP, 2006. Google ScholarDigital Library
J. J. Zhu and L. H. Unger. String edit analysis for merging databases. In KDD, 2000.Google Scholar

Index Terms

Canonicalization of database records using adaptive similarity measures
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Canonicalization of graph database records using similarity measures
ICUIMC '08: Proceedings of the 2nd international conference on Ubiquitous information management and communication

Information extraction and crawling from the Web have been increasingly common, yet raw data are often noisy and redundant due to heterogeneous sources. Although much work has focused on duplicate records detection, there is little investigation in ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More
Application of association rules mining to Named Entity Recognition and co-reference resolution for the Indonesian language

In this paper, we propose a new method, association rules mining for Named Entity Recognition (NER) and co-reference resolution. The method uses several morphological and lexical features such as Pronoun Class (PC) and Name Class (NC), String Similarity ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data cleaning
data mining
information extraction
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 541
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Canonicalization of database records using adaptive similarity measures

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Canonicalization of graph database records using similarity measures

Two-stage approach to named entity recognition using Wikipedia and DBpedia

Application of association rules mining to Named Entity Recognition and co-reference resolution for the Indonesian language