skip to main content
10.1145/1281192.1281217acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Canonicalization of database records using adaptive similarity measures

Published:12 August 2007Publication History

ABSTRACT

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from online papers. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is little existing work on canonicalization.

In this paper, we explore the use of edit distance measures to construct a canonical representation that is "central" in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different styles of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. KDD versus Conference on Knowledge Discovery and Data Mining). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. These approaches can incorporate arbitrary textual evidence to select a canonical record. We evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.

References

  1. M. Bilenko and R. J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI-02-296, University of Texas at Austin, 2002.Google ScholarGoogle Scholar
  2. Y. Censor and S. Zenios. Parallel optimization: theory, algorithms, and applications. Oxford University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In EMNLP, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. J. Mach. Learn. Res., 7:551--585, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Na uk SSR, 163(4):845--848, 1965.Google ScholarGoogle Scholar
  7. D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3,(Ser. B)):503--528, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Mann and D. Yarowsky. Multi-field information extraction and cross-document fusion. In ACL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. McCallum, K. Bellare, and F. Pereira. A conditional random field for discriminatively-trained finite-state string edit distance. In Conference on Uncertainty in AI, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press, Cambridge, MA, 2005.Google ScholarGoogle Scholar
  11. B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with unknown objects. In IJCAI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. S. Ristad and P. N. Yianilos. Learning string edit distance. Technical Report CS-TR-532-96, Princeton University, 1997.Google ScholarGoogle Scholar
  13. S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Wick, A. Culotta, and A. McCallum. Learning field compatibilities to extract database records from unstructured text. In EMNLP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. J. Zhu and L. H. Unger. String edit analysis for merging databases. In KDD, 2000.Google ScholarGoogle Scholar

Index Terms

  1. Canonicalization of database records using adaptive similarity measures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2007
      1080 pages
      ISBN:9781595936097
      DOI:10.1145/1281192

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 August 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader