skip to main content
10.1145/2487575.2487662acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Exploiting user clicks for automatic seed set generation for entity matching

Published:11 August 2013Publication History

ABSTRACT

Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.

References

  1. R. Baeza-Yates and A. Tiberi. Extracting semantic relations fromquery logs. In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Billerbeck, G. Demartini, C. S. Firan, T. Iofciu, and R. Krestel. Ranking entities using web search query logs. In ECDL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Chakrabarti and R. R. Mehta. The paths more taken: matching DOM trees to search logs for accurate webpage clustering. In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. F. Coleman and J. J. Moré. Estimation of sparse Jacobian matrices and graph coloring problems. SIAM Journal on Numerical Analysis, 1983.Google ScholarGoogle Scholar
  8. T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In SIGKDD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the world-wide web. In AGENTS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. F. Dorneles, R. Gonçalves, and R. dos Santos Mello. Approximate data instance matching: a survey. Knowledge and Information Systems, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice and open challenges. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Greiner. A comparison of parallel algorithms for connected components. In SPAA, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Kang, S. Vadrevu, R. Zhang, R. v. Zwol, L. G. Pueyo, N. Torzec, J. He, and Y. Chang. Ranking related entities for web search queries. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197--210, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. N. Mendes, P. Mika, H. Zaragoza, and R. Blanco. Measuring website similarity using an entity-aware click graph. In CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In SIGKDD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. PVLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Inf. Syst., 26(8):607--633, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Wang, F. Jing, L. Zhang, and H.-J. Zhang. Image annotation refinement using random walk with restarts. In ACM MM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Yi and F. Maghoul. Query clustering using click-through graph. In WWW, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting user clicks for automatic seed set generation for entity matching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2013
        1534 pages
        ISBN:9781450321747
        DOI:10.1145/2487575

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 August 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        KDD '13 Paper Acceptance Rate125of726submissions,17%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader