poster

Exploiting user clicks for automatic seed set generation for entity matching

Authors:
Xiao Bai

Yahoo! Research, Barcelona, Spain

Yahoo! Research, Barcelona, Spain
View Profile

,
Flavio P. Junqueira

Microsoft Research, Cambridge, United Kingdom

Microsoft Research, Cambridge, United Kingdom
View Profile

,
Srinivasan H. Sengamedu

Komli Labs, Bangalore, India

Komli Labs, Bangalore, India
View Profile

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2013Pages 980–988https://doi.org/10.1145/2487575.2487662

Published:11 August 2013Publication History

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 980–988

ABSTRACT

Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.

References

R. Baeza-Yates and A. Tiberi. Extracting semantic relations fromquery logs. In KDD, 2002. Google ScholarDigital Library
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006. Google ScholarDigital Library
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, 2003. Google ScholarDigital Library
B. Billerbeck, G. Demartini, C. S. Firan, T. Iofciu, and R. Krestel. Ranking entities using web search query logs. In ECDL, 2010. Google ScholarDigital Library
H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD, 2008. Google ScholarDigital Library
D. Chakrabarti and R. R. Mehta. The paths more taken: matching DOM trees to search logs for accurate webpage clustering. In WWW, 2010. Google ScholarDigital Library
T. F. Coleman and J. J. Moré. Estimation of sparse Jacobian matrices and graph coloring problems. SIAM Journal on Numerical Analysis, 1983.Google Scholar
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill, 2001. Google ScholarDigital Library
N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR, 2007. Google ScholarDigital Library
I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In SIGKDD, 2001. Google ScholarDigital Library
R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the world-wide web. In AGENTS, 1997. Google ScholarDigital Library
C. F. Dorneles, R. Gonçalves, and R. dos Santos Mello. Approximate data instance matching: a survey. Knowledge and Information Systems, 2011. Google ScholarDigital Library
L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice and open challenges. PVLDB, 2012. Google ScholarDigital Library
J. Greiner. A comparison of parallel algorithms for connected components. In SPAA, 1994. Google ScholarDigital Library
C. Kang, S. Vadrevu, R. Zhang, R. v. Zwol, L. G. Pueyo, N. Torzec, J. He, and Y. Chang. Ranking related entities for web search queries. In WWW, 2011. Google ScholarDigital Library
H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197--210, 2010. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
P. N. Mendes, P. Mika, H. Zaragoza, and R. Blanco. Measuring website similarity using an entity-aware click graph. In CIKM, 2012. Google ScholarDigital Library
J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In SIGKDD, 2004. Google ScholarDigital Library
V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. PVLDB, 2011. Google ScholarDigital Library
S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Inf. Syst., 26(8):607--633, Dec. 2001. Google ScholarDigital Library
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In ICDM, 2006. Google ScholarDigital Library
C. Wang, F. Jing, L. Zhang, and H.-J. Zhang. Image annotation refinement using random walk with restarts. In ACM MM, 2006. Google ScholarDigital Library
J. Yi and F. Maghoul. Query clustering using click-through graph. In WWW, 2009. Google ScholarDigital Library

Index Terms

Exploiting user clicks for automatic seed set generation for entity matching
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Global ranking by exploiting user clicks
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

It is now widely recognized that user interactions with search results can provide substantial relevance information on the documents displayed in the search results. In this paper, we focus on extracting relevance information from one source of user ...
Read More
Frameworks for entity matching: A comparison

Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed frameworks for ...
Read More
GNEM: A Generic One-to-Set Neural Entity Matching Framework
WWW '21: Proceedings of the Web Conference 2021

Entity Matching is a classic research problem in any data analytics pipeline, aiming to identify records referring to the same real-world entity. It plays an important role in data cleansing and integration. Advanced entity matching techniques focus on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2013
1534 pages
ISBN:9781450321747
DOI:10.1145/2487575
Editors:
Rayid Ghani
University of Chicago
,
Ted E. Senator
SAIC
,
Paul Bradley
MethodCare, Inc.
,
Rajesh Parekh
Groupon
,
Jingrui He
Stevens Institute of Technology
,
General Chairs:
Robert L. Grossman
University of Chicago and Open Data Group
,
Ramasamy Uthurusamy
General Motors Corporation (retired)
,
Program Chairs:
Inderjit S. Dhillon
University of Texas
,
Yehuda Koren
Google
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
co-clustering
entity matching
random walk
user clicks
Qualifiers
- poster
Conference

Acceptance Rates
KDD '13 Paper Acceptance Rate125of726submissions,17%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 361
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting user clicks for automatic seed set generation for entity matching

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Global ranking by exploiting user clicks

Frameworks for entity matching: A comparison

GNEM: A Generic One-to-Set Neural Entity Matching Framework

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Exploiting user clicks for automatic seed set generation for entity matching

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Global ranking by exploiting user clicks

Frameworks for entity matching: A comparison

GNEM: A Generic One-to-Set Neural Entity Matching Framework

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media