research-article

Crowdsourcing algorithms for entity resolution

Authors:
Norases Vesdapunt

Stanford University

Stanford University
View Profile

,
Kedar Bellare

Facebook

Facebook
View Profile

,
Nilesh Dalvi

Facebook

Facebook
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 12pp 1071–1082https://doi.org/10.14778/2732977.2732982

Published:01 August 2014Publication History

Proceedings of the VLDB Endowment

Abstract

In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a database, where each edge has a probability denoting our prior belief (based on Machine Learning models) that the pair of records represented by the given edge are duplicates. Our objective is to resolve all the duplicates by asking humans to verify the equality of a subset of edges, leveraging the transitivity of the equality relation to infer the remaining edges (e.g. a = c can be inferred given a = b and b = c). We consider the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked. Using our theoretical framework, we analyze several strategies, and show that a strategy, claimed as "optimal" for this problem in a recent work, can perform arbitrarily bad in theory. We propose alternate strategies with theoretical guarantees. Using both public datasets as well as the production system at Facebook, we show that our techniques are effective in practice.

References

http://www.facebook.com/places/editor.Google Scholar
http://www.facebook.com/about/location.Google Scholar
http://dbs.uni-leipzig.de/file/Abt-Buy.zip.Google Scholar
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002. Google ScholarDigital Library
M. Bilgic and L. Getoor. Active inference for collective classification. In Twenty-Fourth Conference on Artificial Intelligence (AAAI NECTAR Track), pages 1652--1655, 2010.Google ScholarDigital Library
N. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 285--294. International World Wide Web Conferences Steering Committee / ACM, 2013. Google ScholarDigital Library
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12, pages 469--478, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12, pages 1970--1974, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, EC '11, pages 167--176, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
A. Gruenheid, D. Kossmann, R. Sukriti, and F. Widmer. Crowdsourcing entity resolution: When is a=b? Technical Report 785, ETH Zurich, Sept. 2012.Google Scholar
S. R. Jeffery, L. Sun, M. DeLand, N. Pendar, R. Barber, and A. Galdi. Arnold: Declarative crowd-machine data integration. In CIDR. www.cidrdb.org, 2013.Google Scholar
D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, NIPS, pages 1953--1961, 2011.Google Scholar
A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5(1):13--24, Sept. 2011. Google ScholarDigital Library
A. McCallum. Cora dataset. http://www.cs.umass.edu/~mcallum/data/cora-refs.tar.gz, 2004.Google Scholar
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: crowdsourcing entity resolution. Proc. VLDB Endow., 5(11):1483--1494, July 2012. Google ScholarDigital Library
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In K. A. Ross, D. Srivastava, and D. Papadias, editors, SIGMOD Conference, pages 229--240. ACM, 2013. Google ScholarDigital Library
S. E. Whang and H. Garcia-Molina. Developments in generic entity resolution. IEEE Data Eng. Bull., 34(3):51--59, 2011.Google Scholar
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. In PVLDB. Stanford InfoLab, August 2013. Google ScholarDigital Library
W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.Google Scholar

Recommendations

Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems

Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Read More
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Read More
Joint Entity Resolution
ICDE '12: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Entity resolution (ER) is the problem of identifying which records in a database represent the same entity. Often, records of different types are involved (e.g., authors, publications, institutions, venues), and resolving records of one type can impact ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 12
August 2014
296 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 65
  Total Citations
  View Citations
- 420
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Crowdsourcing algorithms for entity resolution

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Handling data quality in entity resolution

Collective entity resolution in relational data

Joint Entity Resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Crowdsourcing algorithms for entity resolution

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Handling data quality in entity resolution

Collective entity resolution in relational data

Joint Entity Resolution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media