skip to main content
research-article

Crowdsourcing algorithms for entity resolution

Published:01 August 2014Publication History
Skip Abstract Section

Abstract

In this paper, we study a hybrid human-machine approach for solving the problem of Entity Resolution (ER). The goal of ER is to identify all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Our input is a graph over all the records in a database, where each edge has a probability denoting our prior belief (based on Machine Learning models) that the pair of records represented by the given edge are duplicates. Our objective is to resolve all the duplicates by asking humans to verify the equality of a subset of edges, leveraging the transitivity of the equality relation to infer the remaining edges (e.g. a = c can be inferred given a = b and b = c). We consider the problem of designing optimal strategies for asking questions to humans that minimize the expected number of questions asked. Using our theoretical framework, we analyze several strategies, and show that a strategy, claimed as "optimal" for this problem in a recent work, can perform arbitrarily bad in theory. We propose alternate strategies with theoretical guarantees. Using both public datasets as well as the production system at Facebook, we show that our techniques are effective in practice.

References

  1. http://www.facebook.com/places/editor.Google ScholarGoogle Scholar
  2. http://www.facebook.com/about/location.Google ScholarGoogle Scholar
  3. http://dbs.uni-leipzig.de/file/Abt-Buy.zip.Google ScholarGoogle Scholar
  4. N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In MACHINE LEARNING, pages 238--247, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bilgic and L. Getoor. Active inference for collective classification. In Twenty-Fourth Conference on Artificial Intelligence (AAAI NECTAR Track), pages 1652--1655, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 285--294. International World Wide Web Conferences Steering Committee / ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, WWW '12, pages 469--478, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Georgescu, D. D. Pham, C. S. Firan, W. Nejdl, and J. Gaugaz. Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries. In Proceedings of the 21st ACM international conference on Information and knowledge management, CIKM '12, pages 1970--1974, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Ghosh, S. Kale, and P. McAfee. Who moderates the moderators?: crowdsourcing abuse detection in user-generated content. In Proceedings of the 12th ACM conference on Electronic commerce, EC '11, pages 167--176, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Gruenheid, D. Kossmann, R. Sukriti, and F. Widmer. Crowdsourcing entity resolution: When is a=b? Technical Report 785, ETH Zurich, Sept. 2012.Google ScholarGoogle Scholar
  11. S. R. Jeffery, L. Sun, M. DeLand, N. Pendar, R. Barber, and A. Galdi. Arnold: Declarative crowd-machine data integration. In CIDR. www.cidrdb.org, 2013.Google ScholarGoogle Scholar
  12. D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, editors, NIPS, pages 1953--1961, 2011.Google ScholarGoogle Scholar
  13. A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proc. VLDB Endow., 5(1):13--24, Sept. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. McCallum. Cora dataset. http://www.cs.umass.edu/~mcallum/data/cora-refs.tar.gz, 2004.Google ScholarGoogle Scholar
  15. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: crowdsourcing entity resolution. Proc. VLDB Endow., 5(11):1483--1494, July 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In K. A. Ross, D. Srivastava, and D. Papadias, editors, SIGMOD Conference, pages 229--240. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. E. Whang and H. Garcia-Molina. Developments in generic entity resolution. IEEE Data Eng. Bull., 34(3):51--59, 2011.Google ScholarGoogle Scholar
  18. S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. In PVLDB. Stanford InfoLab, August 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 7, Issue 12
    August 2014
    296 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 August 2014
    Published in pvldb Volume 7, Issue 12

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader