research-article

Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards

Authors:
Shilad Sen

Macalester College, St. Paul, USA

Macalester College, St. Paul, USA
View Profile

,
Margaret E. Giesel

Macalester College, St. Paul, MN, USA

Macalester College, St. Paul, MN, USA
View Profile

,
Rebecca Gold

Macalester College, St. Paul, MN, USA

Macalester College, St. Paul, MN, USA
View Profile

,
Benjamin Hillmann

Macalester College, St. Paul, MN, USA

Macalester College, St. Paul, MN, USA
View Profile

,
Matt Lesicko

Macalester College, St. Paul, MN, USA

Macalester College, St. Paul, MN, USA
View Profile

,
Samuel Naden

Macalester College, St. Paul, MN, USA

Macalester College, St. Paul, MN, USA
View Profile

,
Jesse Russell

Macalester College, St. Paul, MN, USA

Macalester College, St. Paul, MN, USA
View Profile

,
Zixiao (Ken) Wang

Macalester College, St. Paul, MN, USA

Macalester College, St. Paul, MN, USA
View Profile

,
Brent Hecht

University of Minnesota, Minneapolis, USA

University of Minnesota, Minneapolis, USA
View Profile

CSCW '15: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social ComputingFebruary 2015Pages 826–838https://doi.org/10.1145/2675133.2675285

Published:28 February 2015Publication History

CSCW '15: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing

Pages 826–838

ABSTRACT

In just a few years, crowdsourcing markets like Mechanical Turk have become the dominant mechanism for for building "gold standard" datasets in areas of computer science ranging from natural language processing to audio transcription. The assumption behind this sea change - an assumption that is central to the approaches taken in hundreds of research projects - is that crowdsourced markets can accurately replicate the judgments of the general population for knowledge-oriented tasks. Focusing on the important domain of semantic relatedness algorithms and leveraging Clark's theory of common ground as a framework, we demonstrate that this assumption can be highly problematic. Using 7,921 semantic relatedness judgements from 72 scholars and 39 crowdworkers, we show that crowdworkers on Mechanical Turk produce significantly different semantic relatedness gold standard judgements than people from other communities. We also show that algorithms that perform well against Mechanical Turk gold standard datasets do significantly worse when evaluated against other communities' gold standards. Our results call into question the broad use of Mechanical Turk for the development of gold standard datasets and demonstrate the importance of understanding these datasets from a human-centered point-of-view. More generally, our findings problematize the notion that a universal gold standard dataset exists for all knowledge tasks.

References

Babbie, E. R., et al. Survey research methods. Wadsworth Belmont, CA, 1990.Google Scholar
Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., Van Der Goot, E., Halkia, M., Pouliquen, B., and Belyaeva, J. Sentiment analysis in the news. arXiv preprint arXiv:1309.6202 (2013).Google Scholar
Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M., and Gergle, D. Omnipedia: Bridging the wikipedia language gap. In CHI '12 (2012). Google ScholarDigital Library
Bergstrom, T., and Karahalios, K. Conversation clusters: grouping conversation topics through human-computer dialog. In CHI '09 (Boston, MA, 2009), 2349--2352. Google ScholarDigital Library
Bloodgood, M., and Callison-Burch, C. Using mechanical turk to build machine translation evaluation sets. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (2010). Google ScholarDigital Library
Budanitsky, A., and Hirst, G. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics 32, 1 (2006), 13--47. Google ScholarDigital Library
Buhrmester, M., Kwang, T., and Gosling, S. D. Amazon's mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6, 1 (Jan. 2011), 3--5.Google ScholarCross Ref
Callison-Burch, C., and Dredze, M. Creating speech and language data with amazon's mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, Association for Computational Linguistics (2010), 1--12. Google ScholarDigital Library
Clark, H. H. Using Language. Cambridge University Press, May 1996.Google ScholarCross Ref
Dong, W., and Fu, W.-T. Cultural difference in image tagging. In CHI '10 (Atlanta, Georgia, USA, 2010), 981. Google ScholarDigital Library
Dong, Z., Shi, C., Sen, S., Terveen, L., and Riedl, J. War versus inspirational in forrest gump: Cultural effects in tagging communities. In ICWSM '12 (May 2012).Google Scholar
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. Placing search in context: The concept revisited. ACM Transactions on Information Systems 20, 1 (2002), 116--131. Google ScholarDigital Library
Freitas, A., Oliveira, J. G., O'Riain, S., da Silva, J. C., and Curry, E. Querying linked data graphs using semantic relatedness: A vocabulary independent approach. Data & Knowledge Engineering 88, 0 (2013), 126--141. Google ScholarDigital Library
Gabrilovich, E., and Markovitch, S. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI '07 (Hyberabad, India, 2007). Google ScholarDigital Library
Gergle, D., Kraut, R. E., and Fussell, S. R. Action as language in a shared visual space. In Proceedings of the 2004 ACM Conference on Computer Supported Cooperative Work, CSCW '04, ACM (New York, NY, USA, 2004), 487--496. Google ScholarDigital Library
Gergle, D., Millen, D. R., Kraut, R. E., and Fussell, S. R. Persistence matters: Making the most of chat in tightly-coupled work. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '04, ACM (New York, NY, USA, 2004), 431--438. Google ScholarDigital Library
Grieser, K., Baldwin, T., Bohnert, F., and Sonenberg, L. Using ontological and document similarity to estimate museum exhibit relatedness. 10:110:20. Cited by 0013.Google Scholar
Halawi, G., Dror, G., Gabrilovich, E., and Koren, Y. Large-scale learning of word relatedness with constraints. In KDD '12, ACM (New York, NY, USA, 2012), 14061414. Google ScholarDigital Library
Hecht, B., Carton, S. H., Quaderi, M., Schöning, J., Raubal, M., Gergle, D., and Downey, D. Explanatory semantic relatedness and explicit spatialization for exploratory search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM (2012), 415--424. Google ScholarDigital Library
Hecht, B., and Gergle, D. The tower of babel meets web 2.0: User-generated content and its applications in a multilingual context. In CHI '10, ACM (Atlanta, GA, 2010), 291300. ACM ID: 1753370. Google ScholarDigital Library
Heer, J., and Bostock, M. Crowdsourcing graphical perception: using mechanical turk to assess visualization design. In CHI '10 (2010), 203212. Google ScholarDigital Library
Ipeirotis, P. G. Demographics of mechanical turk.Google Scholar
Kittur, A., Chi, E. H., and Suh, B. What's in wikipedia?: Mapping topics and conflict using socially annotated category structure. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '09, ACM (New York, NY, USA, 2009), 1509--1512. Google ScholarDigital Library
Liesaputra, V., and Witten, I. H. Realistic electronic books. International Journal of Human-Computer Studies 70, 9 (Sept. 2012), 588--610. Cited by 0002. Google ScholarDigital Library
Miller, G. A., and Charles, W. G. Contextual correlates of semantic similarity. 1--28.Google Scholar
Milne, D., and Witten, I. H. Learning to link with wikipedia. In CIKM '08 (Napa Valley, California, USA, 2008), 509518. ACM ID: 1458150. Google ScholarDigital Library
Mooney, C. Z., Duval, R. D., and Duvall, R. Bootstrapping: A nonparametric approach to statistical inference. Sage, 1993.Google ScholarCross Ref
Patwardhan, S., Banerjee, S., and Pedersen, T. Using measures of semantic relatedness for word sense disambiguation. In Computational Linguistics and Intelligent Text Processing, A. Gelbukh, Ed. Springer Berlin Heidelberg, Jan. 2003, 241--257. Google ScholarDigital Library
Pavlick, E., Post, M., Irvine, A., Kachaev, D., and Callison-Burch, C. The language demographics of amazon mechanical turk. Transactions of the Association for Computational Linguistics 2 (2014), 79--92.Google ScholarCross Ref
Pedersen, T., Pakhomov, S. V., Patwardhan, S., and Chute, C. G. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 40, 3 (2006), 288--299. Google ScholarDigital Library
Pirró, G., and Seco, N. Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In On the Move to Meaningful Internet Systems: OTM 2008, R. Meersman and Z. Tari, Eds., no. 5332 in Lecture Notes in Computer Science. Springer Berlin Heidelberg, Jan. 2008, 1271--1288. Google ScholarDigital Library
Ponzetto, S. P., and Strube, M. Exploiting semantic role labeling, WordNet and wikipedia for coreference resolution. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (2006), 192199. Google ScholarDigital Library
Popescu, A., and Grefenstette, G. Mining user home location and gender from flickr tags. In ICSWM '10 (2010).Google Scholar
Radinsky, K., Agichtein, E., Gabrilovich, E., and Markovitch, S. A word at a time: Computing word relatedness using temporal semantic analysis. In WWW '11 (Hyberabad, India, 2011), 337--346. Google ScholarDigital Library
Resnick, P. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI '95 (Montreal, Quebec, Canada, 1995), 448--453. Google ScholarDigital Library
Rubenstein, H., and Goodenough, J. B. Contextual correlates of synonymy. Communications of the ACM 8, 10 (Oct. 1965), 627633. Google ScholarDigital Library
Schöning, J., Hecht, B., Raubal, M., Krger, A., Marsh, M., and Rohs, M. Improving interaction with virtual globes through spatial thinking: Helping users ask Why?. In IUI '08 (Masapalomas, Gran Canaria, Spain, 2008), 129--138. Google ScholarDigital Library
Snow, R., O'Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fastbut is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP '08 (2008), 254263. Google ScholarDigital Library
Strube, M., and Ponzetto, S. P. WikiRelate! computing semantic relatedness using wikipedia. In AAAI '06 (Boston, MA, 2006), 1419--1424. Google ScholarDigital Library
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., and Stede, M. Lexicon-based methods for sentiment analysis. Computational linguistics 37, 2 (2011), 267--307. Google ScholarDigital Library
Witten, I., and Milne, D. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, AAAI Press, Chicago, USA (2008), 25--30.Google Scholar
Zesch, T., and Gurevych, I. Wisdom of crowds versus wisdom of linguists-measuring the semantic relatedness of words. Natural Language Engineering 16, 1 (2010), 25. Google ScholarDigital Library

Index Terms

Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards
1. Human-centered computing

Recommendations

A Community Rather Than A Union: Understanding Self-Organization Phenomenon on MTurk and How It Impacts Turkers and Requesters
CHI EA '17: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems

This paper aims to understand the self-organization phenomenon among the workers of Amazon Mechanical Turk (MTurk), a well-known crowdsourcing platform. Specifically, we explored 1) why MTurk workers self-organize into online communities (Turker ...
Read More
Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks
CSCW '16: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing

Crowdsourcing is a common strategy for collecting the “gold standard” labels required for many natural language applications. Crowdworkers differ in their responses for many reasons, but existing approaches often treat disagreements as "noise" to be ...
Read More
Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution

Accurately and efficiently crowdsourcing complex, open-ended tasks can be difficult, as crowd participants tend to favor short, repetitive "microtasks". We study the crowdsourcing of large networks where the crowd provides the network topology via ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CSCW '15: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing
February 2015
1956 pages
ISBN:9781450329224
DOI:10.1145/2675133
General Chairs:
Dan Cosley
Cornell University, USA
,
Andrea Forte
Drexel University, USA
,
Program Chairs:
Luigina Ciolfi
Sheffield Hallam University, UK
,
David McDonald
University of Washington, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 February 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
amazon mechanical turk
cultural communities
gold standard datasets
natural language processing
semantic relatedness
user studies
Qualifiers
- research-article
Conference

Acceptance Rates
CSCW '15 Paper Acceptance Rate161of575submissions,28%Overall Acceptance Rate2,235of8,521submissions,26%
More
Upcoming Conference
CSCW '24

Sponsor:

sigchi

CSCW '24: Computer-Supported Cooperative Work and Social Computing

November 9 - 13, 2024

San Jose , Costa Rica
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 453
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards

CSCW '15: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Community Rather Than A Union: Understanding Self-Organization Phenomenon on MTurk and How It Impacts Turkers and Requesters

Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks

Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution