ABSTRACT
In just a few years, crowdsourcing markets like Mechanical Turk have become the dominant mechanism for for building "gold standard" datasets in areas of computer science ranging from natural language processing to audio transcription. The assumption behind this sea change - an assumption that is central to the approaches taken in hundreds of research projects - is that crowdsourced markets can accurately replicate the judgments of the general population for knowledge-oriented tasks. Focusing on the important domain of semantic relatedness algorithms and leveraging Clark's theory of common ground as a framework, we demonstrate that this assumption can be highly problematic. Using 7,921 semantic relatedness judgements from 72 scholars and 39 crowdworkers, we show that crowdworkers on Mechanical Turk produce significantly different semantic relatedness gold standard judgements than people from other communities. We also show that algorithms that perform well against Mechanical Turk gold standard datasets do significantly worse when evaluated against other communities' gold standards. Our results call into question the broad use of Mechanical Turk for the development of gold standard datasets and demonstrate the importance of understanding these datasets from a human-centered point-of-view. More generally, our findings problematize the notion that a universal gold standard dataset exists for all knowledge tasks.
- Babbie, E. R., et al. Survey research methods. Wadsworth Belmont, CA, 1990.Google Scholar
- Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., Van Der Goot, E., Halkia, M., Pouliquen, B., and Belyaeva, J. Sentiment analysis in the news. arXiv preprint arXiv:1309.6202 (2013).Google Scholar
- Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M., and Gergle, D. Omnipedia: Bridging the wikipedia language gap. In CHI '12 (2012). Google ScholarDigital Library
- Bergstrom, T., and Karahalios, K. Conversation clusters: grouping conversation topics through human-computer dialog. In CHI '09 (Boston, MA, 2009), 2349--2352. Google ScholarDigital Library
- Bloodgood, M., and Callison-Burch, C. Using mechanical turk to build machine translation evaluation sets. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk (2010). Google ScholarDigital Library
- Budanitsky, A., and Hirst, G. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics 32, 1 (2006), 13--47. Google ScholarDigital Library
- Buhrmester, M., Kwang, T., and Gosling, S. D. Amazon's mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6, 1 (Jan. 2011), 3--5.Google ScholarCross Ref
- Callison-Burch, C., and Dredze, M. Creating speech and language data with amazon's mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, Association for Computational Linguistics (2010), 1--12. Google ScholarDigital Library
- Clark, H. H. Using Language. Cambridge University Press, May 1996.Google ScholarCross Ref
- Dong, W., and Fu, W.-T. Cultural difference in image tagging. In CHI '10 (Atlanta, Georgia, USA, 2010), 981. Google ScholarDigital Library
- Dong, Z., Shi, C., Sen, S., Terveen, L., and Riedl, J. War versus inspirational in forrest gump: Cultural effects in tagging communities. In ICWSM '12 (May 2012).Google Scholar
- Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. Placing search in context: The concept revisited. ACM Transactions on Information Systems 20, 1 (2002), 116--131. Google ScholarDigital Library
- Freitas, A., Oliveira, J. G., O'Riain, S., da Silva, J. C., and Curry, E. Querying linked data graphs using semantic relatedness: A vocabulary independent approach. Data & Knowledge Engineering 88, 0 (2013), 126--141. Google ScholarDigital Library
- Gabrilovich, E., and Markovitch, S. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI '07 (Hyberabad, India, 2007). Google ScholarDigital Library
- Gergle, D., Kraut, R. E., and Fussell, S. R. Action as language in a shared visual space. In Proceedings of the 2004 ACM Conference on Computer Supported Cooperative Work, CSCW '04, ACM (New York, NY, USA, 2004), 487--496. Google ScholarDigital Library
- Gergle, D., Millen, D. R., Kraut, R. E., and Fussell, S. R. Persistence matters: Making the most of chat in tightly-coupled work. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '04, ACM (New York, NY, USA, 2004), 431--438. Google ScholarDigital Library
- Grieser, K., Baldwin, T., Bohnert, F., and Sonenberg, L. Using ontological and document similarity to estimate museum exhibit relatedness. 10:110:20. Cited by 0013.Google Scholar
- Halawi, G., Dror, G., Gabrilovich, E., and Koren, Y. Large-scale learning of word relatedness with constraints. In KDD '12, ACM (New York, NY, USA, 2012), 14061414. Google ScholarDigital Library
- Hecht, B., Carton, S. H., Quaderi, M., Schöning, J., Raubal, M., Gergle, D., and Downey, D. Explanatory semantic relatedness and explicit spatialization for exploratory search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM (2012), 415--424. Google ScholarDigital Library
- Hecht, B., and Gergle, D. The tower of babel meets web 2.0: User-generated content and its applications in a multilingual context. In CHI '10, ACM (Atlanta, GA, 2010), 291300. ACM ID: 1753370. Google ScholarDigital Library
- Heer, J., and Bostock, M. Crowdsourcing graphical perception: using mechanical turk to assess visualization design. In CHI '10 (2010), 203212. Google ScholarDigital Library
- Ipeirotis, P. G. Demographics of mechanical turk.Google Scholar
- Kittur, A., Chi, E. H., and Suh, B. What's in wikipedia?: Mapping topics and conflict using socially annotated category structure. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '09, ACM (New York, NY, USA, 2009), 1509--1512. Google ScholarDigital Library
- Liesaputra, V., and Witten, I. H. Realistic electronic books. International Journal of Human-Computer Studies 70, 9 (Sept. 2012), 588--610. Cited by 0002. Google ScholarDigital Library
- Miller, G. A., and Charles, W. G. Contextual correlates of semantic similarity. 1--28.Google Scholar
- Milne, D., and Witten, I. H. Learning to link with wikipedia. In CIKM '08 (Napa Valley, California, USA, 2008), 509518. ACM ID: 1458150. Google ScholarDigital Library
- Mooney, C. Z., Duval, R. D., and Duvall, R. Bootstrapping: A nonparametric approach to statistical inference. Sage, 1993.Google ScholarCross Ref
- Patwardhan, S., Banerjee, S., and Pedersen, T. Using measures of semantic relatedness for word sense disambiguation. In Computational Linguistics and Intelligent Text Processing, A. Gelbukh, Ed. Springer Berlin Heidelberg, Jan. 2003, 241--257. Google ScholarDigital Library
- Pavlick, E., Post, M., Irvine, A., Kachaev, D., and Callison-Burch, C. The language demographics of amazon mechanical turk. Transactions of the Association for Computational Linguistics 2 (2014), 79--92.Google ScholarCross Ref
- Pedersen, T., Pakhomov, S. V., Patwardhan, S., and Chute, C. G. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 40, 3 (2006), 288--299. Google ScholarDigital Library
- Pirró, G., and Seco, N. Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In On the Move to Meaningful Internet Systems: OTM 2008, R. Meersman and Z. Tari, Eds., no. 5332 in Lecture Notes in Computer Science. Springer Berlin Heidelberg, Jan. 2008, 1271--1288. Google ScholarDigital Library
- Ponzetto, S. P., and Strube, M. Exploiting semantic role labeling, WordNet and wikipedia for coreference resolution. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (2006), 192199. Google ScholarDigital Library
- Popescu, A., and Grefenstette, G. Mining user home location and gender from flickr tags. In ICSWM '10 (2010).Google Scholar
- Radinsky, K., Agichtein, E., Gabrilovich, E., and Markovitch, S. A word at a time: Computing word relatedness using temporal semantic analysis. In WWW '11 (Hyberabad, India, 2011), 337--346. Google ScholarDigital Library
- Resnick, P. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI '95 (Montreal, Quebec, Canada, 1995), 448--453. Google ScholarDigital Library
- Rubenstein, H., and Goodenough, J. B. Contextual correlates of synonymy. Communications of the ACM 8, 10 (Oct. 1965), 627633. Google ScholarDigital Library
- Schöning, J., Hecht, B., Raubal, M., Krger, A., Marsh, M., and Rohs, M. Improving interaction with virtual globes through spatial thinking: Helping users ask Why?. In IUI '08 (Masapalomas, Gran Canaria, Spain, 2008), 129--138. Google ScholarDigital Library
- Snow, R., O'Connor, B., Jurafsky, D., and Ng, A. Y. Cheap and fastbut is it good?: evaluating non-expert annotations for natural language tasks. In EMNLP '08 (2008), 254263. Google ScholarDigital Library
- Strube, M., and Ponzetto, S. P. WikiRelate! computing semantic relatedness using wikipedia. In AAAI '06 (Boston, MA, 2006), 1419--1424. Google ScholarDigital Library
- Taboada, M., Brooke, J., Tofiloski, M., Voll, K., and Stede, M. Lexicon-based methods for sentiment analysis. Computational linguistics 37, 2 (2011), 267--307. Google ScholarDigital Library
- Witten, I., and Milne, D. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, AAAI Press, Chicago, USA (2008), 25--30.Google Scholar
- Zesch, T., and Gurevych, I. Wisdom of crowds versus wisdom of linguists-measuring the semantic relatedness of words. Natural Language Engineering 16, 1 (2010), 25. Google ScholarDigital Library
Index Terms
- Turkers, Scholars, "Arafat" and "Peace": Cultural Communities and Algorithmic Gold Standards
Recommendations
A Community Rather Than A Union: Understanding Self-Organization Phenomenon on MTurk and How It Impacts Turkers and Requesters
CHI EA '17: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing SystemsThis paper aims to understand the self-organization phenomenon among the workers of Amazon Mechanical Turk (MTurk), a well-known crowdsourcing platform. Specifically, we explored 1) why MTurk workers self-organize into online communities (Turker ...
Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks
CSCW '16: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social ComputingCrowdsourcing is a common strategy for collecting the “gold standard” labels required for many natural language applications. Crowdworkers differ in their responses for many reasons, but existing approaches often treat disagreements as "noise" to be ...
Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution
Accurately and efficiently crowdsourcing complex, open-ended tasks can be difficult, as crowd participants tend to favor short, repetitive "microtasks". We study the crowdsourcing of large networks where the crowd provides the network topology via ...
Comments