skip to main content
research-article

FourEyes: Leveraging Tool Diversity as a Means to Improve Aggregate Accuracy in Crowdsourcing

Published:09 August 2019Publication History
Skip Abstract Section

Abstract

Crowdsourcing is a common means of collecting image segmentation training data for use in a variety of computer vision applications. However, designing accurate crowd-powered image segmentation systems is challenging, because defining object boundaries in an image requires significant fine motor skills and hand-eye coordination, which makes these tasks error-prone. Typically, special segmentation tools are created and then answers from multiple workers are aggregated to generate more accurate results. However, individual tool designs can bias how and where people make mistakes, resulting in shared errors that remain even after aggregation. In this article, we introduce a novel crowdsourcing approach that leverages tool diversity as a means of improving aggregate crowd performance. Our idea is that given a diverse set of tools, answer aggregation done across tools can help improve the collective performance by offsetting systematic biases induced by the individual tools themselves. To demonstrate the effectiveness of the proposed approach, we design four different tools and present FourEyes, a crowd-powered image segmentation system that uses aggregation across different tools. We then conduct a series of studies that evaluate different aggregation conditions and show that using multiple tools can significantly improve aggregate accuracy. Furthermore, we investigate the idea of applying post-processing for multi-tool aggregation in terms of correction mechanism. We introduce a novel region-based method for synthesizing more accurate bounds for image segmentation tasks through averaging surrounding annotations. In addition, we explore the effect of adjusting the threshold parameter of an EM-based aggregation method. Our results suggest that not only the individual tool’s design, but also the correction mechanism, can affect the performance of multi-tool aggregation. This article extends a work presented at ACM IUI 2018 [46] by providing a novel region-based error-correction method and additional in-depth evaluation of the proposed approach.

References

  1. Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. 2016. What’s the point: Semantic segmentation with point supervision. In Proceedings of the European Conference on Computer Vision. Springer, 549--565.Google ScholarGoogle ScholarCross RefCross Ref
  2. Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. 2013. OpenSurfaces: A richly annotated catalog of surface appearance. ACM Trans. Graph. 32, 4 (2013), 111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. 2010. Soylent: A word processor with a crowd inside. In Proceedings of the 23rd ACM Symposium on User Interface Software and Technology. ACM, 313--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jonathan Bragg, Mausam, and Daniel S. Weld. 2013. Crowdsourcing multi-label classification for taxonomy creation. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  5. Axel Carlier, Vincent Charvillat, Amaia Salvador, Xavier Giro-i Nieto, and Oge Marques. 2014. Click’n’Cut: Crowdsourced interactive segmentation with object candidates. In Proceedings of the International ACM Workshop on Crowdsourcing for Multimedia. ACM, 53--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alexander Philip Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Appl. Stat. 28, 1 (1979), 20--28.Google ScholarGoogle ScholarCross RefCross Ref
  7. Thomas G. Dietterich et al. 2000. Ensemble methods in machine learning. Mult. Class. Syst. 1857 (2000), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. ACM, 1013--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yoav Freund and Robert E. Schapire. 1995. A desicion-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the European Conference on Computational Learning Theory. Springer, 23--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Timnit Gebru, Jonathan Krause, Jia Deng, and Li Fei-Fei. 2017. Scalable annotation of fine-grained categories without experts. In Proceedings of the International Conference on Human Factors in Computing Systems. ACM, 1877--1881. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mitchell Gordon, Jeffrey P. Bigham, and Walter S. Lasecki. 2015. LegionTools: A toolkit+ UI for recruiting and routing crowds to synchronous real-time tasks. In Adjunct Proceedings of the 28th ACM Symposium on User Interface Software 8 Technology. ACM, 81--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sai Gouravajhala, Jean Y. Song, Jinyeong Yim, Raymond Fok, Yanda Huang, Fan Yang, Kyle Wang, Yilei An, and Walter S. Lasecki. 2017. Towards hybrid intelligence for robotics. In Proceedings of the Collective Intelligence Conference (CI’17).Google ScholarGoogle Scholar
  13. Danna Gurari, Mehrnoosh Sameki, and Margrit Betke. 2016. Investigating the influence of data familiarity to improve the design of a crowdsourcing image annotation system. In Proceedings of the AAAI Conference on Human Computation 8 Crowdsourcing (HCOMP’16).Google ScholarGoogle Scholar
  14. Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE Trans. Pattern Anal. Machine Intell. 12, 10 (1990), 993--1001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. Retrieved from: CoRR abs/1703.06870.Google ScholarGoogle Scholar
  16. Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation. ACM, 64--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alexandre Kaspar, Genevieve Patterson, Changil Kim, Yagiz Aksoy, Wojciech Matusik, and Mohamed Elgharib. 2018. Crowd-guided ensembles: How can we choreograph crowd workers for video segmentation? In Proceedings of the Conference on Human Factors in Computing Systems (CHI’18). ACM, New York, NY, Article 111, 111:1--111:12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Harmanpreet Kaur, Mitchell Gordon, Yiwei Yang, Jeffrey P. Bigham, Jaime Teevan, Ece Kamar, and Walter S. Lasecki. 2017. Crowdmask: Using crowds to preserve privacy in crowd-powered systems via progressive filtering. In Proceedings of the AAAI Conference on Human Computation (HCOMP’17), Vol. 17.Google ScholarGoogle Scholar
  19. Juho Kim, Phu Tran Nguyen, Sarah Weir, Philip J. Guo, Robert C. Miller, and Krzysztof Z. Gajos. 2014. Crowdsourcing step-by-step information extraction to enhance existing how-to videos. In Proceedings of the 32nd ACM Conference on Human Factors in Computing Systems (CHI’14). ACM, New York, NY, 4017--4026. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. Crowdforge: Crowdsourcing complex work. In Proceedings of the 24th ACM Symposium on User Interface Software and Technology. ACM, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Anand Kulkarni, Matthew Can, and Björn Hartmann. 2012. Collaboratively crowdsourcing workflows with Turkomatic. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. ACM, 1003--1012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Walter Lasecki and Jeffrey Bigham. 2012. Self-correcting crowds. In CHI’12 Extended Abstracts on Human Factors in Computing Systems. ACM, 2555--2560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Walter S. Lasecki, Mitchell Gordon, Danai Koutra, Malte F. Jung, Steven P. Dow, and Jeffrey P. Bigham. 2014. Glance: Rapidly coding behavioral video with the crowd. In Proceedings of the 27th ACM Symposium on User Interface Software and Technology. ACM, 551--562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Walter S. Lasecki, Christopher Miller, Adam Sadilek, Andrew Abumoussa, Donato Borrello, Raja Kushalnagar, and Jeffrey Bigham. 2012. Real-time captioning by groups of non-experts. In Proceedings of the 25th ACM Symposium on User Interface Software and Technology. ACM, 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Walter S. Lasecki, Christopher D. Miller, and Jeffrey P. Bigham. 2013. Warping time for more effective real-time crowdsourcing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’13). ACM, New York, NY, 2033--2036. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Walter S. Lasecki, Kyle I. Murray, Samuel White, Robert C. Miller, and Jeffrey P. Bigham. 2011. Real-time crowd control of existing interfaces. In Proceedings of the 24th ACM Symposium on User Interface Software and Technology. ACM, 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Walter S. Lasecki, Young Chol Song, Henry Kautz, and Jeffrey P. Bigham. 2013. Real-time crowd labeling for deployable activity recognition. In Proceedings of the Conference on Computer Supported Cooperative Work. ACM, 1203--1212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Matthew Lease, Jessica Hullman, Jeffrey P. Bigham, Michael S. Bernstein, Juho Kim, Walter S. Lasecki, Saeideh Bakhshi, Tanushree Mitra, and Robert C. Miller. 2013. Mechanical Turk is not anonymous. Soc. Sci. Res. Netw. (2013).Google ScholarGoogle Scholar
  29. Christopher Lin, Mausam Mausam, and Daniel Weld. 2012. Dynamically switching between synergistic workflows for crowdsourcing. In Proceedings of the AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Christopher H. Lin, Mausam Daniel, and S. Weld. 2012. Crowdsourcing control: Moving beyond multiple choice. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3159--3167.Google ScholarGoogle ScholarCross RefCross Ref
  32. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.Google ScholarGoogle Scholar
  33. Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2010. Turkit: Human computation algorithms on Mechanical Turk. In Proceedings of the 23nd ACM Symposium on User Interface Software and Technology. ACM, 57--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ching Liu, Juho Kim, and Hao-Chuan Wang. 2018. ConceptScape: Collaborative concept mapping for video learning. In Proceedings of the Conference on Human Factors in Computing Systems. ACM, 387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle ScholarCross RefCross Ref
  36. Alan Lundgard, Yiwei Yang, Maya L. Foster, and Walter S. Lasecki. 2018. Bolt: Instantaneous crowdsourcing via just-in-time training. In Proceedings of the Conference on Human Factors in Computing Systems (CHI’18). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kurt Luther, Nathan Hahn, Steven P. Dow, and Aniket Kittur. 2015. Crowdlines: Supporting synthesis of diverse information sources through crowdsourced outlines. In Proceedings of the 3rd AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  38. Allan MacLean, Richard M. Young, Victoria M. E. Bellotti, and Thomas P. Moran. 1991. Questions, options, and criteria: Elements of design space analysis. Human--Comput. Interact. 6, 3–4 (1991), 201--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Andrew Mao, Ece Kamar, Yiling Chen, Eric Horvitz, Megan E. Schwamb, Chris J. Lintott, and Arfon M. Smith. 2013. Volunteering versus work for pay: Incentives and tradeoffs in crowdsourcing. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  40. Christian A. Meissner and John C. Brigham. 2001. Thirty years of investigating the own-race bias in memory for faces: A meta-analytic review. Psych., Pub. Polic. Law 7, 1 (2001), 3.Google ScholarGoogle ScholarCross RefCross Ref
  41. Tom Ouyang and Yang Li. 2012. Bootstrapping personal gesture shortcuts with the wisdom of the crowd and handwriting recognition. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). ACM, New York, NY, 2895--2904. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Akshay Rao, Harmanpreet Kaur, and Walter S. Lasecki. 2018. Plexiglass: Multiplexing passive and active tasks for more efficient crowdsourcing. In Proceedings of the AAAI Conference on Human Computation. ACM.Google ScholarGoogle Scholar
  43. Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 77, 1 (2008), 157--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jeffrey M. Rzeszotarski and Aniket Kittur. 2011. Instrumenting the crowd: Using implicit behavioral measures to predict task performance. In Proceedings of the 24th ACM Symposium on User Interface Software and Technology. ACM, 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 254--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jean Y. Song, Raymond Fok, Alan Lundgard, Fan Yang, Juho Kim, and Walter S. Lasecki. 2018. Two tools are better than one: Tool diversity as a means of improving aggregate crowd performance. In Proceedings of the 23rd International Conference on Intelligent User Interfaces (IUI’18). ACM, New York, NY, 559--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Saiganesh Swaminathan, Raymond Fok, Fanglin Chen, Ting-Hao Kenneth Huang, Irene Lin, Rohan Jadvani, Walter S. Lasecki, and Jeffrey P. Bigham. 2017. WearMail: On-the-go access to information in your email with a privacy-preserving human computation workflow. In Proceedings of the 30th ACM Symposium on User Interface Software and Technology. ACM, 807--815. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Shane Torbert. 2016. Applied Computer Science. Springer. 158 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Peter Welinder, Steve Branson, Pietro Perona, and Serge J. Belongie. 2010. The multidimensional wisdom of crowds. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 2424--2432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jacob Whitehill, Ting Fan Wu, Jacob Bergsma, Javier R. Movellan, and Paul L. Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Curran Associates, Inc., 2035--2043. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Joseph Jay Williams, Juho Kim, Anna Rafferty, Samuel Maldonado, Krzysztof Z. Gajos, Walter S. Lasecki, and Neil Heffernan. 2016. Axis: Generating explanations at scale with learnersourcing and machine learning. In Proceedings of the 3rd ACM Conference on Learning@ Scale. ACM, 379--388. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FourEyes: Leveraging Tool Diversity as a Means to Improve Aggregate Accuracy in Crowdsourcing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Interactive Intelligent Systems
          ACM Transactions on Interactive Intelligent Systems  Volume 10, Issue 1
          Special Issue on IUI 2018
          March 2020
          347 pages
          ISSN:2160-6455
          EISSN:2160-6463
          DOI:10.1145/3352585
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 August 2019
          • Revised: 1 July 2018
          • Accepted: 1 July 2018
          • Received: 1 May 2018
          Published in tiis Volume 10, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format