skip to main content
research-article

Scaling up crowd-sourcing to very large datasets: a case for active learning

Published:01 October 2014Publication History
Skip Abstract Section

Abstract

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs).

Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements.

Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44× fewer than existing active learning algorithms.

References

  1. D. N. A. Asuncion. UCI machine learning repository, 2007.Google ScholarGoogle Scholar
  2. A. Agarwal, L. Bottou, M. Dudík, and J. Langford. Para-active learning. CoRR, abs/1310.8243, 2013.Google ScholarGoogle Scholar
  3. A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bellare et al. Active sampling for entity matching. In KDD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Bosch, A. Zisserman, and X. Muoz. Image classification using random forests and ferns. In ICCV, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  9. A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In ECAI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Chatterjee and S. N. Lahiri. Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494):608--625, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. Artif. Int. Res., 4, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In ISAIM, 2008.Google ScholarGoogle Scholar
  13. A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 1979.Google ScholarGoogle Scholar
  14. P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  16. L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples. In WGMBV, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical image classification. In ICML, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge: crowdsourcing complex work. In UIST, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Kleiner et al. The big data bootstrap. In ICML, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In ICML, 1996.Google ScholarGoogle Scholar
  22. S. Lahiri. On bootstrapping m-estimators. Sankhyā. Series A. Methods and Techniques, 54(2), 1992.Google ScholarGoogle Scholar
  23. F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In EMNLP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, 2012.Google ScholarGoogle Scholar
  27. A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of LREC, 2010.Google ScholarGoogle Scholar
  28. A. G. Parameswaran et al. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking. Mach. Learn., 54, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2010.Google ScholarGoogle Scholar
  32. V. Sheng, F. Provost, and P. Ipeirotis. Get another label? In KDD, 2008.Google ScholarGoogle Scholar
  33. B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M.-H. Tsai et al. Active learning strategies using svms. In IJCNN, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  35. S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning. Int. J. Comput. Vision, 91, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Vlachos. A stopping criterion for active learning. Comput. Speech Lang., 22(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. ABS: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in NIPS, 2012.Google ScholarGoogle Scholar

Index Terms

  1. Scaling up crowd-sourcing to very large datasets: a case for active learning
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 8, Issue 2
        October 2014
        84 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 October 2014
        Published in pvldb Volume 8, Issue 2

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader