Abstract
Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs).
Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements.
Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44× fewer than existing active learning algorithms.
- D. N. A. Asuncion. UCI machine learning repository, 2007.Google Scholar
- A. Agarwal, L. Bottou, M. Dudík, and J. Langford. Para-active learning. CoRR, abs/1310.8243, 2013.Google Scholar
- A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010. Google ScholarDigital Library
- K. Bellare et al. Active sampling for entity matching. In KDD, 2012. Google ScholarDigital Library
- A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009. Google ScholarDigital Library
- A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.Google ScholarDigital Library
- C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006. Google ScholarDigital Library
- A. Bosch, A. Zisserman, and X. Muoz. Image classification using random forests and ferns. In ICCV, 2007.Google ScholarCross Ref
- A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In ECAI, 2010. Google ScholarDigital Library
- A. Chatterjee and S. N. Lahiri. Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494):608--625, 2011.Google ScholarCross Ref
- D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. Artif. Int. Res., 4, 1996. Google ScholarDigital Library
- S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In ISAIM, 2008.Google Scholar
- A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 1979.Google Scholar
- P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009. Google ScholarDigital Library
- B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.Google ScholarCross Ref
- L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples. In WGMBV, 2004. Google ScholarDigital Library
- M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarDigital Library
- S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical image classification. In ICML, 2006. Google ScholarDigital Library
- A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge: crowdsourcing complex work. In UIST, 2011. Google ScholarDigital Library
- A. Kleiner et al. The big data bootstrap. In ICML, 2012.Google ScholarDigital Library
- R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In ICML, 1996.Google Scholar
- S. Lahiri. On bootstrapping m-estimators. Sankhyā. Series A. Methods and Techniques, 54(2), 1992.Google Scholar
- F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In EMNLP, 2011. Google ScholarDigital Library
- A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 2012. Google ScholarDigital Library
- A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5, 2011. Google ScholarDigital Library
- B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, 2012.Google Scholar
- A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of LREC, 2010.Google Scholar
- A. G. Parameswaran et al. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, 2012. Google ScholarDigital Library
- M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking. Mach. Learn., 54, 2004. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
- B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2010.Google Scholar
- V. Sheng, F. Provost, and P. Ipeirotis. Get another label? In KDD, 2008.Google Scholar
- B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarDigital Library
- M.-H. Tsai et al. Active learning strategies using svms. In IJCNN, 2010.Google ScholarCross Ref
- S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning. Int. J. Comput. Vision, 91, 2011. Google ScholarDigital Library
- A. Vlachos. A stopping criterion for active learning. Comput. Speech Lang., 22(3), 2008. Google ScholarDigital Library
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5, 2012. Google ScholarDigital Library
- K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. ABS: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014. Google ScholarDigital Library
- K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014. Google ScholarDigital Library
- D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in NIPS, 2012.Google Scholar
Index Terms
- Scaling up crowd-sourcing to very large datasets: a case for active learning
Recommendations
Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution
Accurately and efficiently crowdsourcing complex, open-ended tasks can be difficult, as crowd participants tend to favor short, repetitive "microtasks". We study the crowdsourcing of large networks where the crowd provides the network topology via ...
An Online Learning Approach to Improving the Quality of Crowd-Sourcing
SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer SystemsWe consider a crowd-sourcing problem where in the process of labeling massive datasets, multiple labelers with unknown annotation quality must be selected to perform the labeling task for each incoming data sample or task, with the results aggregated ...
Accurate Crowd Counting using Merged Datasets
ICMLT '22: Proceedings of the 2022 7th International Conference on Machine Learning TechnologiesFor city surveillance, intelligent crowd control is crucial. Insufficient samples are huge obstacles in the process of crowd counting and in the learning of deep neural networks. In this work, we have conducted an exploratory study using large and small-...
Comments