research-article

Scaling up crowd-sourcing to very large datasets: a case for active learning

Authors:
Barzan Mozafari

University of Michigan, Ann Arbor

University of Michigan, Ann Arbor
View Profile

,
Purna Sarkar

UC Berkeley

UC Berkeley
View Profile

,
Michael Franklin

UC Berkeley

UC Berkeley
View Profile

,
Michael Jordan

UC Berkeley

UC Berkeley
View Profile

,
Samuel Madden

MIT

MIT
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 2pp 125–136https://doi.org/10.14778/2735471.2735474

Published:01 October 2014Publication History

Proceedings of the VLDB Endowment

Abstract

Crowd-sourcing has become a popular means of acquiring labeled data for many tasks where humans are more accurate than computers, such as image tagging, entity resolution, and sentiment analysis. However, due to the time and cost of human labor, solutions that rely solely on crowd-sourcing are often limited to small datasets (i.e., a few thousand items). This paper proposes algorithms for integrating machine learning into crowd-sourced databases in order to combine the accuracy of human labeling with the speed and cost-effectiveness of machine learning classifiers. By using active learning as our optimization strategy for labeling tasks in crowd-sourced databases, we can minimize the number of questions asked to the crowd, allowing crowd-sourced applications to scale (i.e., label much larger datasets at lower costs).

Designing active learning algorithms for a crowd-sourced database poses many practical challenges: such algorithms need to be generic, scalable, and easy to use, even for practitioners who are not machine learning experts. We draw on the theory of nonparametric bootstrap to design, to the best of our knowledge, the first active learning algorithms that meet all these requirements.

Our results, on 3 real-world datasets collected with Amazons Mechanical Turk, and on 15 UCI datasets, show that our methods on average ask 1--2 orders of magnitude fewer questions than the baseline, and 4.5--44^× fewer than existing active learning algorithms.

References

D. N. A. Asuncion. UCI machine learning repository, 2007.Google Scholar
A. Agarwal, L. Bottou, M. Dudík, and J. Langford. Para-active learning. CoRR, abs/1310.8243, 2013.Google Scholar
A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In SIGMOD, 2010. Google ScholarDigital Library
K. Bellare et al. Active sampling for entity matching. In KDD, 2012. Google ScholarDigital Library
A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML, 2009. Google ScholarDigital Library
A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS, 2010.Google ScholarDigital Library
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006. Google ScholarDigital Library
A. Bosch, A. Zisserman, and X. Muoz. Image classification using random forests and ferns. In ICCV, 2007.Google ScholarCross Ref
A. Brew, D. Greene, and P. Cunningham. Using crowdsourcing and active learning to track sentiment in online media. In ECAI, 2010. Google ScholarDigital Library
A. Chatterjee and S. N. Lahiri. Bootstrapping lasso estimators. Journal of the American Statistical Association, 106(494):608--625, 2011.Google ScholarCross Ref
D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. J. Artif. Int. Res., 4, 1996. Google ScholarDigital Library
S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In ISAIM, 2008.Google Scholar
A. Dawid and A. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, 1979.Google Scholar
P. Donmez, J. G. Carbonell, and J. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In KDD, 2009. Google ScholarDigital Library
B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.Google ScholarCross Ref
L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples. In WGMBV, 2004. Google ScholarDigital Library
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarDigital Library
S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to medical image classification. In ICML, 2006. Google ScholarDigital Library
A. Kittur, B. Smus, S. Khamkar, and R. E. Kraut. Crowdforge: crowdsourcing complex work. In UIST, 2011. Google ScholarDigital Library
A. Kleiner et al. The big data bootstrap. In ICML, 2012.Google ScholarDigital Library
R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In ICML, 1996.Google Scholar
S. Lahiri. On bootstrapping m-estimators. Sankhyā. Series A. Methods and Techniques, 54(2), 1992.Google Scholar
F. Laws, C. Scheible, and H. Schütze. Active learning with amazon mechanical turk. In EMNLP, 2011. Google ScholarDigital Library
A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 2012. Google ScholarDigital Library
A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. PVLDB, 5, 2011. Google ScholarDigital Library
B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Active learning for crowd-sourced databases. CoRR, 2012.Google Scholar
A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of LREC, 2010.Google Scholar
A. G. Parameswaran et al. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, 2012. Google ScholarDigital Library
M. Saar-Tsechansky and F. Provost. Active sampling for class probability estimation and ranking. Mach. Learn., 54, 2004. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002. Google ScholarDigital Library
B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2010.Google Scholar
V. Sheng, F. Provost, and P. Ipeirotis. Get another label? In KDD, 2008.Google Scholar
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, 2013.Google ScholarDigital Library
M.-H. Tsai et al. Active learning strategies using svms. In IJCNN, 2010.Google ScholarCross Ref
S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning. Int. J. Comput. Vision, 91, 2011. Google ScholarDigital Library
A. Vlachos. A stopping criterion for active learning. Comput. Speech Lang., 22(3), 2008. Google ScholarDigital Library
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: Crowdsourcing entity resolution. PVLDB, 5, 2012. Google ScholarDigital Library
K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. ABS: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014. Google ScholarDigital Library
K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014. Google ScholarDigital Library
D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in NIPS, 2012.Google Scholar

Index Terms

Scaling up crowd-sourcing to very large datasets: a case for active learning
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution

Accurately and efficiently crowdsourcing complex, open-ended tasks can be difficult, as crowd participants tend to favor short, repetitive "microtasks". We study the crowdsourcing of large networks where the crowd provides the network topology via ...
Read More
An Online Learning Approach to Improving the Quality of Crowd-Sourcing
SIGMETRICS '15: Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

We consider a crowd-sourcing problem where in the process of labeling massive datasets, multiple labelers with unknown annotation quality must be selected to perform the labeling task for each incoming data sample or task, with the results aggregated ...
Read More
Accurate Crowd Counting using Merged Datasets
ICMLT '22: Proceedings of the 2022 7th International Conference on Machine Learning Technologies

For city surveillance, intelligent crowd control is crucial. Insufficient samples are huge obstacles in the process of crowd counting and in the learning of deep neural networks. In this work, we have conducted an exploratory study using large and small-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 2
October 2014
84 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 October 2014
Published in pvldb Volume 8, Issue 2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 53
  Total Citations
  View Citations
- 547
  Total Downloads
- Downloads (Last 12 months)41
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scaling up crowd-sourcing to very large datasets: a case for active learning

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution

An Online Learning Approach to Improving the Quality of Crowd-Sourcing

Accurate Crowd Counting using Merged Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scaling up crowd-sourcing to very large datasets: a case for active learning

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient Crowd Exploration of Large Networks: The Case of Causal Attribution

An Online Learning Approach to Improving the Quality of Crowd-Sourcing

Accurate Crowd Counting using Merged Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media