ABSTRACT
Semi-Supervised Learning (SSL) is a data mining technique which comes between supervised and unsupervised techniques, and is useful when a small number of instances in a dataset are labelled but a lot of unlabelled data is also available. This is the case with user reviews in application stores such as the Apple App Store or Google Play, where a vast amount of reviews are available but classifying them into categories such as bug related review or feature request is expensive or at least labor intensive. SSL techniques are well-suited to this problem as classifying reviews not only takes time and effort, but may also be unnecessary. In this work, we analyse SSL techniques to show their viability and their capabilities in a dataset of reviews collected from the App Store for both transductive (predicting existing instance labels during training) and inductive (predicting labels on unseen future data) performance.
- David W. Aha, Dennis Kibler, and Marc K. Albert. 1991. Instance-based learning algorithms. Machine Learning 6, 1 (1991), 37--66. Google ScholarDigital Library
- J. Alcalá-Fdez, L. Sánchez, S. García, M.J. del Jesus, S. Ventura, J.M. Garrell, J. Otero, C. Romero, J. Bacardit, V.M. Rivas, J. C. Fernández, and F. Herrera. 2009. KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Computing 13, 3 (2009), 307--318. Google ScholarDigital Library
- L.V.G. Carreño and K. Winbladh. 2013. Analysis of user comments: An approach for software requirements evolution. In 35th International Conference on Software Engineering (ICSE). 582--591. Google ScholarDigital Library
- Ning Chen, Jialiu Lin, Steven C. H. Hoi, Xiaokui Xiao, and Boshen Zhang. 2014. AR-miner: Mining Informative Reviews for Developers from Mobile App Marketplace. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 767--778. Google ScholarDigital Library
- Nadia Felix F. da Silva, Luiz F. S. Coletta, and Eduardo R. Hruschka. 2016. A Survey and Comparative Study of Tweet Sentiment Analysis via Semi-Supervised Learning. Comput. Surveys 49, 1, Article 15 (June 2016), 26 pages. Google ScholarDigital Library
- Mark Harman, Yue Jia, and Yuanyuan Zhang. 2012. App Store Mining and Analysis: MSR for App Stores. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories. 108--111. Google ScholarDigital Library
- C. Iacob and R. Harrison. 2013. Retrieving and analyzing mobile apps feature requests from online reviews. In 10th Working Conference on Mining Software Repositories (MSR). 41--44. Google ScholarDigital Library
- Ming Li and Zhi-Hua Zhou. 2005. SETRED: Self-training with Editing. Springer Berlin Heidelberg, Berlin, Heidelberg, 611--621. Google ScholarDigital Library
- W. Maalej and H. Nabil. 2015. Bug report, feature request, or simply praise? On automatically classifying app reviews. In IEEE 23rd International Requirements Engineering Conference (RE). 116--125.Google Scholar
- J. Ortigosa-Hernández, I. Inza, and J. A. Lozano. 2016. Semisupervised Multiclass Classification Problems With Scarcity of Labeled Data: A Theoretical Study. IEEE Transactions on Neural Networks and Learning Systems 27, 12 (Dec 2016), 2602--2614.Google ScholarCross Ref
- D. Pagano and W. Maalej. 2013. User feedback in the appstore: An empirical study. In 21st IEEE International Requirements Engineering Conference (RE). 125--134.Google Scholar
- S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, and H. C. Gall. 2015. How can I improve my app? Classifying user reviews for software maintenance and evolution. In IEEE International Conference on Software Maintenance and Evolution (ICSME). 281--290. Google ScholarDigital Library
- John C. Platt. 1999. Advances in Kernel Methods. MIT Press, Cambridge, MA, USA, Chapter Fast Training of Support Vector Machines Using Sequential Minimal Optimization, 185--208. http://dl.acm.org/citation.cfm?id=299094.299105 Google ScholarDigital Library
- J. Ross Quinlan. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarDigital Library
- M. Sigdel, İ. Dinç, S. Dinç, M.S. Sigdel, M. L. Pusey, and R.S. Aygün. 2014. Evaluation of Semi-supervised Learning for Classification of Protein Crystallization Imagery. In Proceedings of IEEE Southeastcon.Google Scholar
- Isaac Triguero, Salvador García, and Francisco Herrera. 2015. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information Systems 42, 2 (2015), 245--284. Google ScholarDigital Library
- Jiao Wang, Si wei Luo, and Xian hua Zeng. 2008. A random subspace method for co-training. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 195--200.Google ScholarCross Ref
- Tiejian Wang, Zhiwu Zhang, Xiaoyuan Jing, and Yanli Liu. 2016. Non-negative sparse-based SemiBoost for software defect prediction. Software Testing, Verification and Reliability 26, 7 (2016), 498--515. Google ScholarDigital Library
- Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2016. Data Mining, Practical Machine Learning Tools and Techniques (4th Edition). Morgan Kaufmann. Google ScholarDigital Library
- David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (ACL '95). Association for Computational Linguistics, Stroudsburg, PA, USA, 189--196. Google ScholarDigital Library
- Yusuf Yaslan and Zehra Cataltepe. 2010. Co-training with relevant random subspaces. Neurocomputing 73, 10-12 (2010), 1652--1661. Subspace Learning / Selected papers from the European Symposium on Time Series Prediction. Google ScholarDigital Library
Index Terms
- Preliminary Study on Applying Semi-Supervised Learning to App Store Analysis
Recommendations
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Multiview Semi-Supervised Learning with Consensus
Obtaining high-quality and up-to-date labeled data can be difficult in many real-world machine learning applications. Semi-supervised learning aims to improve the performance of a classifier trained with limited number of labeled data by utilizing the ...
Semi-supervised partial label learning algorithm via reliable label propagation
AbstractPartial label learning (PLL) is a weakly supervised learning method that is able to predict one label as the correct answer from a given candidate label set. In PLL, when all possible candidate labels are as signed to real-world training examples, ...
Comments