ABSTRACT
Background: Machine Learning (ML) has been widely used as a powerful tool to support Software Engineering (SE). The fundamental assumptions of data characteristics required for specific ML methods have to be carefully considered prior to their applications in SE. Within the context of Continuous Integration (CI) and Continuous Deployment (CD) practices, there are two vital characteristics of data prone to be violated in SE research. First, the logs generated during CI/CD for training are imbalanced data, which is contrary to the principles of common balanced classifiers; second, these logs are also time-series data, which violates the assumption of cross-validation. Objective: We aim to systematically study the two data characteristics and further provide a comprehensive evaluation for predictive CI/CD with the data from real projects. Method: We conduct an experimental study that evaluates 67 CI/CD predictive models using both cross-validation and time-series-validation. Results: Our evaluation shows that cross-validation makes the evaluation of the models optimistic in most cases, there are a few counter-examples as well. The performance of the top 10 imbalanced models are better than the balanced models in the predictions of failed builds, even for balanced data. The degree of data imbalance has a negative impact on prediction performance. Conclusion: In research and practice, the assumptions of the various ML methods should be seriously considered for the validity of research. Even if it is used to compare the relative performance of models, cross-validation may not be applicable to the problems with time-series features. The research community need to revisit the evaluation results reported in some existing research.
- Sylvain Arlot and Alain Celisse. 2010. A survey of cross-validation procedures for model selection. Stat. Surv. 4, 2010 (2010), 40--79.Google ScholarCross Ref
- Abigail Atchison, Christina Berardi, Natalie Best, Elizabeth Stevens, and Erik Linstead. 2017. A time series analysis of TravisTorrent builds: to everything there is a season. In Proc. of the 14th Int. Conf. on Mining Softw. Repos. (MSR'17). 463--466.Google ScholarDigital Library
- Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2014. Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring. In 2014 13th Int. Conf. on Mach. Learn. and Appl. (ICMLA'14). 263--269.Google Scholar
- Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2015. Ensemble of Example-Dependent Cost-Sensitive Decision Trees. CoRR (2015).Google Scholar
- Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2015. Example-dependent cost-sensitive decision trees. Expert Syst. With Appl. 42, 19 (2015), 6609--6619.Google ScholarDigital Library
- Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada, and Björn E. Ottersten. 2014. Improving Credit Card Fraud Detection with Calibrated Probabilities. In Proc. 2014 SIAM Inter. Conf. on Data Mining (SDM'14). 677--685.Google Scholar
- Gustavo E. A. P. A. Batista, Ana L. C. Bazzan, and Maria Carolina Monard. 2003. Balancing Training Data for Automated Annotation of Keywords: a Case Study. In II Brazilian Workshop on Bioinformatics, December 3-5, 2003, Macaé, RJ, Brazil (WOB'03). 10--18.Google Scholar
- Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations 6, 1 (2004), 20--29.Google ScholarDigital Library
- Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: synthesizing Travis CI and GitHub for full-stack research on continuous integration. In Proc. 14th Int. Conf. on Min. Softw. Repos. (MSR 17). 447--450.Google ScholarDigital Library
- Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, and Solomon Mensah. 2018. MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. In Proc. 40th Int. Conf. on Softw. Eng. (ICSE '18). 699.Google ScholarDigital Library
- Leo Breiman. 2001. Random Forests. Mach. Learn. archive 45, 1 (2001), 5--32.Google Scholar
- Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google Scholar
- Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In Proc. 2000 ACM SIGMOD Int. Conf. on Manage. of Data, Vol. 29. 93--104.Google ScholarDigital Library
- Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artfi. Intell. Res. 16, 1 (2002), 321--357.Google ScholarCross Ref
- Chao Chen. 2004. Using Random Forest to Learn Imbalanced Data. (2004).Google Scholar
- Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. Multiple Classifier Syst. (2000), 1--15.Google Scholar
- Paul M. Duvall, Steve Matyas, and Andrew Glover. 2007. Continuous Integration: Improving Software Quality and Reducing Risk.Google Scholar
- Bradley Efron. 1983. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. American Statistical Association 78 (06 1983), 316--331.Google Scholar
- Charles Elkan. 2001. The foundations of the cost-sensitive learning. In Proc. 17th Int. joint Conf. on Artfi. Intell. (IJCAI'01). 973--978.Google Scholar
- Jacqui Finlay, Russel Pears, and Andy M. Connor. 2014. Data stream mining for predicting software build outcomes using source code metrics. Inf. & Softw. Technol. 56, 2 (2014), 183--198.Google ScholarDigital Library
- Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine. Annals of Stat. 29, 5 (2001), 1189--1232.Google ScholarCross Ref
- Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2012. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Trans. Softw. Eng. 38, 6 (2012), 1276--1304.Google ScholarDigital Library
- Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Int. Conf. on Intell. Comput. (ICIC '05). 878--887.Google ScholarDigital Library
- Ahmed E. Hassan and Ken Zhang. 2006. Using Decision Trees to Predict the Certification Result of a Build. In 21st IEEE/ACM Int. Conf. on Autom. Softw. Eng. (ASE '06). 189--198.Google Scholar
- Foyzul Hassan and Xiaoyin Wang. 2017. Change-aware build prediction model for stall avoidance in continuous integration. In Proc. 11th ACM/IEEE Int. Sysmp. on Empirical Softw. Eng. and Meas. (ESEM '17). 157--162.Google ScholarDigital Library
- Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE Int. Joint Conf. on Neural Networks (IJCNN '08). 1322--1328.Google Scholar
- Haibo He and Edwardo A. Garcia. 2009. Learning from Imbalanced Data. IEEE Trans. on Knowl. and Data Eng. 21, 9 (2009), 1263--1284.Google ScholarDigital Library
- Wang He-yong. 2008. Imbalance Data Set Classification Using SMOTE and Biased-SVM. Comput. Science (2008).Google Scholar
- Robert C. Holte, Liane E. Acker, and Bruce W. Porter. 1989. Concept learning and the problem of small disjuncts. In Proc. 11th Int. joint Conf. on Artfi. Intell. (IJCAI '89). 813--818.Google Scholar
- Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intell. Data Anal. 6, 5 (2002), 429--449.Google ScholarCross Ref
- Matthieu Jimenez, Renaud Rwemalika, Mike Papadakis, Federica Sarro, Yves Le Traon, and Mark Harman. 2019. The Importance of Accounting for Real-World Labelling When Predicting Software Vulnerabilities. In Proc. 27th ACM Joint Meeting on European Softw. Eng. Conf. and Sysmp. on the Found. Softw. Eng. (ESEC/FSE'19). 695--705.Google ScholarDigital Library
- Xiao-Yuan Jing, Fei Wu, Xiwei Dong, and Baowen Xu. 2017. An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems. IEEE Trans. Softw. Eng. 43, 4 (April 2017), 321--339.Google ScholarDigital Library
- Miroslav Kubat and Stan Matwin. 1997. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proc. of the Fourteenth Inter. Conf. on Mach. Learn. ICML '17. 179--186.Google Scholar
- Jorma Laurikkala. 2001. Improving Identification of Difficult Small Classes by Balancing Class Distribution. Artfi. Intell. in Med. in Eur. (2001), 63--66.Google Scholar
- Guillaume Lemaitre, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. of Mach. Learn. Res. 18, 1 (2017), 559--563.Google ScholarDigital Library
- Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Systems, Man, and Cybernetics 39, 2 (2009), 539--550.Google ScholarDigital Library
- Gilles Louppe and Pierre Geurts. 2012. Ensembles on random patches. In Proc. 2012th Eur.an Conf. on Mach. Learn. and Knowl. Discovery in Databases (ECMLP-KDD'12). 346--361.Google ScholarCross Ref
- Larry M. Manevitz and Malik Yousef. 2002. One-class svms for document classification. J. Mach. Learn. Res. 2, 2 (2002), 139--154.Google ScholarDigital Library
- Antonio Maratea, Alfredo Petrosino, and Mario Manzo. 2014. Adjusted F-measure and kernel scaling for imbalanced data learning. Inf. Sci. 257 (2014), 331--341.Google ScholarDigital Library
- David Mease, Abraham J. Wyner, and Andreas Buja. 2007. Boosted Classification Trees and Class Probability/Quantile Estimation. J. of Mach. Learn. Res. 8 (2007), 409--439.Google ScholarDigital Library
- Ansong Ni and Ming Li. 2017. Cost-effective build outcome prediction using cascaded classifiers. In 2017 IEEE/ACM 14th Int. Conf. on Min. Softw. Repos. (MSR '17). 455--458.Google ScholarDigital Library
- J. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers (1999).Google Scholar
- Bhavani Raskutti and Adam Kowalczyk. 2004. Extreme re-balancing for SVMs: a case study. Sigkdd Explorations 6, 1 (2004), 60--69.Google ScholarDigital Library
- Duksan Ryu, Jong-In Jang, and Jongmoon Baik. 2017. A transfer cost-sensitive boosting approach for cross-project defect prediction. Softw. Qual. J. 25, 1 (2017), 235--272.Google ScholarDigital Library
- Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henriques Abreu, Helder J. Araujo, and Joao A M Santos. 2018. Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches. IEEE Comput. Intell. Mag. 13, 4 (2018), 59--76.Google ScholarDigital Library
- C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. 2010. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Systems, Man, and Cybernetics 40, 1 (2010), 185--197.Google ScholarDigital Library
- Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, and Andres Folleco. 2014. An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf. Sci. 259 (2014), 571--595.Google ScholarDigital Library
- Victor S. Sheng and Charles X. Ling. 2006. Thresholding for making classifiers cost-sensitive. In Proc. 21st national Conf. on Artfi. Intell. (AAAI'06). 476--481.Google Scholar
- Michael J. Siers and Md Zahidul Islam. 2016. Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs. In Advanced Data Min. and Appl. - 12th Int. Conf., ADMA 2016, Gold Coast, QLD, Australia, December 12-15, 2016, Proc. 156--171.Google Scholar
- Michael R. Smith, Tony R. Martinez, and Christophe G. Giraud-Carrier. 2014. An instance level analysis of data complexity. Mach. Learn. 95, 2 (2014), 225--256.Google ScholarDigital Library
- Daniel Ståhl and Jan Bosch. 2014. Modeling continuous integration practice differences in industry software development. J. Syst. Softw. 87, 1 (2014), 48--59.Google ScholarDigital Library
- Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. 2017. An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Softw. Eng. 43, 1 (2017), 1--18.Google ScholarDigital Library
- Gary M. Weiss. 2004. Mining with rarity: a unifying framework. Sigkdd Explorations 6, 1 (2004), 7--19.Google ScholarDigital Library
- Timo Wolf, drian Schröter, Daniela E. Damian, and Thanh H. D. Nguyen. 2009. Predicting build failures using social network analysis on developer communication. In Proc. 31st Int. Conf. on Softw. Eng. (ICSE'09). 1--11.Google Scholar
- Jing Xia and Yanhui Li. 2017. Could We Predict the Result of a Continuous Integration Build? An Empirical Study. In 2017 IEEE Int. Conf. on Softw. Qual., Reliab. and Secur. Companion (QRS-C'17). 311--315.Google Scholar
- Jianping Zhang and Inderjeet Mani. 2003. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In Icml Workshop on Learning from Imbalanced Datasets.Google Scholar
Index Terms
- An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction
Recommendations
Imbalanced Sentiment Classification with Multi-Task Learning
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge ManagementSupervised learning methods are widely used in sentiment classification. However, when sentiment distribution is imbalanced, the performance of these methods declines. In this paper, we propose an effective approach for imbalanced sentiment ...
Learning from Imbalanced Data
With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and ...
Threefold versus fivefold cross-validation and individual versus average data in predictive regression modelling of machining experimental data
Model selection and validation are critical in predicting the performance of manufacturing processes. Proper selection of variables helps minimize the model mismatch error, proper selection of models helps reduce the model estimation error, and proper ...
Comments