research-article

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

Authors:
Bohan Liu

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China
View Profile

,
He Zhang

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China
View Profile

,
Lanxin Yang

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China
View Profile

,
Liming Dong

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China
View Profile

,
Haifeng Shen

Australian Catholic University, Sydney, New South Wales, Australia

Australian Catholic University, Sydney, New South Wales, Australia
View Profile

,
Kaiwen Song

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China

State Key Laboratory of Novel Software Technology, Software Institute, Nanjing University, Nanjing, Jiangsu, China
View Profile

EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software EngineeringApril 2020Pages 21–30https://doi.org/10.1145/3383219.3383222

Published:17 April 2020Publication History

EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering

Pages 21–30

ABSTRACT

Background: Machine Learning (ML) has been widely used as a powerful tool to support Software Engineering (SE). The fundamental assumptions of data characteristics required for specific ML methods have to be carefully considered prior to their applications in SE. Within the context of Continuous Integration (CI) and Continuous Deployment (CD) practices, there are two vital characteristics of data prone to be violated in SE research. First, the logs generated during CI/CD for training are imbalanced data, which is contrary to the principles of common balanced classifiers; second, these logs are also time-series data, which violates the assumption of cross-validation. Objective: We aim to systematically study the two data characteristics and further provide a comprehensive evaluation for predictive CI/CD with the data from real projects. Method: We conduct an experimental study that evaluates 67 CI/CD predictive models using both cross-validation and time-series-validation. Results: Our evaluation shows that cross-validation makes the evaluation of the models optimistic in most cases, there are a few counter-examples as well. The performance of the top 10 imbalanced models are better than the balanced models in the predictions of failed builds, even for balanced data. The degree of data imbalance has a negative impact on prediction performance. Conclusion: In research and practice, the assumptions of the various ML methods should be seriously considered for the validity of research. Even if it is used to compare the relative performance of models, cross-validation may not be applicable to the problems with time-series features. The research community need to revisit the evaluation results reported in some existing research.

References

Sylvain Arlot and Alain Celisse. 2010. A survey of cross-validation procedures for model selection. Stat. Surv. 4, 2010 (2010), 40--79.Google ScholarCross Ref
Abigail Atchison, Christina Berardi, Natalie Best, Elizabeth Stevens, and Erik Linstead. 2017. A time series analysis of TravisTorrent builds: to everything there is a season. In Proc. of the 14th Int. Conf. on Mining Softw. Repos. (MSR'17). 463--466.Google ScholarDigital Library
Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2014. Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring. In 2014 13th Int. Conf. on Mach. Learn. and Appl. (ICMLA'14). 263--269.Google Scholar
Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2015. Ensemble of Example-Dependent Cost-Sensitive Decision Trees. CoRR (2015).Google Scholar
Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2015. Example-dependent cost-sensitive decision trees. Expert Syst. With Appl. 42, 19 (2015), 6609--6619.Google ScholarDigital Library
Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada, and Björn E. Ottersten. 2014. Improving Credit Card Fraud Detection with Calibrated Probabilities. In Proc. 2014 SIAM Inter. Conf. on Data Mining (SDM'14). 677--685.Google Scholar
Gustavo E. A. P. A. Batista, Ana L. C. Bazzan, and Maria Carolina Monard. 2003. Balancing Training Data for Automated Annotation of Keywords: a Case Study. In II Brazilian Workshop on Bioinformatics, December 3-5, 2003, Macaé, RJ, Brazil (WOB'03). 10--18.Google Scholar
Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations 6, 1 (2004), 20--29.Google ScholarDigital Library
Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: synthesizing Travis CI and GitHub for full-stack research on continuous integration. In Proc. 14th Int. Conf. on Min. Softw. Repos. (MSR 17). 447--450.Google ScholarDigital Library
Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, and Solomon Mensah. 2018. MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. In Proc. 40th Int. Conf. on Softw. Eng. (ICSE '18). 699.Google ScholarDigital Library
Leo Breiman. 2001. Random Forests. Mach. Learn. archive 45, 1 (2001), 5--32.Google Scholar
Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google Scholar
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In Proc. 2000 ACM SIGMOD Int. Conf. on Manage. of Data, Vol. 29. 93--104.Google ScholarDigital Library
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artfi. Intell. Res. 16, 1 (2002), 321--357.Google ScholarCross Ref
Chao Chen. 2004. Using Random Forest to Learn Imbalanced Data. (2004).Google Scholar
Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. Multiple Classifier Syst. (2000), 1--15.Google Scholar
Paul M. Duvall, Steve Matyas, and Andrew Glover. 2007. Continuous Integration: Improving Software Quality and Reducing Risk.Google Scholar
Bradley Efron. 1983. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. American Statistical Association 78 (06 1983), 316--331.Google Scholar
Charles Elkan. 2001. The foundations of the cost-sensitive learning. In Proc. 17th Int. joint Conf. on Artfi. Intell. (IJCAI'01). 973--978.Google Scholar
Jacqui Finlay, Russel Pears, and Andy M. Connor. 2014. Data stream mining for predicting software build outcomes using source code metrics. Inf. & Softw. Technol. 56, 2 (2014), 183--198.Google ScholarDigital Library
Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine. Annals of Stat. 29, 5 (2001), 1189--1232.Google ScholarCross Ref
Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2012. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Trans. Softw. Eng. 38, 6 (2012), 1276--1304.Google ScholarDigital Library
Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Int. Conf. on Intell. Comput. (ICIC '05). 878--887.Google ScholarDigital Library
Ahmed E. Hassan and Ken Zhang. 2006. Using Decision Trees to Predict the Certification Result of a Build. In 21st IEEE/ACM Int. Conf. on Autom. Softw. Eng. (ASE '06). 189--198.Google Scholar
Foyzul Hassan and Xiaoyin Wang. 2017. Change-aware build prediction model for stall avoidance in continuous integration. In Proc. 11th ACM/IEEE Int. Sysmp. on Empirical Softw. Eng. and Meas. (ESEM '17). 157--162.Google ScholarDigital Library
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE Int. Joint Conf. on Neural Networks (IJCNN '08). 1322--1328.Google Scholar
Haibo He and Edwardo A. Garcia. 2009. Learning from Imbalanced Data. IEEE Trans. on Knowl. and Data Eng. 21, 9 (2009), 1263--1284.Google ScholarDigital Library
Wang He-yong. 2008. Imbalance Data Set Classification Using SMOTE and Biased-SVM. Comput. Science (2008).Google Scholar
Robert C. Holte, Liane E. Acker, and Bruce W. Porter. 1989. Concept learning and the problem of small disjuncts. In Proc. 11th Int. joint Conf. on Artfi. Intell. (IJCAI '89). 813--818.Google Scholar
Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intell. Data Anal. 6, 5 (2002), 429--449.Google ScholarCross Ref
Matthieu Jimenez, Renaud Rwemalika, Mike Papadakis, Federica Sarro, Yves Le Traon, and Mark Harman. 2019. The Importance of Accounting for Real-World Labelling When Predicting Software Vulnerabilities. In Proc. 27th ACM Joint Meeting on European Softw. Eng. Conf. and Sysmp. on the Found. Softw. Eng. (ESEC/FSE'19). 695--705.Google ScholarDigital Library
Xiao-Yuan Jing, Fei Wu, Xiwei Dong, and Baowen Xu. 2017. An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems. IEEE Trans. Softw. Eng. 43, 4 (April 2017), 321--339.Google ScholarDigital Library
Miroslav Kubat and Stan Matwin. 1997. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proc. of the Fourteenth Inter. Conf. on Mach. Learn. ICML '17. 179--186.Google Scholar
Jorma Laurikkala. 2001. Improving Identification of Difficult Small Classes by Balancing Class Distribution. Artfi. Intell. in Med. in Eur. (2001), 63--66.Google Scholar
Guillaume Lemaitre, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. of Mach. Learn. Res. 18, 1 (2017), 559--563.Google ScholarDigital Library
Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Systems, Man, and Cybernetics 39, 2 (2009), 539--550.Google ScholarDigital Library
Gilles Louppe and Pierre Geurts. 2012. Ensembles on random patches. In Proc. 2012th Eur.an Conf. on Mach. Learn. and Knowl. Discovery in Databases (ECMLP-KDD'12). 346--361.Google ScholarCross Ref
Larry M. Manevitz and Malik Yousef. 2002. One-class svms for document classification. J. Mach. Learn. Res. 2, 2 (2002), 139--154.Google ScholarDigital Library
Antonio Maratea, Alfredo Petrosino, and Mario Manzo. 2014. Adjusted F-measure and kernel scaling for imbalanced data learning. Inf. Sci. 257 (2014), 331--341.Google ScholarDigital Library
David Mease, Abraham J. Wyner, and Andreas Buja. 2007. Boosted Classification Trees and Class Probability/Quantile Estimation. J. of Mach. Learn. Res. 8 (2007), 409--439.Google ScholarDigital Library
Ansong Ni and Ming Li. 2017. Cost-effective build outcome prediction using cascaded classifiers. In 2017 IEEE/ACM 14th Int. Conf. on Min. Softw. Repos. (MSR '17). 455--458.Google ScholarDigital Library
J. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers (1999).Google Scholar
Bhavani Raskutti and Adam Kowalczyk. 2004. Extreme re-balancing for SVMs: a case study. Sigkdd Explorations 6, 1 (2004), 60--69.Google ScholarDigital Library
Duksan Ryu, Jong-In Jang, and Jongmoon Baik. 2017. A transfer cost-sensitive boosting approach for cross-project defect prediction. Softw. Qual. J. 25, 1 (2017), 235--272.Google ScholarDigital Library
Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henriques Abreu, Helder J. Araujo, and Joao A M Santos. 2018. Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches. IEEE Comput. Intell. Mag. 13, 4 (2018), 59--76.Google ScholarDigital Library
C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. 2010. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Systems, Man, and Cybernetics 40, 1 (2010), 185--197.Google ScholarDigital Library
Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, and Andres Folleco. 2014. An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf. Sci. 259 (2014), 571--595.Google ScholarDigital Library
Victor S. Sheng and Charles X. Ling. 2006. Thresholding for making classifiers cost-sensitive. In Proc. 21st national Conf. on Artfi. Intell. (AAAI'06). 476--481.Google Scholar
Michael J. Siers and Md Zahidul Islam. 2016. Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs. In Advanced Data Min. and Appl. - 12th Int. Conf., ADMA 2016, Gold Coast, QLD, Australia, December 12-15, 2016, Proc. 156--171.Google Scholar
Michael R. Smith, Tony R. Martinez, and Christophe G. Giraud-Carrier. 2014. An instance level analysis of data complexity. Mach. Learn. 95, 2 (2014), 225--256.Google ScholarDigital Library
Daniel Ståhl and Jan Bosch. 2014. Modeling continuous integration practice differences in industry software development. J. Syst. Softw. 87, 1 (2014), 48--59.Google ScholarDigital Library
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. 2017. An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Softw. Eng. 43, 1 (2017), 1--18.Google ScholarDigital Library
Gary M. Weiss. 2004. Mining with rarity: a unifying framework. Sigkdd Explorations 6, 1 (2004), 7--19.Google ScholarDigital Library
Timo Wolf, drian Schröter, Daniela E. Damian, and Thanh H. D. Nguyen. 2009. Predicting build failures using social network analysis on developer communication. In Proc. 31st Int. Conf. on Softw. Eng. (ICSE'09). 1--11.Google Scholar
Jing Xia and Yanhui Li. 2017. Could We Predict the Result of a Continuous Integration Build? An Empirical Study. In 2017 IEEE Int. Conf. on Softw. Qual., Reliab. and Secur. Companion (QRS-C'17). 311--315.Google Scholar
Jianping Zhang and Inderjeet Mani. 2003. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In Icml Workshop on Learning from Imbalanced Datasets.Google Scholar

Index Terms

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction
1. Computing methodologies
  1. Machine learning
    1. Cross-validation
    2. Machine learning algorithms
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation

Recommendations

Imbalanced Sentiment Classification with Multi-Task Learning
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Supervised learning methods are widely used in sentiment classification. However, when sentiment distribution is imbalanced, the performance of these methods declines. In this paper, we propose an effective approach for imbalanced sentiment ...
Read More
Learning from Imbalanced Data

With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and ...
Read More
Threefold versus fivefold cross-validation and individual versus average data in predictive regression modelling of machining experimental data

Model selection and validation are critical in predicting the performance of manufacturing processes. Proper selection of variables helps minimize the model mismatch error, proper selection of models helps reduce the model estimation error, and proper ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering
April 2020
544 pages
ISBN:9781450377317
DOI:10.1145/3383219
General Chairs:
Jingyue Li
Norwegian University of Science and Technology, Norway
,
Letizia Jaccheri
Norwegian University of Science and Technology, Norway
,
Program Chairs:
Torgeir Dingsøyr
SINTEF Digital and Norwegian University of Science and Technology, Norway
,
Ruzanna Chitchyan
University of Bristol, UK
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 April 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
continuous deployment
continuous integration
cross-validation
imbalanced learning
time-series-validation
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate71of232submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 311
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Imbalanced Sentiment Classification with Multi-Task Learning

Learning from Imbalanced Data

Threefold versus fivefold cross-validation and individual versus average data in predictive regression modelling of machining experimental data