skip to main content
10.1145/3383219.3383222acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

Authors Info & Claims
Published:17 April 2020Publication History

ABSTRACT

Background: Machine Learning (ML) has been widely used as a powerful tool to support Software Engineering (SE). The fundamental assumptions of data characteristics required for specific ML methods have to be carefully considered prior to their applications in SE. Within the context of Continuous Integration (CI) and Continuous Deployment (CD) practices, there are two vital characteristics of data prone to be violated in SE research. First, the logs generated during CI/CD for training are imbalanced data, which is contrary to the principles of common balanced classifiers; second, these logs are also time-series data, which violates the assumption of cross-validation. Objective: We aim to systematically study the two data characteristics and further provide a comprehensive evaluation for predictive CI/CD with the data from real projects. Method: We conduct an experimental study that evaluates 67 CI/CD predictive models using both cross-validation and time-series-validation. Results: Our evaluation shows that cross-validation makes the evaluation of the models optimistic in most cases, there are a few counter-examples as well. The performance of the top 10 imbalanced models are better than the balanced models in the predictions of failed builds, even for balanced data. The degree of data imbalance has a negative impact on prediction performance. Conclusion: In research and practice, the assumptions of the various ML methods should be seriously considered for the validity of research. Even if it is used to compare the relative performance of models, cross-validation may not be applicable to the problems with time-series features. The research community need to revisit the evaluation results reported in some existing research.

References

  1. Sylvain Arlot and Alain Celisse. 2010. A survey of cross-validation procedures for model selection. Stat. Surv. 4, 2010 (2010), 40--79.Google ScholarGoogle ScholarCross RefCross Ref
  2. Abigail Atchison, Christina Berardi, Natalie Best, Elizabeth Stevens, and Erik Linstead. 2017. A time series analysis of TravisTorrent builds: to everything there is a season. In Proc. of the 14th Int. Conf. on Mining Softw. Repos. (MSR'17). 463--466.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2014. Example-Dependent Cost-Sensitive Logistic Regression for Credit Scoring. In 2014 13th Int. Conf. on Mach. Learn. and Appl. (ICMLA'14). 263--269.Google ScholarGoogle Scholar
  4. Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2015. Ensemble of Example-Dependent Cost-Sensitive Decision Trees. CoRR (2015).Google ScholarGoogle Scholar
  5. Alejandro Correa Bahnsen, Djamila Aouada, and Björn E. Ottersten. 2015. Example-dependent cost-sensitive decision trees. Expert Syst. With Appl. 42, 19 (2015), 6609--6619.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada, and Björn E. Ottersten. 2014. Improving Credit Card Fraud Detection with Calibrated Probabilities. In Proc. 2014 SIAM Inter. Conf. on Data Mining (SDM'14). 677--685.Google ScholarGoogle Scholar
  7. Gustavo E. A. P. A. Batista, Ana L. C. Bazzan, and Maria Carolina Monard. 2003. Balancing Training Data for Automated Annotation of Keywords: a Case Study. In II Brazilian Workshop on Bioinformatics, December 3-5, 2003, Macaé, RJ, Brazil (WOB'03). 10--18.Google ScholarGoogle Scholar
  8. Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations 6, 1 (2004), 20--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Moritz Beller, Georgios Gousios, and Andy Zaidman. 2017. TravisTorrent: synthesizing Travis CI and GitHub for full-stack research on continuous integration. In Proc. 14th Int. Conf. on Min. Softw. Repos. (MSR 17). 447--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kwabena Ebo Bennin, Jacky Keung, Passakorn Phannachitta, Akito Monden, and Solomon Mensah. 2018. MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. In Proc. 40th Int. Conf. on Softw. Eng. (ICSE '18). 699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Leo Breiman. 2001. Random Forests. Mach. Learn. archive 45, 1 (2001), 5--32.Google ScholarGoogle Scholar
  12. Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google ScholarGoogle Scholar
  13. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. In Proc. 2000 ACM SIGMOD Int. Conf. on Manage. of Data, Vol. 29. 93--104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artfi. Intell. Res. 16, 1 (2002), 321--357.Google ScholarGoogle ScholarCross RefCross Ref
  15. Chao Chen. 2004. Using Random Forest to Learn Imbalanced Data. (2004).Google ScholarGoogle Scholar
  16. Thomas G. Dietterich. 2000. Ensemble Methods in Machine Learning. Multiple Classifier Syst. (2000), 1--15.Google ScholarGoogle Scholar
  17. Paul M. Duvall, Steve Matyas, and Andrew Glover. 2007. Continuous Integration: Improving Software Quality and Reducing Risk.Google ScholarGoogle Scholar
  18. Bradley Efron. 1983. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. American Statistical Association 78 (06 1983), 316--331.Google ScholarGoogle Scholar
  19. Charles Elkan. 2001. The foundations of the cost-sensitive learning. In Proc. 17th Int. joint Conf. on Artfi. Intell. (IJCAI'01). 973--978.Google ScholarGoogle Scholar
  20. Jacqui Finlay, Russel Pears, and Andy M. Connor. 2014. Data stream mining for predicting software build outcomes using source code metrics. Inf. & Softw. Technol. 56, 2 (2014), 183--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jerome H. Friedman. 2001. Greedy function approximation: A gradient boosting machine. Annals of Stat. 29, 5 (2001), 1189--1232.Google ScholarGoogle ScholarCross RefCross Ref
  22. Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2012. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Trans. Softw. Eng. 38, 6 (2012), 1276--1304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In Int. Conf. on Intell. Comput. (ICIC '05). 878--887.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ahmed E. Hassan and Ken Zhang. 2006. Using Decision Trees to Predict the Certification Result of a Build. In 21st IEEE/ACM Int. Conf. on Autom. Softw. Eng. (ASE '06). 189--198.Google ScholarGoogle Scholar
  25. Foyzul Hassan and Xiaoyin Wang. 2017. Change-aware build prediction model for stall avoidance in continuous integration. In Proc. 11th ACM/IEEE Int. Sysmp. on Empirical Softw. Eng. and Meas. (ESEM '17). 157--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE Int. Joint Conf. on Neural Networks (IJCNN '08). 1322--1328.Google ScholarGoogle Scholar
  27. Haibo He and Edwardo A. Garcia. 2009. Learning from Imbalanced Data. IEEE Trans. on Knowl. and Data Eng. 21, 9 (2009), 1263--1284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Wang He-yong. 2008. Imbalance Data Set Classification Using SMOTE and Biased-SVM. Comput. Science (2008).Google ScholarGoogle Scholar
  29. Robert C. Holte, Liane E. Acker, and Bruce W. Porter. 1989. Concept learning and the problem of small disjuncts. In Proc. 11th Int. joint Conf. on Artfi. Intell. (IJCAI '89). 813--818.Google ScholarGoogle Scholar
  30. Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intell. Data Anal. 6, 5 (2002), 429--449.Google ScholarGoogle ScholarCross RefCross Ref
  31. Matthieu Jimenez, Renaud Rwemalika, Mike Papadakis, Federica Sarro, Yves Le Traon, and Mark Harman. 2019. The Importance of Accounting for Real-World Labelling When Predicting Software Vulnerabilities. In Proc. 27th ACM Joint Meeting on European Softw. Eng. Conf. and Sysmp. on the Found. Softw. Eng. (ESEC/FSE'19). 695--705.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xiao-Yuan Jing, Fei Wu, Xiwei Dong, and Baowen Xu. 2017. An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems. IEEE Trans. Softw. Eng. 43, 4 (April 2017), 321--339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Miroslav Kubat and Stan Matwin. 1997. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proc. of the Fourteenth Inter. Conf. on Mach. Learn. ICML '17. 179--186.Google ScholarGoogle Scholar
  34. Jorma Laurikkala. 2001. Improving Identification of Difficult Small Classes by Balancing Class Distribution. Artfi. Intell. in Med. in Eur. (2001), 63--66.Google ScholarGoogle Scholar
  35. Guillaume Lemaitre, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. of Mach. Learn. Res. 18, 1 (2017), 559--563.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Systems, Man, and Cybernetics 39, 2 (2009), 539--550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Gilles Louppe and Pierre Geurts. 2012. Ensembles on random patches. In Proc. 2012th Eur.an Conf. on Mach. Learn. and Knowl. Discovery in Databases (ECMLP-KDD'12). 346--361.Google ScholarGoogle ScholarCross RefCross Ref
  38. Larry M. Manevitz and Malik Yousef. 2002. One-class svms for document classification. J. Mach. Learn. Res. 2, 2 (2002), 139--154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Antonio Maratea, Alfredo Petrosino, and Mario Manzo. 2014. Adjusted F-measure and kernel scaling for imbalanced data learning. Inf. Sci. 257 (2014), 331--341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. David Mease, Abraham J. Wyner, and Andreas Buja. 2007. Boosted Classification Trees and Class Probability/Quantile Estimation. J. of Mach. Learn. Res. 8 (2007), 409--439.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ansong Ni and Ming Li. 2017. Cost-effective build outcome prediction using cascaded classifiers. In 2017 IEEE/ACM 14th Int. Conf. on Min. Softw. Repos. (MSR '17). 455--458.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers (1999).Google ScholarGoogle Scholar
  43. Bhavani Raskutti and Adam Kowalczyk. 2004. Extreme re-balancing for SVMs: a case study. Sigkdd Explorations 6, 1 (2004), 60--69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Duksan Ryu, Jong-In Jang, and Jongmoon Baik. 2017. A transfer cost-sensitive boosting approach for cross-project defect prediction. Softw. Qual. J. 25, 1 (2017), 235--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henriques Abreu, Helder J. Araujo, and Joao A M Santos. 2018. Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches. IEEE Comput. Intell. Mag. 13, 4 (2018), 59--76.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, and A. Napolitano. 2010. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Systems, Man, and Cybernetics 40, 1 (2010), 185--197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, and Andres Folleco. 2014. An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf. Sci. 259 (2014), 571--595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Victor S. Sheng and Charles X. Ling. 2006. Thresholding for making classifiers cost-sensitive. In Proc. 21st national Conf. on Artfi. Intell. (AAAI'06). 476--481.Google ScholarGoogle Scholar
  49. Michael J. Siers and Md Zahidul Islam. 2016. Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs. In Advanced Data Min. and Appl. - 12th Int. Conf., ADMA 2016, Gold Coast, QLD, Australia, December 12-15, 2016, Proc. 156--171.Google ScholarGoogle Scholar
  50. Michael R. Smith, Tony R. Martinez, and Christophe G. Giraud-Carrier. 2014. An instance level analysis of data complexity. Mach. Learn. 95, 2 (2014), 225--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Daniel Ståhl and Jan Bosch. 2014. Modeling continuous integration practice differences in industry software development. J. Syst. Softw. 87, 1 (2014), 48--59.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, and Kenichi Matsumoto. 2017. An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Softw. Eng. 43, 1 (2017), 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Gary M. Weiss. 2004. Mining with rarity: a unifying framework. Sigkdd Explorations 6, 1 (2004), 7--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Timo Wolf, drian Schröter, Daniela E. Damian, and Thanh H. D. Nguyen. 2009. Predicting build failures using social network analysis on developer communication. In Proc. 31st Int. Conf. on Softw. Eng. (ICSE'09). 1--11.Google ScholarGoogle Scholar
  55. Jing Xia and Yanhui Li. 2017. Could We Predict the Result of a Continuous Integration Build? An Empirical Study. In 2017 IEEE Int. Conf. on Softw. Qual., Reliab. and Secur. Companion (QRS-C'17). 311--315.Google ScholarGoogle Scholar
  56. Jianping Zhang and Inderjeet Mani. 2003. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In Icml Workshop on Learning from Imbalanced Datasets.Google ScholarGoogle Scholar

Index Terms

  1. An Experimental Evaluation of Imbalanced Learning and Time-Series Validation in the Context of CI/CD Prediction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering
          April 2020
          544 pages
          ISBN:9781450377317
          DOI:10.1145/3383219
          • General Chairs:
          • Jingyue Li,
          • Letizia Jaccheri,
          • Program Chairs:
          • Torgeir Dingsøyr,
          • Ruzanna Chitchyan

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 April 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate71of232submissions,31%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader