ABSTRACT
Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.
We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.
- I. Ahmed, U. A. Mannan, R. Gopinath, and C. Jensen. An empirical study of design degradation: How software projects get worse over time. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 31–40, 2015.Google ScholarCross Ref
- J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an appropriate tool for testing experiments? In International Conference on Software Engineering, pages 402–411. IEEE, 2005. Google ScholarDigital Library
- J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Transactions on Software Engineering, 32(8), 2006. Google ScholarDigital Library
- Apache Software Foundation. Apache commons. http://commons.apache.org/.Google Scholar
- Apache Software Foundation. Apache maven project. http://maven.apache.org.Google Scholar
- A. Arcuri and L. C. Briand. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test., Verif. Reliab., 24(3):219–250, 2014. Google ScholarDigital Library
- C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 121–130. ACM, 2009. Google ScholarDigital Library
- T. A. Budd. Mutation Analysis of Program Test Data. PhD thesis, Yale University, New Haven, CT, USA, 1980. Google ScholarDigital Library
- X. Cai and M. R. Lyu. The effect of code coverage on fault detection under different testing profiles. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1–7. ACM, 2005. Google ScholarDigital Library
- H. Coles. Pit mutation testing. http://pitest.org/.Google Scholar
- M. Daran and P. Thévenod-Fosse. Software error analysis: A real case study involving real faults and mutations. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 158–171. ACM, 1996. Google ScholarDigital Library
- M. Delahaye and L. Bousquet. Selecting a software engineering tool: lessons learnt from mutation analysis. Software: Practice and Experience, 2015. Google ScholarDigital Library
- R. A. DeMillo and A. P. Mathur. On the use of software artifacts to evaluate the effectiveness of mutation analysis for detecting errors in production software. Technical Report SERC-TR92-P, Software Engineering Research Center, Purdue University, West Lafayette, IN ” 1991.Google Scholar
- J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 313–324, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- P. G. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In ACM SIGSOFT Software Engineering Notes, volume 23, pages 153–162. ACM, 1998. Google ScholarDigital Library
- P. G. Frankl and S. N. Weiss. An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Transactions on Software Engineering, 19:774–787, 1993. Google ScholarDigital Library
- P. G. Frankl, S. N. Weiss, and C. Hu. All-uses vs mutation testing: an experimental comparison of effectiveness. Journal of Systems and Software, 38(3):235–253, 1997. Google ScholarDigital Library
- GitHub Inc. Software repository. http://www.github.com.Google Scholar
- M. Gligoric, A. Groce, C. Zhang, R. Sharma, A. Alipour, and D. Marinov. Guidelines for coverage-based comparisons of non-adequate test suites. ACM Transactions on Software Engineering and Methodology, 24(4):4–37, 2014. Google ScholarDigital Library
- M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-adequate test suites using coverage criteria. In ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 2013. Google ScholarDigital Library
- R. Gopinath, C. Jensen, and A. Groce. Code coverage for suite evaluation by developers. In International Conference on Software Engineering. IEEE, 2014. Google ScholarDigital Library
- A. Groce, M. A. Alipour, and R. Gopinath. Coverage and its discontents. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, Onward! 2014, pages 255–268, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- A. Gupta and P. Jalote. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. International Journal on Software Tools for Technology Transfer, 10(2):145–160, 2008.Google ScholarCross Ref
- M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. In International Conference on Software Engineering, pages 191–200. IEEE Computer Society Press, 1994. Google ScholarDigital Library
- L. Inozemtseva and R. Holmes. Coverage Is Not Strongly Correlated With Test Suite Effectiveness. In International Conference on Software Engineering, 2014. Google ScholarDigital Library
- L. M. M. Inozemtseva. Predicting test suite effectiveness for java programs. Master’s thesis, University of Waterloo, 2012.Google Scholar
- R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser. Are mutants a valid substitute for real faults in software testing? In ACM SIGSOFT Symposium on The Foundations of Software Engineering, pages 654–665, Hong Kong, China, 2014. ACM. Google ScholarDigital Library
- S. Kakarla. An analysis of parameters influencing test suite effectiveness. Master’s thesis, Texas Tech University, 2010.Google Scholar
- N. Li, U. Praphamontripong, and J. Offutt. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In International Conference on Software Testing, Verification and Validation Workshops, pages 220–229. IEEE, 2009. Google ScholarDigital Library
- A. P. Mathur and W. E. Wong. An empirical comparison of data flow and mutation-based test adequacy criteria. Software Testing, Verification and Reliability, 4(1):9–31, 1994.Google ScholarCross Ref
- T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, (4):308–320, 1976. Google ScholarDigital Library
- A. Mockus, N. Nagappan, and T. T. Dinh-Trong. Test coverage and post-verification defects: A multiple case study. In Empirical Software Engineering and Measurement, 2009. ESEM 2009. 3rd International Symposium on, pages 291–301. IEEE, 2009. Google ScholarDigital Library
- A. S. Namin and J. H. Andrews. The influence of size and coverage on test suite effectiveness. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 57–68. ACM, 2009. Google ScholarDigital Library
- A. S. Namin and S. Kakarla. The use of mutation in testing experiments and its sensitivity to external threats. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 342–352, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. J. Offutt and J. M. Voas. Subsumption of condition coverage techniques by mutation testing. Technical report, Technical Report ISSE-TR-96-01, Information and Software Systems Engineering, George Mason University, 1996.Google Scholar
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830, 2011. Google ScholarDigital Library
- RTCA Special Committee 167. Software considerations in airborne systems and equipment certification. Technical Report DO-1789B, RTCA, Inc., 1992.Google Scholar
- S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges. In IEEE/ACM International Conference on Automated Software Engineering, pages 201–211, 2015.Google ScholarDigital Library
- M. Shepperd. A critique of cyclomatic complexity as a software metric. Software Engineering Journal, 3(2):30–36, 1988. Google ScholarDigital Library
- A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, and D. Marinov. Balancing trade-offs in test-suite reduction. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 246–256, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- SIR: Software-artifact Infrastructure Repository. Sir usage information, accessed at mar 8, 2016. http://sir.unl.edu/portal/usage.php.Google Scholar
- D. Tengeri, L. Vidacs, A. Beszedes, J. Jasz, G. Balogh, B. Vancsics, and T. Gyimóthy. Relating code coverage, mutation score and test suite reducibility to defect density,accepted paper. In mutationworkshop, 2016.Google Scholar
- Y. Tian, J. Lawall, and D. Lo. Identifying linux bug fixing patches. In Software Engineering (ICSE), 2012 34th International Conference on, pages 386–396. IEEE, 2012. Google ScholarDigital Library
- Y. Wei, B. Meyer, and M. Oriol. Empirical Software Engineering and Verification, chapter Is branch coverage a good measure of testing effectiveness?, pages 194–212. Springer-Verlag, Berlin, Heidelberg, 2012. Google ScholarDigital Library
Index Terms
- Can testedness be effectively measured?
Recommendations
PBCOV: a property-based coverage criterion
Coverage criteria aim at satisfying test requirements and compute metrics values that quantify the adequacy of test suites at revealing defects in programs. Typically, a test requirement is a structural program element, and the coverage metric value ...
An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing
Experimental work in software testing has generally focused on evaluating the effectiveness and effort requirements of various coverage criteria. The important issue of testing efficiency has not been sufficiently addressed. In this paper, we describe ...
Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites
Special Issue on ISSTA 2013A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A ...
Comments