skip to main content
10.1145/2950290.2950324acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Can testedness be effectively measured?

Published:01 November 2016Publication History

ABSTRACT

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.

We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.

References

  1. I. Ahmed, U. A. Mannan, R. Gopinath, and C. Jensen. An empirical study of design degradation: How software projects get worse over time. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 31–40, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  2. J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an appropriate tool for testing experiments? In International Conference on Software Engineering, pages 402–411. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Transactions on Software Engineering, 32(8), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache Software Foundation. Apache commons. http://commons.apache.org/.Google ScholarGoogle Scholar
  5. Apache Software Foundation. Apache maven project. http://maven.apache.org.Google ScholarGoogle Scholar
  6. A. Arcuri and L. C. Briand. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test., Verif. Reliab., 24(3):219–250, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 121–130. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. A. Budd. Mutation Analysis of Program Test Data. PhD thesis, Yale University, New Haven, CT, USA, 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. Cai and M. R. Lyu. The effect of code coverage on fault detection under different testing profiles. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1–7. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Coles. Pit mutation testing. http://pitest.org/.Google ScholarGoogle Scholar
  11. M. Daran and P. Thévenod-Fosse. Software error analysis: A real case study involving real faults and mutations. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 158–171. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Delahaye and L. Bousquet. Selecting a software engineering tool: lessons learnt from mutation analysis. Software: Practice and Experience, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. A. DeMillo and A. P. Mathur. On the use of software artifacts to evaluate the effectiveness of mutation analysis for detecting errors in production software. Technical Report SERC-TR92-P, Software Engineering Research Center, Purdue University, West Lafayette, IN ” 1991.Google ScholarGoogle Scholar
  14. J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 313–324, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. G. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In ACM SIGSOFT Software Engineering Notes, volume 23, pages 153–162. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. G. Frankl and S. N. Weiss. An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Transactions on Software Engineering, 19:774–787, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. G. Frankl, S. N. Weiss, and C. Hu. All-uses vs mutation testing: an experimental comparison of effectiveness. Journal of Systems and Software, 38(3):235–253, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. GitHub Inc. Software repository. http://www.github.com.Google ScholarGoogle Scholar
  19. M. Gligoric, A. Groce, C. Zhang, R. Sharma, A. Alipour, and D. Marinov. Guidelines for coverage-based comparisons of non-adequate test suites. ACM Transactions on Software Engineering and Methodology, 24(4):4–37, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-adequate test suites using coverage criteria. In ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Gopinath, C. Jensen, and A. Groce. Code coverage for suite evaluation by developers. In International Conference on Software Engineering. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Groce, M. A. Alipour, and R. Gopinath. Coverage and its discontents. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, Onward! 2014, pages 255–268, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Gupta and P. Jalote. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. International Journal on Software Tools for Technology Transfer, 10(2):145–160, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  24. M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. In International Conference on Software Engineering, pages 191–200. IEEE Computer Society Press, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Inozemtseva and R. Holmes. Coverage Is Not Strongly Correlated With Test Suite Effectiveness. In International Conference on Software Engineering, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. M. M. Inozemtseva. Predicting test suite effectiveness for java programs. Master’s thesis, University of Waterloo, 2012.Google ScholarGoogle Scholar
  27. R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser. Are mutants a valid substitute for real faults in software testing? In ACM SIGSOFT Symposium on The Foundations of Software Engineering, pages 654–665, Hong Kong, China, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Kakarla. An analysis of parameters influencing test suite effectiveness. Master’s thesis, Texas Tech University, 2010.Google ScholarGoogle Scholar
  29. N. Li, U. Praphamontripong, and J. Offutt. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In International Conference on Software Testing, Verification and Validation Workshops, pages 220–229. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. P. Mathur and W. E. Wong. An empirical comparison of data flow and mutation-based test adequacy criteria. Software Testing, Verification and Reliability, 4(1):9–31, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  31. T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, (4):308–320, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Mockus, N. Nagappan, and T. T. Dinh-Trong. Test coverage and post-verification defects: A multiple case study. In Empirical Software Engineering and Measurement, 2009. ESEM 2009. 3rd International Symposium on, pages 291–301. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. S. Namin and J. H. Andrews. The influence of size and coverage on test suite effectiveness. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 57–68. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. S. Namin and S. Kakarla. The use of mutation in testing experiments and its sensitivity to external threats. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 342–352, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. J. Offutt and J. M. Voas. Subsumption of condition coverage techniques by mutation testing. Technical report, Technical Report ISSE-TR-96-01, Information and Software Systems Engineering, George Mason University, 1996.Google ScholarGoogle Scholar
  36. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. RTCA Special Committee 167. Software considerations in airborne systems and equipment certification. Technical Report DO-1789B, RTCA, Inc., 1992.Google ScholarGoogle Scholar
  38. S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges. In IEEE/ACM International Conference on Automated Software Engineering, pages 201–211, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Shepperd. A critique of cyclomatic complexity as a software metric. Software Engineering Journal, 3(2):30–36, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, and D. Marinov. Balancing trade-offs in test-suite reduction. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 246–256, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. SIR: Software-artifact Infrastructure Repository. Sir usage information, accessed at mar 8, 2016. http://sir.unl.edu/portal/usage.php.Google ScholarGoogle Scholar
  42. D. Tengeri, L. Vidacs, A. Beszedes, J. Jasz, G. Balogh, B. Vancsics, and T. Gyimóthy. Relating code coverage, mutation score and test suite reducibility to defect density,accepted paper. In mutationworkshop, 2016.Google ScholarGoogle Scholar
  43. Y. Tian, J. Lawall, and D. Lo. Identifying linux bug fixing patches. In Software Engineering (ICSE), 2012 34th International Conference on, pages 386–396. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Y. Wei, B. Meyer, and M. Oriol. Empirical Software Engineering and Verification, chapter Is branch coverage a good measure of testing effectiveness?, pages 194–212. Springer-Verlag, Berlin, Heidelberg, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Can testedness be effectively measured?

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering
        November 2016
        1156 pages
        ISBN:9781450342186
        DOI:10.1145/2950290

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 November 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate17of128submissions,13%

        Upcoming Conference

        FSE '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader