research-article

Can testedness be effectively measured?

Authors:
Iftekhar Ahmed

Oregon State University, USA

Oregon State University, USA
View Profile

,
Rahul Gopinath

Oregon State University, USA

Oregon State University, USA
View Profile

,
Caius Brindescu

Oregon State University, USA

Oregon State University, USA
View Profile

,
Alex Groce

Oregon State University, USA

Oregon State University, USA
View Profile

,
Carlos Jensen

Oregon State University, USA

Oregon State University, USA
View Profile

FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software EngineeringNovember 2016Pages 547–558https://doi.org/10.1145/2950290.2950324

Published:01 November 2016Publication History

FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

Pages 547–558

ABSTRACT

Among the major questions that a practicing tester faces are deciding where to focus additional testing effort, and deciding when to stop testing. Test the least-tested code, and stop when all code is well-tested, is a reasonable answer. Many measures of "testedness" have been proposed; unfortunately, we do not know whether these are truly effective. In this paper we propose a novel evaluation of two of the most important and widely-used measures of test suite quality. The first measure is statement coverage, the simplest and best-known code coverage measure. The second measure is mutation score, a supposedly more powerful, though expensive, measure.

We evaluate these measures using the actual criteria of interest: if a program element is (by these measures) well tested at a given point in time, it should require fewer future bug-fixes than a "poorly tested" element. If not, then it seems likely that we are not effectively measuring testedness. Using a large number of open source Java programs from Github and Apache, we show that both statement coverage and mutation score have only a weak negative correlation with bug-fixes. Despite the lack of strong correlation, there are statistically and practically significant differences between program elements for various binary criteria. Program elements (other than classes) covered by any test case see about half as many bug-fixes as those not covered, and a similar line can be drawn for mutation score thresholds. Our results have important implications for both software engineering practice and research evaluation.

References

I. Ahmed, U. A. Mannan, R. Gopinath, and C. Jensen. An empirical study of design degradation: How software projects get worse over time. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 31–40, 2015.Google ScholarCross Ref
J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an appropriate tool for testing experiments? In International Conference on Software Engineering, pages 402–411. IEEE, 2005. Google ScholarDigital Library
J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Transactions on Software Engineering, 32(8), 2006. Google ScholarDigital Library
Apache Software Foundation. Apache commons. http://commons.apache.org/.Google Scholar
Apache Software Foundation. Apache maven project. http://maven.apache.org.Google Scholar
A. Arcuri and L. C. Briand. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test., Verif. Reliab., 24(3):219–250, 2014. Google ScholarDigital Library
C. Bird, A. Bachmann, E. Aune, J. Duffy, A. Bernstein, V. Filkov, and P. Devanbu. Fair and balanced?: bias in bug-fix datasets. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 121–130. ACM, 2009. Google ScholarDigital Library
T. A. Budd. Mutation Analysis of Program Test Data. PhD thesis, Yale University, New Haven, CT, USA, 1980. Google ScholarDigital Library
X. Cai and M. R. Lyu. The effect of code coverage on fault detection under different testing profiles. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1–7. ACM, 2005. Google ScholarDigital Library
H. Coles. Pit mutation testing. http://pitest.org/.Google Scholar
M. Daran and P. Thévenod-Fosse. Software error analysis: A real case study involving real faults and mutations. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 158–171. ACM, 1996. Google ScholarDigital Library
M. Delahaye and L. Bousquet. Selecting a software engineering tool: lessons learnt from mutation analysis. Software: Practice and Experience, 2015. Google ScholarDigital Library
R. A. DeMillo and A. P. Mathur. On the use of software artifacts to evaluate the effectiveness of mutation analysis for detecting errors in production software. Technical Report SERC-TR92-P, Software Engineering Research Center, Purdue University, West Lafayette, IN ” 1991.Google Scholar
J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, pages 313–324, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
P. G. Frankl and O. Iakounenko. Further empirical studies of test effectiveness. In ACM SIGSOFT Software Engineering Notes, volume 23, pages 153–162. ACM, 1998. Google ScholarDigital Library
P. G. Frankl and S. N. Weiss. An experimental comparison of the effectiveness of branch testing and data flow testing. IEEE Transactions on Software Engineering, 19:774–787, 1993. Google ScholarDigital Library
P. G. Frankl, S. N. Weiss, and C. Hu. All-uses vs mutation testing: an experimental comparison of effectiveness. Journal of Systems and Software, 38(3):235–253, 1997. Google ScholarDigital Library
GitHub Inc. Software repository. http://www.github.com.Google Scholar
M. Gligoric, A. Groce, C. Zhang, R. Sharma, A. Alipour, and D. Marinov. Guidelines for coverage-based comparisons of non-adequate test suites. ACM Transactions on Software Engineering and Methodology, 24(4):4–37, 2014. Google ScholarDigital Library
M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-adequate test suites using coverage criteria. In ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 2013. Google ScholarDigital Library
R. Gopinath, C. Jensen, and A. Groce. Code coverage for suite evaluation by developers. In International Conference on Software Engineering. IEEE, 2014. Google ScholarDigital Library
A. Groce, M. A. Alipour, and R. Gopinath. Coverage and its discontents. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, Onward! 2014, pages 255–268, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
A. Gupta and P. Jalote. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. International Journal on Software Tools for Technology Transfer, 10(2):145–160, 2008.Google ScholarCross Ref
M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments of the effectiveness of dataflow-and controlflow-based test adequacy criteria. In International Conference on Software Engineering, pages 191–200. IEEE Computer Society Press, 1994. Google ScholarDigital Library
L. Inozemtseva and R. Holmes. Coverage Is Not Strongly Correlated With Test Suite Effectiveness. In International Conference on Software Engineering, 2014. Google ScholarDigital Library
L. M. M. Inozemtseva. Predicting test suite effectiveness for java programs. Master’s thesis, University of Waterloo, 2012.Google Scholar
R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser. Are mutants a valid substitute for real faults in software testing? In ACM SIGSOFT Symposium on The Foundations of Software Engineering, pages 654–665, Hong Kong, China, 2014. ACM. Google ScholarDigital Library
S. Kakarla. An analysis of parameters influencing test suite effectiveness. Master’s thesis, Texas Tech University, 2010.Google Scholar
N. Li, U. Praphamontripong, and J. Offutt. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In International Conference on Software Testing, Verification and Validation Workshops, pages 220–229. IEEE, 2009. Google ScholarDigital Library
A. P. Mathur and W. E. Wong. An empirical comparison of data flow and mutation-based test adequacy criteria. Software Testing, Verification and Reliability, 4(1):9–31, 1994.Google ScholarCross Ref
T. J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, (4):308–320, 1976. Google ScholarDigital Library
A. Mockus, N. Nagappan, and T. T. Dinh-Trong. Test coverage and post-verification defects: A multiple case study. In Empirical Software Engineering and Measurement, 2009. ESEM 2009. 3rd International Symposium on, pages 291–301. IEEE, 2009. Google ScholarDigital Library
A. S. Namin and J. H. Andrews. The influence of size and coverage on test suite effectiveness. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 57–68. ACM, 2009. Google ScholarDigital Library
A. S. Namin and S. Kakarla. The use of mutation in testing experiments and its sensitivity to external threats. In ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 342–352, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
A. J. Offutt and J. M. Voas. Subsumption of condition coverage techniques by mutation testing. Technical report, Technical Report ISSE-TR-96-01, Information and Software Systems Engineering, George Mason University, 1996.Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830, 2011. Google ScholarDigital Library
RTCA Special Committee 167. Software considerations in airborne systems and equipment certification. Technical Report DO-1789B, RTCA, Inc., 1992.Google Scholar
S. Shamshiri, R. Just, J. M. Rojas, G. Fraser, P. McMinn, and A. Arcuri. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges. In IEEE/ACM International Conference on Automated Software Engineering, pages 201–211, 2015.Google ScholarDigital Library
M. Shepperd. A critique of cyclomatic complexity as a software metric. Software Engineering Journal, 3(2):30–36, 1988. Google ScholarDigital Library
A. Shi, A. Gyori, M. Gligoric, A. Zaytsev, and D. Marinov. Balancing trade-offs in test-suite reduction. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 246–256, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
SIR: Software-artifact Infrastructure Repository. Sir usage information, accessed at mar 8, 2016. http://sir.unl.edu/portal/usage.php.Google Scholar
D. Tengeri, L. Vidacs, A. Beszedes, J. Jasz, G. Balogh, B. Vancsics, and T. Gyimóthy. Relating code coverage, mutation score and test suite reducibility to defect density,accepted paper. In mutationworkshop, 2016.Google Scholar
Y. Tian, J. Lawall, and D. Lo. Identifying linux bug fixing patches. In Software Engineering (ICSE), 2012 34th International Conference on, pages 386–396. IEEE, 2012. Google ScholarDigital Library
Y. Wei, B. Meyer, and M. Oriol. Empirical Software Engineering and Verification, chapter Is branch coverage a good measure of testing effectiveness?, pages 194–212. Springer-Verlag, Berlin, Heidelberg, 2012. Google ScholarDigital Library

Index Terms

Can testedness be effectively measured?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Empirical software validation
      2. Software defect analysis
        Software testing and debugging

Recommendations

PBCOV: a property-based coverage criterion

Coverage criteria aim at satisfying test requirements and compute metrics values that quantify the adequacy of test suites at revealing defects in programs. Typically, a test requirement is a structural program element, and the coverage metric value ...
Read More
An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing

Experimental work in software testing has generally focused on evaluating the effectiveness and effort requirements of various coverage criteria. The important issue of testing efficiency has not been sufficiently addressed. In this paper, we describe ...
Read More
Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites
Special Issue on ISSTA 2013

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering
November 2016
1156 pages
ISBN:9781450342186
DOI:10.1145/2950290
General Chair:
Thomas Zimmermann
Microsoft Research, USA
,
Program Chairs:
Jane Cleland-Huang
University of Notre Dame, USA
,
Zhendong Su
University of California at Davis, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
coverage criteria
mutation testing
statistical analysis
test suite evaluation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate17of128submissions,13%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 497
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Can testedness be effectively measured?

FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

PBCOV: a property-based coverage criterion

An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing

Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites