On the reliability of coverage-based fuzzer benchmarking

Authors:
Marcel Böhme

Monash University, Australia

Monash University, Australia
View Profile

,
László Szekeres

Google

Google
View Profile

,
Jonathan Metzman

Google

Google
View Profile

ICSE '22: Proceedings of the 44th International Conference on Software EngineeringMay 2022Pages 1621–1633https://doi.org/10.1145/3510003.3510230

Published:05 July 2022Publication History

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

Pages 1621–1633

ABSTRACT

Given a program where none of our fuzzers finds any bugs, how do we know which fuzzer is better? In practice, we often look to code coverage as a proxy measure of fuzzer effectiveness and consider the fuzzer which achieves more coverage as the better one.

Indeed, evaluating 10 fuzzers for 23 hours on 24 programs, we find that a fuzzer that covers more code also finds more bugs. There is a very strong correlation between the coverage achieved and the number of bugs found by a fuzzer. Hence, it might seem reasonable to compare fuzzers in terms of coverage achieved, and from that derive empirical claims about a fuzzer's superiority at finding bugs.

Curiously enough, however, we find no strong agreement on which fuzzer is superior if we compared multiple fuzzers in terms of coverage achieved instead of the number of bugs found. The fuzzer best at achieving coverage, may not be best at finding bugs.

References

Andrea Arcuri and Lionel Briand. 2014. A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24, 3 (2014), 219--250. Google ScholarDigital Library
Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. 2021. Fuzzing: Challenges and Opportunities. IEEE Software (2021), 1--9. Google ScholarCross Ref
Marcel Böhme, Valentin Manès, and Sang Kil Cha. 2020. Boosting Fuzzer Efficiency: An Information Theoretic Perspective. In Proceedings of the 14th Joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE). 970--981. Google ScholarDigital Library
Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2019. Coverage-Based Greybox Fuzzing as Markov Chain. IEEE Transactions on Software Engineering 45, 5 (2019), 489--506. Google ScholarCross Ref
Marcel Böhme and Abhik Roychoudhury. 2014. CoREBench: Studying Complexity of Regression Errors. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 105--115. Google ScholarDigital Library
Marcel Böhme, Ezekiel O. Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller. 2017. Where is the Bug and How is It Fixed? An Experiment with Practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 117--128. Google ScholarDigital Library
Joshua Bundt, Andrew Fasano, Brendan Dolan-Gavitt, William Robertson, and Tim Leek. 2021. Evaluating Synthetic Bugs. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security (ASIA CCS '21). Association for Computing Machinery, New York, NY, USA, 716--730. Google ScholarDigital Library
Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. 2017. An Empirical Study on Mutation, Statement and Branch Coverage Fault Revelation That Avoids the Unreliable Clean Program Assumption. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 597--608. Google ScholarDigital Library
Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, Gordon Fraser, Paul Ammann, and René Just. 2020. Revisiting the Relationship between Fault Detection, Test Adequacy Criteria, and Test Set Size. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE '20). Association for Computing Machinery, New York, NY, USA, 237--249. Google ScholarDigital Library
Jaeseung Choi, Joonun Jang, Choongwoo Han, and Sang Kil Cha. 2019. Grey-Box Concolic Testing on Binary Code. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE '19). IEEE Press, 736--747. Google ScholarDigital Library
LLVM Developers. 2021. Clang 13 documentation - UndefinedBehaviorSanitizer (UBSan). https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html. [Online; accessed 16.Aug.21].Google Scholar
LLVM Developers. 2022. Clang 15 documentation - Source-based Code Coverage. https://clang.llvm.org/docs/SourceBasedCodeCoverage.html#interpreting-reports. [Online; accessed 10.Feb.22].Google Scholar
Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. LAVA: Large-scale Automated Vulnerability Addition. In Proceedings of the IEEE Security and Privacy (S&P'16). 1--10. Google ScholarCross Ref
Matthew B. Dwyer, Suzette Person, and Sebastian Elbaum. 2006. Controlling Factors in Evaluating Path-Sensitive Error Detection Techniques. In Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'06). Association for Computing Machinery, New York, NY, USA, 92--104. Google ScholarDigital Library
Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++ : Combining Incremental Steps of Fuzzing Research. In 14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association. https://www.usenix.org/conference/woot20/presentation/fioraldiGoogle Scholar
Gordon Fraser and Andrea Arcuri. 2012. Sound Empirical Evidence in Software Testing. In Proceedings of the 34th International Conference on Software Engineering (Zurich, Switzerland) (ICSE '12). IEEE Press, 178--188.Google ScholarDigital Library
Shuitao Gan, Chao Zhang, Peng Chen, Bodong Zhao, Xiaojun Qin, Dong Wu, and Zuoning Chen. 2020. GREYONE: Data Flow Sensitive Fuzzing. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 2577--2594.Google Scholar
Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. 2018. CollAFL: Path Sensitive Fuzzing. In 2018 IEEE Symposium on Security and Privacy (SP). 679--696. Google ScholarCross Ref
Miroslav Gavrilov, Kyle Dewey, Alex Groce, Davina Zamanzadeh, and Ben Hardekopf. 2020. A Practical, Principled Measure of Fuzzer Appeal: A Preliminary Study. In 20th IEEE International Conference on Software Quality, Reliability and Security, QRS 2020, Macau, China, December 11--14, 2020. IEEE, 510--517. Google ScholarCross Ref
Gregory Gay. 2017. Generating effective test suites by combining coverage criteria. In International Symposium on Search Based Software Engineering. Springer, 65--82.Google ScholarCross Ref
Sijia Geng, Yuekang Li, Yunlan Du, Jun Xu, Yang Liu, and Bing Mao. 2020. An Empirical Study on Benchmarks of Artificial Software Vulnerabilities. (March 2020). https://arxiv.org/abs/2003.09561v1Google Scholar
Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2015. Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites. ACM Transaction on Software Engineering and Methodology 24, 4, Article 22 (Sept. 2015), 33 pages. Google ScholarDigital Library
Google. 2021. ClusterFuzz. https://google.github.io/clusterfuzz/. [Online; accessed 16.Aug.21].Google Scholar
Google. 2021. OSS-Fuzz: Continuous Fuzzing for Open Source Software. https://github.com/google/oss-fuzz. [Online; accessed 16.Aug.21].Google Scholar
Rahul Gopinath, Philipp Görz, and Alex Groce. 2022. Mutation Analysis: Answering the Fuzzing Challenge. arXiv:2201.11303 [cs.SE]Google Scholar
Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code Coverage for Suite Evaluation by Developers. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 72--82. Google ScholarDigital Library
Péter Gyimesi, Béla Vancsics, Andrea Stocco 0001, Davood Mazinanian, Árpád Beszédes, Rudolf Ferenc, and Ali Mesbah 0001. 2019. BugsJS: a Benchmark of JavaScript Bugs. In Proceedings of the 12th International Conference on Software Testing, Verification and Validation. IEEE, 90--101. Google ScholarCross Ref
Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A Ground-Truth Fuzzing Benchmark. In Proceedings of the joint international conference on Measurement and modeling of computer systems (SIGMETRICS'20). Association for Computing Machinery, New York, NY, USA, 29 pages. Google ScholarDigital Library
Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community Expectations for Research Artifacts and Evaluation Processes. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 469--480. Google ScholarDigital Library
Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer, and Antony L. Hosking. 2021. Seed Selection for Successful Fuzzing. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, Denmark) (ISSTA 2021). Association for Computing Machinery, New York, NY, USA, 230--243. Google ScholarDigital Library
Marc Heuse. 2020. Twitter. https://twitter.com/hackerschoice/status/1302514351811842056. [Online; accessed 16.Aug.21].Google Scholar
Laura Inozemtseva and Reid Holmes. 2014. Coverage is Not Strongly Correlated with Test Suite Effectiveness. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 435--445. Google ScholarDigital Library
Marko Ivankovic, Goran Petrovic, René Just, and Gordon Fraser. 2019. Code coverage at Google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '19). Association for Computing Machinery, New York, NY, USA, 955--963.Google ScholarDigital Library
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437--440. Google ScholarDigital Library
René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 654--665. Google ScholarDigital Library
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS '18). Association for Computing Machinery, 2123--2138. Google ScholarDigital Library
Pavneet Singh Kochhar, Ferdian Thung, and David Lo. 2015. Code coverage and test suite effectiveness: Empirical study with real bugs in large systems. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 560--564. Google ScholarCross Ref
Caroline Lemieux and Koushik Sen. 2018. FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). Association for Computing Machinery, New York, NY, USA, 475--485. Google ScholarDigital Library
Stephan Lipp, Daniel Elsner, Thomas Hutzelmann, Sebastian Banescu, Alexander Pretschner, and Marcel Böhme. 2022. FuzzTastic: A Fine-grained, Fuzzeragnostic Coverage Analyzer. In Proceedings of the 44th International Conference on Software Engineering Companion (ICSE'22 Companion). 1--5. Google ScholarDigital Library
Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, and Raheem Beyah. 2019. MOPT: Optimized Mutation Scheduling for Fuzzers. In Proceedings of the 28th USENIX Conference on Security Symposium (Santa Clara, CA, USA) (SEC'19). USENIX Association, USA, 1949--1966.Google Scholar
J. Martin Bland and DouglasG. Altman. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327, 8476 (1986), 307--310. Originally published as Volume 1, Issue 8476. Google ScholarCross Ref
Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.Google Scholar
P. McMinn. 2004. Search-based software test data generation: a survey: Research Articles. Software Testing, Verification & Reliability 14 (2004), 105--156.Google ScholarCross Ref
Phil McMinn. 2011. Search-Based Software Testing: Past, Present and Future. In 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops. 153--163. Google ScholarDigital Library
Jonathan Metzman, László Szekeres, Laurent Maurice Romain Simon, Read Trevelin Sprabery, and Abhishek Arya. 2021. FuzzBench: An Open Fuzzer Benchmarking Platform and Service. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1393--1403. Google ScholarDigital Library
Microsoft. 2021. OneFuzz: A self-hosted Fuzzing-As-A-Service platform. https://github.com/microsoft/onefuzz. [Online; accessed 16.Aug.21].Google Scholar
Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. 2007. Predicting Vulnerable Software Components. In Proceedings of the 14th ACM Conference on Computer and Communications Security (Alexandria, Virginia, USA) (CCS '07). Association for Computing Machinery, New York, NY, USA, 529--540. Google ScholarDigital Library
Tyler Nighswander. 2020. Twitter. https://twitter.com/tylerni7/status/1374519171413766145. [Online; accessed 16.Aug.21].Google Scholar
Sebastian Österlund, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2020. ParmeSan: Sanitizer-guided Greybox Fuzzing. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 2289--2306.Google Scholar
Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are Mutation Scores Correlated with Real Fault Detection? A Large Scale Empirical Study on the Relationship between Mutants and Real Faults. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE '18). Association for Computing Machinery, New York, NY, USA, 537--548. Google ScholarDigital Library
Jibesh Patra and Michael Pradel. 2021. Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 906--918. Google ScholarDigital Library
Jannik Pewny and Thorsten Holz. 2016. EvilCoder: Automated Bug Insertion (ACSAC '16). Association for Computing Machinery, New York, NY, USA, 214--225. Google ScholarDigital Library
Van-Thuan Pham, Marcel Böhme, Andrew E. Santosa, Alexandru R. Căciulescu, and Abhik Roychoudhury. 2019. Smart Greybox Fuzzing. IEEE Transactions on Software Engineering (2019), 1--17.Google ScholarCross Ref
Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug Synthesis: Challenging Bug-Finding Tools with Deep Faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 224--234. Google ScholarDigital Library
Patrick Schober, Christa Boer, and Lothar A Schwarte. 2018. Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia 126, 5 (2018), 1763--1768.Google ScholarCross Ref
Kostya Serebryany. 2021. libFuzzer - a library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html. [Online; accessed 16.Aug.21].Google Scholar
Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (Boston, MA) (USENIX ATC'12). USENIX Association, USA, 28.Google Scholar
Robert Swiecki. 2021. Honggfuzz. https://github.com/google/honggfuzz. [Online; accessed 16.Aug.21].Google Scholar
Howard E.A. Tinsley and David J. Weiss. 2000. 4 - Interrater Reliability and Agreement. In Handbook of Applied Multivariate Statistics and Mathematical Modeling, Howard E.A. Tinsley and Steven D. Brown (Eds.). Academic Press, San Diego, 95--124. Google ScholarCross Ref
David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes. In ICSE. IEEE / ACM, 339--349.Google Scholar
Yi Wei, Bertrand Meyer, and Manuel Oriol. 2012. Is Branch Coverage a Good Measure of Testing Effectiveness? Springer Berlin Heidelberg, Berlin, Heidelberg, 194--212. Google ScholarCross Ref
Cheng Wen. 2021. Recent Papers Related To Fuzzing. https://github.com/wcventure/FuzzingPaper. [Online; accessed 16.Aug.21].Google Scholar
MichałZalewski. 2021. american fuzzy lop (2.52b). https://lcamtuf.coredump.cx/afl/. [Online; accessed 16.Aug.21].Google Scholar
Andreas Zeller. 2022. A Tweet. https://twitter.com/AndreasZeller/status/1468142858553200644. [Online; accessed 10.Feb.22].Google Scholar
Andreas Zeller, Sascha Just, and Kai Greshake. 2019. When Results Are All That Matters: Consequences. https://andreas-zeller.blogspot.com/2019/10/when-results-are-all-that-matters.html. [Online; accessed 16.Aug.21].Google Scholar
Andreas Zeller, Sascha Just, and Kai Greshake. 2019. When Results Are All That Matters: The Case of the Angora Fuzzer. https://andreas-zeller.blogspot.com/2019/10/when-results-are-all-that-matters-case.html. [Online; accessed 16.Aug.21].Google Scholar
Yucheng Zhang and Ali Mesbah. 2015. Assertions Are Strongly Correlated with Test Suite Effectiveness. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). Association for Computing Machinery, New York, NY, USA, 214--224. Google ScholarDigital Library
Xiaogang Zhu, Shigang Liu, Xian Li, Sheng Wen, Jun Zhang, Seyit Ahmet Çamtepe, and Yang Xiang. 2020. DeFuzz: Deep Learning Guided Directed Fuzzing. CoRR abs/2010.12149 (2020).Google Scholar

Index Terms

On the reliability of coverage-based fuzzer benchmarking

Index terms have been assigned to the content through auto-classification.

Recommendations

FuzzBench: an open fuzzer benchmarking platform and service
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Fuzzing is a key tool used to reduce bugs in production software. At Google, fuzzing has uncovered tens of thousands of bugs. Fuzzing is also a popular subject of academic research. In 2020 alone, over 120 papers were published on the topic of improving,...
Read More
Checked Coverage and Object Branch Coverage: New Alternatives for Assessing Student-Written Tests
SIGCSE '15: Proceedings of the 46th ACM Technical Symposium on Computer Science Education

Many educators currently use code coverage metrics to assess student-written software tests. While test adequacy criteria such as statement or branch coverage can also be used to measure the thoroughness of a test suite, they have limitations. Coverage ...
Read More
Green Fuzzer Benchmarking
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis

Over the last decade, fuzzing has been increasingly gaining traction due to its effectiveness in finding bugs. Nevertheless, fuzzer evaluations have been challenging during this time, mainly due to lack of standardized benchmarking. Aiming to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '22: Proceedings of the 44th International Conference on Software Engineering
May 2022
2508 pages
ISBN:9781450392211
DOI:10.1145/3510003
General Chair:
Matthew B Dwyer
University of Virginia
,
Program Chairs:
Daniela Damian
University of Victoria, Canada
,
Andreas Zeller
CISPA, Germany
Copyright © 2022 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 July 2022
Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 731
  Total Downloads
- Downloads (Last 12 months)539
- Downloads (Last 6 weeks)74
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the reliability of coverage-based fuzzer benchmarking

ICSE '22: Proceedings of the 44th International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

FuzzBench: an open fuzzer benchmarking platform and service

Checked Coverage and Object Branch Coverage: New Alternatives for Assessing Student-Written Tests

Green Fuzzer Benchmarking