ABSTRACT
Given a program where none of our fuzzers finds any bugs, how do we know which fuzzer is better? In practice, we often look to code coverage as a proxy measure of fuzzer effectiveness and consider the fuzzer which achieves more coverage as the better one.
Indeed, evaluating 10 fuzzers for 23 hours on 24 programs, we find that a fuzzer that covers more code also finds more bugs. There is a very strong correlation between the coverage achieved and the number of bugs found by a fuzzer. Hence, it might seem reasonable to compare fuzzers in terms of coverage achieved, and from that derive empirical claims about a fuzzer's superiority at finding bugs.
Curiously enough, however, we find no strong agreement on which fuzzer is superior if we compared multiple fuzzers in terms of coverage achieved instead of the number of bugs found. The fuzzer best at achieving coverage, may not be best at finding bugs.
- Andrea Arcuri and Lionel Briand. 2014. A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24, 3 (2014), 219--250. Google ScholarDigital Library
- Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. 2021. Fuzzing: Challenges and Opportunities. IEEE Software (2021), 1--9. Google ScholarCross Ref
- Marcel Böhme, Valentin Manès, and Sang Kil Cha. 2020. Boosting Fuzzer Efficiency: An Information Theoretic Perspective. In Proceedings of the 14th Joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE). 970--981. Google ScholarDigital Library
- Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2019. Coverage-Based Greybox Fuzzing as Markov Chain. IEEE Transactions on Software Engineering 45, 5 (2019), 489--506. Google ScholarCross Ref
- Marcel Böhme and Abhik Roychoudhury. 2014. CoREBench: Studying Complexity of Regression Errors. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 105--115. Google ScholarDigital Library
- Marcel Böhme, Ezekiel O. Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller. 2017. Where is the Bug and How is It Fixed? An Experiment with Practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 117--128. Google ScholarDigital Library
- Joshua Bundt, Andrew Fasano, Brendan Dolan-Gavitt, William Robertson, and Tim Leek. 2021. Evaluating Synthetic Bugs. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security (ASIA CCS '21). Association for Computing Machinery, New York, NY, USA, 716--730. Google ScholarDigital Library
- Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. 2017. An Empirical Study on Mutation, Statement and Branch Coverage Fault Revelation That Avoids the Unreliable Clean Program Assumption. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 597--608. Google ScholarDigital Library
- Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, Gordon Fraser, Paul Ammann, and René Just. 2020. Revisiting the Relationship between Fault Detection, Test Adequacy Criteria, and Test Set Size. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE '20). Association for Computing Machinery, New York, NY, USA, 237--249. Google ScholarDigital Library
- Jaeseung Choi, Joonun Jang, Choongwoo Han, and Sang Kil Cha. 2019. Grey-Box Concolic Testing on Binary Code. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE '19). IEEE Press, 736--747. Google ScholarDigital Library
- LLVM Developers. 2021. Clang 13 documentation - UndefinedBehaviorSanitizer (UBSan). https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html. [Online; accessed 16.Aug.21].Google Scholar
- LLVM Developers. 2022. Clang 15 documentation - Source-based Code Coverage. https://clang.llvm.org/docs/SourceBasedCodeCoverage.html#interpreting-reports. [Online; accessed 10.Feb.22].Google Scholar
- Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. LAVA: Large-scale Automated Vulnerability Addition. In Proceedings of the IEEE Security and Privacy (S&P'16). 1--10. Google ScholarCross Ref
- Matthew B. Dwyer, Suzette Person, and Sebastian Elbaum. 2006. Controlling Factors in Evaluating Path-Sensitive Error Detection Techniques. In Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'06). Association for Computing Machinery, New York, NY, USA, 92--104. Google ScholarDigital Library
- Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++ : Combining Incremental Steps of Fuzzing Research. In 14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association. https://www.usenix.org/conference/woot20/presentation/fioraldiGoogle Scholar
- Gordon Fraser and Andrea Arcuri. 2012. Sound Empirical Evidence in Software Testing. In Proceedings of the 34th International Conference on Software Engineering (Zurich, Switzerland) (ICSE '12). IEEE Press, 178--188.Google ScholarDigital Library
- Shuitao Gan, Chao Zhang, Peng Chen, Bodong Zhao, Xiaojun Qin, Dong Wu, and Zuoning Chen. 2020. GREYONE: Data Flow Sensitive Fuzzing. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 2577--2594.Google Scholar
- Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. 2018. CollAFL: Path Sensitive Fuzzing. In 2018 IEEE Symposium on Security and Privacy (SP). 679--696. Google ScholarCross Ref
- Miroslav Gavrilov, Kyle Dewey, Alex Groce, Davina Zamanzadeh, and Ben Hardekopf. 2020. A Practical, Principled Measure of Fuzzer Appeal: A Preliminary Study. In 20th IEEE International Conference on Software Quality, Reliability and Security, QRS 2020, Macau, China, December 11--14, 2020. IEEE, 510--517. Google ScholarCross Ref
- Gregory Gay. 2017. Generating effective test suites by combining coverage criteria. In International Symposium on Search Based Software Engineering. Springer, 65--82.Google ScholarCross Ref
- Sijia Geng, Yuekang Li, Yunlan Du, Jun Xu, Yang Liu, and Bing Mao. 2020. An Empirical Study on Benchmarks of Artificial Software Vulnerabilities. (March 2020). https://arxiv.org/abs/2003.09561v1Google Scholar
- Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2015. Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites. ACM Transaction on Software Engineering and Methodology 24, 4, Article 22 (Sept. 2015), 33 pages. Google ScholarDigital Library
- Google. 2021. ClusterFuzz. https://google.github.io/clusterfuzz/. [Online; accessed 16.Aug.21].Google Scholar
- Google. 2021. OSS-Fuzz: Continuous Fuzzing for Open Source Software. https://github.com/google/oss-fuzz. [Online; accessed 16.Aug.21].Google Scholar
- Rahul Gopinath, Philipp Görz, and Alex Groce. 2022. Mutation Analysis: Answering the Fuzzing Challenge. arXiv:2201.11303 [cs.SE]Google Scholar
- Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code Coverage for Suite Evaluation by Developers. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 72--82. Google ScholarDigital Library
- Péter Gyimesi, Béla Vancsics, Andrea Stocco 0001, Davood Mazinanian, Árpád Beszédes, Rudolf Ferenc, and Ali Mesbah 0001. 2019. BugsJS: a Benchmark of JavaScript Bugs. In Proceedings of the 12th International Conference on Software Testing, Verification and Validation. IEEE, 90--101. Google ScholarCross Ref
- Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A Ground-Truth Fuzzing Benchmark. In Proceedings of the joint international conference on Measurement and modeling of computer systems (SIGMETRICS'20). Association for Computing Machinery, New York, NY, USA, 29 pages. Google ScholarDigital Library
- Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community Expectations for Research Artifacts and Evaluation Processes. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 469--480. Google ScholarDigital Library
- Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer, and Antony L. Hosking. 2021. Seed Selection for Successful Fuzzing. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, Denmark) (ISSTA 2021). Association for Computing Machinery, New York, NY, USA, 230--243. Google ScholarDigital Library
- Marc Heuse. 2020. Twitter. https://twitter.com/hackerschoice/status/1302514351811842056. [Online; accessed 16.Aug.21].Google Scholar
- Laura Inozemtseva and Reid Holmes. 2014. Coverage is Not Strongly Correlated with Test Suite Effectiveness. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 435--445. Google ScholarDigital Library
- Marko Ivankovic, Goran Petrovic, René Just, and Gordon Fraser. 2019. Code coverage at Google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '19). Association for Computing Machinery, New York, NY, USA, 955--963.Google ScholarDigital Library
- René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437--440. Google ScholarDigital Library
- René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 654--665. Google ScholarDigital Library
- George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS '18). Association for Computing Machinery, 2123--2138. Google ScholarDigital Library
- Pavneet Singh Kochhar, Ferdian Thung, and David Lo. 2015. Code coverage and test suite effectiveness: Empirical study with real bugs in large systems. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 560--564. Google ScholarCross Ref
- Caroline Lemieux and Koushik Sen. 2018. FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). Association for Computing Machinery, New York, NY, USA, 475--485. Google ScholarDigital Library
- Stephan Lipp, Daniel Elsner, Thomas Hutzelmann, Sebastian Banescu, Alexander Pretschner, and Marcel Böhme. 2022. FuzzTastic: A Fine-grained, Fuzzeragnostic Coverage Analyzer. In Proceedings of the 44th International Conference on Software Engineering Companion (ICSE'22 Companion). 1--5. Google ScholarDigital Library
- Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, and Raheem Beyah. 2019. MOPT: Optimized Mutation Scheduling for Fuzzers. In Proceedings of the 28th USENIX Conference on Security Symposium (Santa Clara, CA, USA) (SEC'19). USENIX Association, USA, 1949--1966.Google Scholar
- J. Martin Bland and DouglasG. Altman. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327, 8476 (1986), 307--310. Originally published as Volume 1, Issue 8476. Google ScholarCross Ref
- Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.Google Scholar
- P. McMinn. 2004. Search-based software test data generation: a survey: Research Articles. Software Testing, Verification & Reliability 14 (2004), 105--156.Google ScholarCross Ref
- Phil McMinn. 2011. Search-Based Software Testing: Past, Present and Future. In 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops. 153--163. Google ScholarDigital Library
- Jonathan Metzman, László Szekeres, Laurent Maurice Romain Simon, Read Trevelin Sprabery, and Abhishek Arya. 2021. FuzzBench: An Open Fuzzer Benchmarking Platform and Service. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1393--1403. Google ScholarDigital Library
- Microsoft. 2021. OneFuzz: A self-hosted Fuzzing-As-A-Service platform. https://github.com/microsoft/onefuzz. [Online; accessed 16.Aug.21].Google Scholar
- Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. 2007. Predicting Vulnerable Software Components. In Proceedings of the 14th ACM Conference on Computer and Communications Security (Alexandria, Virginia, USA) (CCS '07). Association for Computing Machinery, New York, NY, USA, 529--540. Google ScholarDigital Library
- Tyler Nighswander. 2020. Twitter. https://twitter.com/tylerni7/status/1374519171413766145. [Online; accessed 16.Aug.21].Google Scholar
- Sebastian Österlund, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2020. ParmeSan: Sanitizer-guided Greybox Fuzzing. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 2289--2306.Google Scholar
- Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are Mutation Scores Correlated with Real Fault Detection? A Large Scale Empirical Study on the Relationship between Mutants and Real Faults. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE '18). Association for Computing Machinery, New York, NY, USA, 537--548. Google ScholarDigital Library
- Jibesh Patra and Michael Pradel. 2021. Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 906--918. Google ScholarDigital Library
- Jannik Pewny and Thorsten Holz. 2016. EvilCoder: Automated Bug Insertion (ACSAC '16). Association for Computing Machinery, New York, NY, USA, 214--225. Google ScholarDigital Library
- Van-Thuan Pham, Marcel Böhme, Andrew E. Santosa, Alexandru R. Căciulescu, and Abhik Roychoudhury. 2019. Smart Greybox Fuzzing. IEEE Transactions on Software Engineering (2019), 1--17.Google ScholarCross Ref
- Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug Synthesis: Challenging Bug-Finding Tools with Deep Faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 224--234. Google ScholarDigital Library
- Patrick Schober, Christa Boer, and Lothar A Schwarte. 2018. Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia 126, 5 (2018), 1763--1768.Google ScholarCross Ref
- Kostya Serebryany. 2021. libFuzzer - a library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html. [Online; accessed 16.Aug.21].Google Scholar
- Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (Boston, MA) (USENIX ATC'12). USENIX Association, USA, 28.Google Scholar
- Robert Swiecki. 2021. Honggfuzz. https://github.com/google/honggfuzz. [Online; accessed 16.Aug.21].Google Scholar
- Howard E.A. Tinsley and David J. Weiss. 2000. 4 - Interrater Reliability and Agreement. In Handbook of Applied Multivariate Statistics and Mathematical Modeling, Howard E.A. Tinsley and Steven D. Brown (Eds.). Academic Press, San Diego, 95--124. Google ScholarCross Ref
- David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes. In ICSE. IEEE / ACM, 339--349.Google Scholar
- Yi Wei, Bertrand Meyer, and Manuel Oriol. 2012. Is Branch Coverage a Good Measure of Testing Effectiveness? Springer Berlin Heidelberg, Berlin, Heidelberg, 194--212. Google ScholarCross Ref
- Cheng Wen. 2021. Recent Papers Related To Fuzzing. https://github.com/wcventure/FuzzingPaper. [Online; accessed 16.Aug.21].Google Scholar
- MichałZalewski. 2021. american fuzzy lop (2.52b). https://lcamtuf.coredump.cx/afl/. [Online; accessed 16.Aug.21].Google Scholar
- Andreas Zeller. 2022. A Tweet. https://twitter.com/AndreasZeller/status/1468142858553200644. [Online; accessed 10.Feb.22].Google Scholar
- Andreas Zeller, Sascha Just, and Kai Greshake. 2019. When Results Are All That Matters: Consequences. https://andreas-zeller.blogspot.com/2019/10/when-results-are-all-that-matters.html. [Online; accessed 16.Aug.21].Google Scholar
- Andreas Zeller, Sascha Just, and Kai Greshake. 2019. When Results Are All That Matters: The Case of the Angora Fuzzer. https://andreas-zeller.blogspot.com/2019/10/when-results-are-all-that-matters-case.html. [Online; accessed 16.Aug.21].Google Scholar
- Yucheng Zhang and Ali Mesbah. 2015. Assertions Are Strongly Correlated with Test Suite Effectiveness. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). Association for Computing Machinery, New York, NY, USA, 214--224. Google ScholarDigital Library
- Xiaogang Zhu, Shigang Liu, Xian Li, Sheng Wen, Jun Zhang, Seyit Ahmet Çamtepe, and Yang Xiang. 2020. DeFuzz: Deep Learning Guided Directed Fuzzing. CoRR abs/2010.12149 (2020).Google Scholar
Index Terms
- On the reliability of coverage-based fuzzer benchmarking
Recommendations
FuzzBench: an open fuzzer benchmarking platform and service
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringFuzzing is a key tool used to reduce bugs in production software. At Google, fuzzing has uncovered tens of thousands of bugs. Fuzzing is also a popular subject of academic research. In 2020 alone, over 120 papers were published on the topic of improving,...
Checked Coverage and Object Branch Coverage: New Alternatives for Assessing Student-Written Tests
SIGCSE '15: Proceedings of the 46th ACM Technical Symposium on Computer Science EducationMany educators currently use code coverage metrics to assess student-written software tests. While test adequacy criteria such as statement or branch coverage can also be used to measure the thoroughness of a test suite, they have limitations. Coverage ...
Green Fuzzer Benchmarking
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and AnalysisOver the last decade, fuzzing has been increasingly gaining traction due to its effectiveness in finding bugs. Nevertheless, fuzzer evaluations have been challenging during this time, mainly due to lack of standardized benchmarking. Aiming to ...
Comments