skip to main content
10.1145/3510003.3510230acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article
Open Access

On the reliability of coverage-based fuzzer benchmarking

Published:05 July 2022Publication History

ABSTRACT

Given a program where none of our fuzzers finds any bugs, how do we know which fuzzer is better? In practice, we often look to code coverage as a proxy measure of fuzzer effectiveness and consider the fuzzer which achieves more coverage as the better one.

Indeed, evaluating 10 fuzzers for 23 hours on 24 programs, we find that a fuzzer that covers more code also finds more bugs. There is a very strong correlation between the coverage achieved and the number of bugs found by a fuzzer. Hence, it might seem reasonable to compare fuzzers in terms of coverage achieved, and from that derive empirical claims about a fuzzer's superiority at finding bugs.

Curiously enough, however, we find no strong agreement on which fuzzer is superior if we compared multiple fuzzers in terms of coverage achieved instead of the number of bugs found. The fuzzer best at achieving coverage, may not be best at finding bugs.

References

  1. Andrea Arcuri and Lionel Briand. 2014. A Hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24, 3 (2014), 219--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Marcel Böhme, Cristian Cadar, and Abhik Roychoudhury. 2021. Fuzzing: Challenges and Opportunities. IEEE Software (2021), 1--9. Google ScholarGoogle ScholarCross RefCross Ref
  3. Marcel Böhme, Valentin Manès, and Sang Kil Cha. 2020. Boosting Fuzzer Efficiency: An Information Theoretic Perspective. In Proceedings of the 14th Joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE). 970--981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2019. Coverage-Based Greybox Fuzzing as Markov Chain. IEEE Transactions on Software Engineering 45, 5 (2019), 489--506. Google ScholarGoogle ScholarCross RefCross Ref
  5. Marcel Böhme and Abhik Roychoudhury. 2014. CoREBench: Studying Complexity of Regression Errors. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 105--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Marcel Böhme, Ezekiel O. Soremekun, Sudipta Chattopadhyay, Emamurho Ugherughe, and Andreas Zeller. 2017. Where is the Bug and How is It Fixed? An Experiment with Practitioners. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (Paderborn, Germany) (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Joshua Bundt, Andrew Fasano, Brendan Dolan-Gavitt, William Robertson, and Tim Leek. 2021. Evaluating Synthetic Bugs. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security (ASIA CCS '21). Association for Computing Machinery, New York, NY, USA, 716--730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Thierry Titcheu Chekam, Mike Papadakis, Yves Le Traon, and Mark Harman. 2017. An Empirical Study on Mutation, Statement and Branch Coverage Fault Revelation That Avoids the Unreliable Clean Program Assumption. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 597--608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, Gordon Fraser, Paul Ammann, and René Just. 2020. Revisiting the Relationship between Fault Detection, Test Adequacy Criteria, and Test Set Size. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE '20). Association for Computing Machinery, New York, NY, USA, 237--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jaeseung Choi, Joonun Jang, Choongwoo Han, and Sang Kil Cha. 2019. Grey-Box Concolic Testing on Binary Code. In Proceedings of the 41st International Conference on Software Engineering (Montreal, Quebec, Canada) (ICSE '19). IEEE Press, 736--747. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. LLVM Developers. 2021. Clang 13 documentation - UndefinedBehaviorSanitizer (UBSan). https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  12. LLVM Developers. 2022. Clang 15 documentation - Source-based Code Coverage. https://clang.llvm.org/docs/SourceBasedCodeCoverage.html#interpreting-reports. [Online; accessed 10.Feb.22].Google ScholarGoogle Scholar
  13. Brendan Dolan-Gavitt, Patrick Hulin, Engin Kirda, Tim Leek, Andrea Mambretti, Wil Robertson, Frederick Ulrich, and Ryan Whelan. 2016. LAVA: Large-scale Automated Vulnerability Addition. In Proceedings of the IEEE Security and Privacy (S&P'16). 1--10. Google ScholarGoogle ScholarCross RefCross Ref
  14. Matthew B. Dwyer, Suzette Person, and Sebastian Elbaum. 2006. Controlling Factors in Evaluating Path-Sensitive Error Detection Techniques. In Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'06). Association for Computing Machinery, New York, NY, USA, 92--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++ : Combining Incremental Steps of Fuzzing Research. In 14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association. https://www.usenix.org/conference/woot20/presentation/fioraldiGoogle ScholarGoogle Scholar
  16. Gordon Fraser and Andrea Arcuri. 2012. Sound Empirical Evidence in Software Testing. In Proceedings of the 34th International Conference on Software Engineering (Zurich, Switzerland) (ICSE '12). IEEE Press, 178--188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shuitao Gan, Chao Zhang, Peng Chen, Bodong Zhao, Xiaojun Qin, Dong Wu, and Zuoning Chen. 2020. GREYONE: Data Flow Sensitive Fuzzing. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 2577--2594.Google ScholarGoogle Scholar
  18. Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. 2018. CollAFL: Path Sensitive Fuzzing. In 2018 IEEE Symposium on Security and Privacy (SP). 679--696. Google ScholarGoogle ScholarCross RefCross Ref
  19. Miroslav Gavrilov, Kyle Dewey, Alex Groce, Davina Zamanzadeh, and Ben Hardekopf. 2020. A Practical, Principled Measure of Fuzzer Appeal: A Preliminary Study. In 20th IEEE International Conference on Software Quality, Reliability and Security, QRS 2020, Macau, China, December 11--14, 2020. IEEE, 510--517. Google ScholarGoogle ScholarCross RefCross Ref
  20. Gregory Gay. 2017. Generating effective test suites by combining coverage criteria. In International Symposium on Search Based Software Engineering. Springer, 65--82.Google ScholarGoogle ScholarCross RefCross Ref
  21. Sijia Geng, Yuekang Li, Yunlan Du, Jun Xu, Yang Liu, and Bing Mao. 2020. An Empirical Study on Benchmarks of Artificial Software Vulnerabilities. (March 2020). https://arxiv.org/abs/2003.09561v1Google ScholarGoogle Scholar
  22. Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2015. Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites. ACM Transaction on Software Engineering and Methodology 24, 4, Article 22 (Sept. 2015), 33 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Google. 2021. ClusterFuzz. https://google.github.io/clusterfuzz/. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  24. Google. 2021. OSS-Fuzz: Continuous Fuzzing for Open Source Software. https://github.com/google/oss-fuzz. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  25. Rahul Gopinath, Philipp Görz, and Alex Groce. 2022. Mutation Analysis: Answering the Fuzzing Challenge. arXiv:2201.11303 [cs.SE]Google ScholarGoogle Scholar
  26. Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code Coverage for Suite Evaluation by Developers. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 72--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Péter Gyimesi, Béla Vancsics, Andrea Stocco 0001, Davood Mazinanian, Árpád Beszédes, Rudolf Ferenc, and Ali Mesbah 0001. 2019. BugsJS: a Benchmark of JavaScript Bugs. In Proceedings of the 12th International Conference on Software Testing, Verification and Validation. IEEE, 90--101. Google ScholarGoogle ScholarCross RefCross Ref
  28. Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A Ground-Truth Fuzzing Benchmark. In Proceedings of the joint international conference on Measurement and modeling of computer systems (SIGMETRICS'20). Association for Computing Machinery, New York, NY, USA, 29 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ben Hermann, Stefan Winter, and Janet Siegmund. 2020. Community Expectations for Research Artifacts and Evaluation Processes. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Adrian Herrera, Hendra Gunadi, Shane Magrath, Michael Norrish, Mathias Payer, and Antony L. Hosking. 2021. Seed Selection for Successful Fuzzing. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, Denmark) (ISSTA 2021). Association for Computing Machinery, New York, NY, USA, 230--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Marc Heuse. 2020. Twitter. https://twitter.com/hackerschoice/status/1302514351811842056. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  32. Laura Inozemtseva and Reid Holmes. 2014. Coverage is Not Strongly Correlated with Test Suite Effectiveness. In Proceedings of the 36th International Conference on Software Engineering (Hyderabad, India) (ICSE 2014). Association for Computing Machinery, New York, NY, USA, 435--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Marko Ivankovic, Goran Petrovic, René Just, and Gordon Fraser. 2019. Code coverage at Google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '19). Association for Computing Machinery, New York, NY, USA, 955--963.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (San Jose, CA, USA) (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. René Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and Gordon Fraser. 2014. Are Mutants a Valid Substitute for Real Faults in Software Testing?. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 654--665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS '18). Association for Computing Machinery, 2123--2138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pavneet Singh Kochhar, Ferdian Thung, and David Lo. 2015. Code coverage and test suite effectiveness: Empirical study with real bugs in large systems. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 560--564. Google ScholarGoogle ScholarCross RefCross Ref
  38. Caroline Lemieux and Koushik Sen. 2018. FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). Association for Computing Machinery, New York, NY, USA, 475--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Stephan Lipp, Daniel Elsner, Thomas Hutzelmann, Sebastian Banescu, Alexander Pretschner, and Marcel Böhme. 2022. FuzzTastic: A Fine-grained, Fuzzeragnostic Coverage Analyzer. In Proceedings of the 44th International Conference on Software Engineering Companion (ICSE'22 Companion). 1--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, and Raheem Beyah. 2019. MOPT: Optimized Mutation Scheduling for Fuzzers. In Proceedings of the 28th USENIX Conference on Security Symposium (Santa Clara, CA, USA) (SEC'19). USENIX Association, USA, 1949--1966.Google ScholarGoogle Scholar
  41. J. Martin Bland and DouglasG. Altman. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet 327, 8476 (1986), 307--310. Originally published as Volume 1, Issue 8476. Google ScholarGoogle ScholarCross RefCross Ref
  42. Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.Google ScholarGoogle Scholar
  43. P. McMinn. 2004. Search-based software test data generation: a survey: Research Articles. Software Testing, Verification & Reliability 14 (2004), 105--156.Google ScholarGoogle ScholarCross RefCross Ref
  44. Phil McMinn. 2011. Search-Based Software Testing: Past, Present and Future. In 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops. 153--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jonathan Metzman, László Szekeres, Laurent Maurice Romain Simon, Read Trevelin Sprabery, and Abhishek Arya. 2021. FuzzBench: An Open Fuzzer Benchmarking Platform and Service. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 1393--1403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Microsoft. 2021. OneFuzz: A self-hosted Fuzzing-As-A-Service platform. https://github.com/microsoft/onefuzz. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  47. Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. 2007. Predicting Vulnerable Software Components. In Proceedings of the 14th ACM Conference on Computer and Communications Security (Alexandria, Virginia, USA) (CCS '07). Association for Computing Machinery, New York, NY, USA, 529--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tyler Nighswander. 2020. Twitter. https://twitter.com/tylerni7/status/1374519171413766145. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  49. Sebastian Österlund, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2020. ParmeSan: Sanitizer-guided Greybox Fuzzing. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 2289--2306.Google ScholarGoogle Scholar
  50. Mike Papadakis, Donghwan Shin, Shin Yoo, and Doo-Hwan Bae. 2018. Are Mutation Scores Correlated with Real Fault Detection? A Large Scale Empirical Study on the Relationship between Mutants and Real Faults. In Proceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden) (ICSE '18). Association for Computing Machinery, New York, NY, USA, 537--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jibesh Patra and Michael Pradel. 2021. Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 906--918. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jannik Pewny and Thorsten Holz. 2016. EvilCoder: Automated Bug Insertion (ACSAC '16). Association for Computing Machinery, New York, NY, USA, 214--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Van-Thuan Pham, Marcel Böhme, Andrew E. Santosa, Alexandru R. Căciulescu, and Abhik Roychoudhury. 2019. Smart Greybox Fuzzing. IEEE Transactions on Software Engineering (2019), 1--17.Google ScholarGoogle ScholarCross RefCross Ref
  54. Subhajit Roy, Awanish Pandey, Brendan Dolan-Gavitt, and Yu Hu. 2018. Bug Synthesis: Challenging Bug-Finding Tools with Deep Faults. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 224--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Patrick Schober, Christa Boer, and Lothar A Schwarte. 2018. Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia 126, 5 (2018), 1763--1768.Google ScholarGoogle ScholarCross RefCross Ref
  56. Kostya Serebryany. 2021. libFuzzer - a library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  57. Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (Boston, MA) (USENIX ATC'12). USENIX Association, USA, 28.Google ScholarGoogle Scholar
  58. Robert Swiecki. 2021. Honggfuzz. https://github.com/google/honggfuzz. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  59. Howard E.A. Tinsley and David J. Weiss. 2000. 4 - Interrater Reliability and Agreement. In Handbook of Applied Multivariate Statistics and Mathematical Modeling, Howard E.A. Tinsley and Steven D. Brown (Eds.). Academic Press, San Diego, 95--124. Google ScholarGoogle ScholarCross RefCross Ref
  60. David A. Tomassi, Naji Dmeiri, Yichen Wang, Antara Bhowmick, Yen-Chuan Liu, Premkumar T. Devanbu, Bogdan Vasilescu, and Cindy Rubio-González. 2019. BugSwarm: mining and continuously growing a dataset of reproducible failures and fixes. In ICSE. IEEE / ACM, 339--349.Google ScholarGoogle Scholar
  61. Yi Wei, Bertrand Meyer, and Manuel Oriol. 2012. Is Branch Coverage a Good Measure of Testing Effectiveness? Springer Berlin Heidelberg, Berlin, Heidelberg, 194--212. Google ScholarGoogle ScholarCross RefCross Ref
  62. Cheng Wen. 2021. Recent Papers Related To Fuzzing. https://github.com/wcventure/FuzzingPaper. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  63. MichałZalewski. 2021. american fuzzy lop (2.52b). https://lcamtuf.coredump.cx/afl/. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  64. Andreas Zeller. 2022. A Tweet. https://twitter.com/AndreasZeller/status/1468142858553200644. [Online; accessed 10.Feb.22].Google ScholarGoogle Scholar
  65. Andreas Zeller, Sascha Just, and Kai Greshake. 2019. When Results Are All That Matters: Consequences. https://andreas-zeller.blogspot.com/2019/10/when-results-are-all-that-matters.html. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  66. Andreas Zeller, Sascha Just, and Kai Greshake. 2019. When Results Are All That Matters: The Case of the Angora Fuzzer. https://andreas-zeller.blogspot.com/2019/10/when-results-are-all-that-matters-case.html. [Online; accessed 16.Aug.21].Google ScholarGoogle Scholar
  67. Yucheng Zhang and Ali Mesbah. 2015. Assertions Are Strongly Correlated with Test Suite Effectiveness. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). Association for Computing Machinery, New York, NY, USA, 214--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Xiaogang Zhu, Shigang Liu, Xian Li, Sheng Wen, Jun Zhang, Seyit Ahmet Çamtepe, and Yang Xiang. 2020. DeFuzz: Deep Learning Guided Directed Fuzzing. CoRR abs/2010.12149 (2020).Google ScholarGoogle Scholar

Index Terms

  1. On the reliability of coverage-based fuzzer benchmarking
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICSE '22: Proceedings of the 44th International Conference on Software Engineering
            May 2022
            2508 pages
            ISBN:9781450392211
            DOI:10.1145/3510003

            Copyright © 2022 Owner/Author

            Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 5 July 2022

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate276of1,856submissions,15%

            Upcoming Conference

            ICSE 2025

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader