skip to main content
10.1145/3433210.3437533acmconferencesArticle/Chapter ViewAbstractPublication Pagesasia-ccsConference Proceedingsconference-collections
research-article
Public Access

BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network

Authors Info & Claims
Published:04 June 2021Publication History

ABSTRACT

Binary code similarity detection, which answers whether two pieces of binary code are similar, has been used in a number of applications,such as vulnerability detection and automatic patching. Existing approaches face two hurdles in their efforts to achieve high accuracy and coverage: (1) the problem of source-binary code similarity detection, where the target code to be analyzed is in the binary format while the comparing code (with ground truth) is in source code format. Meanwhile, the source code is compiled to the comparing binary code with either a random or fixed configuration (e.g.,architecture, compiler family, compiler version, and optimization level), which significantly increases the difficulty of code similarity detection; and (2) the existence of different degrees of code similarity. Less similar code is known to be more, if not equally, important in various applications such as binary vulnerability study. To address these challenges, we design BugGraph, which performs source-binary code similarity detection in two steps. First, BugGraph identifies the compilation provenance of the target binary and compiles the comparing source code to a binary with the same provenance.Second, BugGraph utilizes a new graph triplet-loss network on the attributed control flow graph to produce a similarity ranking. The experiments on four real-world datasets show that BugGraph achieves 90% and 75% true positive rate for syntax equivalent and similar code, respectively, an improvement of 16% and 24% overstate-of-the-art methods. Moreover, BugGraph is able to identify 140 vulnerabilities in six commercial firmware.

References

  1. [n.d.]. IDA Pro - Interatctive Disassembler. https://www.hex-rays.com/products/ida/.Google ScholarGoogle Scholar
  2. [n.d.]. Vulnerability Details: CVE-2016--2842. https://www.cvedetails.com/cve/CVE-2016--2842/.Google ScholarGoogle Scholar
  3. Last accessed, Nov. 2020. Cppcheck - A tool for static C/C++ code analysis. http://cppcheck.sourceforge.net/.Google ScholarGoogle Scholar
  4. Last accessed, Nov. 2020. Flawfinder - C/C++ Source Code Analyzer. https://dwheeler.com/flawfinder/.Google ScholarGoogle Scholar
  5. Zeina Abu-Aisheh, Romain Raveaux, Jean-Yves Ramel, and Patrick Martineau. 2015. An exact graph edit distance algorithm for solving pattern recognition problems.Google ScholarGoogle Scholar
  6. Dennis Andriesse, Asia Slowinska, and Herbert Bos. 2017. Compiler-agnostic function detection in binaries. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 177--189.Google ScholarGoogle ScholarCross RefCross Ref
  7. Joy Batchelor and HENRIK REIF Andersen. 2012. Bridging the product configuration gap between PLM and ERP - an automotive case study. In 19th International product development management Conference, Manchester, UK. 17--19.Google ScholarGoogle Scholar
  8. Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools.IEEE Transactions on software engineering33, 9 (2007).Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christopher M Bishop. 2006.Pattern recognition and machine learning. springer.Google ScholarGoogle Scholar
  10. Martial Bourquin, Andy King, and Edward Robbins. 2013. Binslayer: accurate comparison of binary executables. In Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. ACM, 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology 25, 2 (2001), 163--177.Google ScholarGoogle ScholarCross RefCross Ref
  12. Z Berkay Celik, Earlence Fernandes, Eric Pauley, Gang Tan, and Patrick McDaniel. 2019. Program analysis of commodity IoT applications for security and privacy: Challenges and opportunities. ACM Computing Surveys (CSUR)52, 4 (2019), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. Acm Sigplan Notices49, 6 (2014), 349--360.Google ScholarGoogle Scholar
  14. Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2019. Asm2Vec:Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proceedings of the 40th IEEE Symposium on Security and Privacy (S&P). IEEE Computer Society, 18.Google ScholarGoogle Scholar
  15. Shuaike Dong, Menghao Li, Wenrui Diao, Xiangyu Liu, Jian Liu, Zhou Li, Fenghao Xu, Kai Chen, XiaoFeng Wang, and Kehuan Zhang. 2018. Understanding Android obfuscation techniques: A large-scale investigation in the wild. In International Conference on Security and Privacy in Communication Systems. Springer, 172--192.Google ScholarGoogle ScholarCross RefCross Ref
  16. Mengnan Du, Ninghao Liu, Qingquan Song, and Xia Hu. 2018. Towards explanation of dnn-based prediction with guided feature inversion. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ruian Duan, Ashish Bijlani, Yang Ji, Omar Alrawi, Yiyuan Xiong, Moses Ike, Brendan Saltaformaggio, and Wenke Lee. 2019. Automating Patching of Vulnerable Open-Source Software Versions in Application Binaries. In NDSS.Google ScholarGoogle Scholar
  18. Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. DEEPBINDIFF:Learning Program-Wide Code Representations for Binary Diffing. In NDSS.Google ScholarGoogle Scholar
  19. Thomas Dullien and Rolf Rolles. [n.d.]. Graph-based comparison of Executable Objects (English Version). ([n. d.]).Google ScholarGoogle Scholar
  20. Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovre: Efficient cross-architecture identification of bugs in binary code. In Proceedings of the 23th Symposium on Network and Distributed System Security (NDSS).Google ScholarGoogle ScholarCross RefCross Ref
  21. Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable Graph-based Bug Search for Firmware Images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Debin Gao, Michael K Reiter, and Dawn Song. 2008. Binhunt: Automatically finding semantic differences in binary programs. In International Conference on Information and Communications Security. Springer, 238--255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. 2018. Lemna: Explaining deep learning based security applications. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eric Gustafson, Marius Muench, Chad Spensky, Nilo Redini, Aravind Machiry, Yanick Fratantonio, Davide Balzarotti, Aurélien Francillon, Yung Ryn Choe, Christophe Kruegel, et al. 2019. Toward the Analysis of Embedded Firmware through Automated Re-hosting. In 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019). 135--150.Google ScholarGoogle Scholar
  25. Maarten Houbraken, Sofie Demeyer, Tom Michoel, Pieter Audenaert, Didier Colle, and Mario Pickavet. 2014. The Index-based Subgraph Matching Algorithm with General Symmetries (ISMAGS): exploiting symmetry for faster subgraph enumeration. PloS one9, 5 (2014).Google ScholarGoogle Scholar
  26. Md Rakibul Islam, Minhaz F Zibran, and Aayush Nagpal. 2017. Security vulnerabilities in categories of clones and non-cloned code: an empirical study. In Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE Press, 20--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yuede Ji and H. Howie Huang. 2020. Aquila: Adaptive Parallel Computation of Graph Connectivity Queries. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC).Google ScholarGoogle Scholar
  28. Yuede Ji, Hang Liu, and H. Howie Huang. 2018. ispan: Parallel identification of strongly connected components with spanning trees. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 731--742.Google ScholarGoogle Scholar
  29. Yuede Ji, Hang Liu, and H. Howie Huang. 2020. SwarmGraph: Analyzing Large-Scale In-Memory Graphs on GPUs. In International Conference on High Performance Computing and Communications (HPCC). IEEE.Google ScholarGoogle Scholar
  30. Md Enamul Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. 2005. Malware phylogeny generation using permutations of code. Journal in Computer Virology 1, 1--2 (2005), 13--23.Google ScholarGoogle ScholarCross RefCross Ref
  31. Wei Ming Khoo, Alan Mycroft, and Ross Anderson. 2013. Rendezvous: A search engine for binary code. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 329--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907(2016).Google ScholarGoogle Scholar
  33. William Koch, Abdelberi Chaabane, Manuel Egele, William Robertson, and EnginKirda. 2017. Semi-automated discovery of server-based information oversharing vulnerabilities in Android applications. In Proceedings of ISSTA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Anusha Lalitha, Osman Cihan Kilinc, Tara Javidi, and Farinaz Koushanfar. 2019. Peer-to-peer federated learning on graphs. arXiv preprint arXiv:1901.11173(2019).Google ScholarGoogle Scholar
  35. Klas Leino, Shayak Sen, Anupam Datta, Matt Fredrikson, and Linyi Li. [n.d.]. Influence-directed explanations for deep convolutional networks. In 2018 IEEE International Test Conference (ITC). IEEE, 1--8.Google ScholarGoogle Scholar
  36. Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of ACM SIGSOFT International Symposium on Foundations of Software Engineering.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jiang Ming, Meng Pan, and Debin Gao. 2012. iBinHunt: Binary hunting with inter-procedural control flow. In International Conference on Information Security and Cryptology. Springer, 92--109.Google ScholarGoogle Scholar
  38. Jiang Ming, Dongpeng Xu, and Dinghao Wu. 2015. Memoized semantics-based binary diffing with application to malware lineage inference. In IFIP International Information Security and Privacy Conference. Springer, 416--430.Google ScholarGoogle ScholarCross RefCross Ref
  39. Hiroaki Murakami, Yoshiki Higo, and Shinji Kusumoto. 2014. A dataset of clone references with gaps. In Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 412--415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ginger Myles and Christian Collberg. 2005. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing. ACM, 314--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709--724.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jannik Pewny, Felix Schuster, Lukas Bernhard, Thorsten Holz, and Christian Rossow. 2014. Leveraging semantic signatures for bug search in binary programs. In Proceedings of the 30th Annual Computer Security Applications Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Nathan Rosenblum, Barton P Miller, and Xiaojin Zhu. 2011. Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. ACM, 100--110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Nathan E Rosenblum, Barton P Miller, and Xiaojin Zhu. 2010. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of computer programming74, 7 (2009), 470--495.Google ScholarGoogle Scholar
  46. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  47. Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the40th international conference on software engineering. 303--314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms.Journal of machine learning research15, 1 (2014), 3221--3245.Google ScholarGoogle Scholar
  49. Petar Veli?kovi?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLiò, and Yoshua Bengio. 2017. Graph Attention Networks.arXiv preprint arXiv: 1710.10903(2017).Google ScholarGoogle Scholar
  50. Binghui Wang, Jinyuan Jia, and Neil Zhenqiang Gong. 2018. Graph-based security and privacy analytics via collective classification with joint weight learning and propagation. arXiv preprint arXiv:1812.01661(2018).Google ScholarGoogle Scholar
  51. Ruoyu Wang, Yan Shoshitaishvili, Antonio Bianchi, Aravind Machiry, John Grosen, Paul Grosen, Christopher Kruegel, and Giovanni Vigna. 2017. Ramblr: Making Reassembly Great Again. In NDSS.Google ScholarGoogle Scholar
  52. Dominik Wermke, Nicolas Huaman, Yasemin Acar, Bradley Reaves, Patrick Traynor, and Sascha Fahl. 2018. A large scale investigation of obfuscation use in google play. In Proceedings of Annual Computer Security Applications Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. 2019. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596(2019).Google ScholarGoogle Scholar
  54. Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of CCS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, 590--604.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Jiaqi Yan, Guanhua Yan, and Dong Jin. 2019. Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network. In Proceedings of DSN.Google ScholarGoogle ScholarCross RefCross Ref
  57. Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. GNN Explainer: A Tool for Post-hoc Explanation of Graph Neural Networks. arXiv preprint arXiv:1903.03894(2019).Google ScholarGoogle Scholar
  58. Hang Zhang and Zhiyun Qian. 2018. Precise and accurate patch presence test for binaries. In 27th USENIX Security Symposium (USENIX Security 18). 887--902.Google ScholarGoogle Scholar
  59. Min Zheng, Mingshen Sun, and John CS Lui. 2014. Droid Ray: a security evaluation system for customized android firmwares. In Proceedings of the 9th ACM symposium on Information, computer and communications security. 471--482.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Fei Zuo, Xiaopeng Li, Zhexin Zhang, Patrick Young, Lannan Luo, and Qiang Zeng. 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. In NDSS.Google ScholarGoogle Scholar

Index Terms

  1. BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network

                                Recommendations

                                Comments

                                Login options

                                Check if you have access through your login credentials or your institution to get full access on this article.

                                Sign in
                                • Published in

                                  cover image ACM Conferences
                                  ASIA CCS '21: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security
                                  May 2021
                                  975 pages
                                  ISBN:9781450382878
                                  DOI:10.1145/3433210
                                  • General Chairs:
                                  • Jiannong Cao,
                                  • Man Ho Au,
                                  • Program Chairs:
                                  • Zhiqiang Lin,
                                  • Moti Yung

                                  Copyright © 2021 ACM

                                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                                  Publisher

                                  Association for Computing Machinery

                                  New York, NY, United States

                                  Publication History

                                  • Published: 4 June 2021

                                  Permissions

                                  Request permissions about this article.

                                  Request Permissions

                                  Check for updates

                                  Qualifiers

                                  • research-article

                                  Acceptance Rates

                                  Overall Acceptance Rate418of2,322submissions,18%

                                PDF Format

                                View or Download as a PDF file.

                                PDF

                                eReader

                                View online with eReader.

                                eReader