research-article

Public Access

BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network

Authors:
Yuede Ji

George Washington University, Washington, DC, USA

George Washington University, Washington, DC, USA
View Profile

,
Lei Cui

George Washington University, Washington, DC, USA

George Washington University, Washington, DC, USA
View Profile

,
H. Howie Huang

George Washington University, Washington, DC, USA

George Washington University, Washington, DC, USA
View Profile

ASIA CCS '21: Proceedings of the 2021 ACM Asia Conference on Computer and Communications SecurityMay 2021Pages 702–715https://doi.org/10.1145/3433210.3437533

Published:04 June 2021Publication History

ASIA CCS '21: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security

Pages 702–715

ABSTRACT

Binary code similarity detection, which answers whether two pieces of binary code are similar, has been used in a number of applications,such as vulnerability detection and automatic patching. Existing approaches face two hurdles in their efforts to achieve high accuracy and coverage: (1) the problem of source-binary code similarity detection, where the target code to be analyzed is in the binary format while the comparing code (with ground truth) is in source code format. Meanwhile, the source code is compiled to the comparing binary code with either a random or fixed configuration (e.g.,architecture, compiler family, compiler version, and optimization level), which significantly increases the difficulty of code similarity detection; and (2) the existence of different degrees of code similarity. Less similar code is known to be more, if not equally, important in various applications such as binary vulnerability study. To address these challenges, we design BugGraph, which performs source-binary code similarity detection in two steps. First, BugGraph identifies the compilation provenance of the target binary and compiles the comparing source code to a binary with the same provenance.Second, BugGraph utilizes a new graph triplet-loss network on the attributed control flow graph to produce a similarity ranking. The experiments on four real-world datasets show that BugGraph achieves 90% and 75% true positive rate for syntax equivalent and similar code, respectively, an improvement of 16% and 24% overstate-of-the-art methods. Moreover, BugGraph is able to identify 140 vulnerabilities in six commercial firmware.

References

[n.d.]. IDA Pro - Interatctive Disassembler. https://www.hex-rays.com/products/ida/.Google Scholar
[n.d.]. Vulnerability Details: CVE-2016--2842. https://www.cvedetails.com/cve/CVE-2016--2842/.Google Scholar
Last accessed, Nov. 2020. Cppcheck - A tool for static C/C++ code analysis. http://cppcheck.sourceforge.net/.Google Scholar
Last accessed, Nov. 2020. Flawfinder - C/C++ Source Code Analyzer. https://dwheeler.com/flawfinder/.Google Scholar
Zeina Abu-Aisheh, Romain Raveaux, Jean-Yves Ramel, and Patrick Martineau. 2015. An exact graph edit distance algorithm for solving pattern recognition problems.Google Scholar
Dennis Andriesse, Asia Slowinska, and Herbert Bos. 2017. Compiler-agnostic function detection in binaries. In 2017 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 177--189.Google ScholarCross Ref
Joy Batchelor and HENRIK REIF Andersen. 2012. Bridging the product configuration gap between PLM and ERP - an automotive case study. In 19th International product development management Conference, Manchester, UK. 17--19.Google Scholar
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools.IEEE Transactions on software engineering33, 9 (2007).Google ScholarDigital Library
Christopher M Bishop. 2006.Pattern recognition and machine learning. springer.Google Scholar
Martial Bourquin, Andy King, and Edward Robbins. 2013. Binslayer: accurate comparison of binary executables. In Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. ACM, 4.Google ScholarDigital Library
Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology 25, 2 (2001), 163--177.Google ScholarCross Ref
Z Berkay Celik, Earlence Fernandes, Eric Pauley, Gang Tan, and Patrick McDaniel. 2019. Program analysis of commodity IoT applications for security and privacy: Challenges and opportunities. ACM Computing Surveys (CSUR)52, 4 (2019), 1--30.Google ScholarDigital Library
Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. Acm Sigplan Notices49, 6 (2014), 349--360.Google Scholar
Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2019. Asm2Vec:Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proceedings of the 40th IEEE Symposium on Security and Privacy (S&P). IEEE Computer Society, 18.Google Scholar
Shuaike Dong, Menghao Li, Wenrui Diao, Xiangyu Liu, Jian Liu, Zhou Li, Fenghao Xu, Kai Chen, XiaoFeng Wang, and Kehuan Zhang. 2018. Understanding Android obfuscation techniques: A large-scale investigation in the wild. In International Conference on Security and Privacy in Communication Systems. Springer, 172--192.Google ScholarCross Ref
Mengnan Du, Ninghao Liu, Qingquan Song, and Xia Hu. 2018. Towards explanation of dnn-based prediction with guided feature inversion. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.Google ScholarDigital Library
Ruian Duan, Ashish Bijlani, Yang Ji, Omar Alrawi, Yiyuan Xiong, Moses Ike, Brendan Saltaformaggio, and Wenke Lee. 2019. Automating Patching of Vulnerable Open-Source Software Versions in Application Binaries. In NDSS.Google Scholar
Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. DEEPBINDIFF:Learning Program-Wide Code Representations for Binary Diffing. In NDSS.Google Scholar
Thomas Dullien and Rolf Rolles. [n.d.]. Graph-based comparison of Executable Objects (English Version). ([n. d.]).Google Scholar
Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovre: Efficient cross-architecture identification of bugs in binary code. In Proceedings of the 23th Symposium on Network and Distributed System Security (NDSS).Google ScholarCross Ref
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable Graph-based Bug Search for Firmware Images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.Google ScholarDigital Library
Debin Gao, Michael K Reiter, and Dawn Song. 2008. Binhunt: Automatically finding semantic differences in binary programs. In International Conference on Information and Communications Security. Springer, 238--255.Google ScholarDigital Library
Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. 2018. Lemna: Explaining deep learning based security applications. In Proceedings of ACM SIGSAC Conference on Computer and Communications Security.Google ScholarDigital Library
Eric Gustafson, Marius Muench, Chad Spensky, Nilo Redini, Aravind Machiry, Yanick Fratantonio, Davide Balzarotti, Aurélien Francillon, Yung Ryn Choe, Christophe Kruegel, et al. 2019. Toward the Analysis of Embedded Firmware through Automated Re-hosting. In 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019). 135--150.Google Scholar
Maarten Houbraken, Sofie Demeyer, Tom Michoel, Pieter Audenaert, Didier Colle, and Mario Pickavet. 2014. The Index-based Subgraph Matching Algorithm with General Symmetries (ISMAGS): exploiting symmetry for faster subgraph enumeration. PloS one9, 5 (2014).Google Scholar
Md Rakibul Islam, Minhaz F Zibran, and Aayush Nagpal. 2017. Security vulnerabilities in categories of clones and non-cloned code: an empirical study. In Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE Press, 20--29.Google ScholarDigital Library
Yuede Ji and H. Howie Huang. 2020. Aquila: Adaptive Parallel Computation of Graph Connectivity Queries. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC).Google Scholar
Yuede Ji, Hang Liu, and H. Howie Huang. 2018. ispan: Parallel identification of strongly connected components with spanning trees. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 731--742.Google Scholar
Yuede Ji, Hang Liu, and H. Howie Huang. 2020. SwarmGraph: Analyzing Large-Scale In-Memory Graphs on GPUs. In International Conference on High Performance Computing and Communications (HPCC). IEEE.Google Scholar
Md Enamul Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. 2005. Malware phylogeny generation using permutations of code. Journal in Computer Virology 1, 1--2 (2005), 13--23.Google ScholarCross Ref
Wei Ming Khoo, Alan Mycroft, and Ross Anderson. 2013. Rendezvous: A search engine for binary code. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 329--338.Google ScholarDigital Library
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907(2016).Google Scholar
William Koch, Abdelberi Chaabane, Manuel Egele, William Robertson, and EnginKirda. 2017. Semi-automated discovery of server-based information oversharing vulnerabilities in Android applications. In Proceedings of ISSTA.Google ScholarDigital Library
Anusha Lalitha, Osman Cihan Kilinc, Tara Javidi, and Farinaz Koushanfar. 2019. Peer-to-peer federated learning on graphs. arXiv preprint arXiv:1901.11173(2019).Google Scholar
Klas Leino, Shayak Sen, Anupam Datta, Matt Fredrikson, and Linyi Li. [n.d.]. Influence-directed explanations for deep convolutional networks. In 2018 IEEE International Test Conference (ITC). IEEE, 1--8.Google Scholar
Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of ACM SIGSOFT International Symposium on Foundations of Software Engineering.Google ScholarDigital Library
Jiang Ming, Meng Pan, and Debin Gao. 2012. iBinHunt: Binary hunting with inter-procedural control flow. In International Conference on Information Security and Cryptology. Springer, 92--109.Google Scholar
Jiang Ming, Dongpeng Xu, and Dinghao Wu. 2015. Memoized semantics-based binary diffing with application to malware lineage inference. In IFIP International Information Security and Privacy Conference. Springer, 416--430.Google ScholarCross Ref
Hiroaki Murakami, Yoshiki Higo, and Shinji Kusumoto. 2014. A dataset of clone references with gaps. In Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 412--415.Google ScholarDigital Library
Ginger Myles and Christian Collberg. 2005. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing. ACM, 314--318.Google ScholarDigital Library
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709--724.Google ScholarDigital Library
Jannik Pewny, Felix Schuster, Lukas Bernhard, Thorsten Holz, and Christian Rossow. 2014. Leveraging semantic signatures for bug search in binary programs. In Proceedings of the 30th Annual Computer Security Applications Conference.Google ScholarDigital Library
Nathan Rosenblum, Barton P Miller, and Xiaojin Zhu. 2011. Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. ACM, 100--110.Google ScholarDigital Library
Nathan E Rosenblum, Barton P Miller, and Xiaojin Zhu. 2010. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering.Google ScholarDigital Library
Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of computer programming74, 7 (2009), 470--495.Google Scholar
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarCross Ref
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the40th international conference on software engineering. 303--314.Google ScholarDigital Library
Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms.Journal of machine learning research15, 1 (2014), 3221--3245.Google Scholar
Petar Veli?kovi?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLiò, and Yoshua Bengio. 2017. Graph Attention Networks.arXiv preprint arXiv: 1710.10903(2017).Google Scholar
Binghui Wang, Jinyuan Jia, and Neil Zhenqiang Gong. 2018. Graph-based security and privacy analytics via collective classification with joint weight learning and propagation. arXiv preprint arXiv:1812.01661(2018).Google Scholar
Ruoyu Wang, Yan Shoshitaishvili, Antonio Bianchi, Aravind Machiry, John Grosen, Paul Grosen, Christopher Kruegel, and Giovanni Vigna. 2017. Ramblr: Making Reassembly Great Again. In NDSS.Google Scholar
Dominik Wermke, Nicolas Huaman, Yasemin Acar, Bradley Reaves, Patrick Traynor, and Sascha Fahl. 2018. A large scale investigation of obfuscation use in google play. In Proceedings of Annual Computer Security Applications Conference.Google ScholarDigital Library
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. 2019. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596(2019).Google Scholar
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of CCS.Google ScholarDigital Library
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In 2014 IEEE Symposium on Security and Privacy. IEEE, 590--604.Google ScholarDigital Library
Jiaqi Yan, Guanhua Yan, and Dong Jin. 2019. Classifying Malware Represented as Control Flow Graphs using Deep Graph Convolutional Neural Network. In Proceedings of DSN.Google ScholarCross Ref
Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. GNN Explainer: A Tool for Post-hoc Explanation of Graph Neural Networks. arXiv preprint arXiv:1903.03894(2019).Google Scholar
Hang Zhang and Zhiyun Qian. 2018. Precise and accurate patch presence test for binaries. In 27th USENIX Security Symposium (USENIX Security 18). 887--902.Google Scholar
Min Zheng, Mingshen Sun, and John CS Lui. 2014. Droid Ray: a security evaluation system for customized android firmwares. In Proceedings of the 9th ACM symposium on Information, computer and communications security. 471--482.Google ScholarDigital Library
Fei Zuo, Xiaopeng Li, Zhexin Zhang, Patrick Young, Lannan Luo, and Qiang Zeng. 2019. Neural machine translation inspired binary code similarity comparison beyond function pairs. In NDSS.Google Scholar

Index Terms

BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network

Recommendations

On non-antipodal binary completely regular codes

Binary non-antipodal completely regular codes are characterized. Using a result on nonexistence of nontrivial binary perfect codes, it is concluded that there are no unknown nontrivial non-antipodal completely regular binary codes with minimum distance ...
Read More
Binary code similarity detection
ASE '21: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering

Binary code similarity detection is to detect the similarity of code at binary (assembly) level without source code. Existing work still has their limitation when dealing with mutated binary code with different compiling options. We proposed a novel ...
Read More
Vestige: Identifying Binary Code Provenance for Vulnerability Detection
Applied Cryptography and Network Security
Abstract
Identifying the compilation provenance of a binary code helps to pinpoint the specific compilation tools and configurations that were used to produce the executable. Unfortunately, existing techniques are not able to accurately differentiate among ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASIA CCS '21: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security
May 2021
975 pages
ISBN:9781450382878
DOI:10.1145/3433210
General Chairs:
Jiannong Cao
The Hong Kong Polytechnic University, Hong Kong
,
Man Ho Au
The University of Hong Kong, Hong Kong
,
Program Chairs:
Zhiqiang Lin
The Ohio State University, USA
,
Moti Yung
Google and Columbia University, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
binary code
code similarity
graph embedding
vulnerability
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate418of2,322submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 1,079
  Total Downloads
- Downloads (Last 12 months)394
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network

ASIA CCS '21: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

On non-antipodal binary completely regular codes

Binary code similarity detection

Vestige: Identifying Binary Code Provenance for Vulnerability Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

BugGraph: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network

ASIA CCS '21: Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

On non-antipodal binary completely regular codes

Binary code similarity detection

Vestige: Identifying Binary Code Provenance for Vulnerability Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media