Position Distribution Matters: A Graph-Based Binary Function Similarity Analysis Method
Abstract
:1. Introduction
- First, the current quality of node attributes’ representation prevents them from obtaining deep semantics.
- Second, some graph embedding models are based on node embedding methods such as random walk, which is a random sampling of adjacent nodes, and these models do not have a view of the whole function graph.
- The position distribution is also an often-overlooked feature, as in Figure 1, basic blocks can be grouped as different functional modules according to the attributes and their surroundings.
- We propose a graph-based method for binary function similarity analysis, which enriches information in graph embedding and takes position distribution information into consideration.
- We built a graph embedding model for the binary function using CapsGNN to provide the high-level information of the graph and used dynamic routing to transfer features to a suitable place of the embedding based on the distribution information.
- In the graph embedding model, we added the DiffPool layer to generate the subgraph, by node clustering based on attributes and position information, to supplement the graph information and represent the position distribution feature.
- We implemented a prototype, PDM, and conducted an evaluation on two tasks to prove its effectiveness. Experiments showed that PDM outperforms other research on binary function similarity detection by up to three-times in accuracy. In the vulnerable function ranking task, PDM can compete with state-of-the-art tools and is even better in some circumstances.
2. Related Work
2.1. Binary Code Similarity Detection
2.2. Graph-Based Methods in BCSD
3. Motivation
4. Methodology
4.1. ACFG+ Construction
4.2. Graph Embedding Model Based on Position Distribution
4.2.1. DiffPool Preprocessing
4.2.2. Capsule Graph Neural Network as the Backbone
4.2.3. Similarity Comparison
5. Evaluation
5.1. Implementation Details
5.2. Experimental Setup
5.2.1. Hardware Configuration
5.2.2. Tasks
5.2.3. Datasets
5.2.4. Baselines
5.3. Task 1: Binary Function Similarity Detection
5.3.1. ROC and AUC Performance
5.3.2. Detection Accuracy Performance
5.4. Task 2: Vulnerability Detection
6. Discussion
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ACFG | Attributed Control Flow Graph |
ACFG+ | Enhanced Attributed Control Flow Graph |
AST | Abstract Syntax Tree |
BCSD | Binary Code Similarity Detection |
CapsGNN | Capsule Graph Neural Network |
CFG | Control Flow Graph |
CVE | Common Vulnerabilities and Exposures |
DFG | Data Flow Graph |
GCN | Graph Convolution Network |
GGNN | Gated Graph Sequence Neural Networks |
GMN | Graph Matching Network |
GraphSAGE | Graph SAmple and aggreGatE |
HBMP | Hierarchy-like structure of BiLSTM layers with Max Pooling |
LSFG | Labeled Semantic Flow Graph |
MPNN | Message-Passing Neural Network |
TADW | Text-Associated DeepWalk algorithm |
References
- Luo, Z.; Wang, B.; Tang, Y.; Xie, W. Semantic-based representation binary clone detection for cross-architectures in the internet of things. Appl. Sci. 2019, 9, 3283. [Google Scholar] [CrossRef] [Green Version]
- Marcelli, A.; Graziano, M.; Ugarte-Pedrero, X.; Fratantonio, Y.; Mansouri, M.; Balzarotti, D. How Machine Learning Is Solving the Binary Function Similarity Problem; Usenix Association: Berkeley, CA, USA, 2018. [Google Scholar]
- Feng, Q.; Zhou, R.; Xu, C.; Cheng, Y.; Testa, B.; Yin, H. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 480–491. [Google Scholar]
- Xu, X.; Liu, C.; Feng, Q.; Yin, H.; Song, L.; Song, D. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 363–376. [Google Scholar]
- Dai, H.; Dai, B.; Song, L. Discriminative embeddings of latent variable models for structured data. In Proceedings of the International Conference on Machine Learning (PMLR), Paris, France, 29 April–1 May 2016; pp. 2702–2711. [Google Scholar]
- Gao, J.; Yang, X.; Fu, Y.; Jiang, Y.; Sun, J. Vulseeker: A semantic learning based vulnerability seeker for cross-platform binary. In Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), Montpellier, France, 3–7 September 2018; pp. 896–899. [Google Scholar]
- Xinyi, Z.; Chen, L. Capsule graph neural network. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Ying, Z.; You, J.; Morris, C.; Ren, X.; Hamilton, W.; Leskovec, J. Hierarchical graph representation learning with differentiable pooling. Adv. Neural Inf. Process. Syst. 2018, 31, 2332. [Google Scholar]
- Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 2216. [Google Scholar]
- FCatalog. xorpd. Available online: https://www.xorpd.net/pages/fcatalog.html (accessed on 22 June 2022).
- BinDiff. Zynamics. Available online: https://www.zynamics.com/bindiff.html (accessed on 22 June 2022).
- Koret, J. Diaphora. Available online: https://github.com/joxeankoret/diaphora (accessed on 22 June 2022).
- Dullien, T. FunctionSimSearch. Available online: https://github.com/googleprojectzero/functionsimsearch (accessed on 22 June 2022).
- Zuo, F.; Li, X.; Young, P.; Luo, L.; Zeng, Q.; Zhang, Z. Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv 2018, arXiv:1808.04706. [Google Scholar]
- Duan, Y.; Li, X.; Wang, J.; Yin, H. Deepbindiff: Learning program-wide code representations for binary diffing. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 27 February–3 March 2020. [Google Scholar]
- Ding, S.H.; Fung, B.C.; Charland, P. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 472–489. [Google Scholar]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning (PMLR), Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Li, X.; Qu, Y.; Yin, H. Palmtree: Learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Online, Korea, 15–19 November 2021; pp. 3236–3251. [Google Scholar]
- Zhang, X.; Sun, W.; Pang, J.; Liu, F.; Ma, Z. Similarity metric method for binary basic blocks of cross-instruction set architecture. In Proceedings of the 2020 Workshop on Binary Analysis Research Internet Society, San Diego, CA, USA, 23–26 February 2020. [Google Scholar]
- Redmond, K.; Luo, L.; Zeng, Q. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. arXiv 2018, arXiv:1812.09652. [Google Scholar]
- Gao, J.; Yang, X.; Fu, Y.; Jiang, Y.; Shi, H.; Sun, J. Vulseeker-pro: Enhanced semantic learning based binary vulnerability seeker with emulation. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2018; pp. 803–808. [Google Scholar]
- Gao, J.; Jiang, Y.; Liu, Z.; Yang, X.; Wang, C.; Jiao, X.; Yang, Z.; Sun, J. Semantic learning and emulation based cross-platform binary vulnerability seeker. IEEE Trans. Softw. Eng. 2019, 47, 2575–2589. [Google Scholar] [CrossRef]
- Sun, P.; Garcia, L.; Salles-Loustau, G.; Zonouz, S. Hybrid firmware analysis for known mobile and iot security vulnerabilities. In Proceedings of the 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Valencia, Spain, 29 June 2020; pp. 373–384. [Google Scholar]
- Pei, K.; Xuan, Z.; Yang, J.; Jana, S.; Ray, B. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv 2020, arXiv:2012.08680. [Google Scholar]
- Li, Y.; Gu, C.; Dullien, T.; Vinyals, O.; Kohli, P. Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 3835–3845. [Google Scholar]
- Yang, C.; Liu, Z.; Zhao, D.; Sun, M.; Chang, E. Network representation learning with rich text information. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
- Yang, S.; Cheng, L.; Zeng, Y.; Lang, Z.; Zhu, H.; Shi, Z. Asteria: Deep learning-based AST-encoding for cross-platform binary code similarity detection. In Proceedings of the 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Taipei, Taiwan, 21–24 June 2021; pp. 224–236. [Google Scholar]
- Tai, K.S.; Socher, R.; Manning, C.D. Improved semantic representations from tree-structured long short-term memory networks. arXiv 2015, arXiv:1503.00075. [Google Scholar]
- Yu, Z.; Cao, R.; Tang, Q.; Nie, S.; Huang, J.; Wu, S. Order matters: Semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1145–1152. [Google Scholar]
- Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural message passing for quantum chemistry. In Proceedings of the International Conference on Machine Learning (PMLR), Sydney, Australia, 6–11 August 2017; pp. 1263–1272. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Yu, Z.; Zheng, W.; Wang, J.; Tang, Q.; Nie, S.; Wu, S. Codecmr: Cross-modal retrieval for function-level binary source code matching. Adv. Neural Inf. Process. Syst. 2020, 33, 3872–3883. [Google Scholar]
- Talman, A.; Yli-Jyrä, A.; Tiedemann, J. Natural language inference with hierarchical bilstm max pooling architecture. arXiv 2018, arXiv:1808.08762. [Google Scholar]
- Li, Y.; Tarlow, D.; Brockschmidt, M.; Zemel, R. Gated graph sequence neural networks. arXiv 2015, arXiv:1511.05493. [Google Scholar]
- radare2. Radare. Available online: https://rada.re/n/radare2.html (accessed on 22 June 2022).
- McGill-DMaS. Kam1n0-Community. Available online: https://github.com/McGill-DMaS/Kam1n0-Community (accessed on 22 June 2022).
MRR | Precision (Top-10) | Precision (Top-50) | Recall (Top-1) | F1 (Top-10) | F1 (Top-50) | cmp_acc | |
---|---|---|---|---|---|---|---|
VulSeeker | 13.78% | 15.15% | 34.85% | 7.58% | 10.10% | 12.45% | 89.21% |
PalmTree + VulSeeker | 22.50% | 30.30% | 69.70% | 15.15% | 20.20% | 24.89% | 90.80% |
PDMs-diffpool | 33.03% | 57.81% | 94.53% | 22.65% | 32.55% | 36.54% | 91.79% |
PDMt-diffpool | 40.26 % | 64.06% | 93.75% | 30.46% | 41.29% | 45.98% | 91.60% |
PDMs | 37.32% | 63.28% | 96.87% | 28.12% | 38.94% | 43.59% | 90.23% |
PDMt | 46.81% | 69.53% | 96.88% | 37.50% | 48.72% | 54.07% | 91.99% |
MRR (XO) | Recall (XO) | MRR (XC) | Recall (XC) | MRR (XA) | Recall (XA) | |
---|---|---|---|---|---|---|
VulSeeker | 20.20% | 16.21% | 14.74% | 11.52% | 19.21% | 15.95% |
PalmTree + VulSeeker | 34.96% | 23.08% | 30.49% | 12.50% | 26.33% | 14.28% |
PDMs-diffpool | 50.31% | 37.50% | 26.33% | 14.28% | 39.78% | 30.77% |
PDMt-diffpool | 43.77% | 31.25% | 43.77% | 33.30% | 36.53% | 23.07% |
PDMs | 57.71% | 46.15% | 34.96% | 23.08% | 41.59% | 28.57% |
PDMt | 54.13% | 44.44% | 47.42% | 42.07% | 30.49% | 16.15% |
Busybox | Usage | Index | Version | Compiler | Optimization | Architecture |
---|---|---|---|---|---|---|
busybox-1.33.1-clang-i386-O2_unstripped | Search | 1 | 1.33.1 | clang | O2 | i386 |
busybox-1.33.1-gcc-aarch64-O0_unstripped | Search | 2 | 1.33.1 | gcc | O0 | aarch64 |
busybox-1.33.1-gcc-arm-O0_unstripped | Search | 3 | 1.33.1 | gcc | O0 | arm |
busybox-1.33.1-gcc-i386-O1_unstripped | Search | 4 | 1.33.1 | gcc | O1 | i386 |
busybox-1.33.1-gcc-i386-O3_unstripped | Search | 5 | 1.33.1 | gcc | O3 | i386 |
busybox-1.33.1-gcc-mips64-O0_unstripped | Search | 6 | 1.33.1 | gcc | O0 | mips64 |
busybox-1.33.1-gcc-mips-O0_unstripped | Search | 7 | 1.33.1 | gcc | O0 | mips64 |
busybox-1.34.1-clang-i386-O1_unstripped | Search | 8 | 1.34.1 | clang | O1 | i386 |
busybox-1.33.1-gcc-×86_64-O0_unstripped | Target | 9 | 1.33.1 | gcc | O0 | ×86_64 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pan, Z.; Wang, T.; Yu, L.; Yan, Y. Position Distribution Matters: A Graph-Based Binary Function Similarity Analysis Method. Electronics 2022, 11, 2446. https://doi.org/10.3390/electronics11152446
Pan Z, Wang T, Yu L, Yan Y. Position Distribution Matters: A Graph-Based Binary Function Similarity Analysis Method. Electronics. 2022; 11(15):2446. https://doi.org/10.3390/electronics11152446
Chicago/Turabian StylePan, Zulie, Taiyan Wang, Lu Yu, and Yintong Yan. 2022. "Position Distribution Matters: A Graph-Based Binary Function Similarity Analysis Method" Electronics 11, no. 15: 2446. https://doi.org/10.3390/electronics11152446
APA StylePan, Z., Wang, T., Yu, L., & Yan, Y. (2022). Position Distribution Matters: A Graph-Based Binary Function Similarity Analysis Method. Electronics, 11(15), 2446. https://doi.org/10.3390/electronics11152446