1. Introduction
Binary code similarity detection (BCSD) plays an important role in software security. It is widely used in tasks like 1-day vulnerability discovery [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16], malware detection [
17,
18,
19], third-party library detection [
20,
21], software plagiarism detection [
22,
23], and patch analysis [
24,
25,
26]. It determines whether the function under test has a known vulnerability by calculating the similarity between the function under test and the vulnerability function. The success of deep learning in BCSD can be attributed to its powerful representation learning ability, which has been proven effective in capturing complex patterns and relationships in data and learning meaningful representations of assembly code. Despite their improved performance, existing methods have several limitations.
First, the semantic-based approach regards the code of a function as text, focusing only on the code semantics while neglecting the structural information of the function. Semantic-based methods typically take the machine code of a function, assembly code, pseudo-C code, or other textual information as the function features, and use Word2Vec [
27], Transformer, or other Natural Language models for learning, such as SAFE [
28], Asm2Vec [
29], jTrans [
30], and CLAP [
31]. They achieved excellent results. However, in scenarios where there are significant differences in semantic information due to variations in the compilation environment or architecture, it will affect the accuracy of the detection results. As shown in
Figure 1, the two functions have different names of variables, which leads to a decrease in semantic similarity.
Second, graph-based methods focus solely on the graph structure information of the function, while ignoring its semantic information. The graph-based methods [
8,
9,
16,
32,
33] usually use the graph structure information of functions as the feature, such as control flow graphs (CFGs), data flow graphs (DFGs), Abstract Syntax Trees (ASTs), etc. They have performed well in some scenarios. However, when there are differences in the compilation environment or architecture, it may lead to significant structural variations, which will affect the accuracy of the detection results. As shown in
Figure 2, the two functions have a reduced structural similarity because of the different implementation methods of the “if-else” statement. A comparison of the control flow graphs of these two functions is shown in
Figure 3.
Finally, methods based on semantics and graph structure usually fuse the text semantic features with the graph structure features to conduct detection [
34,
35,
36].
Existing methods have low robustness in diverse real-world scenarios. In terms of semantics, the existing methods mostly take the assembly code as the function feature, overly focusing on the basic instructions of the underlying implementation, such as stack instructions and register operations, while ignoring the actual semantics of the functions. We have written two functions implemented in the C code ourselves. As shown in
Figure 4, the function on the left implements the function of adding two numbers, and the function on the right implements the function of subtracting two numbers. The only difference between the two codes lies in the assembly instruction marked in yellow. The other assembly instructions are highly similar, which introduces a lot of noise when judging the similarity by the semantic model. In terms of graph information representation, CFG is often used as a feature. In complex scenarios such as cross-architecture, the robustness of the graph structure is insufficient, resulting in a decline in the accuracy of the method. As shown in
Figure 3, these are the functions compiled from the same source code for different architectures, and their CFGs have differences. Due to their insufficient learning of semantic and structural features, they have not achieved satisfactory results.
To address these limitations, we propose Fus, a method that integrates the semantic information learned from the pseudo-C code and the graph structure information learned from the AST to obtain robust vector representations for functions, thereby enhancing robustness. We adopt an integrated approach, taking into account both semantic and graph structure information, which alleviates the problem of being unable to handle changes in the compilation environment and architecture when using a single feature. We choose the pseudo-C code as the semantic feature of the function and the AST as the structural feature of the function to study the representation of functions fully. The pseudo-C code is obtained through the decompilation process, which filters out the details of underlying instructions and retains sufficient semantic information, providing a cleaner semantic representation for model learning. As shown in
Figure 5, the pseudo-C code has no underlying call stack and instructions for register operations. Within the function body, there are only variable definitions and statements for actually calling the function. When making the similarity judgment, a lot of interference was reduced. As shown in
Figure 4, this is the corresponding assembly instruction for
Figure 5. There are a large number of call stacks and operation register instructions within the function body, which cause a lot of interference in the similarity judgment of the function. The AST is generated based on the pseudo-C code, and the pseudo-C code is obtained through disassembly and decompilation processes. It is independent of the specific hardware architecture and thus can exhibit better stability across different architectures.
To effectively acquire semantic information, we utilize a Siamese network to integrate two CodeBERT models, allowing them to learn from the pseudo-C code. To effectively acquire the structural information, we train the Tree LSTM in the same manner to learn from the AST. We use these two models to calculate the similarity of the function, and then take the sum of the two similarities as the final function similarity. We select seven scenarios: XA, XO, XC, XA + XO, XA + XC, XO + XC, and XA + XO + XC. Here, XA represents the cross-architecture, XO represents cross-optimization options, and XC represents the cross-compiler. The remaining scenarios are combinations of these three. We use these scenarios to evaluate the robustness of the method in various situations. We construct datasets with sizes of 2, 10, 32, 128, 512, 1000, and 10,000 to evaluate the performance of the method under different data scales. The results show that Fus outperforms baselines in terms of MRR and Recall@1 in most scenarios. As the size of the function pool increases, the performance of Fus remains very good. We also conduct evaluations in real vulnerability search tasks, and Fus is superior to baseline methods.
In summary, our contributions are as follows:
We present a novel method named Fus, which integrates semantic information and graph structure information to encode functions. We use the Siamese neural network to train the CodeBERT model to learn the semantic information from the pseudo-C code, and use the same training method to train the Tree LSTM to learn the structure information from the AST. We use the similarity obtained by integrating the two models as the final similarity value of the function.
We implement a prototype of Fus. For the model training, we construct a dataset across optimization options for the semantic model training and a dataset across architectures for the graph model training. Both of these datasets contain 20,000 pairs of functions. We compare our model against the baselines. The evaluation results show that Fus outperforms all of them in most scenarios.
We evaluate the vulnerability search application. We collect 140,000 firmware functions as the pool of functions to be tested and 3 CVEs as query functions. Experimental results show that Fus achieves the best recall compared with the baselines.
6. Evaluation
Our evaluation aims to answer the following questions.
RQ1. How does the performance of Fus compare with the baselines?
RQ2. How do Fus and baseline perform at different function pool sizes?
RQ3. How does Fus perform in real vulnerability detection?
RQ4. How effective is the integration strategy?
6.1. BCSD Performance (RQ1)
We randomly selected 32 pairs of functions and 10,000 pairs of functions from each scenario of XA, XO, XC, XA + XO, XA + XC, XO + XC, and XA + XO + XC in Dataset-3 as evaluation datasets. This enables our method and the baselines to be evaluated at different levels of difficulty.
Small Pool (Pool Size = 32). As shown in
Table 3 and
Table 4, Fus has the highest MRR and Recall@1 in all seven scenarios compared with the baselines. This proves that Fus can adapt to various complex environments. Compared with our individual method, HLSEn and Tree LSTM, in the XC scenario, Fus are slightly lower than HLSEn. This is an acceptable result.
Large Pool (Pool Size = 1000). As shown in
Table 5 and
Table 6, compared with the baselines, Fus achieves the highest MRR and Recall@1 in all seven scenarios. It proves that Fus can also adapt to various complex environments in large-scale scenarios. Compared with our single method HLSEn and Tree LSTM, Fus achieves the best results in terms of MRR and Recall@1. It proves that our integrated method has higher accuracy and recall than individual methods.
The experiments in this section can prove that Fus has high accuracy on both small-scale test sets and large-scale test sets. Moreover, its outstanding performance in various scenarios also demonstrates that Fus can adapt to various complex situations.
6.2. The Effects of Pool Size on Performance (RQ2)
The previous section demonstrates that the size of the function pool can affect the accuracy of the method. To further explore the impact of the size of the function pool on the accuracy of the method, we construct function pools of sizes 2, 10, 32, 128, 512, 1000, and 10,000, and evaluate the method.
The results are shown in
Figure 8. In most scenarios, as the pool size increases, Fus outperforms all other methods. It indicates that our method can be adapted to different testing scales and will be more robust in practical applications. As the size of the function pool increases, the MRR of other methods drops sharply, while our method experiences a slower decline. It also demonstrates the superiority of our method.
6.3. Real-World Vulnerability Search (RQ3)
The detection of known vulnerabilities is an important application scenario of BCSD. We use Dataset-4 as the evaluation dataset. Each time, one variant of a CVE is selected as the query function, and the other nine variants of that CVE are used as truth functions. The other CVEs, together with 70,000 pairs of functions from Dataset-3, form the function pool. Therefore, our function pool contains 140,029 functions. This is a large-scale function pool that is extremely challenging.
When each CVE is used as a query function, the other nine variants are truth functions, so we use Recall@9 as the metric to evaluate the recall of each query. We take the average value of the Recall@9 obtained from 10 queries of a specific CVE as the Ave Recall@9 for that CVE. We calculated the Ave Recall@9 for each CVE, as shown in
Figure 9. Our method, Fus, outperforms baselines and achieves the highest Ave Recall@9. Compared with the individual methods of HLSEn and Tree LSTM, our integrated method performs best, indicating the effectiveness of our approach.
6.4. The Effectiveness of the Integration Method (RQ4)
In this section, we will conduct ablation experiments on the integrated method. There are many ways to integrate the semantic-based method HLSEn and the graph structure information-based method Tree LSTM. In this article, the proposed Fus fusion method is to add the similarity calculated by HLSEn and Tree LSTM as the final similarity. There is another integration method, which we name as Concat. This method involves vector encoding of the HLSEn and Tree LSTM functions separately, and then concatenating the vectors to form the final embedding vector of the function. The cosine similarity of the embedded vectors is used as the similarity between functions.
We conduct a comparative evaluation of the two integration methods, namely, Fus and Concat. In the function pool with pool size set to 10,000, we calculate the MRR and Recall@1 for both methods. The experimental results are shown in
Figure 10. In various scenarios, the MRR and Recall@1 metrics of Fus are superior to those of Concat. This demonstrates the effectiveness of our integration method.
7. Conclusions
In this paper, we propose Fus, which integrates semantic and graph structure information to encode functions. We adopt an integrated approach, not relying on the unilateral information of the function, which enhances the accuracy and practicability of the method. The experimental results show that our method outperforms baselines in most of the BCSD tasks. Our integrated method, Fus, is also superior to the single methods based on semantics and graph structure. This indicates that the integrated method is superior to the single method, with higher accuracy and greater stability. Meanwhile, our single method HLSEn and Tree LSTM, in comparison with the baselines, also demonstrate excellent performance. It indicates that the pseudo-C code and AST have excellent stability and are excellent features for function representation. However, the integration method in this paper is static and fixed. It lacks the ability for dynamic decision making and self-optimization. It cannot perform multi-angle reasoning, verification, and iterative analysis like security experts, which limits the automation level and scalability of vulnerability detection. With the development of large language models, we will explore more intelligent integration methods to enhance performance in the future. For instance, we can design a collaborative decision-making mechanism based on multi-agent systems. Four types of role-based intelligent agents, namely, “analysis–comparison–reasoning–verification”, are introduced. We can assess the similarity of functions from multiple perspectives together. Each intelligent agent independently assesses based on its own capabilities, collaborates with others, and dynamically adjusts its judgment strategies through a feedback mechanism. It simulates the multi-angle reasoning and knowledge integration in the human analysis process to achieve a more flexible and intelligent vulnerability detection decision-making process.