3.1. Classification and Normalization of Assembly Instructions
The assembly instructions of different CPU architectures have significant differences in syntax structure, register and operand definitions, etc. These differences are one of the main obstacles to cross-architecture binary code analysis. In order to reduce the interference of these differences on the semantic consistency of model learning, this section designs a unified assembly instruction classification and normalization method to improve the efficiency of cross-architecture semantic modeling.
(1) Instruction classification
The semantic diversity of assembly instructions stems from the instruction set design principles of different architectures. To resolve this difference, a set of general instruction-classification rules is proposed based on the functional characteristics of instructions, and complex assembly instructions are classified into ten general categories. By reasonably classifying instruction semantics, a consistent input representation can be provided for subsequent instruction normalization and feature extraction. The specific classification categories are shown in
Table 1.
By analyzing the instruction set characteristics of the x86 architecture and the ARM architecture, the following general classification categories are designed, which mainly divide complex assembly instructions into the following main categories:
Conditional Jump: includes conditional jump instructions, such as JZ, JNZ of x86 and B.EQ, B.NE of ARM, which are used to change the program-execution flow according to specific conditions.
Unconditional Jump: includes unconditional jump instructions, such as x86’s JMP and ARM’s B, which are used to jump to the specified address unconditionally.
Data Transfer: includes data transfer instructions, such as MOV, LEA of x86 and LDR, STR of ARM, which are used to transfer data between registers, memory and immediate values.
Arithmetic: includes numerical calculation instructions such as addition, subtraction, multiplication and division, such as ADD, SUB of x86 and ADD, SUB of ARM.
Logical: includes bitwise logical operation instructions, such as AND, OR of x86 and AND, ORR of ARM.
Shift and Rotate: includes displacement and bit rotation instructions, such as SHL, SAR of x86 and LSL, LSR of ARM.
Bit Operation: includes bit operation instructions, such as BT, BTS of x86 and TST, REV of ARM.
CPU and System: includes control registers, memory barriers and other instructions, such as HLT, CPUID of x86 and DMB, DSB of ARM.
Compare: Includes instructions for numerical or conditional comparison, such as CMP of x86 and CMP and CMN of ARM.
Conditional Set: Includes instructions for setting flags or register values according to conditions, such as SETZ of x86 and CSET of ARM.
However, the instruction classification here only focuses on the macro-function of the instruction, and does not go into too much detail on the specific implementation of the instruction function. For example, the CMP instruction actually performs a subtraction operation under the x86 architecture, and updates the relevant bits of the flag register (EFLAGS) according to the result of the subtraction, while its macro-function is to compare whether the values in two immediate numbers or registers are equal.
Through the instruction-classification method, instructions with the same semantics under different architectures are aligned, which can reduce the macro-function differences of binary codes under different architectures and lay the foundation for instruction normalization.
(2) Instruction normalization
On the basis of completing instruction classification, we further propose an opcode and operand normalization method to eliminate instruction differences between different CPU architectures to the greatest extent possible, mainly including opcode normalization and operand normalization.
The key to opcode normalization is to abstract the functional semantics of instructions and map opcodes with the same semantics but different representations in different architectures to common opcodes. For example, instructions with the same semantics such as MOV, LEA in the x86 architecture and LDR, STR in the ARM architecture are unified as “Data Transfer” opcodes; instructions such as OR, XOR in x86 and ORR, EOR in ARM are unified as “Logical” opcodes.
According to the expression and format of operands in different architectures, it can be divided into symbol normalization, string normalization, immediate normalization and memory address normalization. Symbol normalization normalizes the contents of the symbol table in the assembly instruction to “symbol”; string normalization normalizes the contents of the string table in the assembly instruction to “string”; immediate numbers may have different prefixes (such as 0x for hexadecimal) or suffixes in different architectures, and they are normalized to “immval” by removing these prefixes and suffixes; usually the address in the assembly code starts with “0x”, and its length can be used to determine whether it is a memory address. For the memory address normalization mentioned in this article, when the length of the string starting with 0x is greater than 6, it is considered to be a memory address and normalized to “address”.
Through the above normalization processing, the semantic differences of assembly instructions under different architectures are significantly reduced, so that instructions with the same semantics and their opcodes and operands can be input into the model in a consistent form, so that the model focuses on the assembly instruction itself, and does not pay attention to architecture-irrelevant information, which not only improves the model training efficiency, but also improves the model’s cross-architecture detection capabilities.
3.2. Model Design
3.2.1. Overview
Figure 2 shows the overall framework of the cross-architecture binary code similarity-detection model proposed in this study. The system mainly consists of three key modules: semantic context information-extraction module, structural context information-extraction module and similarity score calculation module.
First, in the semantic context information-extraction module, the system disassembles the input binary file and extracts the assembly instruction sequence in the function. In view of the large grammatical differences and diverse expressions between different architectures, this study designed a cross-architecture instruction alignment and normalization scheme. By uniformly mapping the opcodes and operands of binary functions, heterogeneous instructions with the same semantics are mapped to the same space, minimizing the impact of instruction differences caused by different architectures. Based on the BERT pre-trained language model, it is trained through tasks such as Masked Language Modeling (MLM), Next Sentence Prediction (NSP) and Contrastive Learning (CL), thereby obtaining. This process not only realizes architecture-independent instruction semantic modeling, but also provides a unified representation space for subsequent similarity learning.
Secondly, in the structural context information-extraction module, the program control flow graph (CFG) of the binary function is constructed, and the graph structure is encoded in combination with the graph attention neural network (GAT). Considering that the semantics of binary functions not only depends on a single instruction, but is also closely related to the control path in which it is located, this module uses graphs as basic units to model the control dependencies between basic blocks, which can effectively capture the structural context information of basic blocks; at the same time, the larger the binary function, the more basic blocks it contains, and there may be noisy basic blocks containing fewer instructions or irrelevant instructions. Therefore, GAT is used to make the model pay more attention to important nodes containing rich semantic information and reduce the interference of noise nodes on graph representation learning.
Finally, in the similarity score-calculation module, the function-level vector representation that integrates semantics and structural context is obtained, and its Euclidean distance is calculated. The preset threshold is used to determine whether the given two binary functions are similar.
3.2.2. Semantic Context Information Extraction Module
The semantic context information-extraction module mainly performs semantic embedding through three BERT model pre-training tasks, including the masked language model task, the next sentence-prediction task and the contrastive learning task. The specific design ideas and implementation details are as follows:
(1) Masked Language Model
The Mask Language Model (MLM) task is one of the basic tasks of BERT. Its core idea is to randomly mask some tokens in the input sequence so that the model can use the learned contextual semantic information to predict the masked content. In this module, the MLM task is applied to the normalized instruction sequence so that the model can learn the semantic information in the instruction.
Specifically, the input instruction sequence
, for each instruction
, consists of an opcode and 0 or more operands, and these opcodes and operands are regarded as tokens, that is,
, then a certain proportion (15%) of tokens are randomly selected for masking. Among these selected tokens, 80% are replaced with a special mark “[MASK]”, 10% are replaced with a random token and the remaining 10% remain unchanged. The principle of this design is that the model cannot know which tokens are replaced by other tokens, forcing the model to learn the context information of all tokens. Through self-supervised training, the model can use the context information of Token to accurately predict the original content of the masked token. The training loss function is as follows.
represents the set of masked tokens in the instruction and the model outputs the predicted probability of each masked token. Through multiple masking and prediction, the model gradually learns to use global context information to restore the masked tokens, and then capture the semantic relationship within the instruction.
(2) Next sentence-prediction model
In assembly code, the order of instructions is crucial, and the execution logic of the program often relies on the synergy of a series of assembly instructions. For example, a conditional jump instruction is followed by a data transfer instruction or an arithmetic operation instruction. The combination not only affects the operation result, but also determines the choice of the program-execution path. By capturing the order information between instructions, the accuracy of model similarity detection can be further improved.
The Next Sentence Prediction (NSP) model is another basic task in BERT pre-training. Its main purpose is to train the model to determine whether two input sequences are adjacent in the original text. In this model, the NSP task is applied to the normalized binary instruction sequence to capture the contextual continuity and logical relationship between instructions.
Specifically, given instructions
and instructions
, if these two instructions are adjacent in the instruction sequence, they are marked with positive label 1, otherwise they are marked with negative label 0. The instruction pairs
and
are input into the model, and the predicted probability is output through a binary classification layer. The cross entropy function is selected as the training loss function, and the formula is as follows.
represents the probability that the model predicts and is continuous, represents the probability that the model predicts and is discontinuous, y is the actual label, 1 represents continuous and 0 represents discontinuous. Through the NSP task, the model can learn the sequential relationship and logical dependency of binary instructions during training, improving the model’s ability to understand complex binary code structures and detection accuracy.
(3) Contrastive learning task
The contrastive learning (CL) task is one of the core modules for cross-architecture binary code similarity detection. Its basic idea is to construct positive and negative sample pairs under different architectures so that the model can learn the semantic similarities and differences between different instructions in a unified embedding space, thereby achieving cross-architecture semantic alignment.
In the data preprocessing module, after the assembly instructions are normalized, the instructions that implement the same function in different CPU architectures (such as MOV in the x86 architecture and LDR in the ARM architecture) are classified into the same category. These instructions constitute positive sample pairs; while those instructions with obvious functional differences (such as ADD in x86 and AND in ARM) are used as negative sample pairs. Through the construction of such positive and negative sample pairs, the model can automatically adjust the spatial distribution of the embedded vectors during the training process, so that the distance between semantically similar instruction pairs in the vector space is as close as possible, while the distance between semantically inconsistent instruction pairs is enlarged. That is, for a sample pair, the embedding vectors generated after normalization are
and
respectively, and the contrast loss function can be expressed as follows.
represents the label of the i-th sample pair (1 for positive sample pair, 0 for negative sample pair), and respectively represent the embedding vectors of the two instructions in the i-th sample pair, represents the Euclidean distance between them, m is the interval parameter, which is used to ensure that the embedding distance of the negative sample pair is not less than the threshold, and N is the total number of sample pairs.
During the training process, the model continuously optimizes the distribution of the embedding space by minimizing the above loss function. The vector distance of the positive sample pair is gradually reduced during training, while the vector distance of the negative sample pair is expanded. In this way, the model can learn to map instructions from different architectures but with similar functions to a unified semantic representation, thereby improving the model’s ability and accuracy in cross-architecture detection.
The loss function of the entire model is the sum of the above three pre-training tasks, as shown in Equation (
4).
Through the semantic embedding module, the model can fuse the internal semantic information of a single instruction and the contextual information between instructions, and combine it with comparative learning to effectively learn the common features of semantically similar instructions under different architectures, thereby converting the normalized instruction sequence into a unified semantic vector representation.
3.2.3. Structural Information Extraction Module
The purpose of structural information-extraction module is to extract feature vectors containing contextual structural information between basic blocks from the program control flow graph. In order to reduce the interference of noise nodes in the program control flow graph, this chapter uses the F1-score (GAT)-based attention mechanism-related ideas for design, that is, it includes an embedding layer, a multi-layer graph attention encoding layer and a feature-aggregation output layer.
Since the input is a graph structure containing a continuous instruction sequence, the relevant methods of the graph neural network are borrowed to first embed the instruction input-embedding layer in the basic block, so that each node carries the semantic information of its internal instructions, which is convenient for the subsequent extraction of structural features. After that, the embedded vector is used as the initial vector of the basic block node and input into the multi-layer graph attention encoding layer to obtain a vector containing its own weight and the structural information between nodes for each node. In order to make the model pay more attention to nodes with rich semantics, the multi-head graph attention mechanism in the graph attention neural network is designed as the multi-layer graph attention encoding layer. In order to obtain the embedding vector of the entire program control flow graph, the state concatenation and fully connected layers are used as feature-aggregation output layers at the end. The purpose is to output a vector as the final embedding vector of the binary function, which is convenient for the subsequent calculation of the similarity between functions.
I. Embedding layer
Since each basic block contains multiple instructions, the embedding module is used to generate a semantic embedding vector for each instruction, and the basic block is passed through the embedding layer to obtain the basic block-level embedding vector, which is convenient for subsequent multi-layer attention encoding layers to process. The embedding process is as follows. Given a basic block node containing
N assembly instructions
, the semantic embedding module is first used to convert each instruction
into an embedding vector of
d dimension, as follows:
In which,
represents the embedding vector corresponding to each instruction, and
represents the semantic embedding process. After obtaining the embedding vectors of all instructions in the basic block, all the embedding vectors are averaged and pooled, and merged into a basic block-level embedding vector, which is used as the initial embedding vector of the basic block node, as shown in Equation (
6):
In which, is the final embedding vector of the basic block, which represents the average semantic information of all instructions in the basic block. The above operation is performed on each basic block in the program control flow graph, and the control flow information between basic blocks is combined for the subsequent multi-layer graph attention encoding layer to extract structural features.
II. Multi-layer graph attention encoding layer
The purpose of the multi-layer graph attention encoding layer is to use the graph attention neural network to extract the structural information between basic blocks from the program control flow graph, and to make the model pay more attention to important basic blocks containing rich features through the attention mechanism.
Assume that the program control flow graph is represented as
, where
V is the set of basic block nodes and
E is the set of control flow edges. For each basic block node in the graph, the information of neighboring nodes is aggregated through the process of message passing to learn the structural information of the node context. During the message passing process, the attention weights between adjacent nodes are calculated to adaptively adjust the attention of each basic block of the model and reduce the impact of noise nodes. For any two adjacent basic blocks
and
, first use a learnable linear transformation
w to project the features of the basic blocks
and
; the formula is as follows:
To calculate the attention weights between
and
, we use LeakyReLU as the activation function. LeakyReLU is an improved activation function designed to address the ‘dying ReLU’ problem. Unlike the standard ReLU, which outputs 0 for all negative inputs (potentially causing neurons to stop learning), LeakyReLU introduces a small slope for negative values. This modification allows gradients to flow through negative inputs, ensuring that parameters can continue to update during training.The formula is as follows:
is a learnable attention, “
” represents the vector concatenation operation, and is the Leaky ReLU activation function. The purpose of this operation is to calculate the attention of the basic block
to its neighbor basic block
, that is, the degree of influence between basic blocks. In order to ensure that the sum of the weights of different neighbors of the basic block
is 1, the attention weights of all its neighbor basic blocks
are then normalized by Softmax, the formula is as follows:
represents the set of all neighbor nodes of the basic block
. Then, the features of the neighbor nodes are aggregated by attention weighted summation:
is a nonlinear activation function. This paper uses ReLU as the nonlinear activation function, and is the influence weight of the neighbor node on the node . Through the entire message passing process and attention mechanism, the features of each basic block not only contain its own information, but also integrate the information of its neighbor nodes, and can adaptively adjust the attention of each basic block to its neighbor nodes, reduce the influence of noise nodes and enhance the model’s understanding of the contextual structure information between basic blocks.
In order to improve the expressiveness of the model and enhance the stability of feature extraction, the
K head attention mechanism is introduced. The formula is as follows:
In which, K represents the number of attention heads, each head independently calculates the attention weight , is the parameter matrix of the k-th attention head and “ ” represents the concatenation of the outputs of multiple attention heads. By calculating multi-head attention, not only can the stability of the model be increased and information loss caused by a single attention distribution be avoided, but also a variety of feature patterns can be captured, so that different attention heads can focus on different types of basic block relationships. We use GATLayer to aggregate information from multiple levels. GATLayer is a core component of F1-scores (GATs) and is designed specifically for graph structured data. It calculates node embeddings by dynamically assigning attention weights to neighbouring nodes, enabling the model to focus on more relevant neighbouring nodes.
After
L layers of forward propagation, the final embedding vector of each basic block is aggregated from multiple levels of information:
In which,
represents the node feature matrix of the
l-th layer,
A is the adjacency matrix of the entire program control flow graph, represents the calculation process of a GAT layer, and by stacking multiple GAT layers, the information of the basic block on the control flow graph can be transmitted to a farther range, that is, when there is only one GAT layer, the basic block can only receive information from direct neighbors; when there are two layers of GAT, the basic block can receive information from neighboring nodes (two jumps); when the level is deeper, the embedding vector of the basic block can integrate the context information of the entire program control flow graph, thereby gradually generating an embedding vector containing the structural information of the entire graph. After completing the calculation of multiple GAT layers, the structural feature vector
of each basic block is finally obtained, but because the final similarity detection is at the binary function level, that is, the entire graph level, it is still necessary to integrate the information of the entire control flow graph into a global embedding vector. We use MLP (Multi-Layer Perceptron) as the deep learning model in this paper. MLP (Multi-Layer Perceptron) is a fully connected neural network consisting of multiple layers of neurons. It maps input features to outputs through a series of linear transformations. This paper uses state concatenation and fully connected layer for feature aggregation. The specific implementation method is as follows:
In which, is the feature of the starting basic block, is the feature of the final basic block and MLP (Multi-layer Perceptron) represents a small neural network. This paper uses a fully connected layer combined with a summation method to integrate global information and obtain the embedding vector of the entire program control flow graph. Through the aggregation operation, not only can the global information be strengthened to ensure that the embedding vector of the binary function contains the features of the entire control flow graph, but also information loss can be prevented. By connecting the starting and final basic block information, the structural information of the control flow graph is more complete.
3.2.4. Similarity Score Calculation Module
After obtaining the function-level embedding vectors under different architectures, this module first performs a distance measurement on the two embedding vectors. Specifically, the Euclidean distance is used to measure the difference between the two vectors. Let the two function-embedding vectors from architectures
A and architectures
B be
and
, with dimension
N, and their Euclidean distance
d is defined as follows.
The Euclidean distance is essentially an unbounded metric of difference, and its direct use fails to intuitively reflect similarity. Therefore, an exponential mapping is employed to transform it into the [0,1] interval. During the design process, we explored various mapping methods. Compared to linear mapping, which relies on normalization using maximum and minimum values, the absence of an upper bound in the Euclidean distance may lead to training instability; while functions like sigmoid are bounded, they tend to saturate in large-distance regions, causing the similarity of distant samples to approach 0, which weakens the model’s detection capability. Exponential mapping has the characteristics of being monotonically decreasing, having a value range restricted to the [0,1] interval and being sensitive to small-distance differences. It can maintain numerical stability while enhancing the distinguishability between highly similar samples, better aligning with the intuitive semantic meaning of similarity measurement. This paper ultimately selects the exponential mapping to convert the distance metric into a similarity score between 0 and 1. The specific conversion is shown in Equation (
15).
s is the final similarity score. The smaller the distance, the closer is s to 1, indicating that the two binary functions are semantically consistent; conversely, the larger the distance, the closer is s to 0, indicating a greater semantic difference.
After obtaining the similarity score s, the detection result is further judged by the preset threshold . If s is higher than the threshold , the two binary functions are considered to be semantically consistent; otherwise, they are considered inconsistent. The threshold is determined by experimental parameter adjustment to achieve the best detection effect.
Through the above steps, the similarity score-calculation module can convert the embedding vector generated by the semantic embedding module into a quantitative similarity score to achieve the determination of cross-architecture binary codes. This module not only provides an intuitive scoring basis for the detection results, but also plays a key decision-making role in the overall model, which helps to improve the accuracy of cross-architecture detection.