TTDAT: Two-Step Training Dual Attention Transformer for Malware Classification Based on API Call Sequences

Peng Wang; Tongcan Lin; Di Wu; Jiacheng Zhu; Junfeng Wang

doi:10.3390/app14010092

,

and

¹

School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, China

²

College of Computer Science, Sichuan University, Chengdu 610065, China

³

College of Software Engineering, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Appl. Sci.2024, 14(1), 92;https://doi.org/10.3390/app14010092

This article belongs to the Section Computing and Artificial Intelligence

Version Notes

Order Reprints

Abstract

The surge in malware threats propelled by the rapid evolution of the internet and smart device technology necessitates effective automatic malware classification for robust system security. While existing research has primarily relied on some feature extraction techniques, issues such as information loss and computational overhead persist, especially in instruction-level tracking. To address these issues, this paper focuses on the nuanced analysis of API (Application Programming Interface) call sequences between the malware and system and introduces TTDAT (Two-step Training Dual Attention Transformer) for malware classification. TTDAT utilizes Transformer architecture with original multi-head attention and an integrated local attention module, streamlining the encoding of API sequences and extracting both global and local patterns. To expedite detection, we introduce a two-step training strategy: ensemble Transformer models to generate class representation vectors, thereby bolstering efficiency and adaptability. Our extensive experiments demonstrate TTDAT’s effectiveness, showcasing state-of-the-art results with an average F1 score of 0.90 and an accuracy of 0.96.

Keywords:

two-step training; dual attention; Transformer; malware classification; API call sequences

1. Introduction

Malware, or malicious software, is crafted to infiltrate computers and mobile devices, aiming to manipulate authoritative systems, gather sensitive information, display unwanted ads, or extort users [1,2]. The surge in smart devices like laptops and phones has greatly expanded the threat landscape, jeopardizing user security and system integrity [3,4]. Malware classification assigns specific labels to identify its family, which is a crucial step in addressing security challenges [5].

Malware classification can be divided into signature-based, machine learning-based, and deep learning-based methods in the method view or static analysis and dynamic analysis in the feature view. Signature-based approaches may encounter challenges when dealing with the rapid evolution of malware [6]. In response, traditional machine learning methods, including Support Vector Machines (SVM), Random Forests (RF), and Naïve Bayes (NB), have been utilized for malware detection and classification [7,8]. However, these approaches necessitate the manual extraction of features, relying on expert knowledge, which can introduce complexity to the process.

Contemporary malware classification methods effectively leverage malware features, encompassing both static and dynamic attributes, to build machine learning or deep learning models. Static analysis involves the extraction of features as hex values and opcodes [9] from malware binary executable files through reverse engineering and examination of the original binary code. While static analysis is efficient, it is susceptible to evasion and obfuscation techniques. In contrast, dynamic analysis techniques capture malware behaviors, including file access, API (Application Programming Interface) calls, data flow, and other behavior traces, by executing and monitoring malware within a virtual sandbox. Dynamic analysis offers a more accurate representation of malware’s actual objectives and actions, resulting in lower false-positive rates and higher accuracy [10,11]. Combined with deep learning’s image representation, many research works would treat the malware as an image by converting the feature from static and dynamic analysis into a matrix [12,13].

Despite the success of feature analysis and deep learning, especially in image representation, we posit that API call sequences can be regarded as a form of language through which programs establish communication with operating systems, analogous to how individuals employ languages for interpersonal interaction, which can better reflect the nature of the malware. Tran et al. point out that every type of malware has its own specific API call patterns or unique order of API calls [14]. In contrast to dynamic instruction features, the extraction of API call features necessitates only a coarse-grained dynamic analysis. Consequently, this approach incurs a relatively modest computational cost, rendering it highly effective for a broad spectrum of software codes.

In this work, we propose a Two-step Training Dual Attention Transformer (TTDAT) using API call sequences for malware classification that takes the API call sequence as input and outputs its corresponding category. We employ local attention with an original encoder to form a dual Transformer to capture global and local information. To facilitate the efficient expansion of new categories of malicious code with minimal computational overhead and without the necessity of retraining the entire model, we design a two-step training strategy and transform the multi-classification problem into a classification mapping problem. During the training phase, the matching model is tasked with learning a scoring mechanism that quantifies the likelihood of an API call sequence belonging to a particular category. Subsequently, during the inference stage, we select the category with the highest score from the pool of candidate categories. In step 2, to mitigate the influence of imprecise annotations and enhance the inference speed, we additionally conduct supplementary training to generate a normalized vector representing each category. Experimental results demonstrate the overall effectiveness of the TTDAT method. To summarize, the main contributions of this work are listed below:

We present a tiny local attention mechanism as a complementary component to the multi-head attention in the Transformer, and a new encoder is proposed to model the short-term relationship between API call sequences;
We provide a two-step training method for accuracy facilitation. Unlike adding some cumbersome components to the model that require large computational resources, a novel training method is computationally free during the inference time;
Massive experimental results show that the proposed method outperforms the state-of-the-art malware classifiers in two datasets, and we carry out an ablation study to demonstrate the effectiveness of our module and two-step training strategy.

2. Related Work

This section presents a concise overview of NLP- and API-sequence-based malware detection and classification methodologies, as well as an exploration of Transformer-based approaches and training strategies to delineate the foundation of our research and highlight the distinctions therein.

2.1. Deep Learning-Based or API-Call-Related Malware Classification

There is a line of work focused on building malware classification systems based on extracted features. Nagano et al. [15] have proposed an innovative static analysis approach, integrating Natural Language Processing (NLP) with machine learning classifiers to discriminate between malicious and benign software. Their methodology entails the utilization of a PV-DBOW model for the extraction of features from diverse sources, including DLL imports, assembly code, and hex dumps, all derived from static analysis. Subsequently, these extracted features, or vectors, are input into Support Vector Machines (SVM) and k-nearest neighbor (KNN) classifiers for predictive inference. Another study proposed by Tran et al. [14] used NLP techniques such as N-gram, Doc2Vec (or paragraph vectors), and TF-IDF to convert API call sequences to numeric vectors before feeding them to the classifiers, including SVM, KNN, MLP, and RF. Schofield [16] also uses N-gram and TF-IDF to encode the API call sequences and employs a CNN to classify, which utilizes the ability of image representation. Chandrasekar Ravi et al. [17] employ a third-order Markov chain to model the Windows API call sequences. Nakazato J et al. [18] classify malware into some clusters using characteristics of the behavior, which are derived from Windows API calls in parallel threads with N-gram and TF-IDF.

Deep learning-based methodologies have exhibited remarkable potential for delivering more efficacious and adaptable features, yielding superior outcomes in malware classification. Kolosnjaji et al. [19] pioneered the application of convolutional and recurrent network layers for the extraction of features from comprehensive API sequences. Their pioneering work underscores the substantial accomplishments attained through the integration of deep learning techniques within API-sequence-based malware classification. In the same way, C Li’s work [20] also demonstrates the RNN’s ability to classify the API call sequences alone. In a subsequent development, Li et al. [21] have further refined the network architecture, introducing the extraction of inherent features from API sequences. Especially, their approach incorporates embedding layers to represent API phrases and semantic chains, along with the utilization of Bidirectional Long Short-Term Memory (Bi-LSTM) units to capture interrelationships among APIs. The results of their endeavors demonstrate significant performance enhancements when compared to baseline methodologies, highlighting the efficacy of introducing additional intrinsic features associated with APIs. Some works consider the similarity among the features, especially API call sequences, and employ similarity to do the encoder, followed by some advanced models such as GNN [22], Random Forest, LSTM [23], and F-RCNN [24].

2.2. Transformer Models and Local Attention

Transformer is the first sequence transduction model that relies entirely on the attention mechanism. Unlike RNN [25] and LSTM [26], Transformer [27] uses multi-headed self-attention instead of recurrent layers in encoder-decoder architecture. Thanks to the absence of recurrent layers, the Transformer does not need to face the risk of gradient disappearance and gradient explosion, and it can process the entire sequence and learn the relationship between API calls. Using the Transformer Encoder–Decoder model takes less time to train than the LSTM model, and it is more stable [28]. MalBERT [29] first utilizes the pre-trained Transformer to process and detect malware, and experiments demonstrate that the Bert-based model can achieve high accuracy for malware classification.

Transformer architecture delivers a good design of attention mechanisms; some work employs another attention module to capture the information. Yang [30] proposes to capture features from binary files using stacked CNNs and assembly files via triangular attention and then fuse all features via cross-attention. Their experimental results show that the method can extract both global and local features to improve the detection of malware variants effectively. Moreover, the local attention mechanism is very popular and effective in processing local features. Ma [31] points out that the mutual result of both global and local attention is useful to capture semantics and generate the most informative and discriminative features for text classification. Inspired by the success of local attention in text classification, this paper employs local attention as a complement to global attention to process short-term information in the classification of malware API call sequences.

2.3. Training Strategies

Generally, benefiting from sufficient data, convolutional networks are always trained offline. Thus, researchers favor taking advantage of and developing better training methods that can not only promote the performance of the model but also have no inference cost increase. Inspired by [32], we call this kind of method a “bag of freebies”. Strategies like data augmentation [33], hard negative example mining [34], online hard example mining [35], two-stage object detectors, and objective function designing [36], to name a few, are commonly used in computer vision and natural language processing (NLP).

In malware classification, Hwang [37] designs a two-stage detection method to protect the victims by employing random forest to control false negative error rates in the second stage under low false positive rates delivered by the first stage using the Markov chain model. Baek [38] employs static analysis and dynamic analysis in different stages; static analysis in the first stage is used to classify malware and benign files. After that, they further employ dynamic analysis in the second stage to classify malware from the benign files in stage one to lower the false detection rate and reduce the malware misclassification in stage one. The results show that a two-stage scheme can perform better than a single static analysis or dynamic analysis. Although these strategies can better improve the detection rate, current research lacks consideration of the representation of malware and detection speed performance. Motivated by this situation, we propose a two-step training method and apply it to our model.

2.4. Related-Work Summary and Comparison

From Table 1, we can see that the existing works focus on feature encoding and the construction of classification models for API sequences or other features of malware. For feature encoding, the techniques of NLP have been continuously utilized, from the N-gram model at the beginning to embedding to the latest Transformer architecture. Similarity metrics-based models have been used to encode and characterize features, which can be fully exploited by leveraging the capabilities of CNN, RNN, and GNN models in deep learning. Despite their success, they ignored local information about the API sequence and were computationally heavy; we added a local attention module to the Transformer for better results. Without extracting API sequence characteristics, we can retain more information about the association between malware and its API sequence.

Table 1. Summary and comparison of related works.

At the same time, there are some research works focusing on two-stage phases to further optimize the classification effect of the first stage through the second stage, thus improving the overall performance. In our work, we believe that the Transformer architecture has further room for improvement in efficiency, so we want to speed up the characterization process by saving the category vectors.

3. Methodology

This section presents our Two-step Training Dual Attention Transformer (TTDAT). Section 3.1 describes the general process and applied design principles. Section 3.2 describes the proposed network, including the dual attention Transformer encoder and its local attention operation. Section 3.3 describes the two-step training strategy and illustrates the model updates in different steps.

3.1. Overview and Design Principles

Figure 1 illustrates the overall process of our two-step methodology incorporating a dual attention Transformer. The approach takes API calls as its input and produces predictions for the respective categories. In the first step, a multi-head attention mechanism and local attention are employed within a multiple dual Transformer encoder to capture and represent samples as vectors. Following activation, the model yields probabilities indicating the likelihood of each sample belonging to specific categories for predictive purposes. Moving to the second step, the model treats API call classification as a binary classification task to train individual models, leveraging the pre-trained model from the initial step. Subsequently, the methodology stores the final weights of these models as Normal Vectors, serving as representations for the respective classes and facilitating future predictions. This two-step strategy enables us to proficiently accomplish the malware classification task, optimizing it effectively, with each step addressing distinct optimization objectives.

Figure 1. Overview of our two-step methods.

To meet the security design requirements, the method applies some principles [39] to work, including economy mechanisms, open design, and input validation. We keep the overall architecture consistent with the Transformer, with the only introduction of the local attention module to avoid the complexity caused by excessive modifications. Then, we designed a two-step process to optimize different purposes independently, thus maintaining the clarity and openness of the algorithm. We make assumptions on the input, so we need to apply the validation, including checking if the API call is legal from the system library and validating the input of the API pair for the model. In addition, more privilege validation and fail-safe default design principles need to be considered when the algorithm becomes part of a secure detection system in the future.

3.2. Dual Attention Transformer Encoder

The original encoder in Transformer, as shown in Figure 2a, is actually a stack of multi-head attention modules and feed-forward modules that are used for long-term relationship modeling and feature extraction, respectively. Inevitably, modeling long-term relationships between API call sequences requires attending to all API call sequences, thus somewhat suppressing the expression of short-term dependencies. Especially in the malware classification area, long- and short-term relationships matter equally, i.e., some malware can be classified by several distant or only several adjacent API calls. Consequently, we propose to use lightweight local attention and incorporate it within a dual attention Transformer encoder, which is illustrated in Figure 2b.

Figure 2. (a) Transformer encoder and (b) our dual attention Transformer encoder.

Unlike global attention, i.e., the multi-head attention in the Transformer, local attention attempts to learn the context information in a sliding window of 2D + 1. Given the embedded API call sequence matrix

X \in R^{N \times M}

, where N denotes the number of the API call and M denotes the feature size of a single API call,

x \in R^{1 \times M}

, the local attention aggregates the context information

c_{t}

of the current API call

x_{t}

using Equation (1).

c_{t} = \sum_{i = t - D}^{t + D} α_{t i} x_{i}

(1)

where

α_{t i}

denotes the weight of

x_{i}

with respect to

x_{t}

and can be expressed as Equation (2).

α_{t i} = \frac{\exp (s c o r e (x_{t}, x_{i}))}{\sum_{m = t - D}^{t + D} \exp (s c o r e (x_{t}, x_{m}))}

(2)

where

s c o r e (\cdot, \cdot)

is the sum of the inner product of the variable in this paper. Finally, the context information

c_{t}

is regarded as the local attention feature and replaces

x_{t}

.

Based on the local attention proposed above, we further propose the new structure as a dual attention Transformer (DAT). Similar to the Transformer encoder, the DAT takes the embedded features as inputs and outputs in a fixed size. What makes the difference is that there are two sub-branches in each encoder layer, which are in charge of the long- and short-term dependence modeling, respectively. The long-term part stays the same as the Transformer encoder, while the short-term part replaces the multi-head attention with the local attention proposed above. Input-embedded features are passed into the two sub-branches, and the successive operations can be formulated as follows:

Y_{M A} = \emptyset (φ_{M A} (X; θ_{1}) + X; θ_{2}) Z_{M A} = \emptyset (f_{F F} (Y_{M A}; θ_{3}) + Y_{M A}; θ_{4})

(3)

Y_{L A} = \emptyset (φ_{L A} (X; θ_{1}^{'}) + X; θ_{2}^{'}) Z_{L A} = \emptyset (f_{F F} (Y_{L A}; θ_{3}^{'}) + Y_{L A}; θ_{4}^{'})

(4)

where

x

denotes the input matrix and

φ_{M A}

and

φ_{L A}

refer to the multi-head and local attention layer, respectively.

f_{F F}

is the feed-forward layer.

θ

s represent the parameters in (multi-head/local) attention layers, feed-forward layers, and residual and normalization layers.

y

s are the outputs of the attention layer followed by the Add & Norm layer, while

z

s are the outputs of the feed-forward layer followed by the Add & Norm layer.

After that, outputs from the two sub-branches are concatenated, and then a max-pooling layer is applied, generating the final output of the encoder layer, which can be expressed as follows:

δ = M a x P o o l (z_{M A} | | z_{L A})

(5)

where the

| |

denotes the concatenation operation and

δ

is the final output of the encoder layer.

Thus, we can stack the DAT (dual attention Transformer) encoder layers to form a powerful feature extractor for malware classification.

3.3. Two-Step Training

Dedicated training methods can be deemed as a “bag of freebies”, which only take more training costs but can boost classification accuracy a lot. Our training methods can be divided into two steps. Training step #1 is designed for generating a basic model for malware classification, and most parameters of the model will be fixed in the next step, #2, which focuses on training a Normal Vector for each malware category to improve the class representation and promote detection performance.

3.3.1. Training Step #1

Siamese networks [40] are widely used in deep learning to learn discriminative features and predict feature similarity. In this work, we try to take advantage of it and the architecture employed in training step #1 as shown in Figure 2b, which is also a Siamese-like model and has a shape of Y.

The network in step #1 takes two API call sequences from different categories as inputs. These sequences are sent into two feature extractors that share parameters with each other, like the Siamese network. Note that each feature extractor consists of 6 DAT encoder layers proposed in Section 3.1, responsible for converting API sequences into a new feature space. After that, subtraction and multiplication are applied to the generated features from both extractors for feature similarity evaluation. Derived features and the original features are concatenated together and passed into two linear layers and a softmax layer to obtian the final output probability that the two malwares are in the same class. It is worth mentioning that turning the multi-classification task into a binary classification task is for the consideration of scalability to a new malware class.

The inference process is illustrated in Algorithm 1. When an API call sequence

x

needs to be classified,

N

sequences from each malware category are selected and sent to the Siamese network with

x

for similarity prediction. The category with the highest mean output probability will be determined as the label of

x

.

Algorithm 1 Siamese network inference

Input: The API call sequence

x

of the malware needed to be classified.
Output: The label of

x

.
Init:

c l a s s e s

: All the classes of malware,

m o d e l

: The trained model
function predict (

x

):

p r o b a b i l i t y

← []
for

i

in

c l a s s e s

do:

p r o b a b i l i t y [i]

← average ([

m o d e l

.predict(

x

,

j

) for

j

← RandSelect (

j

,

N

)])
end for
return

a r g m a x (p r o b a b i l i t y)

end function

3.3.2. Training Step #2

Benefitting from our DAT encoder, training step #1 alone can give a satisfactory classification result, but it would suffer from two severe drawbacks: (1) The speed of the network can be encumbered seriously owing to the tremendous comparison times. For instance, given a new sample,

M

categories, it takes

M N

times comparison to give the final label, according to Algorithm 1. (2) The number of samples selected from each category during testing time is hard to trade off. Fewer samples may cause noise because the randomly selected samples cannot represent the whole set, while more samples could lead to inference speed degradation, as stated in the drawback.

Inspired by Prototypical Network [41], we propose training step #2 to solve the issues. Instead of using several random samples, we utilize a Normal Vector to denote the characteristic of each category, which is more time-saving and robust. During training step 2, one of the feature extractors is replaced with a Normal Vector layer. Note that all parameters of the network except Normal Vector layer are frozen since what we want from training step #2 is a Normal Vector.

Training step #2 can be illustrated using Algorithm 2. Given that category

i

needs to generate the Normal Vector

V_{i}

and the trained model, the algorithm has to initialize the training dataset for category

i

. Specifically, for each sample in training set, if it belongs to category

i

, we label it with 1, while 0 is assigned to the sample if it is not in category

i

. When training is executed, every sample

x

in training set is sent to the feature extractor, i.e., the dual attention Transformer, to obtain the embedded feature vector

V_{x}

. After that, subtraction and multiplication are applied to

V_{x}

and the Normal Vector

V_{i}

. As the same in training step #1, the outputs of subtraction and multiplication,

V_{x}

and

V_{i}

, are concatenated and are passed through two linear layers and a softmax activation function, obtaining the final prediction. The Normal Vector is then trained to minimize the gap between the prediction and the label. Once the training is finished, the weight in the Normal Vector layer is drawn out to serve as the Normal Vector of the category.

Algorithm 2 Normal Vector optimization

Input: The malware category

i

which needs to generate the Normal Vector.
Output: The generated Normal Vector

V_{i}

.
Init:

t r a i n S e t

: the training set in this work, for training sample

x

,

x . x

denotes its API call sequence, and

x . y

is the corresponding label,

m o d e l

: The trained model whose parameters are fixed except for NormalVector layer.
function generateNormalVector(

i

):

i n p u t

← []

l a b e l

← []
for

x

in

t r a i n S e t

do:

i n p u t . a p p e n d

(

x . x

)
If

x . y = i

, then

l a b e l . a p p e n d (1)

else

l a b e l . a p p e n d (0)

end if
end for

m o d e l . f i t (i n p u t, o u t p u t)

return

m o d e l . g e t L a y e r (‘ N o r m a l V e c t o r ’) . g e t W e i g h t ()

end function

During inference time, the API call sequence is sent to the feature extractor, and the network will predict similarities between the extracted feature and all Normal Vectors. If the highest probability is produced by the Normal Vector

V_{i}

, the new sample will be classified into category

i

. This procedure can be formulated using Algorithm 3.

Algorithm 3 Feature extraction and classification

Input: The API call sequence

x

of the malware needed to be classified.
Output: The category of

x

.
Init:

c l a s s e s

: All the classes of malware,

m o d e l

: The trained model
function getNormalVector(

m

):
if

m

not exist in NormalVectorSet then
NormalVectorSet[

m

] ← generatorNormalVector(

m

)
return NormalVectorSet[

m

]
end function
function predict (

x

):

p r o b a b i l i t y

← []
for

i

in

c l a s s e s

do:

p r o b a b i l i t y [i]

←

m o d e l

.predict(

x

, getNormalVector(

i

))
end for
return

a r g m a x (p r o b a b i l i t y)

end function

4. Experiments and Discussion

4.1. Dataset and Implementation Details

We implement our network with Tensorflow. Model training and testing are performed on Ubuntu 18.04 with an Intel Xeon Platinum 8255C with eight cores and an NVIDI Tesla T4 with 16 GB of memory. Moreover, the network was trained by an adaptive moment estimation (Adam) solver with mini-batch stochastic gradient descent.

We evaluated our model’s performance on two datasets. The first dataset we used for training and testing is provided by [42], including the categories, hash, and API call sequences of malware. This dataset was built from malware samples randomly from the Malicia project and VirusTotal and it was shared online. We chose this due to the rich variety of sample categories and the high quality of samples and labels, and it has been widely used and recognized by the academic community. To explore the robust ability of the model, we employed our lab’s collected malware from online resources and ran the cuckoo sandbox to collect dynamic analysis results to form the second dataset. We pre-process the suffixes of the called APIs as paper [43] did. Details about two datasets are given in Table 2 and Table 3.

Table 2. Dataset One description.

Table 3. Dataset Two description.

4.2. Comparison with Previous Methods

The comparison on Dataset One between our methods and the previous studies is given in Table 4 and Table 5. In Table 4, we chose two different kinds of methods to report the results. The first five methods are classic methods [14,44,45,46,47] to do the malware family classification, and we report the results from their papers. The following five methods [16,20,21,23,48] are the latest effective work on the classification based on API calls, so we reproduce the methods and offer a convincing comparison result. The [21] method adopts a two-way feature extraction architecture for API calls, but the core module is a multi-layer CNN, and the correlation analysis is performed through Bi-LSTM. Our architecture is unified based on Transformer and attention mechanisms, and so by comparing with this method, it can reflect the role of the backbone network. The study [48] further adopted a pre-training mechanism and integrated multiple Transformer architecture models through Random Forest, which has similarities with our backbone network. Moreover, it uses the process of integrating Random Forest, which is like Step 2. However, we implemented this process through the Normalization Vector. So, comparing this work with others can reflect the role of our mechanisms and strategies. We chose these two works as the latest and most effective API call classification model, which can also give a similarity comparison with our model details. In Table 5, we also compare these five methods with a new method [49] since we can reproduce this work with our cuckoo analysis results.

Table 4. Comparison with previous methods on Dataset One.

Table 5. Comparison with previous methods on Dataset Two.

From Table 4, as can be observed, the first five methods give a baseline for classification tasks, and models based on malware images or traditional NLP methods can achieve accuracy around 0.90. The model of [21] achieved an accuracy of 0.90, the model of [48] achieved an accuracy of 0.93, and our model had an accuracy of 0.96. From a baseline perspective, all three models go beyond other basic or advanced methods and achieve a better result, demonstrating the effectiveness of the API calls and Encoder/Transformer architecture models. From the perspective of SOTA, our model performs much better than the two currently optimal models, indicating the superiority of our model.

In Table 5, we further compared the performance of the latest seven methods on the second dataset to demonstrate the robustness and broad performance of the model. From Table 5, the performance of all three models has decreased on the second dataset, but our proposed method still performs the best.

4.3. Ablation Studies

In this section, we use an incremental method to conduct an ablation study on Dataset One, verifying the effectiveness of every component of our method. The experimental details and results are described in the following subsections.

4.3.1. Ablation Study on Local Attention Mechanism

To verify the effectiveness of our local attention mechanism, we simply use the networks shown in Figure 3 for convenience. We use the network in Figure 3a as our baseline. The only difference between network (a) and (b) is that (a) uses the original Transformer encoder, but (b) employs our well-designed dual attention Transformer encoder (DAT encoder). The networks take the API sequence as input. Features are extracted via six encoder layers, like the Transformer, and then sent into two linear layers and one softmax activation, outputting the final probabilities. Moreover, to find the best value of

D

in the local attention module proposed in Section 3.1, we set

D

ranging from 1 to 6 and compare the accuracy results to discuss.

Figure 3. Network structure comparison. (a) The original Transformer network structure; (b) dual attention Transformer encoder.

We compare the effect and best setting of our local attention module in Table 6. Our baseline is the Transformer encoder performance, which using the Transformer encoder (global attention) resulted in an accuracy of 0.7719 ± 0.0049. The introduction of the DAT (local attention) encoder led to notable improvements in accuracy, where all results are above 0.80. Specifically, the best performance was achieved when D = 2, with an accuracy of 0.8368 ± 0.0038. This represents a substantial gain of 6.4% over the Transformer encoder. Moreover, the DAT encoder is more robust than the Transformer encoder, and the results show less deviation compared to 0.0049. And we can find that although the best setting is D = 2, even when D was increased, the accuracy remained competitive, demonstrating the effectiveness of the local attention mechanism. The experimental results strongly suggest that incorporating a local attention module (DAT encoder) is beneficial for improving performance on the given task. The optimal configuration, in this case, was found to be D = 2, and this setting was chosen for subsequent experiments.

Table 6. Quantitative comparison with different encoder.

Based on the DAT encoder (D = 2), we test against 10 kinds of malware. As can be seen in Table 7, the F1-scores for different malware categories varied, indicating the model’s effectiveness in distinguishing between different types of malware. The model demonstrated high F1-scores in categories such as Trojan-FakeAV (0.96) and Net-Worm (0.92), indicating its proficiency in classifying instances from these categories. The results can be seen in the confusion matrix in Figure 4. There are two categories that are significantly lighter in color. However, the model still faced challenges in accurately classifying P2P-Worm, Trojan-Downloader, and Trojan-Ransom categories, as reflected by lower F1-scores (0.12, 0.45, and 0.58, respectively).

Table 7. Quantitative results of DAT (D = 2).

Figure 4. The confusion matrix (with data on accuracy) of the results of DAT (D = 2). From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

4.3.2. Ablation Study on Training Step #1

To investigate the contribution of training step #1, we conduct a test after the training and compare it with the network (b) in Figure 3. Different from network (b), the network used in training step #1 employs a Siamese network to evaluate the similarity between the new sample and randomly selected

N

samples from the dataset. The value of

N

is empirically set to 10.

We tested our model using 10 kinds of malware. As can be seen in Table 8, the accuracy of our model reaches to 0.87, which is 0.0332 higher than the network with the DAT encoder (see Table 7), demonstrating the contribution of our training step #1. At the same time, Trojan-Ransom’s F1-score has improved from 0.58 to 0.72. Although the accuracy has improved a lot, we can find that the F1-score of some kinds of malware classification is unsatisfactory, such as P2P-Worm (F1-score: 0.20), Misc (F1-score: 0.47) and Trojan-Downloader (F1-score: 0.55), which is unacceptable in practice. The f1-score of P2P-Worm, Net-Worm, and Misc is lower than that of the DAT encoder. The same result can be derived from the confusion matrix in Figure 5 as well, in which the second category, i.e., P2P-Worm, is so light and is prone to be predicted as Trojan-Spy, Packed, and so on. We guess the reason for this gap is caused by the rare data of these sorts of malware.

Table 8. Quantitative results of training step #1.

Figure 5. The confusion matrix (with data on accuracy) of the results of training step #1. From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

4.3.3. Ablation Study on Training Step #2

After training step #2, we obtain 15 Normal Vectors for 15 kinds of malware. To demonstrate the effectiveness of training step #2 and investigate whether the Normal Vectors can represent their malware categories, we test the model after this training procedure. Comparing Table 8 and Table 9, we can see that the values of the F1-score of all 10 categories are boosted, and the mean accuracy improves from 0.87 to 0.92, indicating that the Normal Vectors obtained in training step 2 can represent the characteristics of these categories better than randomly selected samples. However, the F1-scores of the classifications of P2P-Worm and Misc are still low. Moreover, these two categories are also relatively light in the confusion matrix in Figure 6. We think the scarce data in the two categories is unable to cover most features of them, leading to unsatisfactory results.

Table 9. Quantitative results of training step #2.

Figure 6. The confusion matrix (with data on accuracy) of the results of training step #2. From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

4.3.4. Two-Step Inference

Since we have tried to construct a two-step training network, it is also interesting to devise a two-step inference method. Moreover, as can be seen in Figure 7, P2P-Worms are frequently classified as Trojan-Downloader. However, they are in totally different parent categories. We believe this is caused by its small amount of data, resulting in the instability of its Normal Vector. But we find that although the data for P2P-Worm are rare, its parent category, i.e., Worm, has sufficient samples. Consequently, we train the model in training step #2 again and obtain the Normal Vectors for all the parent classes, i.e., Backdoor, Worm, Trojan, and misc., to make full use of these samples. Hence, we have the Normal Vectors for categories and their sub-categories.

Figure 7. The confusion matrix (with data on accuracy) of the results of two-step reference. From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

At reference time, given a new sample

x

, we first compare

x

with the Normal Vectors of the parent categories and find the optimal parent category

i_{p}

by choosing the maximum similarity. Then, we compare

x

with the Normal Vectors of subcategories in

i_{p}

. Finally, the subcategories

i_{s}

with maximum similarity is determined as the label of

x

.

Results of two-step inference are shown in Table 10 and Figure 7, and the F1-scores of these categories have improved a lot, especially P2P-Worm, and the grid of P2P-Worm in the confusion matrix is much darker than in Figure 6, demonstrating the effectiveness of the two-step inference.

Table 10. Quantitative results of two-step inference.

4.4. Findings and Limitations

We compare and discuss through lots of experiments, including (1) the performance comparison between the method proposed in this paper and various types of machine learning and deep learning models; (2) the role of the local attention mechanism; and (3) the role of a two-stage training strategy. By comparing the experimental results, we can see the following:

(1): Our proposed method can achieve an accuracy of 0.9606, but most classical methods cannot exceed 0.90. In addition, this paper is validated on two datasets, and although the performance of all the models on the second dataset is generally degraded, the method in this paper still outperforms the others, which suggests that the method in this paper is very effective and capable of achieving the latest SOTA results;
(2): We compare two works based on the Transformer architecture. Overall, the performances of the three methods are nice, which illustrates the effect of natural language modeling, especially the latest Transformer architecture, on the processing capability of API sequential calls. We further designed the fusion of local attention mechanisms based on the local feature relationships between API calls, which can enhance the feature-capturing ability even further. And this can also be compared more carefully in ablation experiments. By exploring the DAT encoder, we can find that the local attention module can indeed achieve better results than the original Transformer encoder;
(3): Our ablation analysis for Step 1 also further verifies that the introduction of the Siamese Network can improve the performance to a certain extent, with an improvement in accuracy from 0.837 to 0.87. In the second step, we use the Normal Vector to simplify the process of malware classification and obtain a more generalizable category representation vector that can further improve the accuracy of the model, with an increase in the average accuracy from 0.87 to 0.92. This suggests that, compared to randomly selected samples, the Normal Vector obtained in the second training step is better at capturing malware category features;
(4): Finally, we made a further attempt to adapt to the hierarchical relationship between the samples by performing a two-step inference, trying to classify the parent category first and then the subcategories. Experiments show that this strategy is effective and can improve the accuracy from 0.92 to 0.96, which utilizes the richer data in the parent category to improve the stability of the Normal Vectors of the subcategories with limited data and to alleviate the problem of data scarcity for specific malware types (e.g., P2P worms).

However, from the analysis of our results, there are still some limitations of underperformance; accordingly, we point out the limitations and future directions of the work in this paper.

(1): The challenge of data scarcity: Despite the efforts to address the data scarcity issue through the training steps, in multiple comparison experiments, relatively high variance in F1-scores can be found across different categories, e.g., P2P Worms and Misc, which still exhibit low F1-scores. The performance of the model is limited by the availability of representative data, and the strategy to deal with rare data needs to be further explored;
(2): Generalization of new samples: The model’s ability to generalize to new, unseen samples is not explicitly discussed currently. Evaluating its performance on brand new malware samples, especially those that do not exist in the training data, is crucial to assessing real-world applicability;
(3): Interpretability Issues: the interpretability of the model is not adequately discussed, and understanding how decisions are made, especially in security-related applications, is important for building trust in the model’s predictions.

5. Conclusions

In this work, we treat malware classification as a sequence classification problem, which involves taking a complete API call sequence as input and producing its corresponding category as output. To accomplish this work, we first employ the Transformer as the encoder and further introduce a local attention module to adapt to this task. Secondly, to boost performance and increase adaptability to new categories, we transform the multi-classification challenge into a classification mapping problem and design a two-step classification strategy. We design and train a Normal Vector from each category to boost the classification speed and performance of the base model. Thirdly, comprehensive experiments demonstrate that our top-performing model achieves state-of-the-art results, attaining an average F1 score of 0.90 and an accuracy of 0.96. Finally, we conduct a lot of ablation studies to show the gain from each module or step. It shows that our local attention performs better than original Transformer encoders, and our ideas from the Siamese network and Normal Vector benefit the results step by step. In our future work, we will explore more possibilities using sequence modeling for malware analysis and the ability to detect new categories’ samples.

Author Contributions

Conceptualization, P.W. and J.W.; methodology, P.W. and D.W.; validation, T.L. and D.W.; formal analysis, P.W. and T.L.; investigation, T.L. and J.Z.; data curation, P.W.; writing—original draft preparation, J.Z.; writing—review and editing, P.W. and T.L.; visualization, J.Z.; funding acquisition, P.W. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Key R&D projects of the Sichuan Science and technology plan (2022YFG0323) and in part by the Key R&D projects of the Chengdu Science and technology plan (2022-YF05-00451-SN).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Due to restrictions of privacy, the data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Aboaoja, F.A.; Zainal, A.; Ghaleb, F.A.; Al-rimy, B.A.S.; Eisa, T.A.E.; Elnour, A.A.H. Malware Detection Issues, Challenges, and Future Directions: A Survey. Appl. Sci. 2022, 12, 8482. [Google Scholar] [CrossRef]
Begovic, K.; Al-Ali, A.; Malluhi, Q. Cryptographic Ransomware Encryption Detection: Survey. Comput. Security 2023, 132, 103349. [Google Scholar] [CrossRef]
Molloy, C.; Banks, J.; Ding, H.S.; Charland, P.; Walenstein, A.; Li, L. Adversarial Variational Modality Reconstruction and Regularization for Zero-Day Malware Variants Similarity Detection. In Proceedings of the 2022 IEEE International Conference on Data Mining (ICDM), Orlando, FL, USA, 28 November–1 December 2022; pp. 1131–1136. [Google Scholar]
Ling, X.; Wu, L.; Zhang, J.; Qu, Z.; Deng, W.; Chen, X.; Qian, Y.; Wu, C.; Ji, S.; Luo, T. Adversarial Attacks against Windows PE Malware Detection: A Survey of the State-of-the-Art. Comput. Secur. 2023, 128, 103134. [Google Scholar] [CrossRef]
Gržinić, T.; González, E.B. Methods for Automatic Malware Analysis and Classification: A Survey. Int. J. Inf. Comput. Secur. 2022, 17, 179–203. [Google Scholar] [CrossRef]
Aslan, Ö.A.; Samet, R. A Comprehensive Review on Malware Detection Approaches. IEEE Access 2020, 8, 6249–6271. [Google Scholar] [CrossRef]
Muzaffar, A.; Hassen, H.R.; Lones, M.A.; Zantout, H. An In-Depth Review of Machine Learning Based Android Malware Detection. Comput. Secur. 2022, 121, 102833. [Google Scholar] [CrossRef]
Firdausi, I.; Erwin, A.; Nugroho, A.S. Analysis of Machine Learning Techniques Used in Behavior-Based Malware Detection. In Proceedings of the 2010 Second International Conference on Advances in Computing, Control, and Telecommunication Technologies, Jakarta, Indonesia, 2–3 December 2010; pp. 201–203. [Google Scholar]
Fuyong, Z.; Tiezhu, Z. Malware Detection and Classification Based on N-Grams Attribute Similarity. In Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), Guangzhou, China, 21–24 July 2017; Volume 1, pp. 793–796. [Google Scholar]
Taheri, L.; Kadir, A.F.A.; Lashkari, A.H. Extensible Android Malware Detection and Family Classification Using Network-Flows and API-Calls. In Proceedings of the 2019 International Carnahan Conference on Security Technology (ICCST), Chennai, India, 1–3 October 2019; pp. 1–8. [Google Scholar]
Mu, T.; Chen, H.; Du, J.; Xu, A. An Android Malware Detection Method Using Deep Learning Based on Api Calls. In Proceedings of the 2019 IEEE 3rd Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 11–13 October 2019; pp. 2001–2004. [Google Scholar]
Tran, T.K.; Sato, H.; Kubo, M. Image-Based Unknown Malware Classification with Few-Shot Learning Models. In Proceedings of the 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), Nagasaki, Japan, 26–29 November 2019; pp. 401–407. [Google Scholar]
Makandar, A.; Patrot, A. Malware Class Recognition Using Image Processing Techniques. In Proceedings of the 2017 International Conference on Data Management, Analytics and Innovation (ICDMAI), Pune, India, 24–26 February 2017; pp. 76–80. [Google Scholar]
Tran, T.K.; Sato, H. NLP-Based Approaches for Malware Classification from API Sequences. In Proceedings of the 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES), Hanoi, Vietnam, 15–17 November 2017; pp. 101–105. [Google Scholar]
Nagano, Y.; Uda, R. Static Analysis with Paragraph Vector for Malware Detection. In Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication, Beppu, Japan, 5–7 January 2017; pp. 1–7. [Google Scholar]
Schofield, M.; Alicioglu, G.; Binaco, R.; Turner, P.; Thatcher, C.; Lam, A.; Sun, B. Convolutional Neural Network for Malware Classification Based on API Call Sequence. In Proceedings of the 8th International Conference on Artificial Intelligence and Applications (AIAP 2021), Zurich, Switzerland, 23–24 January 2021; pp. 23–24. [Google Scholar]
Ravi, C.; Manoharan, R. Malware Detection Using Windows Api Sequence and Machine Learning. Int. J. Comput. Appl. 2012, 43, 12–16. [Google Scholar] [CrossRef]
Nakazato, J.; Song, J.; Eto, M.; Inoue, D.; Nakao, K. A Novel Malware Clustering Method Using Frequency of Function Call Traces in Parallel Threads. IEICE Trans. Inf. Syst. 2011, 94, 2150–2158. [Google Scholar] [CrossRef]
Kolosnjaji, B.; Zarras, A.; Webster, G.; Eckert, C. Deep Learning for Classification of Malware System Call Sequences. In Proceedings of the AI 2016: Advances in Artificial Intelligence: 29th Australasian Joint Conference, Hobart, TAS, Australia, 5–8 December 2016; Proceedings 29. Springer: Berlin/Heidelberg, Germany, 2016; pp. 137–149. [Google Scholar]
Li, C.; Zheng, J. API Call-Based Malware Classification Using Recurrent Neural Networks. J. Cyber Secur. Mobil. 2021, 10, 617–640. [Google Scholar] [CrossRef]
Li, C.; Lv, Q.; Li, N.; Wang, Y.; Sun, D.; Qiao, Y. A Novel Deep Framework for Dynamic Malware Detection Based on API Sequence Intrinsic Features. Comput. Secur. 2022, 116, 102686. [Google Scholar] [CrossRef]
Li, C.; Cheng, Z.; Zhu, H.; Wang, L.; Lv, Q.; Wang, Y.; Li, N.; Sun, D. DMalNet: Dynamic Malware Analysis Based on API Feature Engineering and Graph Learning. Comput. Secur. 2022, 122, 102872. [Google Scholar] [CrossRef]
Daeef, A.Y.; Al-Naji, A.; Chahl, J. Features Engineering for Malware Family Classification Based API Call. Computers 2022, 11, 160. [Google Scholar] [CrossRef]
Deore, M.; Kulkarni, U. Mdfrcnn: Malware Detection Using Faster Region Proposals Convolution Neural Network. Int. J. Interact. Multimedia Artif. Intell. 2022, 7, 146–162. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Staudemeyer, R.C.; Morris, E.R. Understanding LSTM—A Tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Zeyer, A.; Bahar, P.; Irie, K.; Schlüter, R.; Ney, H. A Comparison of Transformer and Lstm Encoder Decoder Models for Asr. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 8–15. [Google Scholar]
Rahali, A.; Akhloufi, M.A. MalBERT: Malware Detection Using Bidirectional Encoder Representations from Transformers. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, 17–20 October 2021; pp. 3226–3231. [Google Scholar]
Yang, X.; Yang, D.; Li, Y. A Hybrid Attention Network for Malware Detection Based on Multi-Feature Aligned and Fusion. Electronics 2023, 12, 713. [Google Scholar] [CrossRef]
Ma, Q.; Yu, L.; Tian, S.; Chen, E.; Ng, W.W. Global-Local Mutual Attention Model for Text Classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 2127–2139. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Sung, K.-K.; Poggio, T. Example-Based Learning for View-Based Human Face Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 39–51. [Google Scholar] [CrossRef]
Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-Based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An Advanced Object Detection Network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Hwang, J.; Kim, J.; Lee, S.; Kim, K. Two-Stage Ransomware Detection Using Dynamic Analysis and Machine Learning Techniques. Wirel. Pers. Commun. 2020, 112, 2597–2609. [Google Scholar] [CrossRef]
Baek, S.; Jeon, J.; Jeong, B.; Jeong, Y.-S. Two-Stage Hybrid Malware Detection Using Deep Learning. Hum. Centric Comput. Inf. Sci. 2021, 11, 10–22967. [Google Scholar]
Ebad, S.A. Exploring How to Apply Secure Software Design Principles. IEEE Access 2022, 10, 128983–128993. [Google Scholar] [CrossRef]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese Neural Networks for One-Shot Image Recognition. In ICML Deep Learning Workshop; University of Toronto: Lille, France, 2015; Volume 2. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-Shot Learning. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Ki, Y.; Kim, E.; Kim, H.K. A Novel Approach to Detect Malware Based on API Call Sequence Analysis. Int. J. Distrib. Sens. Netw. 2015, 11, 659101. [Google Scholar] [CrossRef]
Gupta, S.; Sharma, H.; Kaur, S. Malware Characterization Using Windows API Call Sequences. In Proceedings of the Security, Privacy, and Applied Cryptography Engineering: 6th International Conference, SPACE 2016, Hyderabad, India, 14–18 December 2016; Proceedings 6. Springer: Berlin/Heidelberg, Germany, 2016; pp. 271–280. [Google Scholar]
Nataraj, L.; Yegneswaran, V.; Porras, P.; Zhang, J. A Comparative Assessment of Malware Classification Using Binary Texture Analysis and Dynamic Analysis. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, Chicago, IL, USA, 21 October 2011; pp. 21–30. [Google Scholar]
Kim, H.-J. Image-Based Malware Classification Using Convolutional Neural Network. In Advances in Computer Science and Ubiquitous Computing: CSA-CUTE 17; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1352–1357. [Google Scholar]
Agarap, A.F. Towards Building an Intelligent Anti-Malware System: A Deep Learning Approach Using Support Vector Machine (SVM) for Malware Classification. arXiv 2017, arXiv:1801.00318. [Google Scholar]
Qiao, Y.; Yang, Y.; He, J.; Tang, C.; Liu, Z. CBM: Free, Automatic Malware Analysis Framework Using API Call Sequences. In Knowledge Engineering and Management, Proceedings of the Seventh International Conference on Intelligent Systems and Knowledge Engineering, Beijing, China, 15–17 December 2012 (ISKE 2012); Springer: Berlin/Heidelberg, Germany, 2014; pp. 225–236. [Google Scholar]
Demirkıran, F.; Çayır, A.; Ünal, U.; Dağ, H. An Ensemble of Pre-Trained Transformer Models for Imbalanced Multiclass Malware Classification. Comput. Secur. 2022, 121, 102846. [Google Scholar] [CrossRef]
Pektaş, A.; Acarman, T. Malware Classification Based on API Calls and Behaviour Analysis. IET Inf. Secur. 2018, 12, 107–117. [Google Scholar] [CrossRef]

Figure 1. Overview of our two-step methods.

Figure 2. (a) Transformer encoder and (b) our dual attention Transformer encoder.

Figure 3. Network structure comparison. (a) The original Transformer network structure; (b) dual attention Transformer encoder.

Figure 4. The confusion matrix (with data on accuracy) of the results of DAT (D = 2). From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Figure 5. The confusion matrix (with data on accuracy) of the results of training step #1. From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Figure 6. The confusion matrix (with data on accuracy) of the results of training step #2. From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Figure 7. The confusion matrix (with data on accuracy) of the results of two-step reference. From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Table 1. Summary and comparison of related works.

Research Paper	Features	Feature Vector and Models
Nagano et al. (2017) [15]	DLL Imports, Assembly Code, and Hex Dumps	PV-DBOW + SVM, KNN
Tran et al. (2017) [14]	API Call Sequences	N-gram, Doc2Vec, TF-IDF + SVM, KNN, MLP and RF
Hwang et al. (2020) [37]	API Call Sequences	Markov Chain + RF
C Li et al. (2021) [20]	API Call Sequences	RNN
Schofield et al. (2021) [16]	API Call Sequences	N-gram. TF-IDF + CNN
Rahali et al. (2021) [29]	API Call Sequences	Transformer
Baek et al. (2021) [38]	Process Memory, API Category, API Calls	Bi-LSTM, EfficientNet-B3
Li et al. (2022) [21]	API Call Sequences	Embedding Layer + Bi-LSTM
Li et al. (2022) [22]	API Call Sequences	Similarity Encoding + GNN
Daeef et al. (2022) [23]	API Call Sequences	Frequence Encoding + RF, LSTM
Deore et al. (2022) [24]	Hex Features, Disassembled File Features	Similarity Statistical + F-RCNN
Yang et al. (2023) [30]	Binary File, Assembly File	Stacked CNN + Regular Attention + Cross Attention

Table 2. Dataset One description.

Category	Subcategory	Ratio (%)
Backdoor		3.37
Worm	Worm	3.32
	Email-Worm	0.55
	Net-Worm	0.79
	P2P-Worm	0.30
Packed		5.57
PUP	Adware	13.63
	Downloader	2.94
	WebToolbar	1.22
Trojan	Trojan (Generic)	29.3
	Trojan-Banker	0.14
	Trojan-Clicker	0.12
	Trojan-Downloader	2.29
	Trojan-Dropper	1.91
	Trojan-FakeAV	18.8
	Trojan-GameThief	0.63
	Trojan-PSW	3.79
	Trojan-Ransom	2.58
	Trojan-Spy	3.12
Misc.		5.52

Table 3. Dataset Two description.

Category	Subcategory	Ratio (%)
Backdoor		27.30
Worm	Email-Worm	1.71
Worm	Net-Worm	1.00
Trojan	Trojan (Generic)	29.3
	Trojan-Banker	1.61
	Trojan-Clicker	1.90
	Trojan-Downloader	18.46
	Trojan-Dropper	7.18
	Trojan-GameThief	18.11
	Trojan-PSW	8.47
	Trojan-Proxy	1.23
	Trojan-Spy	8.40
Virus		1.77
Exploit		1.07
Rootkit		0.50
HackTool		1.28

Table 4. Comparison with previous methods on Dataset One.

Methods	Features	Samples	Families	Accuracy
Malware Image + GIST [44]	File content	63,002	531	0.7280
Malware Image + CNN [45]	File content	10,868	9	0.9176
Malware Image + GRU-SVM [46]	File content	9339	25	0.8492
BBIS + CARL [47]	API calls	3131	28	0.8840 (F1)
NLP(TF-IDF) + SVM [14]	API calls	23,080	10	0.8654
Category Vector + CNN [16]	API calls	23,080	10	0.8797
Frequence Vector + RF [23]	API calls	23,080	10	0.8005
Embedding + RNN [20]	API calls	23,080	10	0.8690
Encoder (Embedding + CNN) + Bi_LSTM [21]	API calls	23,080	10	0.9021
Random Transformer Forest [48]	API calls	23,080	10	0.9330
Ours	API calls	23,080	10	0.9606

Table 5. Comparison with previous methods on Dataset Two.

Methods	Features	Samples	Families	Accuracy
Category Vector + CNN [16]	API calls	33,240	16	0.7782
Frequence Vector + RF [23]	API calls	33,240	16	0.6685
Embedding + RNN [20]	API calls	33,240	16	0.8690
Voting Experts + Confidence Weighted [49]	API calls, Actions	33,240	16	0.8570
Encoder (Embedding + CNN) + Bi_LSTM [21]	API calls	33,240	16	0.8051
Random Transformer Forest [48]	API calls	33,240	16	0.8703
Ours	API calls	33,240	10	0.8859

Table 6. Quantitative comparison with different encoder.

Encoder	Accuracy
Transformer encoder	0.7719 ± 0.0049
DAT encoder with $D = 1$	0.8073 ± 0.0047
DAT encoder with $D = 2$	0.8368 ± 0.0038
DAT encoder with $D = 3$	0.8191 ± 0.0067
DAT encoder with $D = 4$	0.8075 ± 0.0084
DAT encoder with $D = 5$	0.8079 ± 0.0045
DAT encoder with $D = 6$	0.8032 ± 0.0030

Table 7. Quantitative results of DAT (D = 2).

Category	F1-Score	Support	Overall Accuracy
Worm	0.78	81	0.8368
P2P-Worm	0.12	8
Trojan-Spy	0.72	81
Net-Worm	0.92	20
Packed	0.84	145
Trojan-Downloader	0.45	57
Trojan-PSW	0.82	99
Trojan-Ransom	0.58	67
Misc	0.80	20
Trojan-FakeAV	0.96	487

Table 8. Quantitative results of training step #1.

Category	F1-Score	Overall Accuracy
Worm	0.81	0.8700
P2P-Worm	0.20
Trojan-Spy	0.79
Net-Worm	0.84
Packed	0.89
Trojan-Downloader	0.55
Trojan-PSW	0.93
Trojan-Ransom	0.72
Misc	0.47
Trojan-FakeAV	0.98

Table 9. Quantitative results of training step #2.

Category	F1-Score	Overall Accuracy
Worm	0.89	0.92
P2P-Worm	0.29
Trojan-Spy	0.88
Net-Worm	0.78
Packed	0.93
Trojan-Downloader	0.71
Trojan-PSW	0.96
Trojan-Ransom	0.85
Misc	0.55
Trojan-FakeAV	0.99

Table 10. Quantitative results of two-step inference.

Category	F1-Score	Overall Accuracy
Worm	0.95	0.96
P2P-Worm	0.62
Trojan-Spy	0.93
Net-Worm	0.95
Packed	0.99
Trojan-Downloader	0.90
Trojan-PSW	0.95
Trojan-Ransom	0.91
Misc	0.83
Trojan-FakeAV	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

TTDAT: Two-Step Training Dual Attention Transformer for Malware Classification Based on API Call Sequences

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Based or API-Call-Related Malware Classification

2.2. Transformer Models and Local Attention

2.3. Training Strategies

2.4. Related-Work Summary and Comparison

3. Methodology

3.1. Overview and Design Principles

3.2. Dual Attention Transformer Encoder

3.3. Two-Step Training

3.3.1. Training Step #1

3.3.2. Training Step #2

4. Experiments and Discussion

4.1. Dataset and Implementation Details

4.2. Comparison with Previous Methods

4.3. Ablation Studies

4.3.1. Ablation Study on Local Attention Mechanism

4.3.2. Ablation Study on Training Step #1

4.3.3. Ablation Study on Training Step #2

4.3.4. Two-Step Inference

4.4. Findings and Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics