TTDAT: Two-Step Training Dual Attention Transformer for Malware Classification Based on API Call Sequences

: The surge in malware threats propelled by the rapid evolution of the internet and smart device technology necessitates effective automatic malware classification for robust system security. While existing research has primarily relied on some feature extraction techniques, issues such as information loss and computational overhead persist, especially in instruction-level tracking. To address these issues, this paper focuses on the nuanced analysis of API (Application Programming Interface) call sequences between the malware and system and introduces TTDAT (Two-step Training Dual Attention Transformer) for malware classification. TTDAT utilizes Transformer architecture with original multi-head attention and an integrated local attention module, streamlining the encoding of API sequences and extracting both global and local patterns. To expedite detection, we introduce a two-step training strategy: ensemble Transformer models to generate class representation vectors, thereby bolstering efficiency and adaptability. Our extensive experiments demonstrate TTDAT’s effectiveness, showcasing state-of-the-art results with an average F1 score of 0.90 and an accuracy of 0.96.


Introduction
Malware, or malicious software, is crafted to infiltrate computers and mobile devices, aiming to manipulate authoritative systems, gather sensitive information, display unwanted ads, or extort users [1,2].The surge in smart devices like laptops and phones has greatly expanded the threat landscape, jeopardizing user security and system integrity [3,4].Malware classification assigns specific labels to identify its family, which is a crucial step in addressing security challenges [5].
Malware classification can be divided into signature-based, machine learning-based, and deep learning-based methods in the method view or static analysis and dynamic analysis in the feature view.Signature-based approaches may encounter challenges when dealing with the rapid evolution of malware [6].In response, traditional machine learning methods, including Support Vector Machines (SVM), Random Forests (RF), and Naïve Bayes (NB), have been utilized for malware detection and classification [7,8].However, these approaches necessitate the manual extraction of features, relying on expert knowledge, which can introduce complexity to the process.
Contemporary malware classification methods effectively leverage malware features, encompassing both static and dynamic attributes, to build machine learning or deep learning models.Static analysis involves the extraction of features as hex values and opcodes [9] from malware binary executable files through reverse engineering and examination of the original binary code.While static analysis is efficient, it is susceptible to evasion and obfuscation techniques.In contrast, dynamic analysis techniques capture malware behaviors, including file access, API (Application Programming Interface) calls, data flow, and other behavior traces, by executing and monitoring malware within a virtual sandbox.Dynamic analysis offers a more accurate representation of malware's actual objectives and actions, resulting in lower false-positive rates and higher accuracy [10,11].Combined with deep learning's image representation, many research works would treat the malware as an image by converting the feature from static and dynamic analysis into a matrix [12,13].
Despite the success of feature analysis and deep learning, especially in image representation, we posit that API call sequences can be regarded as a form of language through which programs establish communication with operating systems, analogous to how individuals employ languages for interpersonal interaction, which can better reflect the nature of the malware.Tran et al. point out that every type of malware has its own specific API call patterns or unique order of API calls [14].In contrast to dynamic instruction features, the extraction of API call features necessitates only a coarse-grained dynamic analysis.Consequently, this approach incurs a relatively modest computational cost, rendering it highly effective for a broad spectrum of software codes.
In this work, we propose a Two-step Training Dual Attention Transformer (TTDAT) using API call sequences for malware classification that takes the API call sequence as input and outputs its corresponding category.We employ local attention with an original encoder to form a dual Transformer to capture global and local information.To facilitate the efficient expansion of new categories of malicious code with minimal computational overhead and without the necessity of retraining the entire model, we design a two-step training strategy and transform the multi-classification problem into a classification mapping problem.During the training phase, the matching model is tasked with learning a scoring mechanism that quantifies the likelihood of an API call sequence belonging to a particular category.Subsequently, during the inference stage, we select the category with the highest score from the pool of candidate categories.In step 2, to mitigate the influence of imprecise annotations and enhance the inference speed, we additionally conduct supplementary training to generate a normalized vector representing each category.Experimental results demonstrate the overall effectiveness of the TTDAT method.To summarize, the main contributions of this work are listed below: • We present a tiny local attention mechanism as a complementary component to the multi-head attention in the Transformer, and a new encoder is proposed to model the short-term relationship between API call sequences; • We provide a two-step training method for accuracy facilitation.Unlike adding some cumbersome components to the model that require large computational resources, a novel training method is computationally free during the inference time;

•
Massive experimental results show that the proposed method outperforms the stateof-the-art malware classifiers in two datasets, and we carry out an ablation study to demonstrate the effectiveness of our module and two-step training strategy.

Related Work
This section presents a concise overview of NLP-and API-sequence-based malware detection and classification methodologies, as well as an exploration of Transformer-based approaches and training strategies to delineate the foundation of our research and highlight the distinctions therein.

Deep Learning-Based or API-Call-Related Malware Classification
There is a line of work focused on building malware classification systems based on extracted features.Nagano et al. [15] have proposed an innovative static analysis approach, integrating Natural Language Processing (NLP) with machine learning classifiers to discriminate between malicious and benign software.Their methodology entails the utilization of a PV-DBOW model for the extraction of features from diverse sources, including DLL imports, assembly code, and hex dumps, all derived from static analysis.Subsequently, these extracted features, or vectors, are input into Support Vector Machines (SVM) and k-nearest neighbor (KNN) classifiers for predictive inference.Another study proposed by Tran et al. [14] used NLP techniques such as N-gram, Doc2Vec (or paragraph vectors), and TF-IDF to convert API call sequences to numeric vectors before feeding them to the classifiers, including SVM, KNN, MLP, and RF.Schofield [16] also uses N-gram and TF-IDF to encode the API call sequences and employs a CNN to classify, which utilizes the ability of image representation.Chandrasekar Ravi et al. [17] employ a third-order Markov chain to model the Windows API call sequences.Nakazato J et al. [18] classify malware into some clusters using characteristics of the behavior, which are derived from Windows API calls in parallel threads with N-gram and TF-IDF.
Deep learning-based methodologies have exhibited remarkable potential for delivering more efficacious and adaptable features, yielding superior outcomes in malware classification.Kolosnjaji et al. [19] pioneered the application of convolutional and recurrent network layers for the extraction of features from comprehensive API sequences.Their pioneering work underscores the substantial accomplishments attained through the integration of deep learning techniques within API-sequence-based malware classification.In the same way, C Li's work [20] also demonstrates the RNN's ability to classify the API call sequences alone.In a subsequent development, Li et al. [21] have further refined the network architecture, introducing the extraction of inherent features from API sequences.Especially, their approach incorporates embedding layers to represent API phrases and semantic chains, along with the utilization of Bidirectional Long Short-Term Memory (Bi-LSTM) units to capture interrelationships among APIs.The results of their endeavors demonstrate significant performance enhancements when compared to baseline methodologies, highlighting the efficacy of introducing additional intrinsic features associated with APIs.Some works consider the similarity among the features, especially API call sequences, and employ similarity to do the encoder, followed by some advanced models such as GNN [22], Random Forest, LSTM [23], and F-RCNN [24].

Transformer Models and Local Attention
Transformer is the first sequence transduction model that relies entirely on the attention mechanism.Unlike RNN [25] and LSTM [26], Transformer [27] uses multi-headed selfattention instead of recurrent layers in encoder-decoder architecture.Thanks to the absence of recurrent layers, the Transformer does not need to face the risk of gradient disappearance and gradient explosion, and it can process the entire sequence and learn the relationship between API calls.Using the Transformer Encoder-Decoder model takes less time to train than the LSTM model, and it is more stable [28].MalBERT [29] first utilizes the pre-trained Transformer to process and detect malware, and experiments demonstrate that the Bert-based model can achieve high accuracy for malware classification.
Transformer architecture delivers a good design of attention mechanisms; some work employs another attention module to capture the information.Yang [30] proposes to capture features from binary files using stacked CNNs and assembly files via triangular attention and then fuse all features via cross-attention.Their experimental results show that the method can extract both global and local features to improve the detection of malware variants effectively.Moreover, the local attention mechanism is very popular and effective in processing local features.Ma [31] points out that the mutual result of both global and local attention is useful to capture semantics and generate the most informative and discriminative features for text classification.Inspired by the success of local attention in text classification, this paper employs local attention as a complement to global attention to process short-term information in the classification of malware API call sequences.

Training Strategies
Generally, benefiting from sufficient data, convolutional networks are always trained offline.Thus, researchers favor taking advantage of and developing better training methods that can not only promote the performance of the model but also have no inference cost increase.Inspired by [32], we call this kind of method a "bag of freebies".Strategies like data augmentation [33], hard negative example mining [34], online hard example mining [35], two-stage object detectors, and objective function designing [36], to name a few, are commonly used in computer vision and natural language processing (NLP).
In malware classification, Hwang [37] designs a two-stage detection method to protect the victims by employing random forest to control false negative error rates in the second stage under low false positive rates delivered by the first stage using the Markov chain model.Baek [38] employs static analysis and dynamic analysis in different stages; static analysis in the first stage is used to classify malware and benign files.After that, they further employ dynamic analysis in the second stage to classify malware from the benign files in stage one to lower the false detection rate and reduce the malware misclassification in stage one.The results show that a two-stage scheme can perform better than a single static analysis or dynamic analysis.Although these strategies can better improve the detection rate, current research lacks consideration of the representation of malware and detection speed performance.Motivated by this situation, we propose a two-step training method and apply it to our model.

Related-Work Summary and Comparison
From Table 1, we can see that the existing works focus on feature encoding and the construction of classification models for API sequences or other features of malware.For feature encoding, the techniques of NLP have been continuously utilized, from the N-gram model at the beginning to embedding to the latest Transformer architecture.Similarity metrics-based models have been used to encode and characterize features, which can be fully exploited by leveraging the capabilities of CNN, RNN, and GNN models in deep learning.Despite their success, they ignored local information about the API sequence and were computationally heavy; we added a local attention module to the Transformer for better results.Without extracting API sequence characteristics, we can retain more information about the association between malware and its API sequence.At the same time, there are some research works focusing on two-stage phases to further optimize the classification effect of the first stage through the second stage, thus improving the overall performance.In our work, we believe that the Transformer architecture has further room for improvement in efficiency, so we want to speed up the characterization process by saving the category vectors.

Methodology
This section presents our Two-step Training Dual Attention Transformer (TTDAT).Section 3.1 describes the general process and applied design principles.Section 3.2 describes the proposed network, including the dual attention Transformer encoder and its local attention operation.Section 3.3 describes the two-step training strategy and illustrates the model updates in different steps.

Overview and Design Principles
Figure 1 illustrates the overall process of our two-step methodology incorporating a dual attention Transformer.The approach takes API calls as its input and produces predictions for the respective categories.In the first step, a multi-head attention mechanism and local attention are employed within a multiple dual Transformer encoder to capture and represent samples as vectors.Following activation, the model yields probabilities indicating the likelihood of each sample belonging to specific categories for predictive purposes.Moving to the second step, the model treats API call classification as a binary classification task to train individual models, leveraging the pre-trained model from the initial step.Subsequently, the methodology stores the final weights of these models as Normal Vectors, serving as representations for the respective classes and facilitating future predictions.This two-step strategy enables us to proficiently accomplish the malware classification task, optimizing it effectively, with each step addressing distinct optimization objectives.Yang et al. (2023) [30] Binary File, Assembly File tion + Cross Attention At the same time, there are some research works focusing on two-stage phases to further optimize the classification effect of the first stage through the second stage, thus improving the overall performance.In our work, we believe that the Transformer architecture has further room for improvement in efficiency, so we want to speed up the characterization process by saving the category vectors.

Methodology
This section presents our Two-step Training Dual Attention Transformer (TTDAT).Section 3.1 describes the general process and applied design principles.Section 3.2 describes the proposed network, including the dual attention Transformer encoder and its local attention operation.Section 3.3 describes the two-step training strategy and illustrates the model updates in different steps.

Overview and Design Principles
Figure 1 illustrates the overall process of our two-step methodology incorporating a dual attention Transformer.The approach takes API calls as its input and produces predictions for the respective categories.In the first step, a multi-head attention mechanism and local attention are employed within a multiple dual Transformer encoder to capture and represent samples as vectors.Following activation, the model yields probabilities indicating the likelihood of each sample belonging to specific categories for predictive purposes.Moving to the second step, the model treats API call classification as a binary classification task to train individual models, leveraging the pre-trained model from the initial step.Subsequently, the methodology stores the final weights of these models as Normal Vectors, serving as representations for the respective classes and facilitating future predictions.This two-step strategy enables us to proficiently accomplish the malware classification task, optimizing it effectively, with each step addressing distinct optimization objectives.To meet the security design requirements, the method applies some principles [39] to work, including economy mechanisms, open design, and input validation.We keep the overall architecture consistent with the Transformer, with the only introduction of the local attention module to avoid the complexity caused by excessive modifications.Then, we designed a two-step process to optimize different purposes independently, thus maintaining the clarity and openness of the algorithm.We make assumptions on the input, so we need to apply the validation, including checking if the API call is legal from the system library and validating the input of the API pair for the model.In addition, more privilege To meet the security design requirements, the method applies some principles [39] to work, including economy mechanisms, open design, and input validation.We keep the overall architecture consistent with the Transformer, with the only introduction of the local attention module to avoid the complexity caused by excessive modifications.Then, we designed a two-step process to optimize different purposes independently, thus maintaining the clarity and openness of the algorithm.We make assumptions on the input, so we need to apply the validation, including checking if the API call is legal from the system library and validating the input of the API pair for the model.In addition, more privilege validation and fail-safe default design principles need to be considered when the algorithm becomes part of a secure detection system in the future.

Dual Attention Transformer Encoder
The original encoder in Transformer, as shown in Figure 2a, is actually a stack of multi-head attention modules and feed-forward modules that are used for long-term relationship modeling and feature extraction, respectively.Inevitably, modeling long-term relationships between API call sequences requires attending to all API call sequences, thus somewhat suppressing the expression of short-term dependencies.Especially in the malware classification area, long-and short-term relationships matter equally, i.e., some malware can be classified by several distant or only several adjacent API calls.Consequently, we propose to use lightweight local attention and incorporate it within a dual attention Transformer encoder, which is illustrated in Figure 2b.
The original encoder in Transformer, as shown in Figure 2a, is actually a stack of multi-head attention modules and feed-forward modules that are used for long-term relationship modeling and feature extraction, respectively.Inevitably, modeling long-term relationships between API call sequences requires attending to all API call sequences, thus somewhat suppressing the expression of short-term dependencies.Especially in the malware classification area, long-and short-term relationships matter equally, i.e., some malware can be classified by several distant or only several adjacent API calls.Consequently, we propose to use lightweight local attention and incorporate it within a dual attention Transformer encoder, which is illustrated in Figure 2b.

𝑐 = 𝛼 𝑥
where  denotes the weight of  with respect to  and can be expressed as Equation (2).
where (⋅ ,⋅) is the sum of the inner product of the variable in this paper.Finally, the context information  is regarded as the local attention feature and replaces  .Unlike global attention, i.e., the multi-head attention in the Transformer, local attention attempts to learn the context information in a sliding window of 2D + 1.Given the embedded API call sequence matrix X ∈ R N×M , where N denotes the number of the API call and M denotes the feature size of a single API call, x ∈ R 1×M , the local attention aggregates the context information c t of the current API call x t using Equation (1).
where α ti denotes the weight of x i with respect to x t and can be expressed as Equation (2).
where score(•, •) is the sum of the inner product of the variable in this paper.Finally, the context information c t is regarded as the local attention feature and replaces x t .
Based on the local attention proposed above, we further propose the new structure as a dual attention Transformer (DAT).Similar to the Transformer encoder, the DAT takes the embedded features as inputs and outputs in a fixed size.What makes the difference is that there are two sub-branches in each encoder layer, which are in charge of the longand short-term dependence modeling, respectively.The long-term part stays the same as the Transformer encoder, while the short-term part replaces the multi-head attention with the local attention proposed above.Input-embedded features are passed into the two sub-branches, and the successive operations can be formulated as follows: where x denotes the input matrix and φ MA and φ LA refer to the multi-head and local attention layer, respectively.f FF is the feed-forward layer.θs represent the parameters in (multi-head/local) attention layers, feed-forward layers, and residual and normalization layers.ys are the outputs of the attention layer followed by the Add & Norm layer, while zs are the outputs of the feed-forward layer followed by the Add & Norm layer.After that, outputs from the two sub-branches are concatenated, and then a maxpooling layer is applied, generating the final output of the encoder layer, which can be expressed as follows: where the || denotes the concatenation operation and δ is the final output of the encoder layer.Thus, we can stack the DAT (dual attention Transformer) encoder layers to form a powerful feature extractor for malware classification.

Two-Step Training
Dedicated training methods can be deemed as a "bag of freebies", which only take more training costs but can boost classification accuracy a lot.Our training methods can be divided into two steps.Training step #1 is designed for generating a basic model for malware classification, and most parameters of the model will be fixed in the next step, #2, which focuses on training a Normal Vector for each malware category to improve the class representation and promote detection performance.

Training Step #1
Siamese networks [40] are widely used in deep learning to learn discriminative features and predict feature similarity.In this work, we try to take advantage of it and the architecture employed in training step #1 as shown in Figure 2b, which is also a Siamese-like model and has a shape of Y.
The network in step #1 takes two API call sequences from different categories as inputs.These sequences are sent into two feature extractors that share parameters with each other, like the Siamese network.Note that each feature extractor consists of 6 DAT encoder layers proposed in Section 3.1, responsible for converting API sequences into a new feature space.After that, subtraction and multiplication are applied to the generated features from both extractors for feature similarity evaluation.Derived features and the original features are concatenated together and passed into two linear layers and a softmax layer to obtian the final output probability that the two malwares are in the same class.It is worth mentioning that turning the multi-classification task into a binary classification task is for the consideration of scalability to a new malware class.
The inference process is illustrated in Algorithm 1.When an API call sequence x needs to be classified, N sequences from each malware category are selected and sent to the Siamese network with x for similarity prediction.The category with the highest mean output probability will be determined as the label of x.Benefitting from our DAT encoder, training step #1 alone can give a satisfactory classification result, but it would suffer from two severe drawbacks: (1) The speed of the network can be encumbered seriously owing to the tremendous comparison times.For instance, given a new sample, M categories, it takes MN times comparison to give the final label, according to Algorithm 1. (2) The number of samples selected from each category during testing time is hard to trade off.Fewer samples may cause noise because the randomly selected samples cannot represent the whole set, while more samples could lead to inference speed degradation, as stated in the drawback.
Inspired by Prototypical Network [41], we propose training step #2 to solve the issues.Instead of using several random samples, we utilize a Normal Vector to denote the characteristic of each category, which is more time-saving and robust.During training step 2, one of the feature extractors is replaced with a Normal Vector layer.Note that all parameters of the network except Normal Vector layer are frozen since what we want from training step #2 is a Normal Vector.
Training step #2 can be illustrated using Algorithm 2. Given that category i needs to generate the Normal Vector V i and the trained model, the algorithm has to initialize the training dataset for category i.Specifically, for each sample in training set, if it belongs to category i, we label it with 1, while 0 is assigned to the sample if it is not in category i.When training is executed, every sample x in training set is sent to the feature extractor, i.e., the dual attention Transformer, to obtain the embedded feature vector V x .After that, subtraction and multiplication are applied to V x and the Normal Vector V i .As the same in training step #1, the outputs of subtraction and multiplication, V x and V i , are concatenated and are passed through two linear layers and a softmax activation function, obtaining the final prediction.The Normal Vector is then trained to minimize the gap between the prediction and the label.Once the training is finished, the weight in the Normal Vector layer is drawn out to serve as the Normal Vector of the category.During inference time, the API call sequence is sent to the feature extractor, and the network will predict similarities between the extracted feature and all Normal Vectors.If the highest probability is produced by the Normal Vector V i , the new sample will be classified into category i.This procedure can be formulated using Algorithm 3.

Dataset and Implementation Details
We implement our network with Tensorflow.Model training and testing are performed on Ubuntu 18.04 with an Intel Xeon Platinum 8255C with eight cores and an NVIDI Tesla T4 with 16 GB of memory.Moreover, the network was trained by an adaptive moment estimation (Adam) solver with mini-batch stochastic gradient descent.
We evaluated our model's performance on two datasets.The first dataset we used for training and testing is provided by [42], including the categories, hash, and API call sequences of malware.This dataset was built from malware samples randomly from the Malicia project and VirusTotal and it was shared online.We chose this due to the rich variety of sample categories and the high quality of samples and labels, and it has been widely used and recognized by the academic community.To explore the robust ability of the model, we employed our lab's collected malware from online resources and ran the cuckoo sandbox to collect dynamic analysis results to form the second dataset.We pre-process the suffixes of the called APIs as paper [43] did.Details about two datasets are given in Tables 2 and 3.

Comparison with Previous Methods
The comparison on Dataset One between our methods and the previous studies is given in Tables 4 and 5.In Table 4, we chose two different kinds of methods to report the results.The first five methods are classic methods [14,[44][45][46][47] to do the malware family classification, and we report the results from their papers.The following five methods [16,20,21,23,48] are the latest effective work on the classification based on API calls, so we reproduce the methods and offer a convincing comparison result.The [21] method adopts a two-way feature extraction architecture for API calls, but the core module is a multi-layer CNN, and the correlation analysis is performed through Bi-LSTM.Our architecture is unified based on Transformer and attention mechanisms, and so by comparing with this method, it can reflect the role of the backbone network.The study [48] further adopted a pre-training mechanism and integrated multiple Transformer architecture models through Random Forest, which has similarities with our backbone network.Moreover, it uses the process of integrating Random Forest, which is like Step 2. However, we implemented this process through the Normalization Vector.So, comparing this work with others can reflect the role of our mechanisms and strategies.We chose these two works as the latest and most effective API call classification model, which can also give a similarity comparison with our model details.In Table 5, we also compare these five methods with a new method [49] since we can reproduce this work with our cuckoo analysis results.From Table 4, as can be observed, the first five methods give a baseline for classification tasks, and models based on malware images or traditional NLP methods can achieve accuracy around 0.90.The model of [21] achieved an accuracy of 0.90, the model of [48] achieved an accuracy of 0.93, and our model had an accuracy of 0.96.From a baseline perspective, all three models go beyond other basic or advanced methods and achieve a better result, demonstrating the effectiveness of the API calls and Encoder/Transformer architecture models.From the perspective of SOTA, our model performs much better than the two currently optimal models, indicating the superiority of our model.

Features Samples Families Accuracy
Malware Image + GIST [44] File content 63,002 531 0.7280 Malware Image + CNN [45] File content 10,868 9 0.9176 Malware Image + GRU-SVM [46] File content 9339 25 0.8492 BBIS + CARL [47] API calls 3131 28 0.8840 (F1) NLP(TF-IDF) + SVM [14] API calls 23,080 10 0.8654 Category Vector + CNN [16] API calls 23,080 10 0.8797 Frequence Vector + RF [23] API In Table 5, we further compared the performance of the latest seven methods on the second dataset to demonstrate the robustness and broad performance of the model.From Table 5, the performance of all three models has decreased on the second dataset, but our proposed method still performs the best.

Ablation Studies
In this section, we use an incremental method to conduct an ablation study on Dataset One, verifying the effectiveness of every component of our method.The experimental details and results are described in the following subsections.

Ablation Study on Local Attention Mechanism
To verify the effectiveness of our local attention mechanism, we simply use the networks shown in Figure 3 for convenience.We use the network in Figure 3a as our baseline.The only difference between network (a) and (b) is that (a) uses the original Transformer encoder, but (b) employs our well-designed dual attention Transformer encoder (DAT encoder).The networks take the API sequence as input.Features are extracted via six encoder layers, like the Transformer, and then sent into two linear layers and one softmax activation, outputting the final probabilities.Moreover, to find the best value of D in the local attention module proposed in Section 3.1, we set D ranging from 1 to 6 and compare the accuracy results to discuss.
encoder, but (b) employs our well-designed dual attention Transformer encoder (DAT en-coder).The networks take the API sequence as input.Features are extracted via six encoder layers, like the Transformer, and then sent into two linear layers and one softmax activation, outputting the final probabilities.Moreover, to find the best value of  in the local attention module proposed in Section 3.1, we set  ranging from 1 to 6 and compare the accuracy results to discuss.We compare the effect and best setting of our local attention module in Table 6.Our baseline is the Transformer encoder performance, which using the Transformer encoder We compare the effect and best setting of our local attention module in Table 6.Our baseline is the Transformer encoder performance, which using the Transformer encoder (global attention) resulted in an accuracy of 0.7719 ± 0.0049.The introduction of the DAT (local attention) encoder led to notable improvements in accuracy, where all results are above 0.80.Specifically, the best performance was achieved when D = 2, with an accuracy of 0.8368 ± 0.0038.This represents a substantial gain of 6.4% over the Transformer encoder.Moreover, the DAT encoder is more robust than the Transformer encoder, and the results show less deviation compared to 0.0049.And we can find that although the best setting is D = 2, even when D was increased, the accuracy remained competitive, demonstrating the effectiveness of the local attention mechanism.The experimental results strongly suggest that incorporating a local attention module (DAT encoder) is beneficial for improving performance on the given task.The optimal configuration, in this case, was found to be D = 2, and this setting was chosen for subsequent experiments.Based on the DAT encoder (D = 2), we test against 10 kinds of malware.As can be seen in Table 7, the F1-scores for different malware categories varied, indicating the model's ef-fectiveness in distinguishing between different types of malware.The model demonstrated high F1-scores in categories such as Trojan-FakeAV (0.96) and Net-Worm (0.92), indicating its proficiency in classifying instances from these categories.The results can be seen in the confusion matrix in Figure 4.There are two categories that are significantly lighter in color.However, the model still faced challenges in accurately classifying P2P-Worm, Trojan-Downloader, and Trojan-Ransom categories, as reflected by lower F1-scores (0.12, 0.45, and 0.58, respectively).We tested our model using 10 kinds of malware.As can be seen in Table 8, the accuracy of our model reaches to 0.87, which is 0.0332 higher than the network with the DAT encoder (see Table 7), demonstrating the contribution of our training step #1.At the same time, Trojan-Ransom's F1-score has improved from 0.58 to 0.72.Although the accuracy has improved a lot, we can find that the F1-score of some kinds of malware classification  We tested our model using 10 kinds of malware.As can be seen in Table 8, the accuracy of our model reaches to 0.87, which is 0.0332 higher than the network with the DAT encoder (see Table 7), demonstrating the contribution of our training step #1.At the same time, Trojan-Ransom's F1-score has improved from 0.58 to 0.72.Although the accuracy has improved a lot, we can find that the F1-score of some kinds of malware classification is unsatisfactory, such as P2P-Worm (F1-score: 0.20), Misc (F1-score: 0.47) and Trojan-Downloader (F1-score: 0.55), which is unacceptable in practice.The f1-score of P2P-Worm, Net-Worm, and Misc is lower than that of the DAT encoder.The same result can be derived from the confusion matrix in Figure 5 as well, in which the second category, i.e., P2P-Worm, is so light and is prone to be predicted as Trojan-Spy, Packed, and so on.We guess the reason for this gap is caused by the rare data of these sorts of malware.8 and 9, we can see that the values of the F1-score of all 10 categories are boosted, and the mean accuracy improves from 0.87 to 0.92, indicating that the Normal Vectors obtained in training step 2 can represent the characteristics of these categories better than randomly selected samples.However, the F1-scores of the classifications of P2P-Worm and Misc are still low.Moreover, these two categories are also relatively light in the confusion matrix in Figure 6.We think the scarce data in the two categories is unable to cover most features of them, leading to unsatisfactory results.8 and 9, we can see that the values of the F1-score of all 10 categories are boosted, and the mean accuracy improves from 0.87 to 0.92, indicating that the Normal Vectors obtained in training step 2 can represent the characteristics of these categories better than randomly selected samples.However, the F1-scores of the classifications of P2P-Worm and Misc are still low.Moreover, these two categories are also relatively light in the confusion matrix in Figure 6.We think the scarce data in the two categories is unable to cover most features of them, leading to unsatisfactory results.

Two-Step Inference
Since we have tried to construct a two-step training network, it is also interestin devise a two-step inference method.Moreover, as can be seen in Figure 7, P2P-Worms frequently classified as Trojan-Downloader.However, they are in totally different pa categories.We believe this is caused by its small amount of data, resulting in the instab of its Normal Vector.But we find that although the data for P2P-Worm are rare, its pa category, i.e., Worm, has sufficient samples.Consequently, we train the model in train step #2 again and obtain the Normal Vectors for all the parent classes, i.e., Backd Worm, Trojan, and misc., to make full use of these samples.Hence, we have the Nor Vectors for categories and their sub-categories.

Two-Step Inference
Since we have tried to construct a two-step training network, it is also interesting to devise a two-step inference method.Moreover, as can be seen in Figure 7, P2P-Worms are frequently classified as Trojan-Downloader.However, they are in totally different parent categories.We believe this is caused by its small amount of data, resulting in the instability of its Normal Vector.But we find that although the data for P2P-Worm are rare, its parent category, i.e., Worm, has sufficient samples.Consequently, we train the model in training step #2 again and obtain the Normal Vectors for all the parent classes, i.e., Backdoor, Worm, Trojan, and misc., to make full use of these samples.Hence, we have the Normal Vectors for categories and their sub-categories.category, i.e., Worm, has sufficient samples.Consequently, we train the model step #2 again and obtain the Normal Vectors for all the parent classes, i.e., Worm, Trojan, and misc., to make full use of these samples.Hence, we have t Vectors for categories and their sub-categories.At reference time, given a new sample x, we first compare x with the Normal Vectors of the parent categories and find the optimal parent category i p by choosing the maximum similarity.Then, we compare x with the Normal Vectors of subcategories in i p .Finally, the subcategories i s with maximum similarity is determined as the label of x.
Results of two-step inference are shown in Table 10 and Figure 7, and the F1-scores of these categories have improved a lot, especially P2P-Worm, and the grid of P2P-Worm in the confusion matrix is much darker than in Figure 6, demonstrating the effectiveness of the two-step inference.

Findings and Limitations
We compare and discuss through lots of experiments, including (1) the performance comparison between the method proposed in this paper and various types of machine learning and deep learning models; (2) the role of the local attention mechanism; and (3) the role of a two-stage training strategy.By comparing the experimental results, we can see the following: (1) Our proposed method can achieve an accuracy of 0.9606, but most classical methods cannot exceed 0.90.In addition, this paper is validated on two datasets, and although the performance of all the models on the second dataset is generally degraded, the method in this paper still outperforms the others, which suggests that the method in this paper is very effective and capable of achieving the latest SOTA results; (2) We compare two works based on the Transformer architecture.Overall, the performances of the three methods are nice, which illustrates the effect of natural language modeling, especially the latest Transformer architecture, on the processing capability of API sequential calls.We further designed the fusion of local attention mechanisms based on the local feature relationships between API calls, which can enhance the feature-capturing ability even further.And this can also be compared more carefully in ablation experiments.By exploring the DAT encoder, we can find that the local attention module can indeed achieve better results than the original Transformer encoder; (3) Our ablation analysis for Step 1 also further verifies that the introduction of the Siamese Network can improve the performance to a certain extent, with an improvement in accuracy from 0.837 to 0.87.In the second step, we use the Normal Vector to simplify the process of malware classification and obtain a more generalizable category representation vector that can further improve the accuracy of the model, with an increase in the average accuracy from 0.87 to 0.92.This suggests that, compared to randomly selected samples, the Normal Vector obtained in the second training step is better at capturing malware category features; (4) Finally, we made a further attempt to adapt to the hierarchical relationship between the samples by performing a two-step inference, trying to classify the parent category first and then the subcategories.Experiments show that this strategy is effective and can improve the accuracy from 0.92 to 0.96, which utilizes the richer data in the parent category to improve the stability of the Normal Vectors of the subcategories with limited data and to alleviate the problem of data scarcity for specific malware types (e.g., P2P worms).
However, from the analysis of our results, there are still some limitations of underperformance; accordingly, we point out the limitations and future directions of the work in this paper.
(1) The challenge of data scarcity: Despite the efforts to address the data scarcity issue through the training steps, in multiple comparison experiments, relatively high variance in F1-scores can be found across different categories, e.g., P2P Worms and Misc, which still exhibit low F1-scores.The performance of the model is limited by the availability of representative data, and the strategy to deal with rare data needs to be further explored; (2) Generalization of new samples: The model's ability to generalize to new, unseen samples is not explicitly discussed currently.Evaluating its performance on brand new malware samples, especially those that do not exist in the training data, is crucial to assessing real-world applicability; (3) Interpretability Issues: the interpretability of the model is not adequately discussed, and understanding how decisions are made, especially in security-related applications, is important for building trust in the model's predictions.

Conclusions
In this work, we treat malware classification as a sequence classification problem, which involves taking a complete API call sequence as input and producing its corresponding category as output.To accomplish this work, we first employ the Transformer as the encoder and further introduce a local attention module to adapt to this task.Secondly, to boost performance and increase adaptability to new categories, we transform the multi-classification challenge into a classification mapping problem and design a two-step classification strategy.We design and train a Normal Vector from each category to boost the classification speed and performance of the base model.Thirdly, comprehensive experiments demonstrate that our top-performing model achieves state-of-the-art results, attaining an average F1 score of 0.90 and an accuracy of 0.96.Finally, we conduct a lot of ablation studies to show the gain from each module or step.It shows that our local atten-

Figure 1 .
Figure 1.Overview of our two-step methods.

Figure 1 .
Figure 1.Overview of our two-step methods.

Figure 2 .
Figure 2. (a) Transformer encoder and (b) our dual attention Transformer encoder.Unlike global attention, i.e., the multi-head attention in the Transformer, local attention attempts to learn the context information in a sliding window of 2D + 1.Given the embedded API call sequence matrix  ∈

Algorithm 1
Siamese network inference Input: The API call sequence x of the malware needed to be classified.Output: The label of x.Init: classes: All the classes of malware, model: The trained model function predict (x): probability← [] for i in classes do: probability[i] ← average ([model.predict(x,j) for j ← RandSelect (j, N)]) end for return argmax(probability) end function 3.3.2.Training Step #2

Figure 3 .
Figure 3. Network structure comparison.(a) The original Transformer network structure; (b) dual attention Transformer encoder.

Figure 3 .
Figure 3. Network structure comparison.(a) The original Transformer network structure; (b) dual attention Transformer encoder.

4.3. 2 .
Ablation Study on Training Step #1 To investigate the contribution of training step #1, we conduct a test after the training and compare it with the network (b) in Figure 3. Different from network (b), the network used in training step #1 employs a Siamese network to evaluate the similarity between the new sample and randomly selected N samples from the dataset.The value of N is empirically set to 10.

Figure 5 .
Figure 5.The confusion matrix (with data on accuracy) of the results of training step #1.From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Figure 5 .
Figure 5.The confusion matrix (with data on accuracy) of the results of training step #1.From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.
4.3.3.Ablation Study on Training Step #2 After training step #2, we obtain 15 Normal Vectors for 15 kinds of malware.To demonstrate the effectiveness of training step #2 and investigate whether the Normal Vectors can represent their malware categories, we test the model after this training procedure.Comparing Tables

Figure 6 .
Figure 6.The confusion matrix (with data on accuracy) of the results of training step #2.From to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Tro Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Figure 6 .
Figure 6.The confusion matrix (with data on accuracy) of the results of training step #2.From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Figure 7 .
Figure 7.The confusion matrix (with data on accuracy) of the results of two-step refe top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, P jan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Figure 7 .
Figure 7.The confusion matrix (with data on accuracy) of the results of two-step reference.From top to bottom, from left to right, there are Worm, P2P-Worm, Trojan-Spy, Net-Worm, Packed, Trojan-Downloader, Trojan-PSW, Trojan-Ransom, Misc, and Trojan-FakeAV.

Table 1 .
Summary and comparison of related works.
The malware category i which needs to generate the Normal Vector.Output: The generated Normal Vector V i .Init: trainSet: the training set in this work, for training sample x, x.x denotes its API call sequence, and x.y is the corresponding label, model: The trained model whose parameters are fixed except for NormalVector layer.
Feature extraction and classificationInput: The API call sequence x of the malware needed to be classified.Output: The category of x.Init: classes: All the classes of malware, model:

Table 2 .
Dataset One description.

Table 4 .
Comparison with previous methods on Dataset One.

Table 5 .
Comparison with previous methods on Dataset Two.

Table 6 .
Quantitative comparison with different encoder.

Table 8 .
Quantitative results of training step #1.

Table 9 .
Quantitative results of training step #2.

Table 10 .
Quantitative results of two-step inference.