An Efﬁcient Deep Unsupervised Domain Adaptation for Unknown Malware Detection

: As an innovative way of communicating information, the Internet has become an indis-pensable part of our lives. However, it also facilitates a more widespread attack of malware. With the assistance of modern cryptanalysis, emerging malware having symmetric properties, such as encryption and decryption, pack and unpack, presents new challenges to effective malware detection. Currently, numerous malware detection approaches are based on supervised learning. The biggest challenge is that the existing systems rely on a large amount of labeled data, which is usually dif-ﬁcult to gain. Moreover, since the newly emerging malware has a different data distribution from the original training samples, the detection performance of these systems will degrade along with the emergence of new malware. To solve these problems, we propose an Unsupervised Domain Adaptation (UDA)-based malware detection method by jointly aligning the distribution of known and unknown malware. Firstly, the distribution divergence between the source and target domain is minimized with the help of symmetric adversarial learning to learn shared feature representations. Secondly, to further obtain semantic information of unlabeled target domain data, this paper reduces the class-level distribution divergence by aligning the class center of labeled source and pseudo-labeled target domain data. Finally, we mainly use a residual network with a self-attention mechanism to extract more accurate feature information. A series of experiments are performed on two public datasets. Experimental results illustrate that the proposed approach outperforms the existing detection methods with an accuracy of 95.63% and 95.04% in detecting unknown malware on two datasets, respectively.


Introduction
With the rapid development of Internet technologies, the Internet economy is booming with the emerging Internet industry. However, in the meantime, the problem of information security is becoming more and more serious. The Internet industry is closely related to users' data, privacy, and property; thus, the problem of security threat needs to be solved urgently. Numerous security problems are caused by malware or malicious codes. In recent years, Formjacking, Ransomware, and Cryptojacking are very rampant. Under this background, accurately detecting malware is not only necessary but also urgent.
Malware is one of the most common security risks to the Internet infrastructure, which may cause data loss or data theft. Malware detection plays an essential and emergent role in network security. Zscaler reported that more than 300,000 specific malware attacks were detected in December 2020. Their attack targets mainly contain printers, digital signage, smart TVs, and so on. More seriously, malware combined with some modern cryptanalysis can change the form of each instance of software to evade "pattern matching" detection during the detection and investigative process, which increases the detection difficulty.

•
A deep residual network with a self-attention module is used to extract features from multi-channels.

•
We adopt the joint distribution alignment approach to reduce the distribution discrepancy. Firstly, inter-domain distribution discrepancy is reduced by adversarial learning. After that, class-level alignment can be achieved by optimizing the semantic alignment loss functions. Eventually, we can achieve intra-class sample compactness and inter-class sample separation.
• By the proposed model, massive experiments are done on two public malware datasets. Experimental findings show that the model can correctly classify unknown malware and has better accuracy than the existing detection models.
The remainder of this paper is organized as follows. Section 2 surveys the work related to malware detection and classification. A detailed description of our detection system is presented in Section 3. Section 4 demonstrates the experimental details and the related comparative results of our model and other known detection systems. Section 5 concludes this paper.

Related Work
The method of visualizing malware has the advantages of intuition and effectiveness, which has attracted extensive attention in the field of cyberspace security and industry. In this section, we discuss the detection methods of malware feature visualization, which are mainly divided into two categories: machine learning-based and transfer learningbased methods.

Machine Learning-Based Malware Detection
Machine learning (ML) algorithms have widespread applications, e.g., natural language processing, computer vision, automatic transmission, and cybersecurity. ML algorithms mainly are divided into two classifications: supervised and unsupervised learning. The former requires massive labeled data to train the model. Nataraj et al. [13] demonstrated that analyzing texture features of images could detect malware more accurately than existing malware analysis techniques. As a result, this method is widely used for malware detection. Generally, during detection malware, it converts malware raw files into gray-scale images and uses restored gray-scale images to train neural network models. Nataraj et al. [13] extracted GIST (Generalized Search Trees) features of gray-scale images and classified malware by K-Nearest Neighbor (KNN). Instead of extracting specific features, Hamad et al. [14] proposed a novel fine-grained Malware Image Classification Framework (MICS), which extracted hybrid features of malware and classified malicious family samples with the help of an SVM (Support Vector Machines) classifier. Firstly, they converted malicious programs into gray-scale images; then, they captured local and global features of images aiming at classifying malicious software. Jinpei et al. [15] presented the MalNet, a novel Deep Neural Network (DNN)-based malware detection framework, in which CNN (Convolutional Neural Networks) and LSTM (Long Short-Term Memory) networks were adopted to automatically extract features in order to reduce the expense of feature engineering. All these supervised learning methods rely on labeled data to train the model.
For unknown samples, the detection capability of the model will decrease. Different from supervised learning in malware classification, unsupervised learning by cluster quality optimization is based on sample similarity [16]. Pitolli et al. [17] proposed an online clustering algorithm named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) to identify malware families. The algorithm could efficiently update the clusters with new samples emerging, and the algorithm could classify malware as an existing family and could also identify malware of unknown families. All these unsupervised detection methods must depend on a large amount of data to reach high accuracy. Additionally, some unsupervised algorithms extend the dataset based on the original malware file to detect new malware. Zahra et al. [18] generated unknown malware samples by deep generative adversarial networks. Together with the original samples, these generated samples were used to practice a more robust classifier for checking new malware variants.
The afore-mentioned ML methods have obtained a good performance. However, most approaches depend on expert knowledge to extract features. Meanwhile, with new malware increasing rapidly, extracting features faces the challenge of updating the features, which requires much time. This paper converts the raw files into gray-scale images and uses deep neural networks to extract features automatically and quickly. Moreover, to extract features more effectively, this paper introduces a self-attention module.

Transfer Learning-Based Malware Detection
Transfer learning (TL) can extract useful knowledge from one or more tasks in the source domain and apply the knowledge to new target tasks. Its essence is the transfer and reuse of knowledge. Currently, TL methods have been extensively applied to many fields [19][20][21][22][23]: for example, image classification [24], semantic segmentation [25], robot recognition [26], and medical areas [20]. Vasan et al. [9] improved the accuracy of malware detection and classification with the help of fine-tuning the parameters of the neural network, and they used data augmentation to address the data imbalance problem.
In addition to fine tuning, domain adaptation is another subfield of TL algorithms. Compared with fine tuning, the advantage of domain adaptation is that it makes full use of the feature similarity in the source and target domains. Therefore, the knowledge gained from the source domain can be transferred to the target domain. Bartos et al. [27] constructed domain-invariant feature representations of network traffic generated by malware. Nonetheless, they focused on designing a transformation that reduced the discrepancy of cross-domain feature distribution without considering the conditional distribution problem. Additionally, to detect unknown malware variants, Li et al. [28] proposed a framework named DART by taking advantage of adaptation regularization transfer learning. They detected malware variants by aligning the feature distributions of different domains. This method can lessen the difference of marginal distribution in the source and target domain. However, it does not take into account the difference in class distribution. Rong et al. [29] proposed TransNet to check unknown malware. Firstly, they converted malware traffic data into RGB images and then replaced the batch normalization layer with a transfer batch normalization layer to solve the domain shift problem. Finally, they used the RGB images as inputs of a DNN to solve the problem of data distribution discrepancy among multidomains. However, they did not consider the class-level alignment of different domains. To get a better accuracy for malware detection, this paper proposes a novel approach based on unsupervised domain adaptation. It is helpful to reduce domain distribution differences and achieve class-level distribution alignment.

Overview
To achieve a better accuracy in malware detection, we propose a distributed joint alignment unsupervised domain adaptation method to detect unknown malware, which solves the difficulty of obtaining labels, and obtain the distribution discrepancy between tested samples and training samples. The whole architecture is shown in Figure 1. In Figure 1, the unlabeled samples are extracted features and then classified, where the target feature extractor F and classifier C are trained by Figure 2. In Figure 2, the architecture mainly contains three components: feature extractor F ϕ for extracting domain-invariant features, classifier C φ for malware classification, and domain discriminators D ω for domain adversarial learning, where ϕ, φ, and ω are learnable parameters. Firstly, we train the model using the source domain data to obtain a pre-trained model; then, we assign labels for the target domain samples using the trained model to obtain pseudo-labeled samples. Secondly, the source domain samples, target domain samples, and pseudo-labeled samples are input into the pre-trained model to achieve distribution alignment by adversarial training and semantic alignment. Finally, the trained feature extractor F ϕ and classifier C φ are used to classify the target domain samples. Moreover, the self-attention module is introduced for more effectively extracting local and long-term features.  The labeled malware samples are treated as the source domain. In response to this, unlabeled malware is taken as the target domain. A transferable domain classifier is practiced to forecast the labels of target domain samples by the distributed joint alignment of different samples from the source and target domain. In this paper, let be the source domain with labels and the target domain without labels, respectively, where n s denotes the number of malware in the source domain, y s i denotes the label of the sample x s i , and D s and D t have a different distribution. We classify malware by the following processing. Firstly, we train the feature extractor F ϕ and binary classifier C φ using the source domain D s by supervised learning; thus, the pre-trained model is established. Then, the classifier assigns a label to a sample in the target domain by the pre-trained model. These samples in the target domain have obtained pseudo-labels, which are denoted as D t . Secondly, global alignment of the feature extractor and domain discriminator is achieved by adversarial learning in the source domain D s and target domain D t . Finally, to make these intra-class samples more compact and inter-class samples more separate, we propose class-center semantic alignment. That is, the class centers of pseudo-labeled samples D t are aligned to the class centers of labeled samples D s in the source domain. Therefore, our model needs to jointly optimize supervised classification loss L cls , global domain adversarial loss L f ea , and class-level semantic alignment loss function L sa . Hence, the overall optimization objectives are where the hyperparameters α and β are the influence factors of global alignment and semantic alignment, respectively. In the remaining subsections, we will detail the self-attention module, global domain alignment, semantic alignment, and model training.

Self-Attention Module
For extracting features of the images better, we insert a self-attention module in the feature extractor [30]. This module enables each pixel to associate with others. Therefore, our module can settle the long-distance dependence problem among common convolutional structures, and this module achieves a better balance between improving the perceptual field and reducing the number of parameters. Consequently, as a complement to convolutional neural network, we integrate a self-attention mechanism for getting long-term, multi-level dependencies across the image region.
We place the self-attentive mechanism at the fourth block of the residual network due to two reasons. Firstly, the self-attention module can effectively extract local features and reduce the resolution of convolution. Moreover, it can aggregate the global information of the features. Therefore, the ability of the model to extract information has improved. Figure 3 illustrates the workflow of the self-attention module. In the self-attention module, the feature map x is firstly obtained by convolution from the front few layers of the residual network. Furthermore, three feature maps q(x), k(x), and v(x) are obtained by three 1 × 1 convolution, respectively. During the process of obtaining feature maps, the dimensions of q(x) and k(x) remain identical; only the channel value changes, while v(x) keeps the same dimensions and output channels unchanged. Then, we transpose q(x) and multiply it by k(x). The attention map ρ j,i of [H × W, H × W] is obtained by normalizing each row by the Softmax layer. Multiplying the attention map ρ j,i by v(x), we can obtain a feature map [H × W, C]. Then, processing it by a 1 × 1 convolution, the output h(x) is reconstructed as [H × W × C]; then, we can obtain the feature map O. To make the model learn the local information faster in the initial stage and then gradually use the self-attention mechanism as the network training, this paper introduces the parameter θ. The parameter will be learned in the self-attention layer. We initialize the parameter to 0, indicating that the self-attention module has not worked at the beginning. The network will gradually learn more long-range features by the self-attention module as the training proceeds. Therefore, the final output f is given by: where In our model, we use the self-attention mechanism in the feature extractor. The output features f of the attention layer are used as the input of the next residual block. Eventually, each pixel is associated with other pixels. In this way, we solve the long-distance dependence problem that exists in ordinary convolutional structures.

Global Domain Alignment
Considering the distribution discrepancy between emerging unknown malware (target domain) and known labeled malware (source domain), we propose an approach of global domain alignment to reduce the disparity of cross-domain feature distribution. The two types of approaches to obtain global domain alignment in computer vision are mainly non-adversarial domain alignment and adversarial domain alignment. Non-adversarial domain alignment minimizes the global distribution discrepancy of domains by different metrics, e.g., Maximum Mean Discrepancy (MMD) [31], KL [32], CORAL [33], Wasserstein Distance [34], etc. In contrast, adversarial domain adaptation methods are inspired by Generative Adversarial Networks (GAN) to learn domain invariant features [35].
This study utilizes adversarial domain alignment to align the feature distributions of two domains by minimizing the global domain adversarial loss function L f ea . That is, the feature extractor F ϕ and the domain discriminator D ω interrelate during the training process. To make the distribution of D t closer to the distribution of D s , we initialize the target feature extractor F ϕ using the pre-trained source model. Then, the feature extractor F ϕ is corrected by adversarial training. In this mapping, we modify the target model to match the source distribution. This is most similar to the original generative adversarial learning. The D ω is trained to minimize the domain loss L f ea to distinguish the features between both domains, and the feature extractor F ϕ obfuscates the domain discriminator D ω by maximizing the domain loss L f ea to acquire a domain-variant feature representation. When the training ends, the network can obtain domain-variant feature representation. The global domain alignment adversarial training loss can be expressed as follows:

Semantic Alignment
Currently, most of the existing domain adaptation-based malware classification methods focus only on global distribution alignment. However, global alignment alone does not achieve precise alignment. As shown in Figure 4, after global alignment, the source and target domains achieved global alignment, but there are still some misclassified samples.
To make similar samples more compact, Weston et al. [36] calculated the distance among samples in the manifold embedding space and minimized this distance, but this requires a high computational cost. To reduce the computational cost, Wen et al. [37] calculated the absolute distance between each sample and its commensurable class center. In this paper, to make the model have higher classification ability, we consider that not only the similar samples should be compact, but also the centers of different classes should be separate as much as possible. Based on this idea, we propose class center semantic alignment. The class center semantic alignment loss function L sa can be computed as follows: where γ is a trade-off parameter, and n and m denote the volume of malware in a batch and a class, respectively. c y i (y i ∈ {1, 2, · · · , m}) is the class center of the source domain and target domain alternately. r 1 , r 2 are thresholds. x i − c y i 2 2 is the distance from every sample to its class center. c i − c j 2 2 is the distance among the centers of the different classes. During the class center updating, it is based on a mini-batch rather than all samples in each epoch. Therefore, in each iteration, the class center is updated according to the following equation: where n denotes the volume of samples in each batch, and ε represents a learning rate. In class center semantic alignment, we need to know the pseudo-label of a sample in the target domain to obtain center alignment. We define p c x t i | n c=1 as the probability in which x t i belongs to the c-th class. We get the pseudo labels by the following steps. Firstly, the model is trained with the labeled data in the source domain. Secondly, we predict the labels of the target domain by the use of the trained model. The pseudo label corresponding to sample x t i is calculated by y t i = argmax c p c x t i . The class label of x t i is determined by the label with the maximum probability.
After performing semantic alignment, each sample belonging to the identical class is aligned to its class center, respectively. In this way, we can make similar data more compact and dissimilar data more discriminable in the feature space. Finally, all samples with identical labels will be aligned to the neighborhood of the shared class centers.

Model Training
In this paper, to get high detection and classification accuracy, we need to align their distribution of source and target domain. We achieve this goal through two steps: global alignment and semantic alignment. Firstly, global domain alignment is implemented with the help of adversarial learning. Then, semantic alignment is performed through minimizing the class center semantic alignment loss function L sa using Equation (4). Therefore, we need to jointly optimize supervised classification loss L cls (Cross Entropy Loss), global domain adversarial loss L f ea using Equation (3), and semantic alignment loss L sa . So, the whole loss function is represented in Equation (7), The specific training process is as follows. Firstly, we minimize the classification loss L cls by standard supervised learning with D s . Thus, we get a pre-trained feature extractor F ϕ and classifier C φ . Secondly, during the process of computing all samples in a certain target domain, their pseudo-label is assigned by our pre-trained model according to y t i = argmax c p c x t i . Then, we initialize the feature extractor F ϕ and classifier C φ of the target model by the use of the pre-trained model. Global alignment of the D s and the D t is achieved through minimizing the global domain adversarial loss L f ea . Meanwhile, class center semantic alignment is implemented by minimizing loss function L sa . Algorithm 1 gives out the whole training process of the proposed method. , Pseudo-label: Use d s to compute source domain class Center c s and c t ←c s .

7:
Compute joint loss function L total X S , Y S , X T , X T ; ϕ, φ, ω 8: Back propagate L total to get the gradient value of each parameter 9: The parameter set Ω is updated by gradient descent with Adam optimizer 10: end for 11: Calculate mean loss and mean accuracy 12: end while

Dataset
Our approach is evaluated on two public window malware datasets (BIG-2015 and Malimg) and a benign dataset selected from the Playdrone [38], respectively. BIG-2015 is a publicly available malware dataset from Microsoft on the Kaggle platform. The BIG-2015 dataset contains 21,741 samples belonging to nine categories. Among them, 10,868 samples are used for training, and 10,873 are testing ones. This experiment only utilizes the training malware. We use the byte file to generate the malware gray-scale image and normalize it to a fixed size. The Malimg dataset includes 9,339 malware from 25 malware families. Tables 1 and 2 demonstrate the details about the two datasets. In addition to malware, in our experiment, we also use 2280 benign samples. We randomly select a family in the malware dataset as the target domain, whereas the remainder is treated as the source domain. In addition, 1140 benign samples are included in the source and target domain, respectively. For example, in the BIG-2015 dataset, the Ramnit family is used for the unlabeled target domain, and the remaining families are from the source domain. The benign samples are different in the two domains.

Implementation Details
In our experiment, all malware in BIG-2015 and benign software are converted into gray-scale images and normalized to the size of 196 × 196 pixels. The original sample of the Malimg dataset is the gray-scale image, so we only transform it into a fixed size of 196 × 196 pixels. On each dataset, one of the families is selected as the target domain, and the remaining families are used as source domains. In this way, for the BIG-2015 dataset, there are nine tasks; for the Malimg dataset, there are 25 tasks. This study uses a residual network as our feature extractor and classifier. We insert the self-attention module before the fourth residual block. The adversarial discriminator has the same structure with the classifier, which consists of three fully connected layers, and the size of each layer is x-2048-4096-1 (where x is the size of the input feature), respectively. A ReLU activation function is used in each layer, and we use the dropout mechanism to reduce overfitting by ignoring several neurons with a probability in each training batch. The entire training process uses the Adam optimizer, where the learning rate is 0.0001 and the weight decay is 0. The pre-training source domain uses a cross-entropy loss function L cls . During the training process, we simultaneously optimize L cls , the domain adversarial loss function L f ea , and the semantic alignment loss function L sa . The parameters are α = 0.1 and β = 0.1, respectively. Another learning rate ε at the local class center update is set to 0.5. Domain adversarial loss is scaled by 0.1. The two thresholds r1, r2 used in the semantic alignment loss function are 0 and 100, respectively. The batch size is 32. Experiments are performed in the PyTorch framework. The experimental equipment is a personal computer. Its configuration includes the following: CPU: IntelI CITM) i5-4590 3.30GHz 3.30 GH; RAM: 4GB; OS: Windows 7.

Evaluation Metrics
The study uses the following four metrics to evaluate the effect of every model: Accuracy, Precision, Recall, and F1-score. All samples can be classified according to the true and predicted label, and the following factors are introduced: True Positive (TP) means that the true label is positive, and the predicted label is positive.
True Negative (TN) means that the true label is negative, and the predicted label is negative.
False Positive (FP) means that the true label is negative, while the predicted label is positive.
False Negative (FN) means that the true label is positive, while the predicted label is negative.
Accuracy denotes the ratio of correctly classified samples to the total number of samples.
Precision is the ratio of positive samples to forecasted positive samples.
Recall refers to the ratio of predicted positive samples out of all actually positive samples.
F1-score denotes a weighted mean of Recall and Precision. As a comprehensive metric, it is introduced to balance the effects of accuracy and recall, and to evaluate a classifier more comprehensively.

Performance Comparison of Different Models
In this paper, we have run extensive experiments and calculated four evaluation metrics mentioned above when each family acts as an unknown malware (target domain) on the two datasets. We also take into consideration that one dataset (e.g., BIG-2015) is treated as the source data, and another (e.g., Malimg) is the target data. In the experiments, the benign software used in the source and target domain is distinct, too. Figures 5 and 6 illustrate the experimental results. We can see that we achieve good results with our approach for each subtask. We averaged the results over several tests to obtain an average accuracy and recall of 95.04% and 94.25%, respectively on the BIG-2015 dataset, and the average accuracy and recall on the Malimg dataset are 95.63% and 95.30%, respectively.  In addition, we compare our work with some existing malware detection methods. Table 3 demonstrates our performance comparison. We also make a comparison between our work with some existing domain adaptationbased methods such as BIRCH [17], DART [28], GAA-ADS [39], and RCNN+ transfer learning [40], which are shown in Table 3. Our method obtains a higher accuracy and recall than the GAA-ADS and RCNN + transfer learning. Our method is also higher than the DART, which uses the distribution alignments. DART mitigates the domain discrepancy by optimizing the marginal and manifold distributions of two domains. However, they do not take into account semantic alignment. In this paper, to facilitate feature extraction, we transform the raw PE files into gray-scale images before feeding them into the neural network, and a self-attention module is introduced for capturing long-distance dependency. Our model obtains a higher accuracy and recall by jointly aligning the global and semantic alignment. From the above data, the proposed method has a better performance compared with some approaches only considering global domain adaptation, and it also confirms that semantic information can improve the classification accuracy.

Conclusions
This paper studies the detection of Windows unknown malware. To solve the difficulty in obtaining labeled samples and the problem of the discrepancy in the distribution between unknown and source samples, we propose an efficient deep unsupervised domain adaptation for unknown malware detection. Firstly, we adopt the joint distribution alignment approach to reduce the distribution discrepancy. We minimize the discrepancy of the distribution between the source and target domains to learn shared feature representation by adversarial learning. To further obtain semantic information about the unlabeled samples, we minimize the distance from the labeled source domain and the pseudo-labeled target domain samples to the class center. Then, to enhance feature extraction ability, we adopt a residual network with a self-attention mechanism as the pre-trained model. Finally, extensive experiments are conducted on two datasets, and the results illustrate that the proposed method outperforms the state-of-art domain adaptation-based detection methods in detecting unknown malware. In future work, we will investigate a more advanced fine-grained domain adaptation approach for malware family classification and conduct extensive experiments on different datasets (e.g., Android malware, IoT malware).