MalCaps: A Capsule Network Based Model for the Malware Classiﬁcation

: The research on malware detection enabled by deep learning has become a hot issue in the ﬁeld of network security. The existing malware detection methods based on deep learning suffer from some issues, such as weak ability of deep feature extraction, relatively complex model, and insufﬁcient ability of model generalization. Traditional deep learning architectures, such as convolutional neural networks (CNNs) variants, do not consider the spatial hierarchies between features, and lose some information on the precise position of a feature within the feature region, which is crucial for a malware ﬁle which has speciﬁc sections. In this paper, we draw on the idea of image classiﬁcation in the ﬁeld of computer vision and propose a novel malware detection method based on capsule network architecture with hyper-parameter optimized convolutional layers (MalCaps), which overcomes CNNs limitations by removing the need for a pooling layer and introduces capsule layers. Firstly, the malware is transformed into a grayscale image. Then, the dynamic routing-based capsule network is used to detect and classify the image. Without advanced feature extraction and with only a small number of labeled samples, the presented method is tested on an unbalanced Microsoft Malware Classiﬁcation Challenge (MMCC) dataset and experimental results produce testing accuracy of 99.34%, improving on a number of traditional deep learning models posited in recent malware classiﬁcation literature.


Introduction
Malware (a portmanteau for malicious software) is any software intentionally designed to cause damage to a computer, server, or computer network. It refers to the malicious program that is made or used by the attacker, spread by mobile storage media or network, and destroy the availability of information system and steal users' private information without authorization [1]. The criteria used to determine malicious code are: unauthorized and malicious. Malicious code includes: virus, snail, trojan horse, botnet, back door, rogue software, and other types of malicious programs. With the vigorous development of the Internet, malware has become one of the key threats. From the point of actual case in recent years, the outbreak of botnets, advanced persistent threat (APT), ransomware and other major network security incidents, malware act as the core component and cause substantial damages. In 2020, the AV-test system has detected more than 1.1 billion malicious codes as Figure 1 shows [2]. According to the 2020 China Internet Network Security Report, more than 62 million samples of computer malicious programs were captured throughout the year, and the average daily transmission times reached more than 8.24 million, involving more than 660,000 types of malicious programs, and including more than 731,000 ransomware viruses, among which the economic loss caused by Gandcrab virus alone reached 2 billion US dollars [3]. At present, malware is increasingly prevalent. In the process of fighting against malware samples, the analysis and detection technologies are also developing constantly. Since the 1990s, there have been researches on malware [4]; from rule matching and feature code extraction in the early stage, to dynamic and static detection and heuristic detection in the middle stage, and then to the current machine learning and multi-engine joint learning, malware detection technology has been developing continuously. However, the anti-detection technology is also constantly upgraded. Malware uses the technologies of shell, obfuscation, virtual machine protection, and so on to fight against various anti-killing technologies [5].
With the development of deep learning, it has replaced traditional methods to become a research hotspot of malware detection. Deep learning extracts the characteristics of malicious code from a large number of malware samples and classifies these features to obtain a malware recognition model, which has the obvious advantages of high automation and low resource consumption [6]. However, existing detection models based on deep learning still have some problems, such as weak ability of deep feature extraction, relatively complex model, and insufficient ability of model generalization, which need to be further explored and studied.
To address these limitations of existing approaches, we propose a completely new idea for malware detection based on capsule network in this paper. The capsule network, which is very different from other deep networks, was proposed by Hiton in 2017 [7]. Capsule networks are made up of capsules; a capsule is a set of neurons. Different from the traditional neural network, in a capsule, the set of neurons is vector or matrix. Each capsule represents various attributes of a specific entity in the image, such as the direction, position, and color of the object in the image, and the possibility of the entity's existence is represented by the module length of the capsule vector. Low-level capsules transmit information to high-level capsules through dynamic routing mechanism in capsule network. Most prominently, the network can learn and store the spatial relationship information on the whole parts and the local of the object. Therefore, compared with the traditional neural network, capsule is a network with stronger feature extraction ability, especially for the extraction of details. Because of this ability, the number of samples needed for training a capsule network is much smaller than that of CNNs, and good results can be obtained without data enhancement.
Inspired by the works mentioned above, in this paper, we proposed a novel solution to detect malware based on the capsule network, without any features engineering. In this approach, the ideas and techniques of malware image visualization technology are borrowed. Using malware code image generation technology. We first convert the specially processed executable files into grayscale images. Next, the generated uniform images are sent into the capsule network to extract efficient features and train detection model. And then the trained capsule network model is used for the classification analysis of malware. The experiments are conducted on the Microsoft Malware Classification Challenge (MMCC) dataset to evaluate the proposed approach.
In conclusion, the main contributions of this work are summarized as follows: 1. Introduce malware image generation technology into the field of malware detection, which can completely store the feature information of malware; 2. Propose a new architecture in the domain of malware classification by implementing a novel modified Capsule Neural Network architecture; 3. Compare the Capsule Neural Network with convolutional neural network and demonstrates it's effectiveness in malware detection.
The remainder of this article is organized as follows: some related works are discussed in Section 2. Section 3 describes the processes of the proposed approach. In Section 4, the experiments and results are given to validate the scheme. Finally, in Section 5 we describe the conclusion and future work.

Related Work
In this section, prior works are introduced from two perspectives as follows: 1. traditional malware detection approaches; 2. shallow machine learning-based malware detection approaches; 3. deep learning-based malware detection approaches.

Traditional Malware Detection Approaches
The evolution of malware detection approaches was reviewed in detail by Caviglione et al. [8]. The literature showed five main approaches to malware detection: the early detection systems relied on signatures, the more recent ones use behavior-based, heuristic-based, energy-based, or bioinspired techniques. Signature-based methods rely on signatures, such as checksums, behavioral patterns, connections attributes, and metadata. They can match the malware content [9], protocols and payloads [10], being an effective method for detecting well-known malware, but often susceptible to code obfuscation techniques, such as packers, encryption, defor-mation and polymorphism technology, etc., and helpless against new malware. Behavior-based methods detect malware upon recognizing malicious or unwanted behaviors, which are compared with clean templates [11]. Common behaviors include modification of key registry of the system, and use of network communication resources, files, and mutex resources. There are many automated tools supporting behavior analysis such as Cuckoo [12], CWSandbox [13], and Ether [14]. Behavior-based methods can identify the type of malware and unrecognized malicious processes at runtime, so as to detect unknown malware. However, they are prone to false alarms and are time-intensive, resource-consuming, and can be evaded by mimicry attacks. Heuristicbased methods, including specification-based techniques, learn about the behaviors and characteristics of executable files, such as: API calls, byte N-grams, OpCodes, control flow graphs, or their combinations [15,16]. They are able to detect known and unknown malware, however, they are time consuming, and often exhibit a low-level of accuracy. Energy-based methods exploit information about the energetic footprint of hardware and software as an indicator of ongoing attacks [17]. Bioinspired methods, such as Genetic Algorithms [18], Particle Swarm Optimization [19], and Ant Colony Optimization [20] are now used for malware features selection and optimization.

Shallow Machine Learning Based Malware Detection Approaches
Machine learning offers another way to effectively detect malware. Machine learning methods provide the ability to reduce the manual work required by traditional methods. It mainly includes four steps: data set construction, feature engineering, model training, and model evaluation [21]. Typically, features used to characterize malwares include the requested permissions, components, and filtering intent. Advanced static features such as n-grams features, control-flow diagrams [22], and API calling graphs obtained by decompiling malware files [23]. Dynamic behavior features, such as file operation, network operation, encryption operation, service opening, and system call, can be obtained by executing applications in an isolated environment. Combining these static and dynamic features to detect malicious software can achieve higher performance [24]. Models used for training include logistic regression [25], SVM [26], k-nearest neighbor (k-NN) [27], decision tree [28], random forest [29], and so on. The newest research on the machine learning-based malware detection focus on innovative feature spaces. Mohanasruthi et al., proposed an Application Program Interface Call Transition Matrix (API-CTM) [30], Farrokhmanesh et al., proposed to convert malware data bytes into audio signals [31]; however, the shortcoming is that the detection methods based on machine learning still rely on feature engineering and on complex or expert features to complete the learning task. The final effect of the machine learning based detection model is related to the selection of selected features, which is very subjective. Furthermore, these approaches are not conducive to efficient real-time malware classification monitoring. Therefore, models that can be automated are intuitively more useful and time-efficient in comparison.

Deep Learning Based Malware Detection Approaches
With the successful application of convolutional neural networks (CNNs) in ImageNet dataset, computer vision has entered a new era [32]. CNNs can classify images using raw pixel values without the need for complex feature engineering. In reference [33], the authors are the first to take the malware classification method as an image-based classification task. They represented the malware file as a binary string of zeros and ones. The vector is then reshaped into a matrix, and the malware file can be viewed as a grayscale image. The authors observed that after interpreting the original bytecode as grayscale images, images belonging to the same malware family looked very similar in layout and texture. In order to process this malware image as a classification task, all the malware executable binaries must first be reshaped as images. In this field, most of the previous work has used various CNNs structures to classify malware images [34,35]. Rezende et al., applied the ResNet-50 Deep CNN transfer learning to the MalIMG dataset by first converting each byte-graph to an RGB image. Compared with the KNN and GIST classification models, the classification accuracy of the proposed model achieved 98.62%, and without feature engineering [36]. Li et al., proposed a method combining self-attention mechanism with ResNet50 to detect malware variants. The method transformed malicious code into gray-scale image; then the data was augmented by overlapping. The method was tested on MalIMG dataset, and achieved 95.79% accuracy [37]. Priyadarsini et al., proposed a malware recognition procedure based on RNN-LSTM for the internet of things. Their model takes static, dynamic and hybrid highlights as inputs, which were selected using information gain algorithm. Compared with current KNN, LR, NB, DT, SVM, and DNN, the accuracy of their model achieved 98%, which is better than others [38]. Maryam et al., proposed a feature fusion method to combine the features extracted from pre-trained AlexNet and Inception-v3 deep neural networks with features attained using segmentation-based fractal texture analysis (SFTA) of images representing the malware code. The presented method is evaluated on the Malimg dataset, achieving an accuracy of 99.3% [39]. Robertas et al., proposed an ensemble learning-based architecture and explored 14 machine learning algorithms as meta-learners, with neural networks used as base learners. They conducted experiments on a dataset that included malware and benign files from Windows with PE headers. The results showed that Extra Trees algorithm as a meta-learner and an ensemble of dense ANN and 1-D CNN models obtained the best accuracy value on their dataset [40].
Despite the impressive results of these papers, in terms of accuracy, the models can be further improved by adding additional layers and layers to extract more feature complexity. The first requirement for any CNN, regardless of the architecture, is a large amount of training data that is not readily available in the field of malware detection. Another limitation is the loss of information due to local details such as position and attitude during CNN pool operation, which may not have a significant impact on image processing, but this information may be critical to the malware data [41].
Through the above analysis, it is found that there are some problems to be solved in traditional malware detection algorithms, detection methods based on machine learning, and convolutional neural network: 1. the detection rate of traditional static detection algorithm is obviously reduced in the face of code obfuscation, shell, signature, and other camouflage technologies; 2. the dynamic detection method has higher requirements on the system resources; 3. feature extraction and screening based on machine learning is too complicated; 4. the convolutional neural network detection method loses information in the process of feature extraction and has a large demand on the data of training samples.
A study by Hinton et al. attempted to overcome the limitations of CNN. They depicted human visual perception as a deconstruction of images in the brain to simulate its hierarchy [42]. In 2017, Hinton proposed the capsule network model for the first time, and this model is considered to be an important neural network model of the next generation. Capsule networks are made up of capsules. A capsule is a set of neurons, different from the traditional neural network, in a capsule, the set of neurons is a vector or matrix. Each neuron represents various attributes of a specific entity in the image, such as the direction, position, and color of the object, and the possibility of the entity's existence is represented by the module length of the capsule vector. Capsule network is a network in which low-level capsules transmit information to high-level capsules through dynamic routing mechanism [7,42]. Therefore, capsule is a network with stronger feature extraction ability, especially for the extraction of details, compared with the traditional neural network. Due to its excellent performance, the capsule network model has been well applied to small sample image classification tasks, which contains only several thousand samples, such as text classification [43], visual reconstruction [44], brain tumor classification [45], diagnosis of rotator cuff tears, capsule GAN [46], and deep reinforcement learning [47]. As mentioned in relevant literature, capsule network shows potential application in various fields.
As for the cybersecurity field, there are only very few papers in literatures about capsule networks within the malware classification domain. Cayir et al., propose a Random CapsNet for imbalanced malware based on bootstrap aggregating methods [48]. Their RNCF on the dataset achieves a 99.56% accuracy, though an impressive accuracy result, there are arguably two limitations to this paper. The first limitation is the number estimators in the RCNF model, in their implementation, an RCNF model can only contain up to 10 capsules. The second limitation is the training time, training of an RCNF with 10 capsules for the MMCC dataset takes five hours. This is because their models are more complex and require more parameters. Wang et al. proposed a novel malware detection and classification method based on capsule network [49]. Their capsule network structure only contains one layer of convolution layer. Phaye et al., argued that feature maps learned by the first convolution layer in the baseline CapsNet model only learns basic features, which may lack flexibility for deeper feature extraction [50]. Furthermore, they only did the binary classification detection, and the model was not verified in the public test set. This paper will discuss our proposed Malcaps, and the role and significance of this technology.

Method Descriptions
In this section, we describe our capsule network-based approach for malware classification in detail, including the dataset we used for our experiments, malware data preprocessing, capsule neural network architectures, and our proposed capsule network. The overall flow of the proposed methodology is shown in Figure 2.

Microsoft Malware Classification Challenge Dataset
The malware dataset used in this paper is from a Microsoft project on Kaggle, Microsoft Malware Classification Challenge (MMCC) [51]. The dataset has become the standard benchmark to evaluate machine learning techniques for the task of malware classification. It consists of a set of known malware files representing a mix of 9 different families: (1) Ramnit, (2) Lollipop, (3) Kelihos_ver3, (4) Vundo, (5) Simda, (6) Tracur, (7) Kelihos_ver1, (8) Obfuscator.ACY, (9) Gatak. Each sample identified by an identifier, a 20-character hash value uniquely and its class label, has a corresponding assembly file and a raw hexadecimal representation of the file's binary content, without the PE header (to ensure sterility). The dataset has 21,741 samples, with 10,868 for training and the other 10,873 for testing, being a dataset of almost half a terabyte uncompressed. Table 1 shows the distribution of various classified categories present in the training dataset. As the focus of this paper is the application of a novel capsule network-based technique to classify malware based on the raw binary file content, we only consider the raw hexadecimal file representations, and convert it to its binary representation.

Transforming Malware Byte Files into Images
At present, there is lots of research on the methods of malicious code visualization. Nataraj et al. [33] first proposed to visualize malware binaries as grayscale images and classify malware according to the similarity of image textures. Some researchers refer to the processing methods of images of different sizes in the image field, and use deformation processing such as cutting or scaling to convert malicious code images to a fixed size [52]; however, these methods will cause the loss of malicious code data and the destruction of code structure. This paper utilized the method to visualize malicious code, which converts binary code into three-channel image as Wang et al. did in their paper [49]. Since threechannel image (24 bit pixels per sample) can hold 16,777,216 characteristics, single-channel image only has 256 characteristics (8 bit pixels per sample). Three-channel images have stronger feature representation capability. For a given malicious code binary, it reads three 8-bit binary numbers into decimal integers (in the range 0 to 255), reshapes those integers into vectors, and finally generates a two-dimensional array whose width and height vary depending on the size of the file. Finally, the array is visualized as a three-channel image. In order not to affect the final analysis results, 0 is used to fill in the case that the content read in the last time is less than 24 bits. The mapped image is saved as an uncompressed PNG image. The process of mapping malicious code to images is shown in Figure 3.  Figure 4 is an image of some sample samples from the MMCC dataset. It is clear from the figure that image samples from different software families have different characteristic texture features, and samples from the same malware family are very similar. This is because a large amount of malicious code is now using code reuse technology, some key code blocks are reused, so the same code often contains the same module; code is similar, different code is different. The image texture features can reflect the similarity and difference effectively. Therefore, this method will provide effective sample inputs for capsule network-based malware classification.

Capsule Neural Networks
Capsule Neural Network is the latest development in the field of deep learning architectures, introduced by Hinton et al. in 2017 [7] to overcome limitations from the traditional CNN approach. Capsule is a carrier that contains a group of organized neurons, each of which represents various properties of a particular entity that appears in the image. These properties can include many different types of instantiation parameters, such as attitude (position, size, direction, deformation, speed, hue, texture, and so on). A very special property in the capsule is the presence of an instance of a category in the image. Its output value is the probability of the existence of the entity. The capsule network is designed to mimic the way our brains process vision, called reverse rendering. Humans solve an image into many hierarchical subparts, and construct the relationships between these subparts in order to recognize familiar objects; that is the main way the capsule tries to mimic. The capsule is placed in each part of the image, and the capsule output indicates whether the subparts of the image are located in that position.
The capsule network consists of three kinds of hidden layers: convolutional layer, primary capsule layer, and digit capsule layer. The convolution layer extracts features from images through a convolutional filter of 9 × 9, the number of channels is 256, and the step length is 1, which is activated by ReLU function, and create a 20 × 20 local feature map. The primary capsules convert the feature map from scalars into the vectors. A total of 32 distinct 6 × 6 capsules in 8 dimensions transform the scalar quantities into vectors with directional information. The digit capsule layer outputs the capsule length of 16. These capsules are calculated from the capsules of the primary capsule layer through dynamic routing algorithm, known as routing-by-agreement [7].
The general calculation process of capsule network is shown as follows: 1. u i ∈ R k×1 , i = 1, 2, . . . , n, is the input of lower capsule, where n represents the number of capsules, k represents the number of neurons in each capsule (vector length). 2. Apply a transformation matrix (or a matrix multiplication, W ij ∈ R p×k , to the lower layerto the lower layer's vector which represents important spatial information between low-level and high-level features, p represents the number of neurons in the output capsule, and converts the input u i ∈ R k×1 into a "prediction vector" u j|i ∈ R p×1 :û The weighting matrix W ij is learned during the back propagation procedure. 3. Then the weighted sum of all the obtained prediction vectors is carried out: where s j is called the total input of higher capsule, c ij is the coupling coefficient calculated by dynamic routing, and ∑ j c ij = 1 (3) conceptually, the c ij represents the probability distribution of the capsule i activating the capsule j. 4. Finally, the output of higher capsule j is obtained by activating a non-linear "squashing" function, it is used to transform short vectors to almost zero length and long vectors close to 1, and the direction of the vector remains the same.
The coupling coefficient c ij is calculated by the softmax function as equation: where b ij represents the degree of correlation between capsules in layer L and layer L+1, and the initial value of b ij is 0. Update b ij through Equation (4) until the iteration requirements are met.
The calculation of the capsule network loss function is the marginal loss of a specific class of object c, which can be calculated by Formula (7). On the basis of experiments, the ideal parameters are set as m + = 0.9 and m − = 0.1, and remain unchanged in this paper. The lambda coefficient λ prevents initial learning from reducing the size of the class's activity vector and is set to 0.5. The total loss is the sum of the total losses of all categories. If there is an object C of a particular class, then T c =1.
The original paper sets the ideal parameters based on experiments as m + = 0.9 and m − = 0.1 and are kept as constant for this paper. The lambda coefficient λ prevents initial learning from reducing the size of activity vectors for classes and is set at 0.5. The total loss sums the total loss of all classes. If an object of a specific class, c, is present then T c = 1 [53].
One of the primary justifications to why a CapsNet theoretically should perform better than CNN specifically for the task of malware classification, is that malware binaries represented as images have clear sections. In other words, the proposed model should allow for better hierarchical relationships, which is important in a dataset that is comprised of images with set sections.

Our Proposed Capsule Network
Convolutional neural network can extract image feature information very well, while capsule network has a better ability to identify the spatial position relationship, direction, and other attributes of the image, and the transmission loss of image feature information is very small. In this paper, we propose a modified capsule network architecture (MalCaps) for malware classification based on malware visualization and capsule network.
Two strategies were used to increase the accuracy of the proposed MalCaps. Firstly, the architecture proposed in this paper increases the number of convolution layers and tries to improve the learning of discriminative features. On the basis of experiments, the proposed model chooses to increase the number of convolution layers to two, which could also improve accuracy by creating more complex features before feeding to the primary capsule layer. As suggested in the literature, the more layers of the model, the more features the model can extract. Mathematically speaking, the structure of capsule neural network is much more complex, because it can input and output vectors rather than scalars. The advantage of not adding more layers is that the training time should be minimized.
Additionally, a Random Search is used to run through the number of neurons in the first convolutional layer as well as the second convolutional layer to improve the CNN architecture, the neuron number, the optimal epoch number, and the optimal batch size. There are many algorithms that can iterate various experiments with different hyperparameter configurations. Through their empirical experiments, Bergstra and Bengio showed that random search and test were more effective than grid search or manual search in optimizing hyper-parameters [38].
The network architecture of our proposed MalCaps model consists of the convolutional part, the primary capsule part, and the classification part, as Figure 5 shows, and the calculation process is shown in Formula (8). The input of the network is a 64 × 64 × 3 grayscale image, which is converted from malware training samples according to the method in previous description. A down sampling method is used to unify the original image of fixed format as the input of the model. The convolutional part utilizes 2 convolutional layers instead of 1 for feature extraction. The extra convolutional layers can provide more specified feature maps that can be fed into the primary capsule. The first convolutional layer of MalCaps contains 5 kernels of size 3 × 3 with a stride of 1, without padding, using the ReLU function as a nonlinear activation function. Output of the first convolution layer was given to the second convolution layer as input to extract useful low-level features. The second convolutional layer retains the same structure as the first. The next layer is a primary capsule layer, which contains 32 different capsules that systematizes the processed feature maps into capsule nodes. Each capsule node contains 3 × 3 convolutional kernels at a stride of 3, generates 32 distinct 20 × 20 capsules in 8 dimensions and fed into classification capsule. The final layer consists of a high-level capsule layer, referred as "Class Capsules", that performs the classification. The routing iteration is 3 in the capsule layer and the dimension of each capsule is 8, as suggested to be optimal in [7]. Finally, 9 × 16 (9 represents the number of categories, and 16 is the dimension of each capsule) vector features output is normalized by L2 normalization as the output of the network, which represents the probability of the existence of each malware category. The decoder unit consists of three fully connected layers with 512, 1024, and 12,288 neurons, respectively. The number of neurons in the last fully connected layer is the same as the number of pixels in the input layer, because the goal is to minimize the sum of the squared differences between the input image and the reconstructed image.  Other differences to the original paper is that the implementation uses a decay factor = 0.9 (for the learning rate decay) and step = 1 for the epoch, as shown to be optimal rates in experiments run by Guo [54].

Evaluation Metrics
In this section, we verify the validity of the proposed method on Microsoft Malware Classification Challenge dataset. Metrics including precision, accuracy, recall, and F1-Score are used for the quantitative assessment of performance as follows: where, TP is the true positives, FP is the false positives, TN is the true negatives, FN is the false negatives.

Results
Our experiments were implemented on a server with 2 Intel Xeon E5-2680 processors, with no GPU acceleration, and 64 GB of RAM.
Malware images are divided into two parts for training and testing after image preprocessing. The MMCC dataset was split into two, 70% for training and 30% for testing, respectively. From Table 1 we can see the distribution of the classes in the dataset is highly imbalanced, with the number of samples per class ranging from 42 samples for the class Simda to 2942 samples for the classKelihos_v3, which make this task an imbalanced classification. In order to solve the impact of data imbalance on model accuracy, the "class weight" parameter in the scikit-learn library is utilized to give higher weight to minority classes and lower weight to majority classes.
The progress of CNNs in the image recognition field has been remarkable. For performance comparison, we first conducted popular CNN architectures Le-net5 using the database, which resulted in the model's evaluating accuracy of 93.71%. The model accuracy and loss curve of training and testing as Figure 6 shows. It can be found that the training tends to be stable after approximately 25 epochs, but the model loss was slightly higher in the testing phase.  Table 2 shows the training accuracy, precision, recall, F1, and MCC for each malware class. The statistics presented in Table 2 shows some promising result of the Le-net5 model, despite the simplicity of the model. The per-class precision shows most of malwares were correctly classified, with class 0, 6, 7, and 8 below 90%. Per-class recall shows that for class 6, just 28.57% were calculated correctly, that's because its proportion in the dataset is much smaller than the others. The per-class accuracy scores show all of the 9 classes were above 98%. To optimize the hyper-parameters of the convolutional layers with in the proposed MalCaps, a random-search method was then conducted the compared Le-net5 model. In every training epoch, different hyper-parameter configurations were randomly searched. As Table 3 shows, the best accuracy, 99.03%, was got with the hyper-parameter configuration as: convolutional layer1 neuron1 = 5, convolutional layer neuron2 = 5, epoch = 50, batch = 100, which will be used in our proposed MalCaps. The hyper-parameters optimized Le-net5 was trained on the MMCC dataset, and the summary statistics in Table 4 show some better and more realistic results as improvements to some accuracy results of per-class. The best accuracy was on class 7 with 99.86% accuracy, with 100% per-class precision. The average accuracy fluctuated between 95.79% and 98.7%, while during the test phase between 92.27% and 94.85%, as training accuracy showed in Figure 7.  The success of our MalCaps model was compared to that of CNN for the discrimination of malware based on the hyper-parameters optimized Le-net5. The MalCaps model was iterated over 50 epochs, Figure 8 shows training loss decreases over epoch iterations, whilst training accuracy increases, and resulting in an accuracy of 99.89%. These promis-ing results suggest that this MalCaps can be used in malware classification. This final implementation achieved an overall precision of 99.34%, which was better than that of the hyper-parameter optimized CNN model in the previous experiment. Finally, we present the confusion matrix of our proposed model, showed in Figure 9, which reveals that Malcaps is capable of classifying many malware families correctly. But it has problems mainly in classifying samples from Kelihos_ver1, Lollipop, Obfuscator.ACY, Ramnit, and tracur and they misclassified some samples as belonging to the Gatak malware's family. In particular, the major misclassifications are produced from samples of the Lollipop family. There are a number of alternative deep learning architectures that exist in the field of malware classification. To further evaluate the performance of our proposed approach, we compared MalCaps with state-of-the-art methods in the literature that have evaluated their models on the dataset provided for the Kaggle's Microsoft Malware Classification Challenge, which depending on the feature of grayscale image representing the malware's binary content or a set of features extracted from it using any feature extractor technique, used as input for the training algorithms. The results are shown in Table 5. These approaches include traditional classification approaches [55][56][57], CNN [58,59], ensemble model [48,60], deep forest model [61], and our proposed approaches. As observed in Table 5, MalCaps achieved comparable results to others, and achieved a higher detection rate and macro F1-score. Thus, we demonstrate that capsule network based mothed can be successfully complemented with grayscale image features to achieve good results in the malware classification task. In addition to these, Narayanan et al. reported accuracy scores for the MMCC dataset, their accuracy is the highest one for the dataset, but their network has two complex phases and they do not use F1-Score despite the dataset is highly imbalanced.

Discussion
Despite our MalCaps achieving a relatively commendable accuracy rate, these results are not without limitations. The per-class accuracy varies and the confusion matrices presented in the previous section show that the models failed to classify some of the family classes that were a minority class. Additionally, the time used to train the model is higher, the optimized CNN model took around 24 min trained in 200 epochs compared to a capsule neural network of 289 min trained in 50 epochs.
Capsule Neural Networks are the latest development in the field of deep learning, but a lot of research should be developed before they can be used into industry to protect against malware. For future work, there are a number of approaches to improve the proposed MalCaps, in ways of model effectiveness, efficiency, and time taken. From the view of algorithm, a more efficient way is to increase the depth of networks like CNNs. Nevertheless, the current trade-offs are that in order to train such a deep capsule network, much more computational power is required. This may not be the most ideal way, for that it is assumed that the mapping learned from the data will be valid in the future. However, this assumption is not valid in the malware domain. Malwares evolve over time, new malwares are being created daily, and training may not currently be fast enough to keep up with malware development. Additionally, the similarity between previous and future versions will degrade slowly over time; this is known as the problem of concept drift. From the view of data, take more information of malware as input will be helped. One approach is to process the images in color, rather than greyscale, which allows more feature details, such as character information, byte stream information, PE structure information, and so on, which are encoded in RGB channel. These features, which represents the malware byte files, are not available in purely a greyscale image. Another approach is to preserve the semantic meaning of each byte in the raw binary file in the preprocessing step, though this approach means we need a suitable way to compress a large binary file without losing the semantic meaning in the final representation.

Conclusions
Distinguishing and classifying different types of malware is an important task as it provides information to better understand how the malware has infected the computers or devices, their threat level, and how to protect against them. In this paper, we proposed a modified capsule network, MalCaps, for the classification of malwares. This method first converts the software samples into greyscale images and then performs the capsule network model training. To the best of our knowledge, at present there are only a few studies using capsule network in malware analysis, probably because this is a very recent technique in the deep learning field. Experiments show that the proposed MalCaps architecture was found to be successful and the results were quite similar to those of an expert, achieving the highest accuracy of 99.37%. One of the key benefits of the capsule network model over traditional deep learning models (e.g., CNNs) is that it takes into account the spatial information which is usually lost in a pooling layer of CNNs.
This paper describes an attempt to apply the capsule network in the field of malware classification. Capsule network may be considered one of the most innovative and successful models in the field and should be evaluated in further studies. For the future, if the appropriate computational power is made available, deeper capsule network architectures are discussed as potential solutions to increase generalization performance as well as different data pre-processing techniques. Furthermore, real-time data will be trained using this method.