Image-Based Malware Classiﬁcation Using VGG19 Network and Spatial Convolutional Attention

: In recent years the amount of malware spreading through the internet and infecting computers and other communication devices has tremendously increased. To date, countless techniques and methodologies have been proposed to detect and neutralize these malicious agents. However, as new and automated malware generation techniques emerge, a lot of malware continues to be produced, which can bypass some state-of-the-art malware detection methods. Therefore, there is a need for the classiﬁcation and detection of these adversarial agents that can compromise the security of people, organizations, and countless other forms of digital assets. In this paper, we propose a spatial attention and convolutional neural network (SACNN) based on deep learning framework for image-based classiﬁcation of 25 well-known malware families with and without class balancing. Performance was evaluated on the Malimg benchmark dataset using precision, recall, speciﬁcity, precision, and F1 score on which our proposed model with class balancing reached 97.42%, 97.95%, 97.33%, 97.11%, and 97.32%. We also conducted experiments on SACNN with class balancing on benign class, also produced above 97%. The results indicate that our proposed model can be used for image-based malware detection with high performance, despite being simpler as compared to other available solutions.


Introduction
Malware (also known as malicious program) is malicious software developed intentionally to cause harm to computer systems. It is used to attack, infiltrate, or gain access to some digital assets that may be very sensitive in nature or cause damage or unwanted results from a system [1]. The primary purpose is to cause harm and invade assets whose access is not public. On average, 360,000 new malware files were detected every day in 2020, and the number of files found daily has increased by 5.2%. This rapid growth in malware production and distribution became possible due to the use of intelligent and automatic malware generation software such as SpyEye of Zeus and denial of service [2,3]. Newer dangers are evolving as blended threats continue to combine various types of assault into one with more deadly payloads. Despite the emergence of ground-breaking security defensive technologies, hacking, spoofing, phishing, and spyware are rising at an alarming rate. Moreover, phishing attacks are on the rise in many cases, leading to tricking the • To the best of our knowledge, this study is the first attention-based malware detection method working from 25 malware classification and benign class.

•
The study proposes a transfer learning-based architecture that uses spatial convolutional attention to classify malware from multiple families through class weighting and without class balancing techniques. • Lastly, we have performed extensive experiments testing using metrics such as precision, recall, F1-measure, confusion matrix ROC-AUC curves and produced above 97% result with and without class balancing.
The remaining sections are as follows: Section 2 presents an extensive literature review; Section 3 explains in detail the malware benchmark dataset used along with data pre-processing, and highlights the technologies used and their inherent limitations due to which we pivot towards more reliable methods (this section also includes the architecture of our proposed model); Section 4 describes experimental details such as setup, parametrization, and then discusses the results along with important visualizations; and Sections 5 and 6 present the discussion, limitations, and conclusions, where we describe the processes and highlight important details.

Related Work
Traditional signature and heuristic approaches to identify malicious software do not provide a sufficiently high degree of detection for new and previously undiscovered malware types. This decides whether the ML approaches can be used to solve this problem. Sophisticated deep learning methods combined with transfer learning techniques are used to improve the resilience and accuracy of malware detection without requiring advanced security understanding.
Rezende et al. [29] proposed a neural network architecture with transfer learning using ResNet-50. They used RGB images of size 224 × 224 with 10 folds with the Glorot uniform approach for weight initialization and Adam optimization. The model was trained for 750 epochs with a final accuracy of 98.62%. They also used GIST features with kNN using k = 4 resulting in an accuracy of 97.48% with bottleneck features producing an accuracy of 98.0%. Khan et al. [30] conducted an extensive analysis of transfer learning for malware classification using ResNet and GoogleNet with their data preparation pipeline and the top model. Resnet 18,34,50,101,152 achieved an accuracy of 83%, 86.51%, 86.62%, 85.94%, and 87.98%, respectively. The accuracy for GoogleNet was 84%.
Vasan et al. [31] used an ensemble model with VGG16 and ResNet-50. Both of these networks were fine-tuned. Along with PCA to reduce 90% of the features from the dataset and feed them into a one-vs-all multiclass SVM. They fine-tuned their model for 50 epochs and trained their CNN model from 100 to 200 epochs, achieving an accuracy of 99.50%. Yosinski et al. [32] proposed a model with 15 classes with 7087 examples with different types of feature extraction techniques producing their highest 97.47% accuracy. Nataraj et al. [33] used feature extraction with techniques such as GIST descriptors, and used ML algorithms such as KNN to produce an accuracy of 97%. Their algorithm uses static features classification and computed bi-gram distributions. This technique has the very basic flaw that if the adversary knows about their features, they can take countermeasures and avoid detection completely. Agarap et al. [34] used CNN or LSTM hybrid networks with SVM and other SVM hybrid architecture and deep learning models were proposed. Their CNN-SVM model stood at 77.22% accuracy, GRU-SVM stood at 84.92%, and the MLP-SVM hybrid model stood at 80.46% accuracy.
Akarsh et al. [35] proposed a CNN-LSTM hybrid model with the special transformation of the images. Their two-layer CNN which connected to an LSTM layer with 70 memory blocks and an FCN layer with 25 units with a softmax and categorical crossentropy. The final accuracy on different splits was from 96.64% to 96.68%. In another paper, Akarsh et al. [36] used 2 layers of 1D CNN along with an LSTM for feature extraction with 0.1% dropout and 70 memory blocks of LSTM and a cost-sensitive algorithm to their model. They reported the highest accuracy of 95.5%. Sudhakar and Kumar [37] upgraded the ResNet50 model by replacing the last layer of the model pretrained on ImageNet with a completely connected dense layer. The SoftMax layer receives the output of the fully connected dense layer for malware classification. Vinayakumar et al. [38] proposed Ember which used domain-level knowledge, different features from parsed portable execution (PE) files, and format-agnostic features like raw byte histogram.
Xiao et al. [39] proposed a malware classification framework (MalFCS) that visualized malware binaries as entropy graphs based on structural entropy. Then, deep CNNs were used to extract patterns shared by a family from entropy graphs. Finally, SVM was used to classify malware based on extracted features. Cui et al. [40] proposed to use the 'Bat Algorithm' for dynamic image resampling. Their purpose was to fight the imbalance in the dataset. Along with data augmentation, using this algorithm they were able to create a CNN with 94.5% accuracy.
In another study, Cui et al. [41] proposed a method for data equilibrium based on a NSGA-|| genetic algorithm without equalization produced 92.1% accuracy, with a single objective algorithm produced 96.1% accuracy and with the multi-objective algorithm the highest accuracy of 97.1% was achieved. Jain et al. [42] used extreme learning machines (ELMs) with CNNs and proposed an ensemble model. They produced an accuracy of 96.30% with a single CNN layer and 95.7% with two CNN layers.
Naeem et al. [43] used an IOT based hybrid visualization technique with deep learning. By using different image ratios, they were able to develop models with accuracies up to 98.47% and 98.79% but were dependent on dynamic image features. Venkatraman et al. [44] proposed a hybrid architecture with a self-learning system. The proposed hybrid CNN BiLSTM and CNN BiGRU models and trained them with both cost-sensitive and costinsensitive methods. All of their models with different types of parameters and settings range in accuracy from 94.48% to 96.3%.
Vu et al. [45] proposed a CNN-based architecture with transformations on the input images such as byte class, gradient, Hilbert, entropy, and hybrid image transformation with GIST and CNN-based models. Their GIST with grayscale images produced 94.27% accuracy and CNN performs the best with hybrid image transformation (HIT) technique. El-Shafai et al. [46] proposed a malware multi-classification framework that uses the pre-trained fine-tuned CNN models (AlexNet, DenseNet-201, DarkNet-53, ResNet-50, Inception-V3, VGG16, MobileNet-V2, and Places365-GoogleNet,) with transfer learning, while VGG16 has achieved the best performance for the malware recognition task.
Moussas and Andreatos [47] proposed a malware detection system based on a twolevel ANN which used both file and image features. File features were used by the first-level ANN for classify malware, while the malware families creating a confusion were classified by a second level of ANNs using malware image features. Roseline et al. [48] used an ensemble deep forest method for malware identification and classification. Instead of relying on hand-crafted feature descriptors, the proposed method is data-independent and learns the discriminative representation from the data itself. Deep ensemble stacking and low model complexity distinguish the proposed method, which beats deep neural networks in identifying malware.
Verma et al. [49] suggested using a combination of the first-order and grey-level co-occurrence matrix (GLCM)-based second-order statistical texture features, which are classified using ensemble learning. The kernel-based ELM classifier was used for malware classification achieving 94.25% accuracy for the Malimg dataset. Çayır et al. [50] adopted the ensemble model of capsule networks (CapsNet). Instead of complex CNN architectures and domain-specific feature engineering techniques, the CapsNet model employs simple architecture engineering. Furthermore, CapsNet does not need transfer learning, and the model can be easily trained from scratch. Wozniak et al. [51] suggested using the developed RNN-LSTM classifier with the NAdam optimization algorithm for Android malware recognition. The performance evaluation on two benchmark datasets showed a 99% accuracy. Nisa et al. [52] propose a feature fusion technique for combining features derived from pre-trained AlexNet and Inception-v3 deep neural networks with features extracted from images depicting malware code using segmentation-based fractal texture analysis (SFTA). SVM, KNN, decision tree (DT), and other classifiers are used to classify the characteristics retrieved from malware images. Hemalatha et al. [53] adopted the DenseNet model with a reweighted class-balanced loss function is used to gain substantial performance improvements in classifying malware images by dealing with unbalanced data difficulties.
Finally, Toldinas et al. [54] used a multistage deep learning image recognition approach for network intrusion detection. The network characteristics are converted into fourchannel pictures (Red, Green, Blue, and Alpha). After that, the photos are utilized to train and evaluate the pre-trained deep learning model ResNet50. UNSW-NB15 and BOUN Ddos, two publicly available benchmark datasets, are used to test the technique.
In summary, the adoption of deep learning methods to recognize malware and network intrusions from features converted to images is currently on the rise, and a wide variety of neural network models and architectures are being explored, modified, and adopted. Nevertheless, considering a plethora of deep learning architectures available with a great number of hyper parameters more research is still needed to find the best solutions suitable for cybersecurity domain.

Materials and Methods
In this section, we describe all the technologies used and all the essential background information necessary to understand the proposed methodology.

Dataset
The malware image dataset, also called the Malimg dataset, is provided by Nataraj et al. [33]. The dataset contains 9389 grayscale images from 25 malware families. The Malimg dataset was used previously as a benchmark in numerous papers to evaluate malware detection methods, including the ones to be used in the IoT environments [43,55]. These are some of the well-known malware families and their different variants. These malware images were created from binaries of the malware as described by Conti et al. [25]. The conversion from binaries to images is done by first converting the binaries into 8-bit vectors and then these vectors are converted to grayscale images by taking each of the vectors as a pixel representing the intensity. Figure 1 illustrates the construction of the malware image from malware binary. and the model can be easily trained from scratch. Wozniak et al. [51] suggested using developed RNN-LSTM classifier with the NAdam optimization algorithm for Andr malware recognition. The performance evaluation on two benchmark datasets showe 99% accuracy. Nisa et al. [52] propose a feature fusion technique for combining featu derived from pre-trained AlexNet and Inception-v3 deep neural networks with featu extracted from images depicting malware code using segmentation-based fractal text analysis (SFTA). SVM, KNN, decision tree (DT), and other classifiers are used to clas the characteristics retrieved from malware images. Hemalatha et al. [53] adopted DenseNet model with a reweighted class-balanced loss function is used to gain substan performance improvements in classifying malware images by dealing with unbalan data difficulties.
Finally, Toldinas et al. [54] used a multistage deep learning image recognition proach for network intrusion detection. The network characteristics are converted four-channel pictures (Red, Green, Blue, and Alpha). After that, the photos are utilize train and evaluate the pre-trained deep learning model ResNet50. UNSW-NB15 BOUN Ddos, two publicly available benchmark datasets, are used to test the techniqu In summary, the adoption of deep learning methods to recognize malware and work intrusions from features converted to images is currently on the rise, and a w variety of neural network models and architectures are being explored, modified, adopted. Nevertheless, considering a plethora of deep learning architectures availa with a great number of hyper parameters more research is still needed to find the b solutions suitable for cybersecurity domain.

Materials and Methods
In this section, we describe all the technologies used and all the essential backgrou information necessary to understand the proposed methodology.

Dataset
The malware image dataset, also called the Malimg dataset, is provided by Nat et al. [33]. The dataset contains 9389 grayscale images from 25 malware families. Malimg dataset was used previously as a benchmark in numerous papers to evaluate m ware detection methods, including the ones to be used in the IoT environments [43, These are some of the well-known malware families and their different variants. Th malware images were created from binaries of the malware as described by Conti e [25]. The conversion from binaries to images is done by first converting the binaries 8-bit vectors and then these vectors are converted to grayscale images by taking eac the vectors as a pixel representing the intensity. Figure 1 illustrates the construction of malware image from malware binary.  It can be visually verified that these images created from binary files have very strong visual interclass relationships and, vice versa, differs from the images of unrelated classes, which assures us that the image-based classification is a very representative method for this dataset. Figure 2 shows representative grayscale images of malware.
It can be visually verified that these images created from binary files have very strong visual interclass relationships and, vice versa, differs from the images of unrelated classes, which assures us that the image-based classification is a very representative method for this dataset. Figure 2 shows representative grayscale images of malware.
(a) (b) Figure 2. Grayscale images of malware: (a) unrelated classes of malware, and (b) related classes from the same malware family [33].
Another point of interest is that the malware images are very representative of the code format in the source binaries of the malware; by aligning the code file with the grayscale image we can observe that there is a high amount of correlation between the structure of the code file with the generated images from the dataset. Figure 3 shows a generated image from binary with its correlation. The details of the dataset containing 9389 images from 25 classes of malware along with their families are given along with frequency visualization. From the above table and histogram, it can be seen that the frequency of Allaple.A and its variant Allaple.L from the worm family of malware is the highest in the dataset with a few thousand examples. We also have Swizzor.gen!E from trojan family and Wintrim.BX from the Trojan downloader family, with less than a hundred examples. This shows that the dataset is highly unbal- Another point of interest is that the malware images are very representative of the code format in the source binaries of the malware; by aligning the code file with the grayscale image we can observe that there is a high amount of correlation between the structure of the code file with the generated images from the dataset. Figure 3 shows a generated image from binary with its correlation.
It can be visually verified that these images created from binary files have very strong visual interclass relationships and, vice versa, differs from the images of unrelated classes, which assures us that the image-based classification is a very representative method for this dataset. Figure 2 shows representative grayscale images of malware. Another point of interest is that the malware images are very representative of the code format in the source binaries of the malware; by aligning the code file with the grayscale image we can observe that there is a high amount of correlation between the structure of the code file with the generated images from the dataset. Figure 3 shows a generated image from binary with its correlation. The details of the dataset containing 9389 images from 25 classes of malware along with their families are given along with frequency visualization. From the above table and histogram, it can be seen that the frequency of Allaple.A and its variant Allaple.L from the worm family of malware is the highest in the dataset with a few thousand examples. We also have Swizzor.gen!E from trojan family and Wintrim.BX from the Trojan downloader family, with less than a hundred examples. This shows that the dataset is highly unbal- The details of the dataset containing 9389 images from 25 classes of malware along with their families are given along with frequency visualization. From the above table and histogram, it can be seen that the frequency of Allaple.A and its variant Allaple.L from the worm family of malware is the highest in the dataset with a few thousand examples. We also have Swizzor.gen!E from trojan family and Wintrim.BX from the Trojan downloader family, with less than a hundred examples. This shows that the dataset is highly unbalanced with wide gaps in the number of examples in each of the classes. Now we can further observe the dataset by using a scatter plot to visualize the different classes present in the dataset. Figure 4 shows the number of samples of malware families in the dataset.
To conclude, we have seen a few issues (namely the imbalance of the dataset and the structural correlation between binary and the image) about which we cannot comment further since the classification can produce very good results.  To conclude, we have seen a few issues (namely the imbalance of the dataset and the structural correlation between binary and the image) about which we cannot comment further since the classification can produce very good results.

Data Pre-Processing
The dataset was imported in its general image form which was transformed using a wavelet approach of transformation of binaries to images. We have used the Malimg dataset as described in the previous section. After simply importing the dataset from Kaggle we have used the TensorFlow implementation of Keras for importing and training our network in batches using the Keras flow, whereas the image size was set to be 224 × 224. Using the color mode of RGB, we have increased the channels of the images; this also creates a suitable batch for training due to which any complex preprocessing on simple images is not required. Other than the upscaling of the image from grayscale to RGB, no other transformation was applied to the dataset.

Convolutional Neural Networks
Convolutional neural networks (CNNs) are a type of deep learning model that is inspired by the human visual cortex. They are a type of feed-forward neural network that has proven to be immensely successful in areas such as image processing and digital signal processing and other fields. These were the first models to have parameter sharing that made them ideal for tasks related to image processing and classification. The wellsuited non-linearity in convolutional neural networks is the rectified linear unit (ReLU), which has proven to produce better results since it can resist a lot of problems. However, there is a problem with the CNN architecture. CNNs have a max-pooling layer inside them. Max pooling works by down sampling an image. It takes an image and from a fixedsized window, it reduces whole frames to single values. The point to note is that this reduction is different from convolution since the values are not being merged to compute an output but are being ignored. This discarding of information is a very big challenge for

Data Pre-Processing
The dataset was imported in its general image form which was transformed using a wavelet approach of transformation of binaries to images. We have used the Malimg dataset as described in the previous section. After simply importing the dataset from Kaggle we have used the TensorFlow implementation of Keras for importing and training our network in batches using the Keras flow, whereas the image size was set to be 224 × 224. Using the color mode of RGB, we have increased the channels of the images; this also creates a suitable batch for training due to which any complex preprocessing on simple images is not required. Other than the upscaling of the image from grayscale to RGB, no other transformation was applied to the dataset.

Convolutional Neural Networks
Convolutional neural networks (CNNs) are a type of deep learning model that is inspired by the human visual cortex. They are a type of feed-forward neural network that has proven to be immensely successful in areas such as image processing and digital signal processing and other fields. These were the first models to have parameter sharing that made them ideal for tasks related to image processing and classification. The well-suited non-linearity in convolutional neural networks is the rectified linear unit (ReLU), which has proven to produce better results since it can resist a lot of problems. However, there is a problem with the CNN architecture. CNNs have a max-pooling layer inside them. Max pooling works by down sampling an image. It takes an image and from a fixed-sized window, it reduces whole frames to single values. The point to note is that this reduction is different from convolution since the values are not being merged to compute an output but are being ignored. This discarding of information is a very big challenge for CNNs since important information can be lost due to pooling [56][57][58]. Figure 5 shows the process of down sampling in the CNN model.

Spatial Attention Mechanism
The attention mechanism first became popular as an enhancement for encoder decoderbased neural machine translation systems. The mechanism of attention was first introduced by Bahdanau et al. [59] in 2015, when they used it for neural machine translation. They called it jointly learning by aligning to translate. In this paper, they called it soft alignment and were able to produce a state of three art results. Figure 6 shows the attention mechanism in the LSTM model. CNNs since important information can be lost due to pooling [56][57][58]. Figure 5 shows the process of down sampling in the CNN model. Figure 5. Down sampling inside a convolutional neural network with pooling layer.

Spatial Attention Mechanism
The attention mechanism first became popular as an enhancement for encoder decoder-based neural machine translation systems. The mechanism of attention was first introduced by Bahdanau et al. [59] in 2015, when they used it for neural machine translation. They called it jointly learning by aligning to translate. In this paper, they called it soft alignment and were able to produce a state of three art results. Figure 6 shows the attention mechanism in the LSTM model. Later Vaswani et al. [60] demonstrated how using local or global single head and multi-head attention can improve the results using very basic models. Since then, the attention mechanisms have been a big breakthrough in deep learning. These breakthroughs demonstrated how simply built mechanisms such as these can be used to fine grain and enhance the result of already state-of-the-art models. The idea behind attention was simple. It states that for tasks such as machine translation, the thing of importance is not only that all the input words are converted to context vectors, but that their relative importance is also considered; each word should be aligned with its relative importance. Attention can be classified into a few different types. Two major such types are global and local attention. Global attention is also known as soft attention. It is where all of the patches of

Spatial Attention Mechanism
The attention mechanism first became popular as an enhancement for encoder decoder-based neural machine translation systems. The mechanism of attention was first introduced by Bahdanau et al. [59] in 2015, when they used it for neural machine translation. They called it jointly learning by aligning to translate. In this paper, they called it soft alignment and were able to produce a state of three art results. Figure 6 shows the attention mechanism in the LSTM model. Later Vaswani et al. [60] demonstrated how using local or global single head and multi-head attention can improve the results using very basic models. Since then, the attention mechanisms have been a big breakthrough in deep learning. These breakthroughs demonstrated how simply built mechanisms such as these can be used to fine grain and enhance the result of already state-of-the-art models. The idea behind attention was simple. It states that for tasks such as machine translation, the thing of importance is not only that all the input words are converted to context vectors, but that their relative importance is also considered; each word should be aligned with its relative importance. Attention can be classified into a few different types. Two major such types are global and local attention. Global attention is also known as soft attention. It is where all of the patches of Later Vaswani et al. [60] demonstrated how using local or global single head and multi-head attention can improve the results using very basic models. Since then, the attention mechanisms have been a big breakthrough in deep learning. These breakthroughs demonstrated how simply built mechanisms such as these can be used to fine grain and enhance the result of already state-of-the-art models. The idea behind attention was simple. It states that for tasks such as machine translation, the thing of importance is not only that all the input words are converted to context vectors, but that their relative importance is also considered; each word should be aligned with its relative importance. Attention can be classified into a few different types. Two major such types are global and local attention. Global attention is also known as soft attention. It is where all of the patches of a sequence or an image are given some weight. On the other hand, hard attention is also known as local attention and only gives weight to one patch of interest.
From this perspective, and to build on the previous discussion, we also explain how attention is used outside of sequential models. Attention has various types and forms that have been created over the span of a few years and include the use of attention in both simple and far complex ways. Here, a similar version of attention is used like the one used in Bahdanau et al. [59]. Another one is spatial transformer networks that use localization networks along with parameterized grid sampling and image sampling. This type of image processing is a very strong application of attention in the field of image processing. More modern applications use residually connected architectures for tasks of this domain. Figure 7 shows the spatial attention process to enhance features before CNN.
From this perspective, and to build on the previous discussion, we also explain how attention is used outside of sequential models. Attention has various types and forms that have been created over the span of a few years and include the use of attention in both simple and far complex ways. Here, a similar version of attention is used like the one used in Bahdanau et al. [59]. Another one is spatial transformer networks that use localization networks along with parameterized grid sampling and image sampling. This type of image processing is a very strong application of attention in the field of image processing. More modern applications use residually connected architectures for tasks of this domain. Figure 7 shows the spatial attention process to enhance features before CNN.

Figure 7.
Process of spatial attention is added to enhance features before they are fed into CNN.
The importance and need for attention arise due to the fact that it can help overcome the information bottlenecks in deep learning models. Information bottlenecks are created when too much information is allowed to pass or process through a very narrow window where it becomes increasingly difficult to retain useful information by the network.

Our Proposed Architecture
The proposed model architecture consists of three main parts. The first part is a transfer learning model based on ImageNet called VGG19 [61]. The complete architecture of the model has been mentioned in greater detail. Here we only used the base of the VGG19 model, and the top part was removed. The base model has around 20,025,920 non trainable weights in it. To emphasize the fact that attention is better than most other approaches, we decided that it would be best if all the layers in the base VGG19 model were frozen. Since the layers were frozen the only purpose of the base VGG19 model was to behave as a feature extractor using convolution layers with the pre-trained weights. Using this architecture, it is possible to fine-tune and enhance the accuracy of the model, but we are refraining from doing so since we aim to show the enhancement of attention in our model and how it can perform much better than other models despite using a very simple form of attention mechanism. For a comparative analysis, we've only compared the results of VGG19. After analyzing the models, we have observed that the VGG19 model performs better on the malware dataset. Therefore, we are using VGG19 as a part of the proposed malware detection framework through transfer learning [62]. The importance and need for attention arise due to the fact that it can help overcome the information bottlenecks in deep learning models. Information bottlenecks are created when too much information is allowed to pass or process through a very narrow window where it becomes increasingly difficult to retain useful information by the network.

Our Proposed Architecture
The proposed model architecture consists of three main parts. The first part is a transfer learning model based on ImageNet called VGG19 [61]. The complete architecture of the model has been mentioned in greater detail. Here we only used the base of the VGG19 model, and the top part was removed. The base model has around 20,025,920 non trainable weights in it. To emphasize the fact that attention is better than most other approaches, we decided that it would be best if all the layers in the base VGG19 model were frozen. Since the layers were frozen the only purpose of the base VGG19 model was to behave as a feature extractor using convolution layers with the pre-trained weights. Using this architecture, it is possible to fine-tune and enhance the accuracy of the model, but we are refraining from doing so since we aim to show the enhancement of attention in our model and how it can perform much better than other models despite using a very simple form of attention mechanism. For a comparative analysis, we've only compared the results of VGG19. After analyzing the models, we have observed that the VGG19 model performs better on the malware dataset. Therefore, we are using VGG19 as a part of the proposed malware detection framework through transfer learning [62].
The next part of our model is the CNN model enhanced by attention. For this purpose, we have used a very simple form of attention. As we mentioned earlier, the model we introduce is for showing how a simple variant of attention mechanism can capture a lot of information and is a much simpler way of using attention. The technique that we have used here to generate attention is called dynamic spatial convolution. It is a type of spatial attention that works well for image processing and vision tasks. Dynamic spatial convolution uses a global average pooling mechanism which is easy to understand since in any given image all the areas are not equally important. Some specific regions are better suited for the task and are more useful [63].
Their spatial attention is generated using these normalized vectors and 2D convolutional layers. In the end, they are combined using a lambda layer. The lambda layer here is used to rescale the GAP. The attention-enhanced feature maps are generated from this layer and are then fed into a dropout layer with a 25% dropout rate attached to a dense layer with 256 units. This is again passed through a 25% dropout layer for regularization and then a fully connected layer with 25 units. The activation used in this layer is SoftMax. Finally, Figure 8 visualizes our proposed architecture of the neural network.
lot of information and is a much simpler way of using attention. The technique that we have used here to generate attention is called dynamic spatial convolution. It is a type of spatial attention that works well for image processing and vision tasks. Dynamic spatial convolution uses a global average pooling mechanism which is easy to understand since in any given image all the areas are not equally important. Some specific regions are better suited for the task and are more useful [63].
Their spatial attention is generated using these normalized vectors and 2D convolutional layers. In the end, they are combined using a lambda layer. The lambda layer here is used to rescale the GAP. The attention-enhanced feature maps are generated from this layer and are then fed into a dropout layer with a 25% dropout rate attached to a dense layer with 256 units. This is again passed through a 25% dropout layer for regularization and then a fully connected layer with 25 units. The activation used in this layer is SoftMax. Finally, Figure 8 visualizes our proposed architecture of the neural network. There is problem of class imbalanced clearly shown in Figure 4 in between the classes. There are various techniques of class balancing random oversampling, random under- There is problem of class imbalanced clearly shown in Figure 4 in between the classes. There are various techniques of class balancing random oversampling, random undersampling, hybrid sampling and class weighting. We applied class weighting in our models to balance by assigning of score in lower classes. It punishes the errors in the minority classes [64,65]. After class weighting, all classes are equally balanced.

Experimental Results
We have proposed a relatively simple architecture that requires next to no preprocessing and can take the benchmark and simple most available dataset of grayscale images. Our methodology requires no special transformation of the dataset, and we do not need to generate any static or dynamic features as a lot of other approaches do. Using a very simple architecture, we can outperform almost all other models. A lot of models that are far more complex and more tedious to train and use have high accuracy scores, but they are severely limited as we have observed in the literature review.
We used Google Colab Cloud for our experiments using Python Language 3.5, Tensorflow, Scikit learn, and the Keras library. The labels were inferred from the dataset by TensorFlow Keras's function called image_dataset_from_directory [66].
In the following subsections, we describe our approach and the framework functionalities that we have used to implement our proposed malware recognition architecture. As we described before in the introduction section. Much of the architecture used to date have been implemented with special processing of input. In this paper, we aim to replace the tedious approaches with something simpler such as attention.

Experimental Setup
The dataset was split in the most well-used ratio of 70:30 (70% training and 30% testing sets). In total, the training set consists of 8404 files belonging to 25 classes, and the testing set had 935 files belonging to 25 classes. The shape for images for training was (8404, 224, 224, 3) and the shape of labels was (8404, 25).
The attention generation process, which is using 2D Convolutional layers, has a ReLU activation function with the same padding and 2D kernels of size one. The locally connected layer uses sigmoid activation. The upscaling layer that spreads the attention to the other layers simply uses linear activation. The first dense layer uses the exponential linear unit (ELU) activation function, and the last top layer uses SoftMax activation. The model was compiled using the ADAM optimizer with categorical cross-entropy as a loss function, and the metric for performance is accuracy.
The complexity of the proposed models is evaluated using the number of trainable parameters as a proxy. Between both the top and bottom part of the network, the total number of parameters is 20,199,402 which includes the VGG19 weights, the trainable weights in the top layers of CNN, and the fully connected layers. Since we froze the VGG19 layers, the number of trainable parameters was reduced to only 173,482. The number of non-trainable parameters is 20,025,920 as shown in the Table 1. To construct the best possible model with excellent generalizability and to avoid overfitting the model has been trained using callbacks from Keras we explain them with their parameters here. The most important callbacks we used are early stopping and reduce LR on plateau. The early stopping callback is used to automatically observe the validation loss of a model and according to its parameterization, it decides when to stop the training process. The important parameter is the patience which was set to five after a few experiments. The patience value five means that it will wait for five epochs for the validation loss to decrease by the parameter min delta. By setting the min delta to a low value such as 1 ×10 −3 , we wait for a change, mainly a decrease in validation loss of a very small magnitude. This callback is also responsible for restoring the learnt weights from the best epoch as the final weights of the model.
Another important callback was the learning rate scheduler. Since the learning rate can cause problems, it makes sense to change it as the number of epochs grows and the validation loss becomes negligible. The learning rate scheduler used here is called reduce LR on plateau. The important parameters are patience, cool down, and factor. This works by taking a value of the learning rate that it will continue to reduce by the factor of its parameter factor with the present learning rate. The model will keep observing the validation loss. If the validation loss stops to decrease it will wait for the patience number of epoch and then reduce the learning rate to an even smaller value such as 0.01 or 0.001. This continues to happen until this parameter is no longer able to improve the validation loss. The number of epochs that we start this model to train on is 100 epochs. However, in all of the experiments that we have conducted, we never had to train a model for more than 10-15 epochs. We have obtained our threshold accuracies on epochs as low as four to five. This again shows the high potential of our proposed method.

Evaluation
We evaluated our result through accuracy, precision, recall, F1-measure, and RUC-AUC curve. We evaluated our results through testing, not only on the malware classes but on benign class as well. It attests to the fact that our model can classify both types of files reasonably. For this experimentation, we have added benign class as an extra class in our dataset. Now we have around three thousand images of benign code binaries in our dataset as well. The results are measured in our models spatial attention CNN with VGG19 class balancing, without class balancing, and with class balancing on benign class as shown in Table 2. Resultantly we have 25 malware classes and 1 benign class in our dataset, making it a 26-class classification problem. We conducted several experiments for this. For this specific experimental setup for the benign class, we have've run multiple experiments and we have obtained a precision of 100%, recall 100% and f1 score of 100%. The AUC of this specific class is also 100%. Table 3 shows the result of extra benign class with 100%. For our model, we have given the average accuracies for models using VGG19. The average accuracy for SACNN with VGG19 with class balancing is higher on fewer epochs of training. The training and validation accuracy are given below in Figure 9a for with class balancing SACNN VGG19 on 25 malware classes, and Figure 9b for with class balancing SACNN VGG19 on benign class.
The models perform very well in both training and testing. The best loss is reached around the fifth epoch and is restored using model callback early stopping which has been described earlier.
Secondly, we also calculated loss value in our proposed. The training and validation loss are given below in Figure 10a for with class balancing SACNN VGG19 on 25 malware classes and Figure 10b for with class balancing SACNN VGG19 on benign class. Figure 11 shows the normalized confusion matrix obtained after performing the classification of malware samples into 25 malware classes. Figure 12 shows the normalized confusion matrix for classification with class balancing and benign class. For our model, we have given the average accuracies for models using VGG19. The average accuracy for SACNN with VGG19 with class balancing is higher on fewer epochs of training. The training and validation accuracy are given below in Figure 9a  The models perform very well in both training and testing. The best loss is reached around the fifth epoch and is restored using model callback early stopping which has been described earlier.
Secondly, we also calculated loss value in our proposed. The training and validation loss are given below in Figure 10a for with class balancing SACNN VGG19 on 25 malware classes and Figure 10b for with class balancing SACNN VGG19 on benign class.  Figure 11 shows the normalized confusion matrix obtained after performing the classification of malware samples into 25 malware classes. Figure 12 shows the normalized confusion matrix for classification with class balancing and benign class. The models perform very well in both training and testing. The best loss is reached around the fifth epoch and is restored using model callback early stopping which has been described earlier.
Secondly, we also calculated loss value in our proposed. The training and validation loss are given below in Figure 10a for with class balancing SACNN VGG19 on 25 malware classes and Figure 10b for with class balancing SACNN VGG19 on benign class.  Figure 11 shows the normalized confusion matrix obtained after performing the classification of malware samples into 25 malware classes. Figure 12 shows the normalized confusion matrix for classification with class balancing and benign class.  The Receiver Operating Characteristic (ROC) curves with calculated area under curve (AUC) values of classification into 25 malware classes are shown in Figure 13. Figure 14 shows ROC curves with AUC values for classification with class balancing and benign class.

Discussion
The spatial attention-based deep learning model proposed in this paper has some advantages over other machine learning and deep learning models proposed by other authors for the malware recognition task. Most of the models use some type of feature engineering techniques and require specialized algorithms, while other models just too heavily dependent on the feature generation and image transformations with data augmentation. All of this requires too much of human intervention to specifically pre-process the data and/or create or select the malware features. These types of pipelines have become useless since the speed of new malware generation has grown exponentially. In face of all this complexity, we have proposed a simple model that produces better results with next to no complexity in its working. It used spatial attention generated using CNNs and using feedforward and dropout layers with lesser number of trainable parameters, whilst it can outperform most of the other deep learning models for malware recognition. We also compared our result with previous studies as shown in Table 4.
To compare our method with state-of-the-art models we first need to understand that most state-of-the-art models utilize many types of data pre-processing for image augmentation in order to produce good results. Our approach utilizes most basic selflearning capabilities of neural networks and combines them with a simple spatial attention mechanism to avoid this pre-processing, which results in a lightweight model considering the operations it performs for attention generation. Multi-head attention-based models such as transformers have a wide application in computer vision, but the intricate and extensive approach is not what we are aiming. Our model is not light weight, since it holds millions of parameters from the convolution based VGG16 base model alone and our top model only adds more weights to the system. As shown in the above table, our claims hold true since our model is outperforming system which use state of the art techniques. Our results we have presented here have a few limitations as well, the first and biggest of which arise from the dataset used for performance evaluation itself. The dataset has a huge imbalance in the number of examples per malware class, while some classes have more than two thousand examples whereas others have less than a hundred due to which the malware classes with the lower number of examples unfortunately have higher misclassification rates. The other limitation of this work is the lack of the exploration in the data augmentation and the feature engineering domains.
In future we could use image augmentation to improve malware recognition result. As we used spatial attention which is a recently developed concept. Lately more SOTA models and methods are being developed on this topic which we haven't explored yet. In future we hope to compare and enhance our methods with the use of vision transformers and more types of visual attention mechanisms without convolutional pipeline [67,68].
Recently big data using Spark ML and big deep learning (BigDL) was performed very good result due to Spark framework so we can apply our work through big data in IOT malware prediction [69][70][71][72][73][74][75].

Conclusions
In this paper we have proposed to use attention enhancement via CNNs to solve the malware recognition problem without the need of a feature engineering technique or handmade feature design. We have aimed to show how the attention mechanism with CNNs can be used and how this type of spatial attention, which demonstrated enormous advantages in computer vision problems, can be applied in images-based malware detection as well.
We have proposed a novel approach based on convolutional neural networks (CNN) that uses spatial convolutional attention to classify malware into 25 different malware families. The performance was evaluated on the Malimg benchmark dataset, achieving an accuracy of 97.68%. The experimental results indicate our proposed model can be used for image-based malware detection with the state-of-the-art results despite being simpler as compared to other solutions. Attention itself is a wide topic, and so many variants of attention have emerged in recent years. This provides an opportunity to further enhance this solution for use in smart homes and other Internet-of-Things (IoT) environments. Using our proposed attention-based model, and improving upon it using various deep learning techniques, it is possible to develop a generic pipeline that uses deep learning models enhanced by attention to solving malware recognition problems more effectively.