An Efficient DenseNet-Based Deep Learning Model for Malware Detection

Recently, there has been a huge rise in malware growth, which creates a significant security threat to organizations and individuals. Despite the incessant efforts of cybersecurity research to defend against malware threats, malware developers discover new ways to evade these defense techniques. Traditional static and dynamic analysis methods are ineffective in identifying new malware and pose high overhead in terms of memory and time. Typical machine learning approaches that train a classifier based on handcrafted features are also not sufficiently potent against these evasive techniques and require more efforts due to feature-engineering. Recent malware detectors indicate performance degradation due to class imbalance in malware datasets. To resolve these challenges, this work adopts a visualization-based method, where malware binaries are depicted as two-dimensional images and classified by a deep learning model. We propose an efficient malware detection system based on deep learning. The system uses a reweighted class-balanced loss function in the final classification layer of the DenseNet model to achieve significant performance improvements in classifying malware by handling imbalanced data issues. Comprehensive experiments performed on four benchmark malware datasets show that the proposed approach can detect new malware samples with higher accuracy (98.23% for the Malimg dataset, 98.46% for the BIG 2015 dataset, 98.21% for the MaleVis dataset, and 89.48% for the unseen Malicia dataset) and reduced false-positive rates when compared with conventional malware mitigation techniques while maintaining low computational time. The proposed malware detection solution is also reliable and effective against obfuscation attacks.


Introduction
The increasing number and complexity of malware have become one of the most serious cybersecurity threats [1,2]. Although the cybersecurity industry is constantly working to monitor and thrive in several ways with malware, cyber attackers show no indications of stopping or slowing down their attacks. Malicious hacker groups develop sophisticated evasive malware techniques such as polymorphism [3], metamorphism [4], code obfuscations [5], etc., that outperform many traditional malware mitigation systems. The most widely used malware by attackers targeting businesses are backdoors, miners, spyware, and information stealers. Emotet [6] and TrickBot [7] are information stealers that commonly use malicious spam (malspam) to infect systems. Malspam contains infected Motivated by the work of Nataraj et al. [10], we view a malware detection problem as a multi-class image classification problem by visualizing a binary code into a two-dimensional (2D) grayscale image. The structure of the PE binary file (cleanware or malware) is studied by converting it into an image to provide more information about it. The binary images corresponding to the same class appear quite similar in structure and Motivated by the work of Nataraj et al. [10], we view a malware detection problem as a multi-class image classification problem by visualizing a binary code into a twodimensional (2D) grayscale image. The structure of the PE binary file (cleanware or malware) is studied by converting it into an image to provide more information about it. The binary images corresponding to the same class appear quite similar in structure and texture, where they are distinct between different classes. The various subsections of a PE binary are visualized with different textures. The small modifications made to the binary Entropy 2021, 23, 344 3 of 23 by malware writers are recognizable in new variants, but the overall structure of the image remains unaffected. Since it is crucial in detecting malware and to avoid information loss, no other approach for visualization is effective.
Deep learning is a subfield of machine learning, which learns the input at multiple levels to gain better knowledge representations. Advances in computer vision with deep learning were developed, mainly through Convolutional Neural Networks (CNN). Deep learning models learn complex features and train a complex model with many convolutional layers requiring millions of parameters. This eventually leads to overfitting in only a few epochs, and the model does not generalize well, resulting in poor model performance. The knowledge of CNNs thoroughly trained on a massive, well-elucidated dataset such as ImageNet [13] can also be transferred to make the detection and classification of malware images more effective. The key idea of transfer learning is that the knowledge gained in learning a model can help to enhance a different task in learning. CNNs are built on increasingly deeper and input passes through many layers. The input information may vanish before it reaches the final layer of the network. ResNet and other CNNs address this problem, but they generate shorter paths from preceding layers to the subsequent layers.
In this paper, a novel method is presented to classify malware variants based on the deep learning DenseNet model [14] enhanced with a class-balanced loss for reweighting the categorical cross-entropy loss. The proposed modification of the DenseNet model ensures information flow by directly connecting all the layers with their feature maps in the network. The feedforward approach is maintained by acquiring additional inputs from the previous layers and passes on feature maps of the current layer to all succeeding layers. The proposed model was practically assessed using the TensorFlow Python library [15] and obtained promising results for analysis of the Malimg [10], Microsoft BIG 2015 [11], MaleVis [12], and Malicia [16] datasets.
The contribution of this work is as follows: • An effective and expeditious deep learning-based malware detection and classification system using raw binary images while requiring no binary execution (behavioral analysis), reverse engineering, or code disassembly language skills is provided.

•
The proposed methodology employs pretrained Densely Connected Convolutional Networks (DenseNet) to achieve faster preprocessing and training of binary samples. The DenseNet model allows for concatenation of features and utilizes fewer parameters compared to other CNN models. The implicit deep supervision mechanism of the DenseNet model contributes to effective malware detection. Additionally, the dense connections with its regularizing power help reduce overfitting with smaller malware training datasets.

•
The data imbalance problem in classifying malware is tackled using reweighting of the class-balanced categorical cross-entropy loss function in the softmax layer.

•
We conduct an extensive evaluation on four different malware datasets, of which three datasets are used for training and one dataset is used for testing the proposed model. The results show that the proposed system is very efficient and effective. It is also resilient against sophisticated malware evolution over time and against anti-malware evasion tactics. • Without the need for complex feature engineering tasks, the proposed deep learningbased malware detection model achieves higher accuracy rates of 98.23%, 98.46%, and 98.21% for the three datasets and of 89.48% for the unseen (Malicia) dataset. The model has high computational performance, achieving an efficient malware detection system.
The paper is organized as follows. Section 2 presents a literature survey on malware recognition and classification. Section 3 details the proposed DenseNet-based malware detection system. Section 4 describes the malware datasets used to assess the performance of the proposed system. The experimental results of the proposed model and performance analysis with other known malware detection systems are also discussed. The conclusion of the paper is presented in Section 5.

Literature Survey
Significant malware analysis and detection research surveys have been conducted based on static, dynamic, and machine learning methods [17,18]. This section provides a survey of the different methods used to classify malware. Static features such as byte, string, and opcode sequences [19]; function length distribution [20]; functional call-graph [21]; and PE file features [22] are extracted using static analysis methods. Schultz et al. [23] obtained various static features from binary files and analyzed their performance by training with different machine learning techniques. Roseline and Geetha [24] used static features to classify malware using an oblique random forest approach. Common signature-based malware detection approaches include malicious code analysis, signature generation, and signature database storage. These approaches are inefficient since malware attackers execute malicious activities and constantly create zero-day malware. Static analysis is not resilient to code obfuscation and does not enable automated processing.
Behavioral features such as network activities, instruction sequences, and system calls [25] are extracted using dynamic analysis methods. Imran et al. [26] proposed a malware classification approach based on similarity. The API call sequences were obtained using Hidden Markov Models (HMMs), and similarity scores were computed for malware classification. Their approach works well with fewer data and requires high computation overhead. Dynamic analysis is inefficient as malware may modify its behavior in virtual environments during execution. Hybrid methods use features derived from static, dynamic, or machine learning methods [27] to classify malware. Rieck et al. [28] extracted the dynamic API call features and used Support Vector Machine (SVM) for detecting malware. Islam et al. [8] showed that the hybrid approach is more efficient than static or dynamic approaches.
Recently, significant research efforts in malware analysis made use of the vision-based approach [29][30][31][32][33]. Features such as opcode sequences and system calls were visualized as images [34,35]. Han et al. proposed an effective system for identifying packed and encrypted malware. Malware binaries were converted into images [36] for classifying their variants. Conti et al. [37] first reported that visual methods help researchers efficiently classify binary files, analyze new file structures, and obtain perspectives that impart knowledge and enhance the existing set of commonly used methods. The byteview visualization enabled the researchers to easily identify the presence of significant sections in the file. Nataraj et al. [10] extracted GIST texture features from visualized grayscale images and classified malware using K-Nearest Neighbors (KNN) with Euclidean distance. Their system required less computational cost than the n-gram method for malware classification. Han et al. [35] proposed an automatic analysis method for generating entropy graphs from grayscale images. Their method did not identify packed malware since the entropy measure was high and patterns were not visualized. Kancherla et al. [38] extracted Gabor, intensity, and wavelet features from binary images. Their approach was robust to code obfuscations. Liu et al. [39] presented an approach based on grayscale images, and the image size was reduced using the local mean method to achieve better ensembling. Fu et al. [40] visualized malware as RGB (Red, Green, Blue) color images and extracted global texture and color features from them. The code and data segments were also extracted as local features. Their method was a combination of taking global as well as local features, achieving effective malware classification. Nisa et al. [41] converted malware to images and applied segmentation-based fractal texture analysis (SFTA) to obtain features, which were fused with features obtained from pretrained AlexNet and Inception-v3 deep neural networks. Finally, machine learning classifiers were used for malware detection. Azab et al. [42] proposed a malware spectrogram image classification framework that uses spectrogram images classified by CNN for malware detection. Ding et al. [43] extracted bytecode from the Android package (APK) file and transformed it into a 2D bytecode matrix. Then, a CNN model was trained and used for malware recognition. Mahdavifar and Ghorbani [44] proposed a deep learning expert system that extracts refined rules from a trained deep neural network (DNN) model for malware detection. Naeem et al. [45] Entropy 2021, 23, 344 5 of 23 converted APK files to color images and used a convolutional DNN to extract dynamic image features. Then, the DNN was trained to detect malware attacks. Singh et al. [46] proposed a methodology to convert malware features into fingerprint images. Then, a CNN model was used to extract features from visualized malware for malware recognition. Sun and Qian [47] generated malware feature images by aggregating static analysis of malware using recurrent neural network (RNN) and CNN models. Feature images were obtained by fusing original codes with predictive codes obtained from RNN. Finally, a CNN was trained to recognize malware.
Previous works based on traditional methods are time-consuming and inefficient with the growing amount of malware. The visualization method is effective in terms of time and processing efficiency. Conventional machine learning methods are not able to handle raw pixel information from images and do not enable incremental learning. The transformation of raw data into feature vectors needs extensive engineering and technical knowledge. The classification model trains the transformed form of input images. Deep learning techniques achieve this representational learning ability to use raw input data and allow for automated learning.
Deep learning techniques [48] are focused on multiple layers of abstraction, with higher layers representing more abstract data information. Neural networks replace typical machine learning techniques as an alternative in detecting malware. The advantages of neural network models include incremental learning ability, training layers as required, etc. Deep learning contributes to the development of automated and generalized models for the detection and classification of known and unknown malware [49]. CNNs are feed-forward neural networks specifically used for image classification problems. Considering the ability of robust feature learning, state-of-the-art malware detection systems use CNN models [50] for learning binary patterns in malware images. Ensemble models [51][52][53] can combine multiple machine learning and deep learning models using stacking, boosting, or bagging architecture. Cui et al. [54] proposed a CNN model for malware detection. Their system worked for input images of fixed sizes. Agarap et al. [55] trained the hybrid combination of deep learning models and SVM on the Malimg dataset. Their approach provided insights into designing an intelligent malware detection system. The proposed model analyzes malware based on the vision-based technique. The advantages of the CNN model are considered to train the malware images and to effectively classify them using the proposed modification of the DenseNet model.

Proposed Methodology
The overall design of the proposed malware detection approach is illustrated in Figure 2. The flow of the proposed modification of DenseNet model is shown in Figure 3. The input binary images are fed into the DenseNet model for feature extraction and classification. The model is trained by providing the input image directly into the initial convolution (Conv) layer. CNNs have a great potential to extract distinctive features that comprehensively articulate the image and learn task-specific features. They automatically learn features at various levels of abstraction, allowing them to learn complex functions by modeling raw input data into the desired output. The proposed model uses DenseNet to extract all features from malware datasets and trains the DenseNet on top of the extracted features. Every dense layer can extract fine details from binary images.

Preprocessing of Input Binaries
The PE binary files are read as bytes in the range 0 to 255 and stored as a one-dimensional (1D) vector of 8-bit unsigned integers. Each byte represents the pixel intensity levels (0 denotes black, 255 denotes white, and intermediate values denote various gray shades). These byte values are organized into a two-dimensional array (pixels), which are visualized as grayscale images. The sizes of the binary files are of

Preprocessing of Input Binaries
The PE binary files are read as bytes in the range 0 to 255 and stored as a one-dimensional (1D) vector of 8-bit unsigned integers. Each byte represents the pixel intensity levels (0 denotes black, 255 denotes white, and intermediate values denote various gray shades). These byte values are organized into a two-dimensional array (pixels), which are visualized as grayscale images. The sizes of the binary files are of The proposed model was built with an initial convolutional layer, max-pooling layer, four dense convolution (Dense Conv) blocks, and four transition layers (1 × 1 Conv and 2 × 2 average pooling). Dense Conv blocks consist of a collection of 1 × 1 Conv and 3 × 3 Conv blocks. After each alternative transition layer, these convolutions (1 × 1 and 3 × 3) are repeated 6, 12, 48, and 32 times within each Dense Conv block. The output feature maps obtained after passing through these layers are given as an input for the Global Average Pooling (GAP) block. Next, a fully connected (FC) layer follows GAP. The FC layer classifies the malware samples into their corresponding classes.

Preprocessing of Input Binaries
The PE binary files are read as bytes in the range 0 to 255 and stored as a onedimensional (1D) vector of 8-bit unsigned integers. Each byte represents the pixel intensity levels (0 denotes black, 255 denotes white, and intermediate values denote various gray shades). These byte values are organized into a two-dimensional array (pixels), which are visualized as grayscale images. The sizes of the binary files are of variable sizes originally. The CNNs do not accept images of different resolutions, since it is composed Entropy 2021, 23, 344 7 of 23 of the FC layers with a fixed number of trained weights. Therefore, the input images of various dimensions are resized into a square image of 64 × 64 dimensions. The images are resampled using the nearest interpolation method since it does not alter the actual image data. It chooses the value of the pixel that is close to the neighboring coordinates of the desired interpolation point. This method locates the closest pixel in the original input image for each pixel in the resulting image. The nearest interpolation approach is beneficial over other interpolation methods, such as bilinear and bicubic interpolation, in terms of its simplicity, its capability to retain original values in the unalterable setting, as well as its computational time. This approach is used in our work since malware images should not be changed and critical information should not be lost to provide accurate resampling.

DenseNet
DenseNet [14] is a deep learning architecture in which all layers are directly connected, thereby achieving effective information flow between them. Each layer acquires additional inputs from all previous layers and transfers its feature maps to all subsequent layers. The output feature maps obtained from the current layer are combined with the previous layer using concatenation. Every layer is linked with all the succeeding layers of the network, and they are referred to as DenseNets. This model requires fewer parameters compared to traditional CNNs. It also reduces the overfitting problem that occurs with smaller malware training sets.
Consider an input image x 0 , which is passed through the proposed convolutional network. The network contains N layers, and each layer executes a nonlinear transformation F n (.). Suppose that layer n consists of the feature maps of all preceding convolutional layers. The input feature maps of layers 0 to n − 1 are concatenated and represented as x 0 , . . . , x n−1 . Hence, this model has N(N + 1)/2 connections on an N-layer network. The output of the nth layer is given by where x n is the current nth layer, [x 0 , . . . , x n−1 ] is a concatenation of feature maps obtained from 0 to n − 1 layers, and F n (.) is the composite function of Batch Normalization (BN)-Rectified Linear Units (ReLU). The consecutive operations in the transition layer include Batch Normalization (BN), Rectified Linear Units (ReLU), and 3 × 3 convolution (Conv). The concatenation operation is not feasible if the sizes of feature maps are changed. Therefore, the layers that have different feature map sizes are downsampled. The transition layers consisting of 1 × 1 Conv and 2 × 2 average pooling operations are given between two adjacent Dense Conv blocks. The initial Conv layer consists of 7 × 7 Conv blocks with stride 2. After the final Dense Conv block, the classification layer consisting of global average pooling and the softmax classifier are connected. The correct prediction is done using all feature maps in the neural network. The output layer with K neurons gives the correct match of K malware families.
Convolution operation learns the image features and maintains the connection among the pixels. Mathematically, a convolution function operates on an image matrix and filter. Each convolution layer corresponds to the BN-ReLU-Conv sequence. After the convolution is performed on the image, ReLU is applied to the output feature maps. This function introduces nonlinearity in CNNs. The ReLU function is given by Pooling is performed to reduce the dimensionality of output feature maps. This pooling is performed either using max pooling or average pooling. Max pooling involves taking the largest component from the improved feature map. Average pooling divides the input into the pooling area and computes the average values of each area. GAP computes the average of each feature map, and the resulting vector is taken to the softmax layer.
The operations of the proposed network are summarized in Algorithm 1.

Classification
The classification layer is composed of a fully connected softmax layer. In FC, the number of neurons is set according to the number of malware classes available in the dataset. The softmax function is used for categorizing multi-class classification problems. This function calculates the probability distributions of each class i over all possible classes. The softmax activation function is given by where y i is the input value and y j is all input values of I. The formula calculates the ratio of the exponential of the input element and the sum of the exponential values of all input data. The class imbalance problem is a classification challenge in which the distribution of classes in the training dataset is uneven. The degree of class imbalance varies, but a significant imbalance is more difficult to model and demands advanced techniques to tackle the issue. The Malimg dataset and the Microsoft BIG 2015 dataset are imbalanced and longtailed malware datasets that contain more samples for few classes and very few samples in some classes. Models trained on these varied sample sizes are biased toward dominant classes. To resolve the issue of data imbalance, data augmentation techniques such as oversampling of minority classes or downsampling of majority classes are not appropriate for malware detection problems. It is not possible to generate images corresponding to realistic malware binaries by oversampling. Many representative malware variants might be possibly overlooked by downsampling.
Reweighting losses by inverse class frequency typically results in low performance on real-world data with a high-class imbalance. The proposed malware detection model uses class-balanced loss [56] and uses a weighting factor W i , which has an inverse ratio to the number of samples for class i. It is given by where S n i is the effective number of samples for class i. It is given by where B = (I − 1)/I and I is the set of all possible instances in a class, defined as

Training
The Adaptive Learning Rate Optimization Algorithm, called Adam [57], is used to update weights based on the malware training data. It determines the individual learning rates for distinct parameters. Adam uses evaluations of the 1st and 2nd moments of the gradient to adjust the learning rate for individual weights of the neural network. Therefore, it is known as adaptive moment estimation. This optimizer evaluates the moments using increased moving averages. These moving averages are based on the calculated gradient on the current mini-batch. The moving average estimates of the first and second moments of the gradient are given by where a is the moving average, β 1 and β 2 are decay rates, and g is the gradient on the current mini-batch. Cross-entropy (CE) loss or log loss assesses the efficiency of a classification method with a probability score between 0 and 1. When the predicted probability deviates from the real class label, the CE loss increases. The cross-entropy loss is given by where C is the set of all classes in each dataset, t i is the ground truth, and s i is the CNN score for each class i in C.
Categorical CE loss is a combination of softmax activation function and CE loss, also known as Softmax Loss, used for multiclass classification. It outputs a probability value for each input binary image over C.
The Categorical Cross-Entropy (CCE) loss for a sample s corresponding to class label y is given by The Class Balanced Cross-Entropy (CBCE) loss for class y with n y training samples is given by

Datasets
The proposed model was evaluated with four malware datasets: Malimg [10], Microsoft's BIG 2015 [11], MaleVis [12], and Malicia [16]. The first three datasets were used for training, and the fourth (Malicia) dataset was used for testing. The experiments were carried out with 1043 cleanware samples. These samples were collected from executable files (.exe) of the Windows operating system and checked using the VirusTotal portal. The various families of the malware datasets used for evaluation of the proposed malware detection method are given in Table 1. The samples of different classes of malware vary in number across different datasets. There were 9339 malicious samples presented as grayscale images in the Malimg dataset. Each of the malware samples in the dataset corresponds to one of the 25 malware classes. The BIG 2015 dataset contains 21,741 malware samples, among which the training set includes 10,868 samples and the remaining 10,873 samples are test samples. In our experiments, the training set samples are used for evaluation. Each malware file has an identifier and class. The identifier is a hash value that particularly identifies the file, while the class labels one of nine distinct malware families. Each malware has two files, namely, .bytes and .asm. We use .bytes files, which have raw hexadecimal code of the file, to generate malware images.
The The experiments were implemented on a Linux system with Intel® Xeon(R) CPU E3-1226 v3 at 3.30 GHz × 4, 32 GB RAM, and NVIDIA GM107GL Quadro K2200/PCIe/SSE2. The performance evaluations were carried out with the following hy-

Results and Discussion
The dataset was randomly divided into 70% training and 30% validation sets. The results were taken with 1043 cleanware samples and each of the three malware datasets. Train and test files were divided such that 30% of the overall samples were considered for testing purposes. The proposed malware detection system was trained on 7268 samples and tested on 3115 samples for the Malimg dataset with cleanware samples (9339 + 1043). Then, the model was trained on 8338 samples and tested on 3573 samples from the BIG 2015 dataset along with cleanware samples (10,868 + 1043). On the MaleVis dataset, 9958 samples were training samples and 4268 were testing samples.
The experiments were implemented on a Linux system with Intel ® Xeon(R) CPU E3-1226 v3 at 3.30 GHz × 4, 32 GB RAM, and NVIDIA GM107GL Quadro K2200/PCIe/SSE2. The performance evaluations were carried out with the following hyperparameter settings: 100 epochs, learning rate 0.0001, and batch size 32. The proposed deep neural network model was implemented on the Python framework and Keras v0.1.1 deep learning library. The experiments were performed for various input binary image sizes such as 32 × 32 dimensions and 64 × 64 dimensions. It is observed that the information is retained and showed better predictive accuracy for images reshaped to 64 × 64.
There are four types of metrics calculated to assess classification predictions. True Positive (TP): the prediction that an observation belongs to a class and it actually does belong to that class, i.e., a binary image that is classified as malware and is actually malware.
True Negative (TN): the prediction that an observation does not belong to a class and it actually does not belong to that class, i.e., a binary image that is classified as not malware (negative) and is actually not malware (negative).
False Positive (FP): the prediction that an observation belongs to a class and it actually does not belong to that class, i.e., a binary image that is classified as malware and is actually not malware (negative).
False Negative (FN): the prediction that an observation does not belong to a class and it actually does belong to that class, i.e., a binary image that is classified as not malware (negative) and is actually malware.
These four outcomes are presented on a confusion matrix to better describe the results of the proposed model. If there are N classes, the confusion matrix will be the N × N matrix, with the true class on the left axis and the class assigned to an element with that true class on the top axis. Each member a, b of the matrix is the number of elements with actual class a that is classified as belonging to class b.
The elements of confusion matrix for each class are defined by  The generalization ability of the proposed method is assessed using unseen dataset. The dataset is untrained by the proposed DenseNet model to evaluate how well it performs under different samples. The three trained malware datasets contain completely different classes from the Malicia dataset classes. The comparison of the proposed methods with the ML and DL methods over the unseen Malicia dataset is given in Table 4. The results on the unseen Malicia dataset show an accuracy of 89.48%, which is less than the performances of the ML and DL methods over the trained datasets.  Table 5 provides details about the time taken for the proposed model to train and test the binary samples. The comparison of the proposed model and the malware detectors based on various DL methods are studied in terms of computational efficiency. The results indicate that the proposed DenseNet-based malware detection model takes less time to train and test the samples when compared to other deep learning-based malware detection systems.  Table 6            For instance, the Malimg dataset includes 26 classes. The graph shows 26 ROC curves, with the first curve representing the first class that is classified against the other 25 classes, the next ROC curve representing the second class that is classified against the rest of the classes, and so on. TPR is approximately one and FPR is close to zero on the curves for each class against every other class. The area under the curve is higher for all the classes on the Malimg and BIG2015 malware datasets compared to the area under the curve for the MaleVis dataset. This indicates the outperforming efficiency of the proposed DenseNet-based malware detection model. the classes, and so on. TPR is approximately one and FPR is close to zero on the curves for each class against every other class. The area under the curve is higher for all the classes on the Malimg and BIG2015 malware datasets compared to the area under the curve for the MaleVis dataset. This indicates the outperforming efficiency of the proposed DenseNet-based malware detection model.     The proposed malware detection system would be effective and can produce advanced results, as shown in Table 7. Any new malware that resembles these families of malware will also be detected with the same accuracy because of the generalization property of the proposed model. If the new malware is completely unseen, i.e., a zero-day malware attack, the proposed system may fail to detect it. Therefore, if such zero-day attacks accumulate, then the performance of the proposed model could fall, but a false alarm may indicate that the model needs to be retrained. Therefore, the model will be retuned with new samples and the performance will be tuned such that the model will detect malware that has already been trained as well as newly seen malware, almost  The experiments were conducted for binary classification (malware or cleanware) with the Malimg, BIG2015, and MaleVis datasets. For each of the three datasets, 1000 samples were picked and included in the malware class, while the other class contained 1043 cleanware samples. The results were taken to assess the performance of the proposed DenseNet-based malware detection system for the three binary datasets. The accuracy for the BIG2015 binary dataset shows a higher detection accuracy of 97.72% compared to the other datasets. The accuracy for the Malimg binary dataset is 97.55%, and the accuracy for the MaleVis binary dataset is 96.81%. The other metrics such as precision, recall, and f1score are similarly higher for BIG2015 than for the other two binary datasets.     The proposed malware detection system would be effective and can produce advanced results, as shown in Table 7. Any new malware that resembles these families of malware will also be detected with the same accuracy because of the generalization property of the proposed model. If the new malware is completely unseen, i.e., a zero-day malware attack, the proposed system may fail to detect it. Therefore, if such zero-day attacks accumulate, then the performance of the proposed model could fall, but a false alarm may indicate that the model needs to be retrained. Therefore, the model will be retuned with new samples and the performance will be tuned such that the model will detect malware that has already been trained as well as newly seen malware, almost similar to a top-up of the training set. As a result, the proposed model would be able to keep up with malware evolution over time and to understand anti-malware evasion techniques. The experiments were conducted for binary classification (malware or cleanware) with the Malimg, BIG2015, and MaleVis datasets. For each of the three datasets, 1000 samples were picked and included in the malware class, while the other class contained 1043 cleanware samples. The results were taken to assess the performance of the proposed DenseNet-based malware detection system for the three binary datasets. The accuracy for the BIG2015 binary dataset shows a higher detection accuracy of 97.72% compared to the other datasets. The accuracy for the Malimg binary dataset is 97.55%, and the accuracy for the MaleVis binary dataset is 96.81%. The other metrics such as precision, recall, and f1score are similarly higher for BIG2015 than for the other two binary datasets.

Conclusions
We proposed an efficient malware detection and classification technique that combines malware visualization and a pretrained DenseNet model with a reweighted categorical cross-entropy loss criterion. The performance of the proposed DenseNet-based malware detection approach was evaluated on four malware datasets, and its superiority over other models was analyzed.
The proposed model achieved a better classification accuracy of 98.23% for the Malimg dataset, of 98.46% for the BIG 2015 dataset, and of 98.21% for the MaleVis dataset, which is higher than the other methods explored. The accuracy of the unseen dataset that has not been trained by the proposed model achieves an accuracy of 89.48%.
The proposed model correctly identified most of the obfuscated malware samples, proving its resiliency towards malware mitigation methods. The proposed solution does not require execution or unpacking of the packed executables. The experiment results demonstrate that, even though the training set is imbalanced, our technique can effectively and efficiently classify malware samples to their corresponding families. The proposed detection system shows high accuracy and time performance that is comparable with conventional solutions based on machine learning while eliminating the manual feature engineering stage.
In the future, we will concentrate on the reduction of false negatives to achieve an optimal solution.
Author Contributions: J.H., S.A.R., S.G., S.K. and R.D. contributed equally to this study. All authors have read and agreed to the published version of the manuscript.