Hybrid Malware Classiﬁcation Method Using Segmentation-Based Fractal Texture Analysis and Deep Convolution Neural Network Features

: As the number of internet users increases so does the number of malicious attacks using malware. The detection of malicious code is becoming critical, and the existing approaches need to be improved. Here, we propose a feature fusion method to combine the features extracted from pre-trained AlexNet and Inception-v3 deep neural networks with features attained using segmentation-based fractal texture analysis (SFTA) of images representing the malware code. In this work, we use distinctive pre-trained models (AlexNet and Inception-V3) for feature extraction. The purpose of deep convolutional neural network (CNN) feature extraction from two models is to improve the malware classiﬁer accuracy, because both models have characteristics and qualities to extract di ﬀ erent features. This technique produces a fusion of features to build a multimodal representation of malicious code that can be used to classify the grayscale images, separating the malware into 25 malware classes. The features that are extracted from malware images are then classiﬁed using di ﬀ erent variants of support vector machine (SVM), k-nearest neighbor (KNN), decision tree (DT), and other classiﬁers. To improve the classiﬁcation results, we also adopted data augmentation based on a ﬃ ne image transforms. The presented method is evaluated on a Malimg malware image dataset, achieving an accuracy of 99.3%, which makes it the best among the competing approaches.


Introduction
Malware programs are undesirable harmful threats that are intended to damage the security of a computer system. Malware detection has become an essential concern in the cybersecurity community since malware is capable of causing excessive loss and harm to computer security. Every day, a huge amount of malware is generated intentionally. A recent Symantec report in 2019 [1] demonstrated that malware is growing by 36% annually and the total samples of malware are estimated to be beyond 430 million. The rapid growth of malware causes an extensive threat in our daily life. Ransomware locks or encodes the files on a user's device and asks for a payment to restore them, which may lead to large productivity loses [2]. Data breaches due to malware activity often incur huge financial losses for major corporations [3]. Trojans and spyware are used in cyber espionage resulting in damage of geopolitical and international relations [4]. Malware has become a primary concern for computer network [5] and smartphone users [6].
To counter these threats, deep learning, a technique based on artificial neural networks (ANN), can be employed successfully [7,8]. Deep learning with its multilayer architecture has a great ability to learn the features of the labeled and unlabeled data. However, when using a deep learning model, we need to train the model on a large data set every time, which takes a lot of time and computational resources. To overcome this learning difficulty, we can use pre-trained deep learning network architectures and extract features to build the classification model [9]. This motivates us to develop pre-trained network-based deep learning architecture for malware class identification and classification.
In this paper, we utilize binary files of malware as black-and-white (grayscale) images, which are generated using the byte code information and later used for malware classification. Visualizations of malware [10,11] in such form prove to be efficient in getting a complete view structure of images.
The conversion of textual data into image form is one of the major problems of malware detection [12]. In parallel, most of the malware detection methods adopt machine learning techniques by analyzing the different features within the malicious codes. These techniques generally experience two major problems: (1) extraction of robust and efficient features effectively for malware detection [13] and (2) data imbalance [14] preventing accurate malware detection and classification. Hence, keeping these problems in the mind, the ultimate challenge is to develop a malware identification model that can address the enormous variations in malware families.
In this paper, we propose a feature fusion-based solution to classify malware into different classes using features extracted from images as well as from pre-trained deep convolutional networks. The contributions of our proposed method are as follows: • The conversion of malware binaries into the grayscale image. • Data augmentation performed on malware images to overcome the data imbalance within the malware datasets for robust feature extraction to enhance the classifier performance.

•
An optimized multimodal feature representation to combine the segmentation-based fractal texture analysis (SFTA) features [15] and deep convolutional neural network (DCNN) features into a single feature vector to obtain a robust malware classification model.
The other parts of the paper are prepared as follows: Section 2 discusses the related work, whereas Section 3 defines the detail of the proposed methodology. The results and analysis of experimental results and their significance are given in Section 4. Section 5 concludes the paper.

Related Work
In recent times, deep learning has confirmed its supremacy for many computer vision and machine learning applications like action recognition [16], gait recognition [17,18], object detection [19,20], and many more [21][22][23]. For malware detection and classification, different researchers have applied deep learning and image processing techniques to accomplish high accuracy because of their ground-breaking capacity to learn the best features.
For example, Cui et al. [12] presented a novel deep learning-based method for malware variant detection. They generated grayscale images from malicious code and effectively addressed the data imbalance problem by using bat algorithm. Then, a convolutional neural network (CNN) was used for identification and classification of malware images and the results show that the model achieved a good accuracy rate. Vinayakumar et al. [7] presented a more scalable and hybrid deep learning framework method for effective visual detection of malware with the same setup. Namavar Jahromi et al. [24] used an extreme learning machine with two hidden layers to detect malware threats in safety-critical systems. Zhu et al. [25] propose deep flow, a novel deep learning-based methodology for detecting malware straightforwardly from the information streams in the Android application. The results show that deep flow can accomplish a high discovery F1 score of 95.05%, outflanking customary deep learning based approaches, which uncovers the upside of deep learning procedure in malware detection. Jeon and Moon [26] extracted the opcode sequences from a malware binary file and adopted a deep recurrent neural network (DRNN) as a classifier to detect malware. Sung et al. [27] used word embedding techniques to embed the malware codes into lower dimensionality feature space. Then, they applied a bi-directional long short-term memory (Bi-LSTM) network model to categorize the malware files by attack family. Gibert et al. [28] combined hand-engineered features and neural networks for malware classification. Each of the subnetworks can be trained separately, while the learned features are fused into a common representation.
To increase the classifier efficiency, researchers adopted deep learning as a tool for feature extraction to detect the malware code. Venkatraman et al. [29] presented a hybrid deep learning visualization approach and the result showed that the deep learning-based model effectively differentiates the behavioral pattern of different malware families. Zhong et al. [30] presented a technique which robustly handles the malware dataset complexities. This technique used a multi-level deep model, which organizes the tree structure of multiple deep models for enhancing the scalability of anonymous malware detection. Ye et al. [31] presented a heterogeneous deep model, which is composed of different layers of associative memory and a weighted auto encoder with multilayer restricted Boltzmann machines to identify malware. For unsupervised feature extraction, the deep learning model is forced to perform a training operation as a greedy layer-wise methodology followed by fine-tuning of a supervised parameter to efficiently detect the malware. Yuxin et al. [32] used a deep belief network (DBN) to train a multilayer generative model using the unlabeled data, which represents the better characteristics of employed data samples. DBNs are used as an auto encoder for feature extraction to detect malware. Vasan et al. [33] used handcrafted features as well as those of VGG16 and ResNet-50 CNNs to perform image-based malware classification.Čeponis and Goranin [34] analyzed the use of dual-flow deep learning methods, such as gated recurrent unit fully convolutional network (GRU-FCN) vs single-flow convolutional neural network (CNN) models for detection of malware signatures.
Over the last few years, the Android operating system (OS) has experienced tremendous popularity. However, this popularity comes at the expense of security, because it is an alluring target for malicious apps. Billah et al. [35] used deep learning as sequences for classification and proposed MalDozer, which is a family attribution and automatic Android malware detection model. MalDozer detects and learns the patterns of malicious and benign data from the actual dataset to expose Android malware. Pektaş and Acarman [36] used deep learning to recognize malware based on the application programming interface (API) call graphs transformed into a numeric feature set as a representation of the malware's execution paths. D'Angelo et al. [37] encoded the sequences of API calls invoked by apps as sparse image-like matrices (API images). Then, they used autoencoders to get the most informative features from these images, which were provided to an ANN-based classifier for malware detection. Naeem et al. [38] converted a raw Android file into a color image and then submitted it to a DCNN model, achieving 97.81% accuracy on a Leopard Mobile malware dataset and 98.47% accuracy on a Windows dataset. Vidal et al. [39] used genetic sequence alignment methods and statistical hypothesis testing for Android malware recognition, achieving an average hit rate of 98.61% on several public datasets.
The related works are summarized in Table 1. Here, the static analysis uses only static resources, which are available before the installation and execution of malicious applications [40].
Different from the earlier proposed methods, the proposed hybrid method based on feature fusion from pre-trained DCNN and SFTA multimodal feature representation is used for malware detection. The foremost objective of the proposed methodology is to extract the robust feature using malicious images to improve the classification accuracy after malware detection.

Outline of Methodology
The proposed methodology for malware classification has four major steps: (a) visualization of binary to grayscale image, (b) image augmentation, (c) feature extraction and fusion, and (d) feature selection and classification. The flow diagram of the proposed methodology is illustrated in Figure 1.

Outline of Methodology
The proposed methodology for malware classification has four major steps: (a) visualization of binary to grayscale image, (b) image augmentation, (c) feature extraction and fusion, and (d) feature selection and classification. The flow diagram of the proposed methodology is illustrated in Figure 1. The proposed method uses DCNN architecture with a combination of SFTA (scale feature texture analyzer) features [15] for malware classification. Firstly, we convert the binary dataset files into grayscale images for accurate malware classification. To overcome the challenge of the data imbalance problem between different malware classes, we use a data augmentation method [46]. In the feature extraction step, the SFTA features are extracted from grayscale images and the pre-trained DCNN models of AlexNet [47] and Inception-v3 [48] are used for deep feature extraction. Both SFTA features and DCNN features are combined into a single feature vector using a serial-based feature fusion method, and the most informative and robust features are selected using a principal The proposed method uses DCNN architecture with a combination of SFTA (scale feature texture analyzer) features [15] for malware classification. Firstly, we convert the binary dataset files into grayscale images for accurate malware classification. To overcome the challenge of the data imbalance problem between different malware classes, we use a data augmentation method [46]. In the feature extraction step, the SFTA features are extracted from grayscale images and the pre-trained DCNN models of AlexNet [47] and Inception-v3 [48] are used for deep feature extraction. Both SFTA features and DCNN features are combined into a single feature vector using a serial-based feature fusion method, and the most informative and robust features are selected using a principal component analysis (PCA)-based feature selection method for accurate malware classification. The complete details of each step are listed in further subsections below.

Visualization of Binary to a Grayscale Image
While dealing with malware, it is difficult to decide the type of attack by viewing the binary malware data. We convert the parallel malware executable file into 8-bit decimal value with a scope of 0-255, which is saved in a decimal number vector, representing a malware sample. We select the dimensions of the 2D matrix depending on the size of the malware binary file. At that point, we convert the grayscale image into three channel (red (R), green (G), blue (B)) images by duplicating the grayscale channels for three slices. A well-known change technique to change RGB (red, green, blue) to grayscale images is shown in Equation (1).
The process of visualization of binaries into a grayscale image is shown in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 23 component analysis (PCA)-based feature selection method for accurate malware classification. The complete details of each step are listed in further subsections below.

Visualization of Binary to a Grayscale Image
While dealing with malware, it is difficult to decide the type of attack by viewing the binary malware data. We convert the parallel malware executable file into 8-bit decimal value with a scope of 0-255, which is saved in a decimal number vector, representing a malware sample. We select the dimensions of the 2D matrix depending on the size of the malware binary file. At that point, we convert the grayscale image into three channel (red (R), green (G), blue (B)) images by duplicating the grayscale channels for three slices. A well-known change technique to change RGB (red, green, blue) to grayscale images is shown in Equation (1).
The process of visualization of binaries into a grayscale image is shown in Figure 2. The width of each gray scale malware image was 256 pixels, while the height relied upon the file size.

Image Augmentation
The imbalanced datasets are a communal problem in computer vision and harm classification problems. A deficiency of images in each class may cause under-fitting and over-fitting [10], which has a large influence on the performance of the DCNNs. Hence, to overcome this problem, a data augmentation method is proposed within the malware datasets to enhance the classifier performance. Image augmentation is a method that is used to increase the dataset that is commonly used to train the neural network [49]. In the augmentation phase, we generate new data from classes that have less population in the dataset [50]. This process overcomes the restrictive impact on data to avoid uneven representation. To escape over-fitting complications successfully, proper data augmentation schemes can be helpful and improve the strength of the model. Let be a given malware dataset, = . . . . . . . . . , where represents the th image sample in the dataset. Assume that has N pixels. The pixel coordinate matrix for is given below in Equation (2). The width of each gray scale malware image was 256 pixels, while the height relied upon the file size.

Image Augmentation
The imbalanced datasets are a communal problem in computer vision and harm classification problems. A deficiency of images in each class may cause under-fitting and over-fitting [10], which has a large influence on the performance of the DCNNs. Hence, to overcome this problem, a data augmentation method is proposed within the malware datasets to enhance the classifier performance. Image augmentation is a method that is used to increase the dataset that is commonly used to train the neural network [49]. In the augmentation phase, we generate new data from classes that have less population in the dataset [50]. This process overcomes the restrictive impact on data to avoid uneven representation. To escape over-fitting complications successfully, proper data augmentation schemes can be helpful and improve the strength of the model.
Let D 1 be a given malware dataset, D 1 = {a i . . . . . . . . . a n }, where a n represents the nth image sample in the dataset. Assume that a n has N pixels. The pixel coordinate matrix A k for a j is given below in Equation (2).
here each row shows similar coordinates of one pixel. For data augmentation on image a n , we apply an affine transform matrix, p, on the coordinate matrix, M k , and obtain a transformed coordinate matrix, M k t , for the image. The operation is given in Equation (3) as follows: where each row of M k t is the transformed coordinate for one pixel. The transform matrix operations are also shown in Table 2. Table 2. Affine transformation matrices for image augmentation.

Operations Flipping Scaling Rotation
Matrix transform There are different approaches to decide the affine transformation matrix M. We utilize three sorts of arbitrary irritations for creating new augmentation data as follows. Flipping operation flips the image along the horizontal dimension. Rotation operation pivots the image with a point inspected from 0 • to 180 • . Scaling operation moves the image in both the x and y dimension of the image. The respective affine transform matrices are shown in Table 2, where β shows the angle of rotation T x , and T y shows the balances on the arrange hub.
Let a k be a given image the augmented details denote as A k , and the dataset is denoted as D 1 . The augmentation process is formulated in the following equations: where T 1 is flipping, T 2 is rotation, and T 3 is scaling operations. The expanded dataset, D 1 , alongside the compared class names, is then employed to train a deep CNN with the end goal of malware recognition and classification. It ought to be noticed that we misuse flips, rotation, and scaling as the data augmentation tasks since they do not modify the texture topologies in malware classes, as shown in Figure 3.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 23 here each row shows similar coordinates of one pixel. For data augmentation on image , we apply an affine transform matrix, , on the coordinate matrix, , and obtain a transformed coordinate matrix, , for the image. The operation is given in Equation (3) as follows: where each row of is the transformed coordinate for one pixel. The transform matrix operations are also shown in Table 2.

Operations Flipping Scaling Rotation
Matrix transform There are different approaches to decide the affine transformation matrix Μ. We utilize three sorts of arbitrary irritations for creating new augmentation data as follows. Flipping operation flips the image along the horizontal dimension. Rotation operation pivots the image with a point inspected from 0° to 180°. Scaling operation moves the image in both the x and y dimension of the image. The respective affine transform matrices are shown in Table 2, where β shows the angle of rotation , and shows the balances on the arrange hub. Let be a given image the augmented details denote as , and the dataset is denoted as . The augmentation process is formulated in the following equations: where Τ is flipping, Τ is rotation, and is scaling operations. The expanded dataset, , alongside the compared class names, is then employed to train a deep CNN with the end goal of malware recognition and classification. It ought to be noticed that we misuse flips, rotation, and scaling as the data augmentation tasks since they do not modify the texture topologies in malware classes, as shown in Figure 3.  The augmentation process for example flips, rotates, and scales up a decent variety of training data and improves the classification accuracy of deep CNNs. The data augmentation method in Algorithm 1 describes how the malware images are rotated and flipped to deal with the problem of class imbalance, as given in Algorithm 1.

Feature Extraction
Feature extraction is an important step in Malware classification. When the size of the data is large and difficult to manage, then the data is transformed into a reduced set of feature representations. A feature holds data about the size, texture, color, and shape of an image. Here, we used local features, global features, and texture features. The proposed framework adopts two stages for feature extraction, which are performed using the equivalent stream. In the first stage, from grayscale images, SFTA is used for texture feature extraction because of its robustness and low computation cost. In the second stage, the pre-trained CNNs, like Alexnet and Inception-v3, are used to extract deep features in depth to obtain a robust classification model. Each feature extraction method is discussed in the following sections.

Texture Feature (SFTA)
Segmentation-based fractal texture analysis (SFTA) is a texture-based feature acquisition method [15]. The texture element is the most noteworthy trait of an image, which is utilized to identify and characterize the malware images and discover the similitudes between images of various malware families. SFTA is utilized for texture feature extraction because of its robustness and low computation cost. The SFTA extraction method can be partitioned into two parts, which are discussed below.
First, the input grayscale image i(v, u) is broken down into a binary image. We utilize two-threshold binary deterioration (TTBD) for binary decomposition. Binary images are formulated in Equation (6).
where t low is the lower threshold and t up represents the upper threshold value. After applying TTBD to the gray level input image, the SFTA feature vector is developed as the fractal dimension size, mean gray level, and, furthermore, the SFTA feature vector. The fractal estimations are utilized to portray the complexity of malware image structures fragmented in the input image. The fractal dimension D i is determined by using the boundary around an image. The boundary image, ∆(v, u), is computed as follows: where N I=8 [(v, u)] is the connected 8 pixels to (v, u), ∆(v, u) in the event that the pixel at (v, u) in the binary image I b (v , u ) has the value of 1 and having in any event one neighboring pixel with esteem 0. Otherwise, the value is 0. ∆(x, y) takes the value of 0. Consequently, one can understand that the comings about outskirts are one pixel wide. The gray level mean size (pixel tally) supplements the data extricated from every binary image without essentially expanding the calculation time. Algorithm 2 and Figure 4 show the SFTA feature extraction process.

Algorithm 2. SFTA Feature Extraction Algorithm
Require: Grayscale image i(v, u) and two thresholds t low and t up . Consequently, one can understand that the comings about outskirts are one pixel wide. The gray level mean size (pixel tally) supplements the data extricated from every binary image without essentially expanding the calculation time. Algorithm 2 and Figure 4 show the SFTA feature extraction process.

Deep Convolution Neural Network (DCNN)
Here we use a deep convolutional neural network (DCNN) model for malware classification and recognition. The network architecture has a multilayer structure comprising of convolutional layers, pooling layers, and fully connected layers. Every neuron is processed by a spot item between their little areas and weight, which are associated with the information volume. From that point, activation is performed utilizing the ReLu layer. The ReLu layer does not change the size of an image. Thereafter, the pooling layer is performed to lessen the noise impacts in the separated features. Lastly, significant level features are determined by an associated fully convolutional (FC) layer. The architecture of the neural network is shown in Figure 5. Let us denote the output layer , and the base value, then, the filter for the th feature map is , , and ℎ denotes the − 1 th output layer. Then, the deep convolutional layers have the formulation as stated in Equation (8).
The pooling layer extricates the most extreme returns from the lower convolutional layer with the goal of removing unnecessary features. The maximum pooling settles the issue of over-fitting and a 2 × 2 matrix is performed on the extricated framework. Max pooling is depicted in Equations (9)-(11).
Here, denotes the stride and , , and are 2 × 2 and 3 × 3 filters for feature maps. The ReLu and FC (fully connected) layers are defined as: Here, represents the ReLu layer and represents the FC layer. The FC layers follow the convolution and pooling layers. The FC layer is like a convolution layer, and the greater part of the analysts perform activation on the FC layer for deep features extraction.

DCNN Feature Extraction and Fusion
Here we use two distinctive pre-trained models (AlexNet, Inception-V3) for feature extraction. These models have different structures, but they have been trained with the same dataset of malware.
AlexNet [47] contains five progressive convolutional layers, Conv1 · Conv5, succeeded by three max-pooling layers, and three FC layers, FC7 with a last softmax classifier. The AlexNet CNN has In this paper we utilize two pre-trained CNN models (AlexNet [47] and Inception-V3 [48]) for feature extraction. These models link a convolution layer, pooling layer, normalization layer, ReLu layer, and FC layer. Deep convolution layer local feature extraction from an image is expressed in the following equations.
Let us denote the M N i output layer N, and b N i the base value, then, the filter for the jth feature map is ψ N i,j , and h j denotes the N − 1 th output layer. Then, the deep convolutional layers have the formulation as stated in Equation (8).
The pooling layer extricates the most extreme returns from the lower convolutional layer with the goal of removing unnecessary features. The maximum pooling settles the issue of over-fitting and a 2 × 2 matrix is performed on the extricated framework. Max pooling is depicted in Equations (9)- (11).
Here, S k denotes the stride and f k 1 , f k 2 , and f k 3 are 2 × 2 and 3 × 3 filters for feature maps. The ReLu and FC (fully connected) layers are defined as: Here, R l i represents the ReLu layer and FC l i represents the FC layer. The FC layers follow the convolution and pooling layers. The FC layer is like a convolution layer, and the greater part of the analysts perform activation on the FC layer for deep features extraction.

DCNN Feature Extraction and Fusion
Here we use two distinctive pre-trained models (AlexNet, Inception-V3) for feature extraction. These models have different structures, but they have been trained with the same dataset of malware.
AlexNet [47] contains five progressive convolutional layers, Conv1 · Conv5, succeeded by three max-pooling layers, and three FC layers, FC7 with a last softmax classifier. The AlexNet CNN has been trained on 1.2 million images of 1000 classes, with around 60 million network parameters. The network only accepts RGB images with a dimension size of 224 × 224 × 3.
The Inception-V3 [48] model is composed of well-structured convolution modules that can both produce preferential features and decrease the quantity of the parameters. Every Inception module is made of a few convolutional layers and pooling layers in equal numbers. Convolutional layers, for example, 3 × 3, 1 × 3, 3 × 1, and 1 × 1 layers, are utilized in the Inception modules to diminish the quantity of parameters.
The purpose of DCNN feature extraction from two network models is to improve the classifier accuracy. Since both models have characteristics and qualities of different extract features. Therefore, three kinds of features are extracted and used: AlexNet features, Inception-V3 features and texture (SFTA) features. We extract M × 87 features from SFTA and deep features by performing an activation function on the FC7 layer and applying max pooling to expel the noise. The f a×n = {a 1×1 , a 1×2 , a 1×3 , a 1×4 , . . . . . . , a 1×n }, Then, the resulting feature vectors from SFTA, AlexNet and Inception-v3 are concatenated. These feature vectors are shown below: where f Resu 1×q is the resultant feature vector, and 1 × q represents the dimension q = (m, n, o).

Feature Selection
The inspiration behind the employment of the feature selection is to select the most prominent features to refine the accuracy and eliminate the unnecessary features to make the system fast in terms of execution time and increase the accuracy rate. In this paper, from the fused feature vector (FV), principal component analysis (PCA) is used for optimal feature selection. PCA has the following steps: (a) calculating the mean of a fused feature vector, (b) subtracting the mean from each feature, (c) calculating the covariance matrix, and (d) calculating the eigenvalues and eigenvector from covariance matrix. Finally, PCA returns a score value and a principal components score.
The first process before applying PCA is normalizing the FV so that it works adequately. It can be done by subtracting the relevant means from the corresponding column in the FV. Let us have FV values denoted by x, and N is the total number of extracted feature values in FV: where K 0 will be the normalized FV that will be used further for finding the covariance matrix.
conv(x, y) The covariance of conv(x, y) matrix calculates the eigenvectors with their corresponding eigenvalues. The selected optimal features from both models after applying PCA are fused using a parallel-based method and a final fused vector is obtained.

Dataset
The Malimg [51] dataset (available for download from https://www.dropbox.com/s/ ep8qjakfwh1rzk4/malimg_dataset.zip) consists of malware images that were processed from malware binaries, and we trained the deep learning models to classify each malware family. The dataset comprised of 25 malware families and has 9348 grayscale images. Each family of the Malimg dataset contains a different number of samples, so the dataset is imbalanced (see Figure 6). The dataset has an unfair distribution of classes, for example, 2949 images represent the Allaple malware family, while only 80 images are present in the Skintrim family. The image augmentation techniques are used to stabilize the number of training samples to recover the superiority of data trials. By applying the data augmentation technique, the data imbalance problem is resolved, and each family of malware contains 1000 malware samples. Hence, the number of total images becomes 25,000.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 23 The covariance of ( , ) matrix calculates the eigenvectors with their corresponding eigenvalues. The selected optimal features from both models after applying PCA are fused using a parallel-based method and a final fused vector is obtained.

Dataset
The Malimg [51] dataset (available for download from https://www.dropbox.com/s/ep8qjakfwh1rzk4/malimg_dataset.zip) consists of malware images that were processed from malware binaries, and we trained the deep learning models to classify each malware family. The dataset comprised of 25 malware families and has 9348 grayscale images. Each family of the Malimg dataset contains a different number of samples, so the dataset is imbalanced (see Figure 6). The dataset has an unfair distribution of classes, for example, 2949 images represent the Allaple malware family, while only 80 images are present in the Skintrim family. The image augmentation techniques are used to stabilize the number of training samples to recover the superiority of data trials. By applying the data augmentation technique, the data imbalance problem is resolved, and each family of malware contains 1000 malware samples. Hence, the number of total images becomes 25,000.

Settings
For classification, we used a support vector machine (SVM) with linear, quadratic, polynomial (cubic), and Gaussian kernels, k-nearest neighbor (KNN) with cosine similarity as distance metrics (cosine KNN), KNN with the count of neighbors set to 1 (fine KNN), KNN with the count of neighbors set to 10 (medium KNN), KNN with the count of neighbors set to 100 (coarse KNN), KNN

Settings
For classification, we used a support vector machine (SVM) with linear, quadratic, polynomial (cubic), and Gaussian kernels, k-nearest neighbor (KNN) with cosine similarity as distance metrics (cosine KNN), KNN with the count of neighbors set to 1 (fine KNN), KNN with the count of neighbors set to 10 (medium KNN), KNN with the count of neighbors set to 100 (coarse KNN), KNN with distance weighting (weighted KNN), decision tree (DT), DT with the maximum count of splits set to four (fine DT), 20 (medium DT), and 100 (coarse DT), boosted DT, cosine DT, Adaboost classifier, and linear discriminate (LD) classifiers.
Each classifier performance is evaluated using recall, error rate (ER), and area under the curve (AUC) and accuracy metrics (Equations (23)-(25)). All experiments were done with MATLAB 2018b (Math Works, Inc., Natick, MA, USA) having machine specifications of 1.00 GHz CPU with 8 GB RAM.

Results
We use the multi-representation of various features and selection methods incorporated to obtain diverse classifier results to verify the efficiency of the proposed method. The results are generated into three groups: (a) selection of features using traditional method (SFTA), (b) selection of features using traditional approach (SFTA) and pre-trained network (AlexNet, Inception-v3), and (c) deep CNN and SFTA feature fusion lengthwise with PCA selection method. All these experiments are employed on the malware dataset by applying PCA and the planned features collection method.
We claim that we can build a high level of accuracy in recognizing malware from trusted samples deep learning with features derived from the pre-trained network and handcrafted method. These features are continuous and allow us to be more flexible with the classification. We use these continuous data and develop a novel classification method using a pre-trained network and handcrafted method to reduce over-fitting during training. We compare our method to a set of machine classification methods that have been applied in previous research and demonstrate an increase of classification accuracy using our method and an unseen dataset over the range of other machine classification methods that have been applied in previous research.
Four experiments are performed using the Malimg balanced and imbalanced dataset to check the effectiveness of results, and the results are compared with the existing methods.

Experiment 1: With Imbalanced Data and SFTA Features without Feature Fusion and Feature Optimization
Experiment 1 is performed using the features obtained by applying SFTA on the Malimg dataset [51] and using the state-of-the-art classifiers. The imbalanced dataset has large different sized images in 25 classes. This shows the problem of over-fitting and has a negative influence on the results. The results based on imbalanced data and without feature fusion and feature optimization using SFTA are depicted in Table 3. The best accuracy is achieved using the quadratic SVM. The quadratic SVM on an imbalanced dataset obtained a maximum classification accuracy of 95.2%, recall of 69.0%, AUC of 90.0%, and ER of 4.8% on 10-fold cross-validation. Quadratic SVM obtained minimum ER and maximum AUC as compared to other classifiers such as weighted KNN, linear discriminate and some more in Table 3. The recall, ER, accuracy and AUC values of certain classifiers are also plotted in the graph shown in Figure 7.

Experiment 2: With Balanced (Augmented) Data and SFTA Features
Experiment 2 was performed on balanced (augmented) data, using the SFTA features but without any feature fusion and optimization. The classification results are depicted in Table 4.

Experiment 2: With Balanced (Augmented) Data and SFTA Features
Experiment 2 was performed on balanced (augmented) data, using the SFTA features but without any feature fusion and optimization. The classification results are depicted in Table 4. The linear discriminant (LD) classifier used on a balanced dataset obtained a maximum classification accuracy of 99.3%, recall of 97.60%, AUC of 60.0%, and ER of 0.7% on 10-fold cross-validation. In the balanced data, accuracy is improved, and ER is smaller as compared to the imbalanced dataset. The image augmentation to obtain the balanced dataset has a positive influence on the results and solves the problem of overfitting, so accuracy is improved. The recall, ER, accuracy and AUC values of classifiers are also plotted in the graph shown in Figure 8.

Experiment 3: With Imbalanced Data and Using Featured Obtained from a Pre-Trained Network Model
Two pre-trained neural networks (AlexNet and Inception-v3) are used as deep feature extractors, and the classification results are presented in Table 5.

Experiment 3: With Imbalanced Data and Using Featured Obtained from a Pre-Trained Network Model
Two pre-trained neural networks (AlexNet and Inception-v3) are used as deep feature extractors, and the classification results are presented in Table 5. The quadratic SVM used an on imbalanced dataset performed well and obtained a maximum classification accuracy of 98.3%, recall of 97.0%, AUC of 88.0%, and ER of 1.7% on 10-fold cross-validation. The quadratic SVM obtained minimum ER and maximum AUC as compared to other classifiers such as weighted KNN, linear discriminate, and some more in Table 5. The recall, ER, accuracy and AUC of certain classifiers are also plotted in the graph shown in Figure 9.  Table 6, and their visualization is shown in Figure 10.

Classifier
Recall (%) Accuracy (%) AUC (%) ER (%)  Table 6, and their visualization is shown in Figure 10. effectiveness of balanced data and the robustness of the proposed technique. The pre-trained network works well with balanced datasets and obtained maximum accuracy on certain classifiers. The recall, accuracy, and AUC values of the balanced dataset are higher than for the imbalanced dataset.

Statistical Analysis
To assess the statistical significance of the results, we adopted the non-parametric Friedman test for the compared methods on all cross-validation folds. As the classification results are not Gaussian distributed, the Friedman test was applied to detect differences in the classification results of various methods across multiple runs. The results of the Friedman test regarding accuracy (see  show that the differences between the methods are statistically significant (p < 0.01). Critical difference (CD) shows the smallest difference in mean ranks, where the difference is not statistically significant.  Table 6 shows a maximum accuracy level of 99.3%, recall of 99.0%, AUC of 100.0%, and ER of 1.3% on cubic SVM, as compared to some more classifiers depicted in Table 6, which shows the effectiveness of balanced data and the robustness of the proposed technique. The pre-trained network works well with balanced datasets and obtained maximum accuracy on certain classifiers. The recall, accuracy, and AUC values of the balanced dataset are higher than for the imbalanced dataset.

Statistical Analysis
To assess the statistical significance of the results, we adopted the non-parametric Friedman test for the compared methods on all cross-validation folds. As the classification results are not Gaussian distributed, the Friedman test was applied to detect differences in the classification results of various methods across multiple runs. The results of the Friedman test regarding accuracy (see  show that the differences between the methods are statistically significant (p < 0.01). Critical difference (CD) shows the smallest difference in mean ranks, where the difference is not statistically significant.

Statistical Analysis
To assess the statistical significance of the results, we adopted the non-parametric Friedman test for the compared methods on all cross-validation folds. As the classification results are not Gaussian distributed, the Friedman test was applied to detect differences in the classification results of various methods across multiple runs. The results of the Friedman test regarding accuracy (see  show that the differences between the methods are statistically significant (p < 0.01). Critical difference (CD) shows the smallest difference in mean ranks, where the difference is not statistically significant. Figure 11. Critical difference diagram of the Nemenyi test results for malware classification using SFTA features (imbalanced dataset). CD is critical difference.   The confusion matrix for the best classifier (cubic SVM with a fusion of SFTA and pre-trained neural network features applied on the balanced dataset) is presented in Figure 15. We can see only a few misclassifications mostly between the two most numerous classes.
The best classification results (based on accuracy) for all experiments are summarized in Figure  16. Note that image augmentation allowed a statistically significant improvement of results to be achieved for the SFTA features (by 4.1%, p < 0.05, using non-parametric Wilcoxon rank-sum test). However, for the fused feature set, the improvement was not statistically significant. On the other hand, using feature fusion on the original (imbalanced) dataset also allowed a statistically significant   The confusion matrix for the best classifier (cubic SVM with a fusion of SFTA and pre-trained neural network features applied on the balanced dataset) is presented in Figure 15. We can see only a few misclassifications mostly between the two most numerous classes.
The best classification results (based on accuracy) for all experiments are summarized in Figure  16. Note that image augmentation allowed a statistically significant improvement of results to be achieved for the SFTA features (by 4.1%, p < 0.05, using non-parametric Wilcoxon rank-sum test). However, for the fused feature set, the improvement was not statistically significant. On the other hand, using feature fusion on the original (imbalanced) dataset also allowed a statistically significant   The confusion matrix for the best classifier (cubic SVM with a fusion of SFTA and pre-trained neural network features applied on the balanced dataset) is presented in Figure 15. We can see only a few misclassifications mostly between the two most numerous classes.
The best classification results (based on accuracy) for all experiments are summarized in Figure  16. Note that image augmentation allowed a statistically significant improvement of results to be achieved for the SFTA features (by 4.1%, p < 0.05, using non-parametric Wilcoxon rank-sum test). However, for the fused feature set, the improvement was not statistically significant. On the other hand, using feature fusion on the original (imbalanced) dataset also allowed a statistically significant improvement to be achieved (by 3.5%, p < 0.05). The confusion matrix for the best classifier (cubic SVM with a fusion of SFTA and pre-trained neural network features applied on the balanced dataset) is presented in Figure 15. We can see only a few misclassifications mostly between the two most numerous classes.

Comparison of the Results of the Proposed Technique with Other Existing Methods
The results on the Malimg dataset are compared with the results of other authors using the same dataset in Table 7. Note that our method allowed the best results to be achieved on this dataset. The best classification results (based on accuracy) for all experiments are summarized in Figure 16. Note that image augmentation allowed a statistically significant improvement of results to be achieved for the SFTA features (by 4.1%, p < 0.05, using non-parametric Wilcoxon rank-sum test). However, for the fused feature set, the improvement was not statistically significant. On the other hand, using feature fusion on the original (imbalanced) dataset also allowed a statistically significant improvement to be achieved (by 3.5%, p < 0.05).

Comparison of the Results of the Proposed Technique with Other Existing Methods
The results on the Malimg dataset are compared with the results of other authors using the same dataset in Table 7. Note that our method allowed the best results to be achieved on this dataset.

Comparison of the Results of the Proposed Technique with Other Existing Methods
The results on the Malimg dataset are compared with the results of other authors using the same dataset in Table 7. Note that our method allowed the best results to be achieved on this dataset. Complexity: As shown in Figure 1, our proposed method consists of five key steps such as data augmentation, features extraction, fusion, selection, and classification. From these, the fusion and selection steps are much important, and the efficiency of our system depends on these. We compute the complexity in the form of Big-O (O) notation for each step. For data augmentation, features extraction, and classification, the complexity is O(1), while for features fusion and selection, the complexity is O(n) + 3 and O(n) + k, where k denotes the number of selected parameters. Hence, our method is working under one major loop, so the overall complexity is O(n) = C.

Conclusions
In this paper, we have demonstrated a hybrid method which uses image augmentation and a fusion of segmentation-based fractal texture analysis (SFTA) and pre-trained deep neural network (AlexNet and Inception-v3) features to classify malware images into various classes using different state-of-the-art classifiers. In the first step, the grayscale images are generated using the bytecodes from the malicious programs. Then, in the second step, image augmentation is performed to balance the malware classes and resize the images to 227 × 227 × 3. In the last phase, the pre-trained network features and SFTA best features are selected and fused. Then, the selected features are forwarded to several versions of KNN, SVM, DT, and LD classifiers for malware image identification. The key advantage of the proposed method is the ability to achieve a high level of classification accuracy when compared to the existing methods. When faced with imbalanced data sets there is no one stopping solution to improve the accuracy of the prediction model. In order to get rid of the imbalanced data sets problem, we perform augmentation and balance the dataset and improve the performance of the proposed model. Malware classification based on deep learning introduced an extraordinary improvement when compared with the existing strategies and the experiment result proves the efficiency of the proposed technique as the accuracy was improved and model loss was decreased.
The proposed method achieved an accuracy of 99.3% on the cubic SVM classifier, which displays exceptional performance when compared to other current malware classification approaches.