Deep Feature Extraction and Classification of Android Malware Images

The Android operating system has gained popularity and evolved rapidly since the previous decade. Traditional approaches such as static and dynamic malware identification techniques require a lot of human intervention and resources to design the malware classification model. The real challenge lies with the fact that inspecting all files of the application structure leads to high processing time, more storage, and manual effort. To solve these problems, optimization algorithms and deep learning has been recently tested for mitigating malware attacks. This manuscript proposes Summing of neurAl aRchitecture and VisualizatiOn Technology for Android Malware identification (SARVOTAM). The system converts the malware non-intuitive features into fingerprint images to extract the quality information. A fine-tuned Convolutional Neural Network (CNN) is used to automatically extract rich features from visualized malware thus eliminating the feature engineering and domain expert cost. The experiments were done using the DREBIN dataset. A total of fifteen different combinations of the Android malware image sections were used to identify and classify Android malware. The softmax layer of CNN was substituted with machine learning algorithms like K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Random Forest (RF) to analyze the grayscale malware images. It is observed that CNN-SVM model outperformed original CNN as well as CNN-KNN, and CNN-RF. The classification results showed that our method is able to achieve an accuracy of 92.59% using Android certificates and manifest malware images. This paper reveals the lightweight solution and much precise option for malware identification.


Introduction
Any software with mala fide intention is a malware (malicious software). They generally have a mischievous behaviour and are developed to interrupt normal functioning, steal sensitive information, display unwanted advertising, or getting control of the users' device without knowledge. Moreover, malware and unintentionally harmful software are collectively termed as badware. Main categories in which malware can be grouped are the virus, worms, Trojans, ransomware, rootkits, and botnet [1]. Like computer systems, malware systems have evolved to be more intelligent, smart, and decisive. Malware can adopt polymorphic and metamorphic techniques to obfuscate traditional methods of malware identification [2][3][4][5]. Newly developed malware is too sophisticated to obstruct emulators and avoid deep static analysis. Malware also propagates through deploying metamorphism methods like multi-packer, code transformation, encryption, registry modification, virtual machines, anti-debugging, and instruction permutation. Malware is smart enough to detect the best moment to launch its • We propose a novel system called SARVOTAM that is defined as Summing of neurAl aRchitecture and VisualizatiOn Technology for Android Malware classification. • It works on the raw bytes and eliminates the need for decryption, disassembly, reverse engineering, and execution of code for malware identification. The system converts the malware non-intuitive features into fingerprint images to extract the quality information. • Seeing through malware binary, the proposed system can discover and extract insights necessary for malware analysis, and paves the path for the development of effective malware classification systems.
• A CNN was fine-tuned to automatically extract the rich features from visualized malware thus eliminating the feature engineering and domain expert cost. • SARVOTAM was augmented by imbuing traditional classifiers like K-Nearest Neighbour (KNN), Support Vector Machine (SVM) and Random Forest (RF) to recommend prominent Android File structure features for malware identification and classification. It was noted that CNN-SVM model outperformed original CNN as well as CNN-KNN, and CNN-RF. • To the best of our knowledge, classification and generation of malware images using fifteen unique combinations of Android malware file structure have been explored for the first time. • It was observed that malware images formed using Certificate and Android Manifest files (CR+AM) offer a light-weight and much precise option for malware identification. One may not try inspecting all files in the APK for malware identification and classification. • The proposed system was evaluated against the DREBIN dataset [47]. This dataset consists of 179 different malware families containing 5560 applications.
The simplistic depiction of proposed SARVOTAM methodology is shown in Figure 1. Computer System: The machine with configuration Intel core i5 processor, 8G RAM, and 2.7 Ghz clock speed was used for the experiments and results. • Transformation of malware applications into images: The proposed SARVOTAM system allows seeing through malware binary, discover and extract insights necessary for malware analysis by converting malware binary into grayscale images. Fifteen unique malware images were created using different files of an APK for every malware family samples. Section 3.1 discusses in detail about the methodology adopted to transform malware applications into images. • Feature Extraction: Accurate Feature engineering is the important task for any classification model. In this study, a fine-tuned CNN was used to automatically extract rich features from visualized malware images thus eliminating the feature engineering and domain expert cost. The rest of this paper is organized as follows; Section 2 offers a discussion on related work; Section 3 elaborates adopted methodology; Section 4 interprets the experimental results and Section 5 concludes the findings.

Related Work
Visualization-based analysis of malware has been conducted by the researchers [10,[48][49][50]. Visualization-based approaches tend to directly work on malware image structure [11,[51][52][53]. Unlike static and dynamic techniques, visualization-based analysis supports the faster classification of the malware samples as it does not require an application to be disassembled or executed. Therefore, it outperformed than conventional techniques when the task is to classify a large number of malware samples. In [54], the author converted an APK file structure into four different image formats. Those image formats were Grayscale, Red-Green-Blue (RGB), Cyan-Magenta-Yellow-Black (CMYK), and Hue Saturation Lightness (HSL). Three different machine learning classifiers namely, Decision Trees, Random Forest, and K-Nearest Neighbour were trained using Global Image Descriptors (GIST) features against each image representation to classify whether an application is benign or malware. The authors achieved a high accuracy of 91% with random forest classifier on grayscale image representation. Authors in [11] performed fine-grained classification on Portable Executable (PE) files using the visualization-based approach. They visualize the malware as an RGB-coloured image. The dataset was composed of 15 families that contained 7087 malware samples. They built their model by combining global and local features for the malware classification. The data and code section of the file was processed as feature vectors to constitute local features. Global Features were extracted from RGB-coloured image. To train the model they used three classifiers namely, Random Forest, Support Vector Machine, and K-Nearest Neighbour. The results of the malware classification experiments showed that the Random Forest classifier achieved a high accuracy of 97.47%. Their approach did not work with a non-PE file structure, e.g., an APK file structure. Hence, their method cannot be used directly for classification Android malware families. Authors in [55], consider only the code section of an APK file. For this task, they first converted the dex file into a jar file using dex2jar tool. Further jar file was converted into java file using jad tool. For each APK file, they put the code part in separate text files. To identify the important words in text file, authors employed the technique called as Term Frequency-Inverse Document Frequency (TF-IDF) in their work. TF-IDF weight is a statistical measure that helps to interpret that how important a term is to a text file in a collection of large text files. TF computes the normalized term frequency, which is calculated as the number of times a term appears in a document, divided by the total number of terms in that document. IDF measures how important a term is. It is computed as the logarithm of the number of documents in the corpus divided by the number of documents where the specific term appears. It helps to weight down the frequent terms while scaling up the rare ones. After mining the important terms from the text files, they arranged these files into several groups. These groups were further processed to generate pictures by using simhash [56] and djb2 algorithm [57]. The authors deployed a convolutional neural network for learning and classification and achieved an accuracy of 92%. Authors ignored other building blocks of APK file, such as META-INF, Resources, AndroidManifest.XML in their work. Authors in [58], demonstrated the experiment over 32 malware families constituting 12,000 images of malware. They studied the performance comparisons on various classifiers such as a Convolutional Neural Network, K-Nearest Neighbour, and Support Vector Machine with different image descriptors such as Local Binary Pattern (LBP) and GIST. Convolutional neural network model trained with 6 layers using LBP features achieved a high accuracy of 93.92% against the dataset chosen. They visualized the malware as grayscale and Red Green Blue Alpha (RGBA) images. Researchers analysed the performance of both image formats using the CNN model, which was trained with LBP features. Authors also concluded that visualizing malware as a colour image might lose some important features. In machine learning, deciding the subset of features that can potentially be used for critical malware analysis is a challenging task. A proper feature set should be generated to build an accurate malware analysis or detection model. Authors in [59] developed the visualization method in C language to study the internal structure (patterns/anomalies) of Android malware executable files. Researchers also claimed that their method has the potential to disclose feature set for classification of malware families. They only considered the .dex file in their work. Bytes in .dex file were mapped to a pixel on the image. Numerous varieties of obfuscation tools have been available in the market, being used by legitimate developers to protect their intellectual property of Android applications. The tools and techniques which were originally designed to protect intellectual property are now widely exploited and abused among malware authors to create Android malware variants more resilient. Authors in [60] utilize the visualization-based approach to fingerprint the obfuscation tools used in the development of the Android application cycle. Malware binary visualized as an image. They calculated two types of statistical features from an image. These features are synthesized to extract information to uncover the type of obfuscation tool employed by an application developer. Researchers claimed accuracy of 73% and 86% for fingerprinting the obfuscation tool and classification of obfuscated and original applications respectively.
The literature review concludes the fact that an APK file is a sequence of bits and therefore a binary image, there is no clear consensus within researchers pertaining to type of analysis and prominent APK parameters suitable for the classification of malware entities. The traditional malware classification approaches rely on extracting static and dynamic features. These approaches tend to use code analysis to solve a malware classification problem. Existing malware classification approaches used signature-based and feature-based approaches. Unfortunately, these approaches suffer from code disassembly, code obfuscation, and high consumption of resources. Researches have also realized that these approaches are heavy on time and space. Moving towards deep learning infusion with visualization approaches is the beginning of a new era in Android security. The proposed solution leverages the goodness of visualization and deep learning techniques to solve the multiclass malware classification problem. Deep learning architecture eliminates the need to capture features such as API calls, permissions, meta-data information, and other dynamic features such as system call, network activity to generate a high-quality malware classification model. The solutions leveraging the combination of visualization-based analysis and deep learning [61,62] have shown the impact lately in the research related to security and privacy. Most of the proposed solutions [10,11,16] have attained good accuracy against windows malware classification. Researchers worked with PE files because their experiments were restricted to Windows environments. Windows platform is most popular in desktop personal computers, and their hardware architecture is much different from light-weight mobile devices running Android. Therefore, solutions for Windows platform applications such as PE files cannot be directly applied for Android malware family classification. The cited literature has been published in the year 2020 and the authors have probably not tested it on APKs. This study validates the use of feature extraction for Android malware images.

Materials and Methods
This section offers a discussion on various fundamental concepts involved in the experiment design. DREBIN dataset of Android malware applications has been used for this experiment. The dataset contains 5560 files from 179 different malware families. Most of the research literature from year 2014-2020 has used DREBIN dataset as standard dataset for malware related experiments. The dataset includes popular Android malware families such as Fake Installer, GoldDream [24], GingerMaster [23] and DroidKungFu [25]. A summary of malware datasets used by the research community is summarized in Figure 2 [63]. Further, the prime objective of this manuscript lies with validating the proposed method for malware identification instead of malware itself. Furthermore, the most recent malware dataset available for research is from year 2017 [12], which too may not have sufficient samples of contemporary malware types.
Experiment design, adopted methodology and fundamental contributory concepts are detailed next.

Transforming Malware APK into Images
As per established research standards classes.dex, resource, manifest, and certificate files are primarily considered for visualization of APK [55]. In this manuscript, the authors generated malware images using these four files from malware APK. The malware binaries are converted into 8-bit vectors and subsequently converted into grayscale images. There are a few fundamental steps involved in transforming any malware samples into a digital image. Entire malware substring can be seen as the sequence of several substrings. Each substring is 8-bit length long and termed as a pixel. Further, this 8-bit substring is mapped to an unsigned decimal number within a range from 0 to 255. For example, if a bit string is 0011101110111111, the process is 0011101110111111→00111011, 10111111→59, 191. Any 8-bit number can be represented as bin 7 , bin 6 , bin 5 , bin 4 , bin 3 , bin 2 , b 1 , bin 0 and can be converted into a decimal number D as bin 7 * 2 7 + bin 6 * 2 6 + bin 5 * 2 5 + bin 4 * 2 4 + bin 3 * 2 3 + bin 2 * 2 2 + bin 1 * 2 1 + bin 0 * 2 0 . The next step is to create a malicious code matrix. For this purpose, all malware substrings have been transformed into a one-dimensional vector of decimal numbers. Subsequently, a one-dimensional vector is transformed into a two-dimensional matrix of a certain width. The resultant two-dimensional matrix is then interpreted as a two-dimensional grayscale image. The graphical representation of the transformation process is depicted in the Figure 3.  Based on the empirical observations we have fixed the image widths according to the different image file size, as depicted in Table 1 [16]. It is to be noted that the height of malware image varies with the file size. Grayscale image visualization of Android families from DREBIN dataset is represented in Figure 4. The overall structure of grayscale images corresponds to various sections of an APK. Android malware images for twenty distinct families in the DREBIN dataset have been generated using fifteen different file structure combinations. These files are certificate (CR), Android manifest (AM), classes.dex (CL), and resource (RS).The combinations and associated samples of each class are illustrated in the Table 2. For example, the instances of malware images from various families with respect to CR+AM+RS+CL combinations are shown in Figure 4.

CNN Architectures
The proposed approach sees through binary information to discover and extract necessary insights for malware analysis. It paves the path for developing an effective malware classification system. CNN can attain high accuracy over challenging problems such as object detection, object classification and object recognition. They are a kind of special neural network for processing data that is known to have a grid-like topology. This could either be a one-dimensional time series data which is a grid of samples over time or two-dimensional image data. Every filter in CNN does some kind of operation to extract quality information from images. Filters in CNN play a very important role in extracting information from images. The detailed configuration of CNN architecture deployed during this experiment is briefed in Table 3. The description of each layer has been discussed below: (a) Convolutional Layer: This is the first layer for CNN. At this layer, we convolve image or data using filters or kernels. Filters are small units that are to be applied through a sliding window. The depth of the filter is the same as that of input. For instance, a coloured image would have RGB values hence its depth would be set to three. In other words, a filter of depth 3 would be applied to it. The convolution operation involves taking the element-wise product of filters in the image and then summing those values for every sliding action. The output of the convolution of a 3D filter with a color image is a 2D matrix. It is important to note that convolution is not only applicable to images but can also convolve one-dimensional time-series data. In this experiment, the convolution layers are composed of 32, 128, and 256 with filters of size 7 × 7, 5 × 5, and 3 × 3 for the first, second, and third convolutional layer respectively. (b) Activation Function Layer: An activation function is used to activate the neurons and send the signals further within the model. Weights and activation functions are important to transfer the signals through neurons. Rectified Linear Unit (ReLU) activation function prevents the vanishing gradient problem. It supports faster computation and less overhead as it does not compute exponentials and divisions. ReLU has been used to remove all the negative values from the output or matrix that we got through the convolution layer. It only activates a node if the input is above a certain threshold. While the input is below zero the output is also zero. When the input rises above the certain threshold it has a linear relationship with the dependent variable. The output of the ReLU activation function is fed to the pooling layer. (c) Pooling Layer: It involves the downsampling of features to reduce the number of parameters during training. Typically, there are two hyper parameters introduced with the pooling layer. The first is the dimensions of the spatial extent. It is defined as the value of N for which we can take N × N feature representation and map to a single value. The second is the stride which is defined as how many features the sliding window should skip along the width and height of the malware image. In this experiment, the pooling layer uses a max filter of size 3 × 3, 3 × 3, and 2 × 2 for the first, second, and third convolutional layers respectively. It was moved across entire matrix resulted by ReLU layer. The maximum pixel value is taken from each window to shrink the malware image. All these layers were stacked up by adding more layers of convolution, ReLU, and pooling. (d) Batch Normalization Layer: Batch normalization is used for stable learning of deep neural network. There is a significant problem in stable convergence in deep networks. This problem is caused by the vanishing and exploding gradient problems [64,65] and the different variants of activations within layers. The varying scale of different parameters cause bouncing in the gradient descent. In the forward propagation, it multiplicatively depends on each weight and activation function evaluation. The key point is that in the backward propagation, the partial derivative gets multiplied by the weights and the activation function derivatives. When the product of the weight and the activation function derivative is exactly one the gradients will either tend to increase or they will tend to decrease. This is partially caused by the fact that the activations in different layers have different variances. The distribution of input at each layer changes over training. Batch normalization is a way to address this issue by adding an additional batch normalization layer between the layers of the neural network. It ensures that the variances of the outputs of each layer are similar. Batch normalization normalizes not only the input features but also the features in each layer. This principle of normalization of the input features is carried through to all layers to ensure the most stable behaviour and faster convergence of the underlying algorithm. (e) Dropout Layer: In the multilayer neural network, we often face an overfitting problem, also known as high variance problem. The Dropout layer in a neural network is used to solve the overfitting problem. Only a subset of features is selected from the input layer. Dropout randomly selects the neurons and deactivate them while learning the process. In a nutshell, deactivated neurons do not participate in the learning process. For every layer, a Dropout Ratio value is selected to be as 0.5. (f) Flatten Layer: Flatten is a function or a library which converts the 2D image into 1D image.
The flatten layer in the network takes the output from the previous layer and flattening it into a one-dimensional tensor. Basically, it takes the shrunk malware images and put it in a single list or vector. (g) Fully Connected/Dense Layer: The output from the convolutional layers represents high-level features in data. Essentially the convolutional layers provide the meaningful low dimensional and somewhat invariant feature space whereas the fully connected layer learns a possible nonlinear function in that space. The output of a pooling layer has to be converted to a suitable input for the fully connected layers. The output of the pooling layer is a 3D feature map (a 3D volume of features). However, the input to a simple fully-connected feed-forward neural network is a one-dimensional feature vector. The features are usually very deep at this point because of the increased number of kernels that are introduced at every convolutional layer. Convolution, activation, and pooling layers can occur at many times before the fully connected layers and hence is the reason for the increased depth. To convert the 3D feature map into one dimension the output width and height has to be 1. This is done by flattening the 3D layer into a 1D vector. For classification problems, it involves introducing hidden layers and applying a softmax activation to the dense layers of neurons. In this paper, hidden dense layers D1, D2, and D3 have been added to the CNN architecture which has 50,100, and 200 neurons respectively. At the last, one more dense layer D4 is used as the output layer with 20 neurons. It classifies the malware images with respect to their families. Softmax is used as the activation function at the last layer.

Machine Learning Algorithms
The machine learning algorithms such as KNN, SVM, and RF are applied to analyze the grayscale malware images using CNN features. The stated algorithms are discussed as follows: (a) KNN (K-Nearest Neighbors): KNN or K-Nearest Neighbor is a supervised classification algorithm. It identifies data points which are separated into several classes and predicts the class label for a new sample data point. It is a renowned method to classify data objects based on the closest training samples in a feature space. K in KNN refers to the number of nearest neighbors that the classifier will use to make its prediction. The unknown data points are classified by majority votes from chosen 'K' nearest neighbors. KNN uses the least distance measures such as Euclidean and Manhattan to find out the nearest neighbors. We have used Euclidean distance measure in this study. (b) SVM (Support Vector machine): SVM is specific to supervised machine learning. The model based on supervised learning learns from the past input data and makes future predictions as output. SVM is primarily used for classification purposes, though it can also solve regression problem statements. In the SVM algorithm, support vectors are the extreme points in the dataset. The distance between the hyperplane and the support vectors should be as far as possible.
Hyperplane has the maximum distance to the support vectors of any class. The distance between the support vectors of different classes is defined as a distance margin. Distance margin is calculated as the sum of D− and D+, where D− is the shortest distance from hyperplane to closest negative point and D+ is the shortest distance from hyperplane to the closest positive point. SVM aims to find the largest distance margin that leads to getting the optimal hyperplane. An optimal hyperplane produces good classification results. For the non-linear data or where hyperplane having a low or no margin, there is a high chance of misclassification of data points. In such scenarios, kernel functions are used to transform the data into a 2D or 3D array which makes it easy to split the data and classify. Kernel functions take the low dimensional feature space as input and transform into high dimensional feature space as output. Applications of the support vector machine are commonly used with it face detection, text and hypertext categorization, classification of images, and bioinformatics. (c) Random Forests: The random forests algorithm is one of the most popular and powerful supervised machine learning algorithms that is capable of performing both regression and classification tasks. Random forests combine the simplicity of decision trees with flexibility resulting in a vast improvement in the accuracy. In general, the more trees in the forest, the more robust is the prediction. The use of multiple trees in random forests reduces the risk of overfitting. It runs efficiently and produces highly accurate predictions on large databases. Random forests can maintain accuracy even when there is a large proportion of data is missing. To classify a new object based on attributes each tree gives a classification result according to its defined rules. It can also be assumed that each tree cast its vote for classification. The random forests choose the classification class which has the most votes over all the other trees in the forests.

Results
Experiments were conducted on the DREBIN dataset. As the preprocessing step, the DREBIN dataset was transformed into malware images (discussed in previous sections). We have worked on the top 20 classes of the dataset, refer Table 2. The detailed algorithm of the proposed work is depicted in Algorithms 1-4. The machine with configuration Intel core i5 processor, 8G RAM, 2.7 Ghz clock speed and GPU was used for experimentation. Proposed SARVOTAM implementation includes the following steps. First, there is a need to train a deep convolutional neural network. It would actually be a coding network, and would extract the rich features from the malware images. These features represent high-level concepts for identification and classification of malware features. Finally, we design an efficient model to fuse the CNN features with machine learning algorithms. The results obtained are shown in Table 4. Support vector machine (SVM) is popular for classification, particularly for medical signal processing, image detection, face detection, geo and environmental sciences, and bioinformatics. For classification and recognition, great attention has been paid to the fusion of neural networks and SVM [66][67][68][69]. The benefits of their combination have been confirmed by many researchers for pedestrian detection [70], face recognition [71], and handwritten digit recognition [67].
For classifier boosting, SVM, KNN, and RF are used as an alternative to softmax layer to enhance generalization ability of CNN. Stand-alone CNN architecture and other machine learning algorithms such as SVM, KNN, and RF were fused with CNN to augment the performance of proposed system on various combinations of malware images. As can be seen in Table 4, CR+AM were found to most precise features for identification and classification of Android malware. In case of generic CNN, an accuracy of 91.48% was recorded for classification of Android malware based on binary images. To further augment the classification accuracy of CNN its softmax layer was substituted with SVM, KNN, and RF. The results observed while substituting softmax layer with SVM, KNN, and RF are shown in Figures 6-8 respectively.

Algorithm 1: Classification of Android malware families
Input: Malicious aplications from DREBIN dataset Result: Classification of Android malware families Step 1. Import all the necessary libraries.
Step 2. An empty list is created for storing the training data. Step 5. Create the object of the file for further processing. o b j = p i c k l e . load ( fw ) Step 6. For every unique combination as stated in Step 3.
[ a l l d a t a , l a b e l , f l i s t ]= Fimg ( obj , comb ) TRAINDATA=numpy . a r r a y ( a l l d a t a ) t r a i n _ L =numpy . a r r a y ( l a b e l ) model_cnn , t r a i n _ a l l , t e s t _ l a b e l , pred_prob=cnn_model (TRAINDATA, t r a i n _ L ) Algorithm 1: Cont.
Step 7. Split the testing and training data and set up the features and labels.
[ X _ t r a i n , X _ t e s t , t r a i n _ l a b e l , t e s t _ l a b e l ] = t r a i n _ t e s t _ s p l i t ( t r a i n _ a l l , t r a i n _ L , t e s t _ s i z e = 0 . Step 4. In the convolution layer, each feature will move throughout the entire image and the pixel value of the image gets multiplied with that of the corresponding pixel value of the filter, adding them up and dividing by the total number of pixels to get the output.

Algorithm 3: Cont.
Step 5. ReLU activation function is applied as we want to remove all the negative values from the output or matrix that we got through the convolution layer. Step 7. Batch normalization is applied for the stable learning of the network model . add ( BatchNormalization ( ) ) Step 8. The Dropout layer in a neural network is used to solve the overfitting problem. The value is selected to be as 0.5.
model . add ( Dropout ( 0 . 5 ) ) Step 9. More layers of convolution, ReLU, pooling, batch normalization, and dropout are stacked up. Step 10. The flatten layer is used in the network that takes the output from the previous layers and flattening it into a one-dimensional tensor. Shrunk malware images are put it in a single list or vector.

model . add ( F l a t t e n ( ) )
Step 11. Further, malware images fed into a fully connected layer/dense layer. Three dense layers D1, D2, and D3 have been added to the CNN architecture which has 50,100, and 200 neurons respectively. img=numpy . reshape ( a , ( i n t ( len ( a )/ width ) , width ) ) img=numpy . u i n t 8 ( img ) r e t u r n img End procedure It was observed that fusion of CNN-SVM outperformed rest of the softmax layer substitutes. An improvement of classification accuracy has been observed for entire fifteen combinations of malware image sections. For thirteen combinations, CNN-SVM is able to achieve accuracy in the window 90% to 93%, as shown in Figure 6. The highest accuracy of 92.59% is observed using CR+AM combination of malware images. The increase in accuracy ranges from 0.50% to 3%. Using KNN within CNN as softmax layer resulted in marginal increase in CNN accuracy that too in case of a few image sections. A decrease in accuracy was also observed with respect to the combination of CR and AM. The average classification results of CNN and CNN-KNN is observed between 88.66% and 88.76% respectively. Detailed performance of CNN-KNN fusion is depicted in Figure 7. Integrating RF with CNN resulted in poorest performance in comparison to SVM and KNN. CNN-RF, performed poorly as shown in Figure 8.  Table 5 shows the comparison of proposed work with that of state-of-the-art proposals. The detailed runtime performance metrics such as memory-consumption, total execution time and time spent to identify a possible APK as malware using different combinations of malware images is shown in Table 6.  In our work, CNN-SVM performed well on comparison to generic CNN architecture and other substitutes of softmax layer for 100 epochs. The detailed confusion matrix and other performance metrics are presented in Table 7 and Figure 9 respectively. Among all classifier combination, CNN infusion with SVM perfomed well and particularly showed high precision and recall for the Android malware families Kmin, GoldDream, FakeDoc, Iconosys, Opfake, and FakeInstaller. CNN-SVM enhanced the performance in malware classification and attained the accuracy of 92.59% using CR+AM images, as discussed earlier. The performance of CNN-SVM showed low precision and recall for the malware families such as ExploitLinuxLootor, MobileTx, Gappusin, and BaseBridge. This is mainly due to the reason that these Android malware families contain less number of samples as compared to other families. Malware family SendPay attained equal precision and recall of 0.94. The error rate of malware families such as Kmin and Iconosys is 0. It means that the model learned the actual behavior of these malware families. The highest error rate was observed for the malware families such as ExploitLinuxLootor, MobileTx, Imlog, SMSreg, DroidDream, and Gappusin. The probable reason for low performance of the proposed method in case of malware like ExploitLinuxLotoor was the small number of samples within the training dataset. Such malware families are meant to exploit a rooted Android device the most (where admin rights of the device are with used and not with stock Android provider or proprietor). It alters its signature after attaining root access of the device, till it does so, the malware file tries to look legitimate to the extent possible. Evaluating the proposed method on rooted and non-rooted devices opens a new horizon for this research. It is to be noted that samples of malware families Imlog and SMSreg get highly misclassified to other families but achieved the precision as high as 100%. This depicts that images of these families are highly different from other malware families. The classification achieved low error rate for malware families Opfake, Plankton, FakeInstaller, Golddream, Fakedream, SendPay, and Geinimi which ranged from 2% to 6%. The root mean square analysis was done to measure the error rate of the proposed method. It was calculated for every malware family as shown in Figure 10. The value is found to be in between 0 to 0.45. A comparison of the proposed model with that of Visual Geometry Group (VGG16) typic nertwork was done. VGG16 is a typic convolutional neural network which is adopted from the VGG family. VGG16 network architecture has been previously used to solve multi-class malware familial classification problem [78,79]. A comparison of classification accuracy of SARVOTAM and VGG16 on different malware image combinations is presented in Table 8. As per the recorded observations, proposed CNN structure(s) attained better accuracy than VGG16. The average accuracy of VGG16 is visibaly less than the average accuracy of SARVOTAM. VGG16 attained an average accuracy of 86.02% whereas, for CNN-SVM, CNN, CNN-KNN, and CNN-RF it was recorded at 89.96%, 88.66%, 87.50%, and 86.78% respectively. The classification execution time and RAM usage based on different malware images combination using the VGG16 network and SARVOTAM is also depicted in Table 9.     The information in the Table 9 reveals that VGG16 is heavy on time and memory. The average classification time for all malware image combinations is recorded to be as 1720.72 s. The SARVOTAM model attained the average classification time as low as 972.78 s. The average RAM usage is observed to be 59.67% for VGG16 whereas, for SARVOTAM, it is recorded as 53.09%. The performance of SARVOTAM was best recorded for the malware image combination CR+AM. It utilized 37.33% of the total RAM available and took 840.22 s to classify Android malware applications. The malware image combination CR+AM attained a classification accuracy of 92.59% using CNN-SVM. The malware images generated using only CR and AM files took less time and RAM than CR+AM but their highest accuracy was recorded as 83.58% using CNN and 90.18% using CNN-SVM respectively which was lesser than CR+AM. CR+AM proved to be the lightweight combination to classify applications. VGG16 also attained a high accuracy of 90.57% on CR+AM malware images but at the same time consumes more memory and time. It almost took double the time and 4.33% more consumption of memory as taken by CNN-SVM.

Conclusions and Future Scope
This manuscript concludes the fact that certificate and Android manifest (CR+AM) are most suited features for malware identification and classification. Generic CNN attained a maximum accuracy of 91.48%. The softmax layer of CNN was augmented for classification purposes using SVM, KNN and RF. The combination of CNN and SVM was found to be most suited and even surpassed generic CNN in identification and classification of Android malware families. CNN-SVM achieved the classification acuracy of 92.59%. Following common sense, one may try to identify and classify malware using entire of the features for malware images. This may demand additional hardware resources, time and complex comparisons for identification of malware features. On the other hand, CR+AM offer a light weight and much precise option for malware identification. The proposed methodology is primarily focused on identification and classification of malware images using feature extraction techniques instead of static and dynamic analysis of malware applications. Malware authors employ automation tools to generate dynamic payloads and inject them into the applications. It was noticed that the malware families hard coded with dynamic payloads or some obfuscated code, tend to generate similar malware images. Therefore, a visual similarity between malware images from the same malware family is anticipated. The scope of this experiment was limited to evaluate the performance of the proposed model using malware images. Obfuscation images may look legitimate but they differ with respect to the access rights, resource utilization and other attributes related to APKs, this is why they do not look completely similar to legitimate Android applications and can be classified using proposed method. We will look forward to attune the proposed methodology to be used alongside static and dynamic analysis as future scope of this research. We also intend to investigate the effect of data augmentation and feature fusion strategy. Also, the transformation of malware images into color images and fine-tuning of pre-trained typic CNNs need to be further explored for the classification of Android malware images.