Ensemble Malware Classiﬁcation System Using Deep Neural Networks

: With the advancement of technology, there is a growing need of classifying malware programs that could potentially harm any computer system and / or smaller devices. In this research, an ensemble classiﬁcation system comprising convolutional and recurrent neural networks is proposed to distinguish malware programs. Microsoft’s Malware Classiﬁcation Challenge (BIG 2015) dataset with nine distinct classes is utilized for this study. This dataset contains an assembly ﬁle and a compiled ﬁle for each malware program. Compiled ﬁles are visualized as images and are classiﬁed using Convolutional Neural Networks (CNNs). Assembly ﬁles consist of machine language opcodes that are distinguished among classes using Long Short-Term Memory (LSTM) networks after converting them into sequences. In addition, features are extracted from these architectures (CNNs and LSTM) and are classiﬁed using a support vector machine or logistic regression. An accuracy of 97.2% is achieved using LSTM network for distinguishing assembly ﬁles, 99.4% using CNN architecture for classifying compiled ﬁles and an overall accuracy of 99.8% using the proposed ensemble approach thereby setting a new benchmark. An independent and automated classiﬁcation system for assembly and / or compiled ﬁles provides the luxury to anti-malware industry experts to choose the type of system depending on their available computational resources.


Introduction
Classifying malware programs into different categories based on their pattern has been a research area attracting great interest for several years [1]. Malware programs can either be present in the form of assembly files or binary files or even both in a computer or any other electronic device such as mobile phones and laptops. Anti-malware industries have remedial measures after associating a given malware program with a particular category. However, identifying a particular category of malware program can be difficult and extensive due to polymorphism and huge file size. Hackers introduce the concept of polymorphism to represent malware programs in different forms and sizes to make it difficult for the anti-malware industry to classify or identify such files. Computationally efficient Machine Learning (ML) methods to identify such patterns and malware programs are currently needed. Identifying features and providing insights into both assembly level and compiled files would be of valuable help for the anti-malware industry experts.
Several approaches have been studied in the literature to distinguish malware programs . The most commonly used techniques include static [2][3][4], dynamic [5,6] and signature-based analysis. In addition, [7][8][9][10] presented various classification approaches for classifying malware programs after visualizing them as images using only compiled files. In [7], after engineering features using the visualization technique, authors study the performance of traditional classification approaches using Support Vector Machine (SVM), k-Nearest Neighbors (kNN) and Artificial Neural Networks (ANNs). In [8], authors study the performance of autoencoders after visualizing malware programs as images. In [9], a multi-level architecture using both a traditional ML approach and an autoencoder-based approach is presented. A traditional ML approach is utilized to identify the minority class and an autoencoder is used to subcategorize all other classes. In [10], authors of this paper presented a Convolutional Neural Network (CNN) approach to classify malware patterns after visualizing them as images. In [10], the performance of a CNN is studied both as a feature extractor and a classification tool and the best performance is obtained by utilizing CNN as a feature extractor and SVM for classification. One of the striking advantages of visualizing a malware bytes file is that we could identify patterns with minimal memory consumption. In [12], computer viruses are visualized at an early stage utilizing Windows executable files with the help of self-organizing maps, without using virus specific signature information. In [13], dynamic analysis of program execution is presented using Ether hypervisor framework to monitor, process and then perform reverse engineering for compiled executables. In [14], results of various software analysis tools are brought together into a visual environment to support the triage and explore code vulnerabilities, thereby reducing the false positives in the voluminous data. In [15], quick analysis with visualization through treemaps and thread graphs is presented based on parameterized abstraction of detailed behavioral reports for detecting and classifying the maliciousness of software. In [16], an automated means to map the large binary objects is studied to classify regions using a multi-dimensional, information-theoretic approach. In [18], limitations of static and dynamic detection of malicious codes are studied, and a CNN-based detection model is employed. In [19], CNN is used for compiled files and Long-Short Term Memory (LSTM) is utilized for assembly files where malware files are classified using a stacking approach. Accuracies are obtained using the existing approaches and our approaches are compared and presented in the experimental results section. In [20], malware binaries are trained using deep learning models to classify each malware family. In [21], one-class classification is implemented that takes privileged information into account during the training phase to detect anomalies. In [22], authors present a gene sequence method of classification to distinguish the malware programs.
Even though so many algorithms have been proposed in the literature, very few research papers [19] have worked on combining both assembly level and compiled files. We extend our work from [10] for malware classification. In this paper, we present multiple approaches to classify malware programs. We present independent architectures for compiled and assembly files so that users can pick their choice based on the availability of files and computational resources. Another contribution of this paper is a novel approach of fusing both Natural Language Processing (NLP)-based approaches and image-based approaches into a single simple architecture. In this paper, we present a novel approach to utilizing CNNs and LSTMs as feature extractors instead of classification tools to combat class imbalance and limited training images present in the dataset. Representing such huge malware programs in a simple suite of features helps in the reduction of memory and computational time.
In this paper, three different approaches are presented and studied to classify malware programs based on different file formats: (a) CNN-based approach for classifying malware compiled files after visualizing them as images as implemented in [10], (b) Recurrent Neural Network (RNN)-based approach for classifying malware assembly files, (c) novel ensemble approach of combining the features extracted using (a) and (b) technique and later classifying them either using logistic regression or SVM. This type of approach helps in providing the end-user (anti-malware industry expert) in choosing the type of network of their choice. This type of approach also helps in choosing the type of network based on the files available as unique networks are presented for each domain. Some relevant information could be missing in either compiled or assembly level files for a given malware program, this could be addressed using our proposed approach. Our proposed approach helps in overcoming class imbalance issue present in the dataset without any data augmentation. An ensemble approach helps in taking advantage of both the domains and would help in providing valuable insights to the industry expert Electronics 2020, 9, 721 3 of 13 on the correlation between the compiled and assembly level files. Not much research has been done on utilizing recurrent-based networks for assembly files which we believe, is an area of great interest. In addition, we compare the performance of these models with N-gram techniques using SVM for classification of assembly files. Furthermore, the ability to combine both sequential and visualization techniques using a logistic regression or SVM helps in achieving a reliable performance with minimal memory consumption.
In addition, the results are presented for publicly available BIG 2015 dataset [17] thereby serving as a benchmark for future research efforts. The remainder of this paper is organized as follows. Section 2 provides a brief description of the database that is employed for this research. Section 3 presents the training, testing and validation dataset distribution. Section 4 elucidates the CNN approach adopted for classification of compiled files along with the proposed RNN-based approach for classification of assembly files adopted in this paper. Section 4 also presents the proposed ensemble classification approach. Section 5 presents the experimental results obtained using the proposed methods. Finally, discussions are offered in Section 6.

Materials and Methods
As mentioned, a publicly available dataset provided in Kaggle by Microsoft for Malware Classification Challenge (BIG 2015) [17] is utilized for this research. This dataset is categorized into 9 different groups by anti-malware industry experts. In Kaggle, separate training and testing datasets are provided as a part of the competition. However, ground truth is not provided for the testing dataset and since the competition is closed, we cannot upload the results in order to estimate our performance. The performance of our technique is analyzed by performing hold-out validation solely using the training dataset provided by Kaggle. Training and testing distribution utilized for this research is provided in Section 3. Figure 1 shows the distribution of the training dataset of malware programs [10]. Table 1 displays their corresponding IDs and malware categories [10]. For each malware program, an assembly and a compiled file has been provided. Both set of files are preprocessed differently. At first, the compiled file is converted to images by converting every 2digit hexadecimal code to its equivalent decimal number thereby achieving the range of grayscale image. These images are later resized as necessary for the CNN architectures implemented, as done in [10]. More details in regards to these preprocessing techniques are provided in [7,11]. Figure 2 presents visualization patterns obtained for each category [10].  For each malware program, an assembly and a compiled file has been provided. Both set of files are preprocessed differently. At first, the compiled file is converted to images by converting every 2-digit hexadecimal code to its equivalent decimal number thereby achieving the range of grayscale image. These images are later resized as necessary for the CNN architectures implemented, as done in [10]. More details in regards to these preprocessing techniques are provided in [7,11]. Figure 2 presents visualization patterns obtained for each category [10]. For each malware program, an assembly and a compiled file has been provided. Both set of files are preprocessed differently. At first, the compiled file is converted to images by converting every 2digit hexadecimal code to its equivalent decimal number thereby achieving the range of grayscale image. These images are later resized as necessary for the CNN architectures implemented, as done in [10]. More details in regards to these preprocessing techniques are provided in [7,11]. Figure 2 presents visualization patterns obtained for each category [10]. The assembly files are preprocessed by extracting only the list of opcodes present in every file and ignoring other commands present in them. The order in which these opcodes are maintained are The assembly files are preprocessed by extracting only the list of opcodes present in every file and ignoring other commands present in them. The order in which these opcodes are maintained are present in the file to preserve the sequential pattern present in these assembly files. There are about 483 unique opcodes present in all assembly files in the training dataset of Kaggle. Figure 3 presents the word cloud for each malware category solely based on assembly opcodes. Larger font indicates high frequency of the opcodes. These word clouds clearly indicate that different sets of opcodes are utilized in different frequencies for each malware category. This type of visualization would also assist the anti-malware experts to look for key opcodes for each malware category. present in the file to preserve the sequential pattern present in these assembly files. There are about 483 unique opcodes present in all assembly files in the training dataset of Kaggle. Figure 3 presents the word cloud for each malware category solely based on assembly opcodes. Larger font indicates high frequency of the opcodes. These word clouds clearly indicate that different sets of opcodes are utilized in different frequencies for each malware category. This type of visualization would also assist the anti-malware experts to look for key opcodes for each malware category. Later, the opcodes are encoded in the entire training set to numeric indices. Figure 4 presents the lengths of documents present in the Kaggle training dataset after encoding. We study the performance of both the traditional N-gram approach with SVM and a custom RNN architecture.  Later, the opcodes are encoded in the entire training set to numeric indices. Figure 4 presents the lengths of documents present in the Kaggle training dataset after encoding. We study the performance of both the traditional N-gram approach with SVM and a custom RNN architecture. 483 unique opcodes present in all assembly files in the training dataset of Kaggle. Figure 3 presents the word cloud for each malware category solely based on assembly opcodes. Larger font indicates high frequency of the opcodes. These word clouds clearly indicate that different sets of opcodes are utilized in different frequencies for each malware category. This type of visualization would also assist the anti-malware experts to look for key opcodes for each malware category. Later, the opcodes are encoded in the entire training set to numeric indices. Figure 4 presents the lengths of documents present in the Kaggle training dataset after encoding. We study the performance of both the traditional N-gram approach with SVM and a custom RNN architecture.

Dataset Distribution
As mentioned earlier, the training dataset provided in Kaggle as a part of this challenge is split into groups of 72%, 8% and 20% for training, validation and testing purposes, respectively, as done in [10]. The same set of cases are utilized throughout this research. Figure 5 and Table 2 shows the distribution of training, validation and testing datasets respectively [10].
All these documents are converted into sequences of 2000 words. Sequences with less than 2000 words are padded on the left with a value of 0 and the longer sequences are truncated. We believe that a sequential list of 2000 opcodes could be a sweet spot in recognizing the malware category with minimal memory consumption. The proposed algorithm achieves an accuracy of 97.2% for classification of assembly files, 99.4% for classification of binary files and the ensemble approach achieves an overall accuracy of 99.8%.

CNN Architecture
For classifying the malware compiled files after visualizing them as images, the architecture presented in [10] is adopted. In [10], several CNN architectures were presented and compared. CNN as feature extractor and SVM with linear kernel as the classifier achieved the highest accuracy, which is employed in this study. Features are extracted from the last fully-connected layer of several well-established CNN-based architectures which includes AlexNet [23], ResNet [24] and VGG-16 [25] after training using our dataset. In addition, a set of features are extracted from a simple CNN architecture. In total, 36 features are extracted (9 from each architecture) and are later classified using SVM with linear kernel. This type of architecture helps in addressing class imbalance problem associated with the BIG 2015 dataset. CNN weights are already learned, there is no need to retrain the dataset, thereby making it easier for its deployment. We adopt the same architecture for classifying compiled files in this research.

RNN Architecture
To classify assembly files, LSTM network, a type of RNN is proposed. After preprocessing the assembly files as mentioned in Section 2, preprocessed files serve as the input to the LSTM architecture [26]. LSTM neural networks have been highly effective for natural language processing classification problems and help in identifying patterns in sequences, hence the same is adopted in this research. Figure 6 presents the architecture adopted in this paper for classification of assembly level files. training using our dataset. In addition, a set of features are extracted from a simple CNN architecture. In total, 36 features are extracted (9 from each architecture) and are later classified using SVM with linear kernel. This type of architecture helps in addressing class imbalance problem associated with the BIG 2015 dataset. CNN weights are already learned, there is no need to retrain the dataset, thereby making it easier for its deployment. We adopt the same architecture for classifying compiled files in this research.

RNN Architecture
To classify assembly files, LSTM network, a type of RNN is proposed. After preprocessing the assembly files as mentioned in Section 2, preprocessed files serve as the input to the LSTM architecture [26]. LSTM neural networks have been highly effective for natural language processing classification problems and help in identifying patterns in sequences, hence the same is adopted in this research. Figure 6 presents the architecture adopted in this paper for classification of assembly level files.

Ensemble Architecture
In this subsection, we present our novel approach of fusing both RNN-and CNN-based architectures for the classification of malware programs. This type of architecture is implemented to have a single representation for a given malware program. In addition, some malware programs do not contain sufficient information/data in either assembly or compiled files; this could be tackled using our approach. For the ensemble approach, we extract features from the last fully connected layer of our trained models for CNN and RNN, respectively. We extract 9 features from each of those architectures compiling a suite of 45 in total. The extraction of features is adopted as it has been proven to be an effective overcome limited training data and class imbalance issue for malware classification [10]. These features represent the malware programs both in terms of assembly and compiled files. These extracted features are classified using logistic regression and SVM for our study. Figure 7 presents the block diagram of this proposed approach. This type of architecture not only assists in representing malware program containing assembly level and compiled files, it also assists in overcoming the class imbalance issue present in the BIG 2015 dataset. This type of architecture can be adapted to various applications containing different file types to represent a training sample.

Ensemble Architecture
In this subsection, we present our novel approach of fusing both RNN-and CNN-based architectures for the classification of malware programs. This type of architecture is implemented to have a single representation for a given malware program. In addition, some malware programs do not contain sufficient information/data in either assembly or compiled files; this could be tackled using our approach. For the ensemble approach, we extract features from the last fully connected layer of our trained models for CNN and RNN, respectively. We extract 9 features from each of those architectures compiling a suite of 45 in total. The extraction of features is adopted as it has been proven to be an effective overcome limited training data and class imbalance issue for malware classification [10]. These features represent the malware programs both in terms of assembly and compiled files. These extracted features are classified using logistic regression and SVM for our study. Figure 7 presents the block diagram of this proposed approach. This type of architecture not only assists in representing malware program containing assembly level and compiled files, it also assists in overcoming the class imbalance issue present in the BIG 2015 dataset. This type of architecture can be adapted to various applications containing different file types to represent a training sample.

Results
In this section, results obtained for assembly files, compiled files and their combination using our proposed approaches are presented. At first, results obtained after visualizing our compiled files as images, extracting features using CNN and later classifying them using SVM [10], are presented. Figure 8 presents the confusion matrix obtained using this approach. An overall accuracy of 99.4% is achieved for classifying malware programs solely based on compiled files.

Results
In this section, results obtained for assembly files, compiled files and their combination using our proposed approaches are presented. At first, results obtained after visualizing our compiled files as images, extracting features using CNN and later classifying them using SVM [10], are presented. Figure 8 presents the confusion matrix obtained using this approach. An overall accuracy of 99.4% is achieved for classifying malware programs solely based on compiled files.

Results
In this section, results obtained for assembly files, compiled files and their combination using our proposed approaches are presented. At first, results obtained after visualizing our compiled files as images, extracting features using CNN and later classifying them using SVM [10], are presented. Figure 8 presents the confusion matrix obtained using this approach. An overall accuracy of 99.4% is achieved for classifying malware programs solely based on compiled files.  Figure 9 presents the confusion matrix obtained using N-gram technique (N = 4) using SVM with linear kernel for classification. Table 3 presents the results obtained using different values of "N" with this approach. Best results are obtained with 4-gram technique for classification of assembly files.    Figure 10 presents the confusion matrix obtained using the proposed LSTM architecture for classifying malware programs solely based on assembly files. An overall accuracy of 97.2% is achieved using the proposed approach. Figures 9 and 10 clearly indicate that proposed LSTM-based architecture outperforms the N-gram approach.   Figure 10 presents the confusion matrix obtained using the proposed LSTM architecture for classifying malware programs solely based on assembly files. An overall accuracy of 97.2% is achieved using the proposed approach. Figures 9 and 10 clearly indicate that proposed LSTM-based architecture outperforms the N-gram approach.
Results obtained using our ensemble approach are presented for both logistic regression and SVM. A combination of 45 features is obtained (as mentioned in Figure 6) and the performance of logistic regression and SVM for classification is studied. Our proposed ensemble approach significantly outperforms the independent architectures for assembly level and compiled files, respectively. Representing such huge malware programs in terms of 45 features significantly reduces the computational time and complexity. In addition, representing both assembly level and compiled files in terms of 45 features helps in overcoming missing information present in either of the file types. Figures 11 and 12 present the results obtained using logistic regression and SVM respectively. Table 4 presents the performance of various classification approaches presented for the BIG 2015 dataset [22]. This performance includes principal component analysis (PCA) using traditional classification approaches [7], autoencoders [8], CNN for classifying compiled files [10], the random forest approach [20], one-class SVM approach, strand gene sequence [21] and our proposed approaches. Results obtained using our ensemble approach are presented for both logistic regression and SVM. A combination of 45 features is obtained (as mentioned in Figure 6) and the performance of logistic regression and SVM for classification is studied. Our proposed ensemble approach significantly outperforms the independent architectures for assembly level and compiled files, respectively. Representing such huge malware programs in terms of 45 features significantly reduces the computational time and complexity. In addition, representing both assembly level and compiled files in terms of 45 features helps in overcoming missing information present in either of the file types. Figures 11 and 12 present the results obtained using logistic regression and SVM respectively. Table  4 presents the performance of various classification approaches presented for the BIG 2015 dataset [22]. This performance includes principal component analysis (PCA) using traditional classification approaches [7], autoencoders [8], CNN for classifying compiled files [10], the random forest approach [20], one-class SVM approach, strand gene sequence [21] and our proposed approaches.

Discussion
In this paper, multiple classification approaches are presented for distinguishing malware programs depending on the type of data available. Results clearly indicate that extracting features using CNN and classification using SVM provides the best performance for the classification of compiled files. An accuracy of 99.4% is achieved for classifying malware programs solely utilizing compiled files. In this research, a novel approach for classifying assembly files using a simple LSTM network is presented. An accuracy of 97.2% is achieved for the classification of malware programs solely based on assembly files, thereby setting a new benchmark. A simple LSTM approach outperforms the N-gram approach with SVM.
Our proposed ensemble approach provided a further boost in performance. Extracting features from our proposed CNN and RNN and later classifying them using logistic regression and SVM provided a further improvement in performance. Extracting a total of 45 features from all these networks helped in distinguishing the malware programs effectively. An overall accuracy of 99.5% and 99.8% is achieved using logistic regression and SVM, respectively. Table 4 clearly indicates that our proposed approach outperforms several other existing approaches. SVM coupled with 45 features extracted using CNN and RNN provided the best performance. Representing malware programs both in terms of compiled and assembly level files helped in overcoming a lack of information present in either of those file types. Representing such huge malware programs in terms of a simple suite of 45 features helps in reducing data complexity and computational resources. Extracting features from each of these architectures takes about 50 milliseconds for each malware program after training. This computational speed test was conducted on NVIDIA GeForce GTX-1070. The study of algorithms in terms of performance and type of files helps the anti-malware industry experts to choose the algorithm based on their needs. This type of independent architecture for each file type also helps in retraining any particular architecture depending on the new set of data collected. Any network can be easily modified without affecting other architectures utilized for feature extraction. This type of automated detection of malware programs would be valuable for the anti-malware industry.