4.3. Investigation of Relevant Feature Type-Dataset-1
We extracted mnemonics from 2000 samples. The experimental results obtained from feature reduction using mRMR (MID and MIQ) and ANOVA are as shown in
Figure 4. We obtained these outcomes after classifying the samples using SVM, AdaBoost, Random Forest, and J48. Five mnemonic-based models were constructed at a variable length, starting from 40 to 120 at an interval of 20. Among these five models, ANOVA provides the best result with a strong positive likelihood ratio of 16.38 for the feature length of 120 mnemonics using AdaBoostM1 (J48 as base classifier). The main advantages of this model are its low error rate and speed. However, mnemonic-based features can be easily modified using code obfuscation techniques.
The dynamic API features initially had a feature length of 4480. We use reduction algorithms like mRMR (MID), mRMR (MIQ), and ANOVA to obtain a reduced feature-length of 40, 60, 80, and 120 as illustrated in
Figure 5. When we compare the different feature lengths, we observe that the likelihood for returning positive values is the highest in the case of mRMR (MIQ) at a feature-length of 120 prominent APIs using Random Forest.
Next, we derived 4-grams from a total feature space consisting of 1249 features. The features effectively reduced with mRMR (MID) methods, mRMR (MIQ), and ANOVA. The classification model’s performance is estimated over variable feature-length starting from 40 until 120 in steps of 20 as shown in
Figure 6. For the above-mentioned five-feature length, mRMR(MIQ) produces the best result with over 96% accuracy for feature-length 120 with Random Forest. However, the limitation stated in [
56] is applicable in the current scenario. Generation of 4-grams are computationally expensive, exhibit diminishing returns with more data, are prone to over-fitting, and do not seem to carry vital information from discriminating samples. At the same time, 4-grams do exhibit some merits as it partially depicts the behavioral snapshot of a program and sometimes produces comparable results to other approaches.
Finally, we derived the opcode-based feature set and reduced these features with mRMR (MID), mRMR (MIQ), and ANOVA, where the performance of the model is evaluated over a feature-length between 40 and 120 in increments of 20 as shown in
Figure 7. Among these five-feature lengths, we observed that ANOVA attains the highest performance with a positive likelihood ratio of 19 for feature-length 100 using Random Forest. However, the results obtained with mRMR is very close to the ANOVA features.
Hence, from the results we obtained, we observe that API features had a higher detection rate of 97.4% with only a fallout of 1.7% as against 4-gram’s 93.15% accuracy. Again, when we compare the results obtained from API features as opposed to the results gathered from mnemonic features and opcode features, we see that opcode features had the highest likelihood ratio of 19 as against mnemonic features, and API features having a ratio of 16.38 and 15.69, respectively.
To summarize experiments on Dataset-1 considering each feature independently on four classification algorithms, namely J48, Support Vector Machine (SVM), AdaBoostM1 (with J48 as a base classifier), and Random Forest (RF), we observe that Random Forest and AdaBoost produced the best results. We can attribute this accuracy of Random Forest as it is an ensemble-based technique that derives its output from the sample by accumulating votes from multiple forests. We can credit the boosting technique that AdaBoost employs for improving its accuracy. AdaBoost cascades multiple weak classifiers to give a strong learner, ensuring a high degree of precision. The J48 classifier comes next in terms of the results produced in comparison to the other classifiers. The output produced by J48 is close to the best classifiers in some cases but is consistently inferior compared to the other two classifiers. SVM produces poor results among the four classifiers, which can be explained by SVM’s tendency for over-fitting when the number of features is higher than the number of samples.
We further evaluate the performance of machine learning models generated by combining different feature categories. We consider such a feature space as a multimodal attribute set. The term modality means the particular mode in which something is expressed. In this context, it refers to the various features obtained with feature extraction, as shown in
Figure 2. In the unimodal architecture, we perform classification based on a single modality and thus, this framework is limited to operating on a single attribute type. To investigate if blending different features from diverse feature categories could improve classification accuracy, we furthered our experiment using multimodal architecture.
Multimodal architecture involves learning based on multiple modalities. This solution is based on utilizing the relationship existing between the various features of the data available. This network can be used in converting data from one modality to another or in using one attribute set to assist the learning of another attribute set etc. We have achieved multimodal fusion in our experiment by carrying out feature selection (as shown in
Figure 3) on the relevant attributes from diverse categories (4-gram, mnemonics, API, and opcodes) and then fusing them as shown in
Figure 8.
As each feature has a different representation and correlation structure, the fusion of all these relevant features helps to extract maximum performance. Furthermore, after fusing these features, we were able to obtain a new feature space comprising of promising attributes. Additionally, we considered the new feature space for creating diverse classification models.
The presence of irrelevant features or redundancy in the data set might degrade the performance of the multimodal classification. Since we present the feature sets through various feature selection methods before performing feature fusion, our classifier is less susceptible to problems induced due to redundancy and extraneous features.
The ensemble classifier demonstrated the maximum accuracy of 97.98% with a feature-length of 240 using Random Forest, as shown in
Figure 9. Among the unimodal classifiers, the API features demonstrated the highest detection rate of 97.4% with a FPR of 1.7%. Moreover, the opcode features displayed a detection rate of 91.6% and 0.48% FPR. By analyzing the results of both the unimodal and multimodal architectures, the results obtained using the multimodal architecture illustrate significant improvement compared to the results gained from the unimodal classifiers (as shown in
Figure 4,
Figure 5,
Figure 6 and
Figure 7). Since the ensemble classifier was developed by concatenating prominent features from various feature sets, it is evident from the results that each modality considered for fusion has contributed to the overall performance of the classifier. Furthermore, this demonstrates that multimodal learning can be promising for increasing the detection in the malware detection task.
Summary: Experiments on VX-Dataset demonstrates that combining prominent mRMR features results in improved results on comparing individual features. The highest detection rate is obtained with Random Forest and AdaBoost models, due to ensemble, bagging, and boosting strategies. APIs play a significant role in predicting examples, with poor outcomes obtained using opcodes. Another important trend noticed is that the results of multimodal feature space and API unimodal classifier marginally differ. This is because the opcode attribute in combined attribute space does not contribute towards classification, as they introduce more sparsity in feature vectors. Hence, we conclude that dynamic feature, i.e., API plays a critical role in discriminating malware and benign files.
4.5. Evaluation on Android Applications Dataset-3
In this experiment, we identify malicious Android applications (also known as app.) using machine learning and deep learning techniques. Here, we use system calls as a feature for representing each application. First, we create an Android virtual device and install applications to be inspected. A total of 2000 malware applications are randomly chosen from Drebin dataset [
37], and 2000 legitimate applications are downloaded from the Google Playstore. While running applications, system calls are recorded using
strace utility, during this event we employ Android Monkey (a utility in Android SDK for fuzz testing application) to simulate the collection of events (e.g., changing the location, battery charging status, sending SMS, dialling to a number, swipes, clicking on widgets of an app, etc.). In particular, in this work we execute an application with 1500 random events for one minute, however, the analysis could also be performed with varying events.
Relevant system calls are selected using the mRMR feature selection approach, and further each app. is represented using a numerical vector employing Term Frequency Inverse Document Frequency (TF-IDF). The performance of machine learning classifiers on the sequence of system call (two calls considered in sliding window fashion) is shown in
Table 4. It was observed that distinguishing feature vectors were obtained by considering two consecutive system calls. Some examples of system call sequence are shown in
Figure 10.
We considered 40% of top system calls from the list of unique calls extracted from entire training set.
From
Table 4, we can visualize the best outcome for the XGBoost classifier. However, this result is obtained with an extra effort i.e., feature engineering which is a critical task in the machine learning pipeline. To eliminate the task of feature engineering, we make use of deep neural network architecture, which is a collection of layers, with each layer consisting of several neurons. A neuron acts as a processing unit that collects multiple inputs, multiplies weight, and finally applies the activation function. We use a deep neural network with an input layer consisting of 500 neurons and the second layer contains 250 neurons. In all layers, we use the Rectified Linear Unit (ReLU) activation function. The sigmoid activation function was used in the output layer since malware identification is a binary classification problem. For faster convergence and to avoid overfitting, the Adam optimizer and cross-entropy loss function are utilized.
Table 5 is the results obtained at varying values of dropout, the best results are obtained with a dropout rate of 0.1.
4.6. Evaluation on Synthetic Samples Dataset-4
Malware constructors generate variants from the base virus by inserting equivalent instructions, reordering, and subroutine permutations as code obfuscation techniques. The segments mutate from one generation to another where mutant code is transformed by the metamorphic engine to evade AntiVirus (AV) signature detection. This motivates the use of machine learning techniques to explore metamorphism among variants and within different families among synthetic samples, and to understand the extent of obfuscation induced by the virus kits. Malware data set comprising of 800 NGVCK viruses were used. Prior studies in [
57] reported that the NGVCK samples could easily bypass strong statistical detectors based on HMM by using the opcode sequence. Likewise, 1200 benign executables were downloaded from different sources, which include games, web browsers, media players, and executables of system 32 from a fresh installation of the Windows XP operating system. As in previous experiments, we scan all benign with VirusTotal to assure that none of the benign samples is infected. The complete data set was divided such that 80% of samples are used for training and the remaining 20% are used as a test set. In this experiment, executables based on API calls were analyzed.
We extracted unique opcode bigrams from the training set and found 733 of them. Prominent opcodes are filtered out using the mRMR approach. We also studied the impact of varying feature lengths beginning with 50 bigram opcode until 250 bigrams are included. The feature space is extended in increments of 50 opcodes at a time. We found that an increase in bigrams had a marginal influence on the classifier performance. As we progressively extend the feature vector, the informative attributes begin to appear, which eventually improves the results. However, if we further increase the features beyond a certain limit there is a drop in accuracy, primarily due to the addition of noise. We developed a classification model using different algorithms such as J48, AdaBoostM1 with J48 as a base classifier, and Random Forest.
Table 6 compares the best outcome of classifiers attained at a feature length of 150 bigrams.
To understand the extent of metamorphism in virus generation kits, 677 viruses were created using different infection mechanisms to form malware families. In particular, we generated using virus kits (NGVCK, MPCGEN, G2, and PSMPC) and also downloaded real malware samples downloaded from VX Heavens. Data set description is given in
Table 7.
Mnemonics are extracted from each malware sample and aligned using the global and local sequence alignment method. Sequence alignment places one opcode sequence over another to determine if sequences are identical. In the process of alignment, two opcode sequence gaps may be inserted. We have adopted a simple scoring scheme where a match is assigned a value of +1, and every mismatch and gap score is assumed as −1. A similarity matrix is constructed using pairwise alignment of malware samples within the family. We record minimum, average, and highest similarity distance for all malware samples. Likewise, the similarity distance of base malware across malware families is computed.
Two families are said to overlap if the similarity distance computed for base malware samples
and
is within the range of minimum and average similarity distance determined for families
i or
j. This means the greater the distance of a sample from the base malware, the lesser the similarity. Conversely, a high score depicts a higher similarity between any two samples.
Table 8 depict a segment of pairwise alignment of two samples generated using the NGVCK constructor. Each row preceded with a hash symbol represents a gap and an asterisk designate a mismatch of an opcode for any two malware samples.
The local alignment technique is employed to identify a common code among obfuscated samples as the code varies in the subsequent generation to identify conserved code regions. We found variants generated from MPCGEN are similar to G2 and PSMPC. In
Figure 11, MPCGEN-F1 and MPCGEN-F3 have high similarities with a base malware of G2 and PSMPC (G2-F1, G2-F3, PSMPC-F1, and PSMPC-F3).
To examine obfuscation techniques using malware constructors, we calculated alignments of sequences and recorded mismatch among mnemonics. There was a visible instruction replacement for NGVCK samples in comparison to other synthetic generators. In
Table 9, prominent mismatch opcodes are shown for four generators as the rest has shown a similar trend.
mov,
push,
lea,
pop, and
jmp are primarily used as replacement instructions.
To ascertain overlap among real malware samples of VX-Heavens and the obfuscated families, we studied the overlapping of the opcode sequence of real malware samples with synthetic ones. Initially, we determine base malware alignment (a sample that is closer to all samples in a family).
Figure 12 shows the overlap of Win32.Agent with NGVCK indicating real samples that also use code modification similar to synthetic constructors. Win32.Bot and Win32.Downloader overlap Win32.Autorun, Win32.Downloader, Win32.Mydoom, and Win32.Xorer families indicating that worm families preserve the common base code to differ in syntactic structure due to obfuscation or an extension of malevolence.