A Novel Monte-Carlo Simulation-Based Model for Malware Detection (eRBCM)

The use of innovative and sophisticated malware definitions poses a serious threat to computer-based information systems. Such malware is adaptive to the existing security solutions and often works without detection. Once malware completes its malicious activity, it self-destructs and leaves no obvious signature for detection and forensic purposes. The detection of such sophisticated malware is very challenging and a non-trivial task because of the malware’s new patterns of exploiting vulnerabilities. Any security solutions require an equal level of sophistication to counter such attacks. In this paper, a novel reinforcement model based on Monte-Carlo simulation called eRBCM is explored to develop a security solution that can detect new and sophisticated network malware definitions. The new model is trained on several kinds of malware and can generalize the malware detection functionality. The model is evaluated using a benchmark set of malware. The results prove that eRBCM can identify a variety of malware with immense accuracy.


Introduction
As the Internet has become essential in our life, the number of users who use internet services such as e-commerce and e-banking, has increased rapidly. Unfortunately, this increment is accompanied by an increased number of cyber-criminals who use malware (malicious programs) to achieve their malicious intentions [1].
Cyber-criminals launch new malware/attacks every year that are more sophisticated and harmful than previous years. Malware can adapt to the environment according to the security barriers set in an IT environment. Millions of new definitions are generated daily to exploit the vulnerabilities and compromise commercial information systems [2].
To overcome this severe threat, security companies such as Kaspersky and Symantec have introduced several anti-malware products to protect individuals and companies [2]. These products are for known malware definitions. While such solutions can detect known malware with high accuracy, they often lack the ability to detect unknown malware. Moreover, referencing all the different malware has become a complex task because of the enormous increase in the number of malware programs, making it difficult to find lasting solutions. These limitations have made it necessary to explore intelligent approaches that are flexible and adaptable in detecting unknown malware.
Most of the new intelligent approaches to malware detection are trained using the selective features of known malware that can represent malware in its best form. These representations are then used as training instances for a suitable machine-learning algorithm that generalizes or maps such features-based malware detection mechanisms [3][4][5][6][7][8][9][10][11][12][13]. This work extends a previously explored approach called RBCM, which is also based on reinforcement learning [3]. The RBCM extension is called eRBCM, and merges the most beneficial features of Monte-Carlo-based real-time learning (MOCART) [4] and random forest [5][6][7] to make it more scalable for higher-order training datasets.
The rest of this paper is organized as follows: Section 2 presents the various approaches adopted to detect and analyze network malware; Section 3 describes our motivations and contributions; Section 4 provides a short introduction to MOCART; Section 5 illustrates the enhancements to our previous approach (RBCM) [14], made to avoid converging to local minima in the search spaces with a narrow range of values in an observation dataset; Section 6 shows the experimental set-up and compares the performance of eRBCM with its state-of-the-art rivals; and Section 7 presents our conclusions and future work.

Related Work
According to the malware detection taxonomy outlined by [8], machine-learning approaches can be classified based on three major dimensions: malware targets, malware features, and the AI model used to generalize malware detection. This section focuses on the third dimension, the machine-learning algorithm, since our study evaluates algorithm performance in the malware detection task. Machine-learning algorithms are scalable to generalize non-linear problem spaces, which is the main motivation for exploring such approaches to optimize malware detection.
The RF machine-learning technique has been applied in several malware classification problems in the literature [5][6][7] because of its competitive performance compared to other algorithms. In an original approach proposed in [19], where the malware features were modelled as grayscale images, a comparison between three machine learning techniques revealed that RF outperformed the naïve Bayes and KNN algorithms.
RF was explored in [6] to generalize the malware detection and classification. The authors presented a machine-learning technique called AMICO [6], which was trained using the network-traffic-based selection parameters. The main purpose was to evaluate the payload information in network traffic. Parameters such as IP address, source URL, target URL, file contents, etc. were analyzed to identify the malware patterns. In the sandbox environment, a download reconstruction module was used to generate the network traffic in real-time. The traffic was based on executable files and the malware detection technique was evaluated using real-time generated data. The training data constructed to generalize the AI model was based on both malware-based traffic data and normal traffic.
To distinguish between malware and benign files, Vadrevu et al. [7] opted for a supervised learning approach based on a RF algorithm, where the training set of labelled malicious instances was evaluated over a period of one to two months. The simulated data contained a fair distribution of both kinds of samples. The model trained on this data was tested using an academic network, and the test results showed that AMICO could detect 90% of the malicious content that travelled over the network during the testing phase.
A classifier based on a decision tree was used in [7] to detect malicious contents. The "Malware Target Recognition (MaTR)" model is a hybrid of a decision tree classifier and is optimized by using a sophisticated heuristic-based feature search to keep the rules exploration-focused towards the promising area of the search space. In their work, the heuristics are built using the structural information of malicious contents and structural anomalies. Examples of malicious content structure include file path, attributes, and size, while examples of structural anomalies include entry point and section names. The classifier was trained using a benchmark dataset called VX-Heaven. The heuristics were built in the pre-processing stages and remained part of the training instances to extract quality rules. The classifier was tested on malicious contents that were not used during the training phase. The test data showed an accuracy of 99% for the decision-tree-based classifier's malware detection.
Neural-network-based approaches in malware detection were introduced in [9-11], while a recurrent neural network (RNN)-based model was explored by Andrade et al. [9].
The model was trained using a benchmark dataset that is publicly available for exploring new security solutions. Their neural network model creates new connections among the neurons based on cycles to increase the memory-based connectivity. The model also balances the trade-off between the long-term and short-term memory approaches. Shortterm memory emphasizes the exploration of solution space, while long-term memory exploits the already known best regions of the solution space. The experimental results showed that RNN detected malicious content with 67% accuracy.
Other approaches for app malware detection, such as API-Graph [27], DroidEvolver [28], and DroidSpan [29], are oriented towards solving the problem of sustainability (performance over time). However, it is unclear how these approaches address this problem in the case of a network attack or malware.
This paper focuses mainly on exploring a machine-learning model that can generalize the patterns of a variety of malware.

Motivations and Contributions
Our motivations are summarized below. Because of the enormous increase in new malware samples, traditional approaches are not scalable to the sophistication of new attacks and lack the capability to detect and analyze these attacks. Intelligent, self-adaptive approaches for efficient network malware detection and analysis are required [2].
There is a need for an approach that can be easily scaled for large and high-dimensional malware datasets to avoid extensive training episodes. This is essential in order to generalize the characteristics of different kinds of malware. The security solution can be trained on different datasets without changes in the learning structures.
Our contributions are summarized as follows. We improve RBCM [3] to avoid being trapped in local minima. The current version combines the best features of MOCART [4] and RF [5][6][7]. Monte-Carlo simulations are optimized to dynamically select the region and scale of samples used by the learning model. The dynamic sampling technique is used to enhance the performance of the RBCM learning model, which selects a sampling region of lower error and fixed size. This drawback decreases RBCM's performance in cases where the sample space is limited or there are large areas of low-quality samples that reduce the error with respect to the current surroundings, but the model does not learn new knowledge.
We test eRBCM using the three datasets: Microsoft Malware [30], ARP attack, and ICMP attack [31]. Furthermore, we provide a comparison of eRBCM with four state-of-theart, best-performing prediction algorithms.

Monte-Carlo-Based Real-Time Learning (MOCART)
MOCART [4] is a Monte-Carlo (MC)-simulation-based machine-learning algorithm that applies the Monte-Carlo tree search to obtain estimates from one node of the solution space to another to reach the goal node. The MC simulations explore the solutions using a sample space and build a learning structure. In MOCART, MC simulations build a value function that can predict the outcomes for each action in an uncertain or unknown environment. The simulations use a model of system which can predict outcomes for a deterministic or nondeterministic problem space. As a result of these characteristics, MOCART has been used in several domains, especially nondeterministic domains. Because of these capabilities, MOCART is particularly suitable for malware detection, as the behavior of a sophisticated piece of malware can be non-deterministic, and it might behave differently at the same state in a problem space. This is particularly true for a new set of malwares that are sensitive to sandboxing and Trojans. However, MOCART underperforms in domains where the number of possible samples generated during simulations are limited or if the simulation model is biased towards more exploitation than exploration.

Reinforcement Learning Model RBCM
To generalize the pattern recognition of various malware attacks, an updated version of reinforcement learning called eRBCM is explored in this work. eRBCM combines the best features of MOCART and RF.
The sampling techniques are modified in RBCM to keep finding new samples until the error rate remains below a threshold θ.
RBCM suffers from local minima when space dimensions are of a small scale or data has fewer variations with respect to class labels [4]. eRBCM increases the number of samples in the simulation model if class labels are not equally distributed. The generative model of eRBCM is shown in Figure 1 for a sample S, simulation length d, and extension n. The generative model of eRBCM extends the simulation length by n, as shown in step 6 of Figure 1. The update decision is made by using epsilon e, which depends on the current root mean square error (RMSE) of the learning structure (the learning structure is a Q function). This is a validation RMSE of the Q function on unseen data. The decision parameter is dynamic and keeps reducing itself depending on the RMSE, which makes the sample exploration self-adaptive and keeps the trade-off between exploration and exploitation in balance. Figure 2 explains the sampling process with respect to the depth of search for samples in the direction of solutions. The search space at S1 was simulated with three neighboring states and only S3 was extended as it met the criteria for decision making. The state S3 produced the smallest RMSE of its siblings and a more reliable and stronger heuristic to select the direction to explore deeper into. This also assisted the Q function to be updated with the weight values that reduce the error of the network. The epsilon value was updated on each extension of the search process; for example, epsilon at S4 will be 0.002. If no neighbor of S4 produces a lower RMSE than S4, the search will stop at this state of the space. This process is also intuitive to bring the search out of local minima, as the RMSE will never be reduced below the best local minimum and the generative model will explore more spaces.
Due to dynamic changes in epsilon in the generative model, the model can learn biased strategies to explore the space rather than exploiting the best results. However, because of a fixed number of extensions, the search policy is kept balanced at the exploitation of the best and the search of new states in the search process. The adaptive use of epsilon also introduces the benefit of avoiding the visit for the same sample more than once. It reduces the time of searching for the best solution in the space and gives eRBCM the advantage of quicker convergence than CNN.

Experimentation
All experiments were performed using Windows 10 Enterprise with 16 GB RAM and dual Intel Core (TM) i7-4702MQ CPUs, each of 2.20 GHz speed. The benchmark malware files were analyzed using different programs for deep visibility of attack data. The tools used were Wireshark and Network Miner. All tools were run in a special operating system called Security Onion. The attack files were further processed to generate a training dataset. The benchmark malware datasets analyzed were: Microsoft Malware dataset [30], ARP attack dataset [31], and ICMP attack dataset [31].

Microsoft Malware Data
The dataset in [30] is organized with respect to machines and has several input features (e.g., 'machineidentifier' and 'hasdetected') which are malware detected on the machine. This column is used as the actual output for training the machine-learning algorithms. The dataset contains the system details for each observation, including default browser, current OS version, firewall, processor, primary disk type, volume capacity, total physical RAM, casing details, and gaming systems. This dataset is used for training machine-learning algorithms to detect malware on end systems running Windows OS.

ARP Attack Dataset
The ARP dataset [31] is taken from the Contagion malware dataset. ARP attacks exploit the vulnerabilities related to Address Resolution Protocol. ARP vulnerabilities can lead to attacks such as ARP spoofing. These types of attacks require careful analysis of the network characteristics for detection. The dataset for this malware is given in pcap files, which contain the network characteristics of the malware attack. Wireshark is used to extract the pattern of the malware. The data in pcap files is exported to csv which is then used as a training dataset.

ICMP Attack Dataset
The ICMP malware dataset [31] is also in the form of pcap files or network data of ICMP-related attacks (IMCP smurf or ping of death, etc.). These malwares exploit the vulnerabilities of network traffic based on ICMP messages or echo messages. Such messages can penetrate a network without being flagged because most of the security solutions are used to filter TCP/UDP based messages. The pattern of such attacks is extracted using Wireshark and exported to a csv file which is then used to train machinelearning algorithms.
eRBCM was trained using the benchmark datasets. The model testing also included malware definitions not used in the training. The malware categories of ICMP and ARP include several patterns (definitions) of network malware that are part of the benchmark dataset [28]. These models were trained using 200 malwares in both categories. The testing of the eRBCM to measure its performance was conducted on 150 malwares that were different from the 200 used in the training of eRBCM. The eRBCM performance was compared with the following state-of-the-art machine-learning techniques: J48.
The model performance was measured by applying the correlation coefficient (CC), RMSE, and accuracy. Higher correlation coefficient and accuracy values indicate a better performance, while a model with a lower RMSE is considered superior to those with a higher RMSE.

Results
The results of each model were averaged over ten runs, with the averages shown in Figure 3. The results show that RF established better correlation-based rules and had a superior performance than other models with respect to the CC. RF extracts the best possible rules as it is an ensemble model of a decision tree and identifies the best tree structure.  Figure 4 shows each model's accuracy. The accuracy profile indicates that eRBCM's performance was better than its competitors. Because of variations in sample size in each run, the error-rate fluctuated greatly in each episode of testing. The application of convolutional neural network to extract the attack behaviors of the different malware was a promising strategy. The convolutional neural network took several training episodes to converge as compared to eRBCM. While random forest had a higher CC than other models, its performance lacked consistency in relation to accuracy due to the complex nature of malware patterns. When comparing RF and J48, RF performed better with respect to CC and accuracy because it is an ensemble model.
The performances of CNN and FNN were comparable in terms of accuracy, indicating the capability of neural network structures to generalize malware patterns. However, FNN identified fewer similar rules and produced low correlation-based outcomes.
The main success of eRBCM in terms of performance is its self-adaptability to explore and then balance the trade-off between exploration and exploitation. eRBCM can guide its search towards the promising area of a solution space due to epsilon. The generative model explores more on the lower sides of RMSE as compared to regions of higher RMSE. Figure 5 displays the RMSE results of each machine-learning technique. The results show that eRBCM produced a lower RMSE than most of its rivals. eRBCM performed better than its predecessor, RBCM, and had a consistently better performance than other models because of its adaptive approach in simulations to keep the sample size and space suitable for model learning. The samples were selected based on the quality of the search for a solution during Monte-Carlo simulations.  eRBCM's selection mechanism depends on a threshold based on the error rate. It selects a threshold value that minimizes the RMSE. This is the main reason for the successful generalization of attack patterns by eRBCM. The self-adaptivity of epsilon enables eRBCM to explore the larger but focused area of search space compared to RBCM and CNN. eRBCM converges faster than its rival because of the self-tuning of epsilon.
The look-ahead search of the generative model also benefits eRBCM in terms of searching high-quality regions with a smaller number of iterations. The regions of lower RMSE are explored in more depth compared with the regions of higher RSME. This can lead to local minima, but due to the dynamic value of epsilon, the generative model departs such regions in few iterations. Figure 7 shows the results with respect to RMSE for look-ahead search self-adaptability. The results show that the extended search produced quality solutions with low RMSE. The enhanced performance in the look-ahead search during simulations is explained by the guided exploration of the generative model in the simulation. The higher the n-value of the simulation model (as given in Figure 1), the more eRBCM explores more promising states of the solution space. With shallow searches in the region of quality solutions, eRBCM remains biased towards the exploitation of the best solutions found and it converges to suboptimal solutions as shown in Figure 7. With extensions to look-ahead search, the deep search provides an optimal balance of exploration and exploitation of the current best-found solutions. It also explains the phenomena shown in Figure 2 relating to the look-ahead search of the generative model of eRBCM. At a deeper search, the eRBCM generative model mirrors the natural selection mechanism of evolutionary techniques. It provides a new solution as a mutation of the existing best solution, as shown in Figure 2. At state S3, for example, the generative model generates a new state S4 which is a mutation of S3.

Conclusions
In this paper, we presented a new approach called eRBCM to detect malware. The new model was designed using the reinforcement learning approach, which utilizes the strength of Monte-Carlo simulations and builds a strong machine-learning model to detect complex malware patterns. It combines the most beneficial elements of MOCART's reinforcement learning and RF's exploration capabilities. A large number of experiments were conducted using different malware benchmarks, including ARP attack, ICMP attack, and Microsoft Malware. eRBCM was consistently better than its competitors in terms of learning the new malware patterns and detecting unknown malware. This was mainly explained by eRBCM's self-adaptability to exploration and intelligent tuning of the balance for the trade-off between exploration and exploitation.
For future work, we plan to test our approach with various attacks to measure its scalability and accuracy. Furthermore, eRBCM will be explored for mobile malware using benchmark datasets. The mobile malware will be analyzed using sophisticated forensics tools, identifying key patterns via an innovative pre-processing stage. The malware will be scanned and categorized based on its malicious agenda. In each category, the common parameters will be explored using clustering, with these clusters used to generate a training dataset.