An Investigation to Detect Banking Malware Network Communication Trafﬁc Using Machine Learning Techniques

: Banking malware are malicious programs that attempt to steal conﬁdential information, such as banking authentication credentials, from users. Zeus is one of the most widespread banking malware variants ever discovered. Since the Zeus source code was leaked, many other variants of Zeus have emerged, and tools such as anti-malware programs exist that can detect Zeus; however, these have limitations. Anti-malware programs need to be regularly updated to recognise Zeus, and the signatures or patterns can only be made available when the malware has been seen. This limits the capability of these anti-malware products because they are unable to detect unseen malware variants, and furthermore, malicious users are developing malware that seeks to evade signature-based anti-malware programs. In this paper, a methodology is proposed for detecting Zeus malware network trafﬁc ﬂows by using machine learning (ML) binary classiﬁcation algorithms. This research explores and compares several ML algorithms to determine the algorithm best suited for this problem and then uses these algorithms to conduct further experiments to determine the minimum number of features that could be used for detecting the Zeus malware. This research also explores the suitability of these features when used to detect both older and newer versions of Zeus as well as when used to detect additional variants of the Zeus malware. This will help researchers understand which network ﬂow features could be used for detecting Zeus and whether these features will work across multiple versions and variants of the Zeus malware.


Introduction
Cybercrime is a major threat to cybersecurity [1,2] estimates that the yearly cost of cybercrime could rise to USD 10.5 trillion by the year 2025 and a significant proportion of this is related to malware such as banking malware.Banking malware have also been increasing on a yearly basis, and according to [3], banking malware attacks have increased by 80% in 2021 alone.One of these banking variants, specifically, the Zeus malware (from hereon, referred to as Zeus), has become one of the most prevalent banking malware variants ever discovered [4].Furthermore, in 2011, the Zeus program code was made public [5], allowing malware developers to create additional variants of Zeus and to also develop additional modules for the Zeus malware [6].Since the Zeus code was leaked, many variants of Zeus have emerged, and some of these include ZeusPanda, Ramnit and Citadel.

Need for Malware Detection
As the number of malware and their variants are increasing rapidly and becoming more sophisticated and prevalent [7], additional modern techniques need to be developed to detect these malware variants, and [7] highlights the importance of using AI to detect malware.The authors of [8] also discuss the limitations in other malware detection approaches, such as detecting malicious patterns in executables, and using heuristic-based approaches and statistical approaches and have recommended that researchers should use machine learning and deep learning approaches to address these limitations.Signaturebased malware detection systems also exist, but these systems also have limitations; for example, they can only detect known malware [9].
This paper proposes a framework and methodology to detect malware and benign traffic using machine learning and deep learning algorithms.The main contributions of this paper are to develop a methodology to detect the Zeus banking malware and differentiate it from benign traffic using binary classification machine learning algorithms.This paper will compare three binary classification algorithms to determine which provides the best detection results when used to detect Zeus from benign traffic.This paper also determines the minimum number of features that could be used to detect Zeus and benign traffic.Researchers [10][11][12][13] have discussed and proposed several supervised machine learning (ML) algorithms that could be used for analysing this type of problem and this paper uses three of these ML algorithms.These are: random forest ML algorithm, decision tree ML algorithm and the KNN deep learning algorithm.This paper aims to: • Determine a methodology that can be used by deep learning and machine learning algorithms for detecting the Zeus malware.

•
Determine which ML algorithm produces the best detection results.

•
Determine whether the features that produce the best detection results on one dataset will work on other datasets from other sources.

•
Determine a minimum set of features that could be used for detecting Zeus.

•
Determine whether the features that produce the best detection results work across newer and older versions of Zeus.

•
Determine whether the features that produce the best detection results when detecting Zeus also work on additional variants of the Zeus malware.

Zeus Malware Architecture
An important feature of the Zeus malware is the way that it communicates, as it uses command and control channels (C&C) for this purpose.The author of [14] has discussed the various phases of the C&C communication, which can be seen in Figure 1.This communication can occur using either a centralised or a peer-to-peer architecture, with the peer-to-peer architecture being more robust and resilient [15].This is because if the central C&C server becomes unreachable or is taken down, the Zeus bots will not be able to communicate with the C&C server, preventing the bots from receiving commands, updating themselves and downloading new configuration files [16].Newer variants of Zeus use the P2P C&C architecture.These are more resilient to takedown efforts because the configuration file does not point to a static C&C server [17].Instead, the C&C server information is obtained from a peer (proxy bot), which can be updated if the C&C server is taken down or becomes unreachable [18].Stolen data is routed through the C&C network to the malware authors' C&C server, where the stolen data is decrypted and saved to a database [19].
malware.The authors of [8] also discuss the limitations in other malware detection ap proaches, such as detecting malicious patterns in executables, and using heuristic-based approaches and statistical approaches and have recommended that researchers should use machine learning and deep learning approaches to address these limitations.Signa ture-based malware detection systems also exist, but these systems also have limitations for example, they can only detect known malware [9].
This paper proposes a framework and methodology to detect malware and benign traffic using machine learning and deep learning algorithms.The main contributions o this paper are to develop a methodology to detect the Zeus banking malware and differ entiate it from benign traffic using binary classification machine learning algorithms.This paper will compare three binary classification algorithms to determine which provides the best detection results when used to detect Zeus from benign traffic.This paper also determines the minimum number of features that could be used to detect Zeus and benign traffic.Researchers [10][11][12][13] have discussed and proposed several supervised machine learning (ML) algorithms that could be used for analysing this type of problem and this paper uses three of these ML algorithms.These are: random forest ML algorithm, decision tree ML algorithm and the KNN deep learning algorithm.This paper aims to: • Determine a methodology that can be used by deep learning and machine learning algorithms for detecting the Zeus malware.

•
Determine which ML algorithm produces the best detection results.

•
Determine whether the features that produce the best detection results on one datase will work on other datasets from other sources.

•
Determine a minimum set of features that could be used for detecting Zeus.

•
Determine whether the features that produce the best detection results work across newer and older versions of Zeus.

•
Determine whether the features that produce the best detection results when detect ing Zeus also work on additional variants of the Zeus malware.

Zeus Malware Architecture
An important feature of the Zeus malware is the way that it communicates, as it uses command and control channels (C&C) for this purpose.The author of [14] has discussed the various phases of the C&C communication, which can be seen in Figure 1.This com munication can occur using either a centralised or a peer-to-peer architecture, with the peer-to-peer architecture being more robust and resilient [15].This is because if the centra C&C server becomes unreachable or is taken down, the Zeus bots will not be able to com municate with the C&C server, preventing the bots from receiving commands, updating themselves and downloading new configuration files [16].Newer variants of Zeus use the P2P C&C architecture.These are more resilient to takedown efforts because the configu ration file does not point to a static C&C server [17].Instead, the C&C server information is obtained from a peer (proxy bot), which can be updated if the C&C server is taken down or becomes unreachable [18].Stolen data is routed through the C&C network to the mal ware authors' C&C server, where the stolen data is decrypted and saved to a database [19].As discussed by [20], Zeus propagates like a virus, mainly infecting Windows systems and predominantly, the infection vector occurs via phishing emails, which is a significant distribution mechanism for malware.Research by [21] has discussed this in detail, and states that around 90 percent of data breaches are caused by phishing.Once the Zeus binary executes on a Windows system, it performs several actions.One of these is to create two files called local.ds and user.ds.Local.ds is the dynamic configuration of the file downloaded from the command and control (C&C) server, while the user.dsstores stolen credentials and other information that needs to be transmitted back to the C&C server [22].Additional code is injected into svchost and is responsible for network communications.Svchost is also responsible for injecting malicious code into many Windows processes, which provide Zeus with the ability to steal credentials and launch financial attacks.

Related Studies
Bothunter [23] is a perimeter scanning system which uses three sensors and a correlation engine to identify malicious traffic flows that can occur between an infected host and a malicious entity.Bothunter [23] has been built on top of the open-source platform called SNORT, and it is an application developed to track the various stages of the malware communication flow and can correlate both inbound and outbound traffic to identify malware traffic.Two plugins called SLADE and SCADE are used by Bothunter, and SCADEs role is to analyse the communication flows to identify traffic patterns that can be considered harmful.These traffic patterns include:

•
Hosts that frequently scan external IP addresses.

•
An evenly distributed communication pattern which is likely to indicate that that communication is malicious.
SLADEs role is to analyse network packets and alert the administrator if a packet deviates from an established profile.SLADE was developed using PAYL [24], which allows SLADE to examine 256 features of the packet and then use this information to make determinations as to whether the packet is malicious or not.
Botminer [25] is a tool that was designed to detect groups of compromised computers, and this is achieved by monitoring network communication flows using two modules, a C-plane module, and an A-plane module.The C-plane's role is to log network traffic to identify all the hosts that are communicating, and the A-plane's role is to identify what these hosts are doing.Features extracted from both these modules can be used to identify communication patterns that are similar between hosts and if these communication patterns are malicious, it is indicative that a particular group of hosts are communicating maliciously.The A-plane module is based on Bothunter's [23] SCADE module and can analyse communications to determine malicious communication patterns [25].
CONIFA [26] uses machine learning to detect malware communication traffic, and it does this by training and testing the Zeus malware by using the correlation-based feature selection (CFS) algorithm with the C4.5 classification algorithm.To improve CONIFAs accuracy and prediction results, [26] created a cost-sensitive variant of the C4.5 classification algorithm, which uses a lenient and strict classifier and compares the prediction results to a standard machine learning framework, which uses a cost-insensitive version of the C4.5 algorithm.The standard framework's detection rate was good when evaluating the training dataset; however, when evaluating the test data, the recall rate dropped to 56%.CONIFAs results demonstrated an improvement in the detection accuracy, with the recall rate increasing to 67%.
The RCC Detector (RCC) [27] analyses network traffic flowing from a host to identify any malware communication traffic.To do this, the RCC [27] uses a multi-layer perceptron (MLP) and a temporal persistence (TP) classifier.The MLP classifier is made up of an input layer, an output layer and one hidden layer [27], and these are used to classify botnets using several characteristics, including, the Flow count, session length, uniformity score and the Kolmogorov-Smirnov Test.
The multi-layer feed forward network (MLFFN) [28] is a tool that extracts TCP features from the TCP connections originating from a host computer and uses these to identify botnet communication traffic.MLFFN [28] consists of an input layer made up of six neurons and an output layer made up of four neurons.MLFFN was tested on four datasets, namely, Zeus-1, Zeus-2, Spyeye-1 and Spyeye-2, and it is worth noting that these are all older versions of the Zeus malware.
Genetic programming (GP) [29] used the Symbiotic Bid-Based (SBB) algorithm and the C4.5 machine learning algorithms to identify unique botnet communication patterns, and to do this, features were extracted from the communication flows of three malware variants including Zeus, Conficker and Torpig.The features were extracted using Softflowd [30], and the authors of [29] were able to categorise these three malware variants.It is worth noting that the results are based the usage of the older versions of the malware variants.
MOCA [31] uses a two-stage monitoring and classification system to detect and classify malicious attacks.It does this by identifying behaviours within the network flows that are outside of the normal range (abnormal) and this part of the MOCA system is classed as the stage one classifier of the MOCA system.These abnormal behaviours are then sent to the stage two classifier, which attempts to classify the attacks into a class such as a DDoS attack in an IoT network or a Botnet attack.Two datasets were used for testing, CICIDS2017 and CICDDOS2019, and the accuracy achieved was 99.84% for CICIDS2017 and 93% for the CICDDOS2019 dataset.The algorithms used in this research include the decision tree, random forest and XGBoost ML algorithms.

Problem Statement
This paper intends to develop a framework and methodology that uses machine learning techniques to detect malware.Other methodologies exist and have been used by many researchers to detect malware.These include anomaly-based detection approaches such as those discussed by [32,33], and signature-based approaches such as those discussed by [34,35]; however, these do have drawbacks, and these are highlighted in [36].For example, signature based-systems need to be updated regularly to cater for newly emerging malware variants, and signature-based systems are not able to detect unknown malware variants or zero-day malware.
Machine learning can help address many of these issues [37] and this paper has developed a framework and approach using machine learning that will be able to detect several banking malware variants.Although other researchers [26][27][28] have done some experimental work on detecting malware, there is little to no research that aims to detect a range of malware variants by only training one dataset, i.e., one malware variant.This research paper aims to use only one dataset for training and then use this to build a machine learning model.This model is then used to detect multiple banking malware variants and is also used to distinguish between benign and malware communication traffic.This research is also analysing banking malware variants that have emerged recently and those that have been around since they were developed, and this should ensure that both the older and newer versions of the banking malware are detectable by the machine learning algorithms.

Research Methodology
This research paper aims to classify network traffic flows as either Zeus (malware) or benign (good).For this research, the raw network traffic samples were collected as pcap files, and each pcap file is made up of network flows, which refers to a sequence of packets flowing between a source and a destination host.In this paper, the flows are referred to as ML samples and the features are extracted from these samples.

Data Collection and Preperation
Figure 2 depicts the data collection and preparation steps and is discussed further in this section.To prepare the data for the ML algorithms, the features were extracted from the samples using Netmate-flowcalc (NF), a tool developed by [38], and were then exported into a CSV file.NF was used because it is an open-source tool that can extract the features required by the ML algorithms and has also been used by other researchers [39][40][41][42].A total of 44 features were extracted by NF (see Appendix A for a brief description of the features), and the features from the benign and Zeus flows were extracted into separate CSV files and labelled.A label of '1' was applied to the Zeus samples and a label of '0' was applied to the benign samples.The two files were then combined into one CSV file, and this was used for the empirical analysis conducted during this research.

Data Collection and Preperation
Figure 2 depicts the data collection and preparation steps and is discussed further in this section.To prepare the data for the ML algorithms, the features were extracted from the samples using Netmate-flowcalc (NF), a tool developed by [38], and were then exported into a CSV file.NF was used because it is an open-source tool that can extract the features required by the ML algorithms and has also been used by other researchers [39][40][41][42].A total of 44 features were extracted by NF (see appendix A for a brief description of the features), and the features from the benign and Zeus flows were extracted into separate CSV files and labelled.A label of '1' was applied to the Zeus samples and a label of '0' was applied to the benign samples.The two files were then combined into one CSV file, and this was used for the empirical analysis conducted during this research.

Feature Selection
One of the main issues in ML is selecting the appropriate features for the ML algorithm, and the criticality of this has been discussed by many researchers such as [43,44].Selecting the right features has the following benefits: • Variance (overfitting) is reduced.

•
Computational cost and the time for running the algorithm is reduced.

•
Enables the ML algorithm to learn faster.
There are several techniques that can be used for selecting the appropriate and best features and [45,46] discuss these in detail.For example, two of these techniques are: • Filter method-Feature selection is independent of the ML algorithm.

•
Wrapper method-A subset of the features are selected and used to train the ML algorithm.Based on the results, features are either removed or added until the best features are determined.For this research, the features were studied [47][48][49][50] and based on this, the features were divided into two groups, called Feature set1 and Feature set2, and only the features from Feature set1 were used during this research.Feature set2 contained those features that were not used during this research and were excluded.This was because these features could potentially be related to the characteristics of the network from which the packets were extracted, resulting in the ML algorithm making false correlations.For example, if the benign and malware traffic came from a particular IP address range, the ML algorithm might use the IP address information to make predictions.Table 1 shows the features that were excluded (Feature set2).All the remaining features were included in

Feature Selection
One of the main issues in ML is selecting the appropriate features for the ML algorithm, and the criticality of this has been discussed by many researchers such as [43,44].Selecting the right features has the following benefits:

•
Computational cost and the time for running the algorithm is reduced.

•
Enables the ML algorithm to learn faster.
There are several techniques that can be used for selecting the appropriate and best features and [45,46] discuss these in detail.For example, two of these techniques are:

•
Filter method-Feature selection is independent of the ML algorithm.

•
Wrapper method-A subset of the features are selected and used to train the ML algorithm.Based on the results, features are either removed or added until the best features are determined.
For this research, the features were studied [47][48][49][50] and based on this, the features were divided into two groups, called Feature set1 and Feature set2, and only the features from Feature set1 were used during this research.Feature set2 contained those features that were not used during this research and were excluded.This was because these features could potentially be related to the characteristics of the network from which the packets were extracted, resulting in the ML algorithm making false correlations.For example, if the benign and malware traffic came from a particular IP address range, the ML algorithm might use the IP address information to make predictions.Table 1 shows the features that were excluded (Feature set2).All the remaining features were included in Feature set1 and were used during this research.These features are: total_fpackets, total_fvolume, total_bpackets, to-tal_bvolume, min_fpktl, mean_fpktl, max_fpktl, std_fpktl, min_bpktl, mean_bpktl, max_bpktl, std_bpktl, sflow_fpackets, sflow_fbytes, sflow_bpackets, sflow_bbytes, fpsh_cnt, bpsh_cnt, furg_cnt, burg_cnt, total_fhlen, total_bhlen, duration, min_active, mean_active, max_active, std_active, min_idle, mean_idle, max_idle and std_idle.This paper analyses and compares the performance of the ML algorithms using nine datasets obtained from four locations.One location was Zeustracker [51], a website that monitors Zeus C&C activities, and these samples were downloaded on 4 February 2019.The other datasets were obtained from Stratosphere, Abuse.ch and Dalhousie University, and these datasets are a combination of older and newer versions of the Zeus malware and three other variants of the Zeus malware, which are ZeusPanda, Ramnit and Citadel.Stratosphere [52] specializes in collecting malware and benign traffic captures, and they have multiple datasets which have been made available for research purposes.Abuse.ch is a research project that identifies and tracks malware and botnets, and is a platform integrated with many commercial and open-source platforms, including VirusTotal, ClamAV, Karspersky and Avast [53].Dalhousie University has botnet samples that are available for download and these samples are part of the NIMS botnet research dataset and have been used by other researchers [54].Table 2 describes the datasets that were used for the research reported in this paper.

Machine Learning Algorithms
The ML algorithms used for this research are discussed in this section, and they are supervised machine learning algorithms as these are used and are the most suitable for classification problems, as discussed by [55].The machine learning algorithms used during this research include the decision tree (DT) algorithm, the random forest (RF) algorithm and the keras neural network (KNN) deep learning algorithm.
The decision tree algorithm is a common machine learning algorithm that can be used for classification problems [56] and is especially useful when used for binary classification problems [56].For this reason, the decision tree algorithm is well suited for this prediction problem because this analysis is trying to determine if the network flow is malicious (Zeus banking malware traffic), or benign.The authors of [57] also state that the decision tree algorithm can produce good prediction results.
The random forest (RF) algorithm works by building and combining multiple decision trees [58].It can be more efficient and provide better prediction results than the decision tree algorithm [59], and it reduces the possibility of overfitting [60].It is important to tune the parameters to try and increase the prediction accuracy when using the RF algorithm; however, it is difficult to predict the best parameters ahead of time as the parameters are selected based on trial and error.One of these parameters is the number of trees built during the training and testing of the data.The author of [61] states that building more than 128 trees provides no significant gain in the accuracy and can increase costs.The authors of [61] also state that the optimum number of trees for the random forest classifier was found to be between 64 and 128.For this empirical analysis, the random forest algorithm was coded to build between 64 and 128 decision trees, and once the training was complete, the optimal number of trees was selected based on the best prediction results.
Keras is a popular neural network library implemented in Python [62] and can be used for classification problems such as the one examined during this research [63].The keras neural network (KNN) deep learning algorithm was used for training and testing the datasets and for this empirical analysis, a sequential KNN model [64] was used, which means that the output of one layer is input into the next layer.For this research, the deep learning model consisted of one input layer, three hidden layers and one output layer, and a graphical representation of this can be seen in Figure 3.It is important to note that only one of the datasets was used for training and the remaining datasets were used for testing.
ing was complete, the optimal number of trees was selected based on the best prediction results.
Keras is a popular neural network library implemented in Python [62] and can be used for classification problems such as the one examined during this research [63].The keras neural network (KNN) deep learning algorithm was used for training and testing the datasets and for this empirical analysis, a sequential KNN model [64] was used, which means that the output of one layer is input into the next layer.For this research, the deep learning model consisted of one input layer, three hidden layers and one output layer, and a graphical representation of this can be seen in Figure 3.It is important to note that only one of the datasets was used for training and the remaining datasets were used for testing.

System Architecture and Methodology
The system architecture is depicted in Figure 4 and shows the steps that are completed to prepare the samples for the ML algorithms.These include:

•
The datasets are identified and collected.

•
Features are extracted from these datasets.

•
The extracted features are transferred to a CSV file and prepared.

•
The features are selected for training and testing.

•
The algorithm is trained and tested, and a model is created.Only one dataset is used for the training.

•
The model is tuned and trained and tested again if required.

•
The model is used to test and evaluate the remaining datasets.

•
Deploy the final model, test all the data samples and create a report highlighting the evaluation metrics.

System Architecture and Methodology
The system architecture is depicted in Figure 4 and shows the steps that are completed to prepare the samples for the ML algorithms.These include:

•
The datasets are identified and collected.

•
Features are extracted from these datasets.

•
The extracted features are transferred to a CSV file and prepared.

•
The features are selected for training and testing.

•
The algorithm is trained and tested, and a model is created.Only one dataset is used for the training.

•
The model is tuned and trained and tested again if required.

•
The model is used to test and evaluate the remaining datasets.

•
Deploy the final model, test all the data samples and create a report highlighting the evaluation metrics.

Evaluation
Precision, recall and f1-score evaluation metrics [65] are used to determine the accuracy of the ML algorithms.Precision is the percentage of correctly identified positive cases from the whole data sample, which in this case is the malware and benign samples [65].Recall is the percentage of correctly identified positive cases from the positive samples only [66], which in this case is the malware samples.The formulas to calculate precision

Evaluation
Precision, recall and f1-score evaluation metrics [65] are used to determine the accuracy of the ML algorithms.Precision is the percentage of correctly identified positive cases from the whole data sample, which in this case is the malware and benign samples [65].Recall is the percentage of correctly identified positive cases from the positive samples only [66], which in this case is the malware samples.The formulas to calculate precision and recall are: The f1-score is another measure used for evaluation, and this considers both the positive and negative cases.The author of [67] states that the precision and recall are both combined during the evaluation of the ML algorithm.The formula to calculate the f1-Score is set out below: A confusion matrix [67] will also be generated, and an example of this is shown in Table 3.The confusion matrix will be used to measure the performance and prediction accuracy of the algorithm when tested and evaluated on the unseen datasets, and it will identify how many Zeus and benign samples were correctly identified.Table 3.An example of the confusion matrix used to measure the detection accuracy.

Predicted Benign Predicted Zeus
Actual Benign (Total) TN FN Actual Zeus (Total) FP TP

Results
This section presents the training and testing results of the three algorithms and compares the prediction results.

Training and Testing the Machine Learning Algorithms Using the Data Sets
The DT and RF algorithms were trained on Dataset1, using all the features from Feature set1, and a model was created and used for testing all the remaining datasets.The precision, recall and f1-score results for the DT algorithm can be seen in Table 4, and the precision, recall and f1-score results for the RF algorithm can be seen in Table 5.A confusion matrix was generated for testing all the datasets and the results of these can be seen in Tables 6 and 7.They also show the number of Zeus samples tested, how many of the Zeus samples were correctly classified (true positives) and how many of the Zeus samples were misclassified (false negatives).The table also shows the number of benign samples tested, how many of these were classified correctly (true negatives) and how many of these were misclassified (false positives).

Training and Testing the Deep Learning Algorithm Using the Data Sets
The DL algorithm was also trained in a similar manner and the precision, recall and f1-score results can be seen in Table 8, and the confusion matrices can be seen in Table 9.

Comparing the Predication Results of the Three Algorithms Tested
The results obtained from testing the three algorithms are all compared in this section.Figure 5 shows the true positive results of all the algorithms when tested against all the datasets, and Figure 6 shows the true negative results when tested against all the datasets.
The Zeus malware prediction accuracy for dataset1, dataset2, dataset3, dataset7, dataset8 and dataset9 were all above 90%, with the random forest algorithm performing the best with an average accuracy prediction result of 97% across these datasets.The three 2014 Zeus datasets (dataset4, dataset5 and dataset6) produced mixed results with the deep learning algorithm, performing better than the other two, with a detection result of 86% for dataset6.For dataset4 and dataset5, the random forest algorithm performed the best with a result of 71% and 74%, respectively.
For the benign traffic, the prediction results showed that for dataset1, dataset2, dataset3, dataset5, dataset6, dataset7, dataset8 and dataset9, the prediction accuracy for all the algorithms were above 90%, with the random forest algorithm performing the best with an average accuracy prediction result of 98% across these datasets.For dataset4 and dataset8, the deep learning algorithm performed best with a result of 95% and 94%, respectively, and the decision tree algorithm had the lowest prediction with a result of 87% for both these datasets.
This paper has demonstrated a methodology that could be used to detect the Zeus malware and its variants and has demonstrated that the methodology does work across multiple datasets and three other variants of the Zeus malware.The next section (Section 4.4) investigates the impact of the prediction accuracy when the number of features used during testing and training are reduced.

Comparing the Predication Results of the Three Algorithms Tested
The results obtained from testing the three algorithms are all compared in this section.Figure 5 shows the true positive results of all the algorithms when tested against all the datasets, and Figure 6 shows the true negative results when tested against all the datasets.

Comparing the Predication Results of the Three Algorithms Tested
The results obtained from testing the three algorithms are all compared in this section.Figure 5 shows the true positive results of all the algorithms when tested against all the datasets, and Figure 6 shows the true negative results when tested against all the datasets.

Reducing the Features to the Minimum Number of Possible Features
Multiple experiments were conducted by reducing the features from Feature set1 and this section seeks to investigate the prediction accuracy of both the malware and benign traffic as the number of features are reduced.To do this, the ML algorithms were trained and tested using dataset1, and the impact rating of each feature was determined and then used to establish which features have the highest impact ratings and which features have the lowest impact ratings.Some of these impact ratings can be seen in Figure 7.For example, Figure 7 shows that the mean active feature has an impact rating of 13.103% and that max_bpktl has an impact rating of 6.025%.Analysing the features in this way supported the systematic removal of the features, and this process can be seen in Figure 8.This process is described here.

•
Remove one feature which has the lowest impact score.

•
Training a dataset with this one feature redacted.

•
Test the remaining datasets.

•
Calculate the prediction accuracy and record the results.

•
Remove another feature and re-train the dataset.

•
Test the remaining datasets.

•
Calculate the prediction accuracy and record these results.

•
Repeat this process until the accuracy of two of the datasets fall below 50% during testing, as this would mean that more than half of the Zeus samples were misclassified for two or more of the datasets.

Training and Testing with the Minimum Number of Features with the DL Algorithm
Following the process discussed in Section 5.4, it was determined that the minimum number of features that could be used by the DL algorithm are as follows: total_fvolume, total_bpackets, total_bvolume, min_fpktl, mean_fpktl, max_fpktl, std_fpktl, min_bpktl, mean_bpktl, max_bpktl, std_bpktl, sflow_fbytes, sflow_bbytes, bpsh_cnt, duration, min_active, mean_active, max_active, min_idle and max_idle.The precision, recall and f1-score results can be seen in Table 10 and the confusion matrices can be seen in Table 11.

Training and Testing with the Minimum Number of Features with the DL Algorithm
Following the process discussed in Section 5.4, it was determined that the minimum number of features that could be used by the DL algorithm are as follows: total_fvolume, total_bpackets, total_bvolume, min_fpktl, mean_fpktl, max_fpktl, std_fpktl, min_bpktl, mean_bpktl, max_bpktl, std_bpktl, sflow_fbytes, sflow_bbytes, bpsh_cnt, duration, min_active, mean_active, max_active, min_idle and max_idle.The precision, recall and f1-score results can be seen in Table 10 and the confusion matrices can be seen in Table 11.

Training and Testing with the Minimum Number of Features with the DL Algorithm
Following the process discussed in Section 5.4, it was determined that the minimum number of features that could be used by the DL algorithm are as follows: to-tal_fvolume, total_bpackets, total_bvolume, min_fpktl, mean_fpktl, max_fpktl, std_fpktl, min_bpktl, mean_bpktl, max_bpktl, std_bpktl, sflow_fbytes, sflow_bbytes, bpsh_cnt, duration, min_active, mean_active, max_active, min_idle and max_idle.The precision, recall and f1-score results can be seen in Table 10 and the confusion matrices can be seen in Table 11. Figure 9 compares the results of detecting the Zeus malware between using all the features and the minimum number of features.The prediction results of dataset1, dataset2, dataset3, dataset4, dataset7, dataset8 and dataset9 were all within 5% of each other, dataset5 was within 9% and dataset6 was within 12%.Figure 10 compares the results of detecting the benign samples between using all the features and the minimum number of features when tested with the deep learning algorithm and shows that the prediction results of all the datasets were within 1% of each other.Figure 9 compares the results of detecting the Zeus malware between using all the features and the minimum number of features.The prediction results of dataset1, da-taset2, dataset3, dataset4, dataset7, dataset8 and dataset9 were all within 5% of each other, dataset5 was within 9% and dataset6 was within 12%.Figure 10 compares the results of detecting the benign samples between using all the features and the minimum number of features when tested with the deep learning algorithm and shows that the prediction results of all the datasets were within 1% of each other.

Training and testing using the minimum number of features with the DT algorithm
Similar experiments were conducted using the DT algorithm and it was determined that the minimum number of features that could be used by the DT algorithm are: to-tal_fvolume, total_bpackets, total_bvolume, min_fpktl, mean_fpktl, max_fpktl, min_bpktl, mean_bpktl, max_bpktl, std_bpktl, sflow_fbytes, sflow_bbytes, furg_cnt, burg_cnt, duration, min_active, mean_active, max_active, min_idle and max_idle.
The precision, recall and f1-score results can be seen in Table 12, and the confusion matrices can be seen in Table 13. Figure 11 compares the results of detecting the Zeus malware between using all the features and the minimum number of features and shows that the prediction results of dataset1, dataset2, dataset3, dataset4, dataset6, dataset7, da-taset8 and dataset9 were all within 5% of each other, and dataset5 was within 8%. Figure 12 compares the results of detecting the benign samples between using all the features and the minimum number of features and shows that the prediction results of all the datasets are within 1% of each other.

Training and testing using the minimum number of features with the DT algorithm
Similar experiments were conducted using the DT algorithm and it was determined that the minimum number of features that could be used by the DT algorithm are: to-tal_fvolume, total_bpackets, total_bvolume, min_fpktl, mean_fpktl, max_fpktl, min_bpktl, mean_bpktl, max_bpktl, std_bpktl, sflow_fbytes, sflow_bbytes, furg_cnt, burg_cnt, duration, min_active, mean_active, max_active, min_idle and max_idle.
The precision, recall and f1-score results can be seen in Table 12, and the confusion matrices can be seen in Table 13. Figure 11 compares the results of detecting the Zeus malware between using all the features and the minimum number of features and shows that the prediction results of dataset1, dataset2, dataset3, dataset4, dataset6, dataset7, dataset8 and dataset9 were all within 5% of each other, and dataset5 was within 8%. Figure 12 compares the results of detecting the benign samples between using all the features and the minimum number of features and shows that the prediction results of all the datasets are within 1% of each other.

Training and Testing Using the Minimum Number of Features with the RF Algorithm
Multiple experiments were conducted using the RF algorithm and the features were manually reduced by following the process described above (Section 5.4).This process was repeated until two of the dataset prediction results fell below 50% and it was determined that the minimum number of features that could be used are as follows: total_fvolume, to-tal_bvolume, min_fpktl, mean_fpktl, max_fpktl, min_bpktl, mean_bpktl, max_bpktl, std_bpktl, sflow_fbytes, sflow_bbytes, bpsh_cnt, duration, min_active, mean_active, max_active and min_idle.The precision, recall and f1-score results for testing all the datasets using the minimum number of features with the RF algorithm can be seen in Table 14 and the confusion matrices can be seen in Table 15. Figure 13 compares the results of detecting the Zeus malware between using all the features and the minimum number of features when tested with the DT algorithm, and shows that the prediction results of dataset1, dataset2, dataset3, dataset7, dataset8 and dataset9 were all within 5% of each other and that dataset4, dataset5 and dataset6 were within 16% of each other.Figure 12 compares the results of detecting the benign samples between using all the features and the minimum number of features and shows that the prediction results of all the datasets were within 1% of each other.
Figure 14 compares the true positive results of all three algorithms when tested using the minimum number of features, and the malware prediction results for all the datasets apart from dataset6 were within 10% of each.Dataset6 was an outlier with a difference of 36%, and in this case, the DL algorithm performing the best with a prediction result of 74% and the DT performing the worst with a prediction result of 38%. Figure 15 compares the results of detecting the benign samples between using all the features and the minimum number of features and shows that the prediction results of all the datasets were within 2% of each other.Figure 14 compares the true positive results of all three algorithms when tested using the minimum number of features, and the malware prediction results for all the datasets apart from dataset6 were within 10% of each.Dataset6 was an outlier with a difference of 36%, and in this case, the DL algorithm performing the best with a prediction result of 74% and the DT performing the worst with a prediction result of 38%. Figure 15 compares the results of detecting the benign samples between using all the features and the minimum number of features and shows that the prediction results of all the datasets were within 2% of each other.

Conclusions
The empirical analysis has shown that the framework and methodology adopted for this research can detect both older and newer versions of the Zeus banking malware, which demonstrates the potential of the framework to detect banking malware that evolve over time.The framework and methodology can also predict other banking malware variants, which demonstrates the potential to detect a wide range of banking malware variants without the need to analyse each banking malware variant to learn about their features.

Conclusions
The empirical analysis has shown that the framework and methodology adopted for this research can detect both older and newer versions of the Zeus banking malware, which demonstrates the potential of the framework to detect banking malware that evolve over time.The framework and methodology can also predict other banking malware variants, which demonstrates the potential to detect a wide range of banking malware variants without the need to analyse each banking malware variant to learn about their features.
For future work, there is a potential to further this research by enhancing the methodology to incorporate additional banking malware variants.Moreover, further research can be conducted to detect other malware variants and improve the prediction accuracy when detecting them.Researchers can also further this research by designing and building an IDS solution that could detect a wide range of malware, and the findings from this research could be used for this and by anti-malware vendors when they design malware detection tools.Action on the malicious traffic could also be taken once the malware has been detected.The findings from this research can be used by other researchers to develop their own malware prediction tools to enhance their research. sflow_fbytes The average number of bytes in a sub flow in the forward direction sflow_bpackets The average number of packets in a sub flow in the backward direction sflow_bbytes The average number of packets in a sub flow in the backward direction fpsh_cnt The number of times the PSH flag was set in packets travelling in the forward direction (0 for UDP) bpsh_cnt The number of times the PSH flag was set in packets travelling in the backward direction (0 for UDP) furg_cnt The number of times the URG flag was set in packets travelling in the forward direction (0 for UDP) burg_cnt The number of times the URG flag was set in packets travelling in the backward direction (0 for UDP) total_fhlen The total bytes used for headers in the forward direction.total_bhlen The total bytes used for headers in the backward direction.

Figure 2 .
Figure 2. Methodology for collecting and preparing the data.

Figure 2 .
Figure 2. Methodology for collecting and preparing the data.

Figure 5 .
Figure 5.Comparison of the Zeus prediction results for all three ML algorithms.

Figure 6 .
Figure 6.Comparison of the benign prediction results for all three ML algorithms.

Figure 5 .
Figure 5.Comparison of the Zeus prediction results for all three ML algorithms.

Figure 5 .
Figure 5.Comparison of the Zeus prediction results for all three ML algorithms.

Figure 6 .
Figure 6.Comparison of the benign prediction results for all three ML algorithms.

Figure 6 .
Figure 6.Comparison of the benign prediction results for all three ML algorithms.

Figure 8 .
Figure 8. Flow diagram showing the feature redaction process.

Figure 8 .
Figure 8. Flow diagram showing the feature redaction process.

Figure 8 .
Figure 8. Flow diagram showing the feature redaction process.

Figure 9 .
Figure 9. Zeus prediction results when tested using the minimum number of features.

Figure 9 .
Figure 9. Zeus prediction results when tested using the minimum number of features.

Figure 10 .
Figure 10.Benign communication prediction results when tested using the minimum number of features.

Figure 10 .
Figure 10.Benign communication prediction results when tested using the minimum number of features.

Figure 11 .
Figure 11.DT Zeus prediction results compared between using the minimum number of features and all the features in Feature set1.

Figure 12 .
Figure 12.DT benign communication prediction results compared between using the minimum number of features and all the features in Feature set1.

Figure 11 .
Figure 11.DT Zeus prediction results compared between using the minimum number of features and all the features in Feature set1.

Figure 11 .
Figure 11.DT Zeus prediction results compared between using the minimum number of features and all the features in Feature set1.

Figure 12 .
Figure 12.DT benign communication prediction results compared between using the minimum number of features and all the features in Feature set1.

Figure 12 .
Figure 12.DT benign communication prediction results compared between using the minimum number of features and all the features in Feature set1.

Figure 13 .
Figure 13.RF Zeus prediction results compared between using the minimum number of features and all the features in Feature set1.

Figure 14 .
Figure 14.True positive rates compared for all three algorithms when using the minimum number of features.

Figure 13 .
Figure 13.RF Zeus prediction results compared between using the minimum number of features and all the features in Feature set1.J. Cybersecur.Priv.2023, 3, x FOR PEER REVIEW 20 of 24

Figure 15 .
Figure 15.True positive rates compared for all three algorithms when using the minimum number of features.

Figure 14 .
Figure 14.True positive rates compared for all three algorithms when using the minimum number of features.

Figure 15 .
Figure 15.True positive rates compared for all three algorithms when using the minimum number of features.

Figure 16 .
Figure 16.True negative rates compared for all three algorithms when using the minimum number of features.

Figure 15 .
Figure 15.True negative rates compared for all three algorithms when using the minimum number of features.

Table 1 .
The features that were not used during this research.
mean_fiat This is the mean amount of time between two packets sent in the forward direction (in microseconds) which was removed to negate any correlation with a network characteristic max_fiat This is the maximum time between two packets sent in the forward direction (in microseconds) which was removed to negate any correlation with a network characteristic std_fiat This is the standard deviation from the mean time between two packets sent in the forward direction (in microseconds)

Table 2 .
Datasets used in this research.

Table 4 .
Test results when using the decision tree algorithm.

Table 5 .
Test results when using the random forest algorithm.

Table 6 .
Confusion matrix for testing with the decision tree algorithm.

Table 7 .
Confusion matrix for testing with the random forest algorithm.

Table 8 .
Test results when using the deep learning algorithm.

Table 9 .
Confusion matrix for using the deep learning algorithm.

Table 10 .
Predication results when using the DL algorithm with minimum features.

Table 10 .
Predication results when using the DL algorithm with minimum features.

Table 10 .
Predication results when using the DL algorithm with minimum features.

Table 11 .
Confusion matrix for testing the DL algorithm with minimum features.

Table 11 .
Confusion matrix for testing the DL algorithm with minimum features.

Table 12 .
Predication results when using the DT algorithm with minimum features.

Table 12 .
Predication results when using the DT algorithm with minimum features.

Table 13 .
Confusion matrix for testing the DT algorithm with minimum features.

Table 13 .
Confusion matrix for testing the DT algorithm with minimum features.

Table 13 .
Confusion matrix for testing the DT algorithm with minimum features.

Table 14 .
Predication results when using the RF algorithm with minimum features.

Table 15 .
Confusion matrix for testing the RF algorithm with minimum features.