DBoTPM: A Deep Neural Network-Based Botnet Prediction Model

: Internet of things (IoT) devices’ evolution and growth have boosted system efﬁciency, reduced human labour, and improved operational efﬁciency; however, IoT devices pose substantial security and privacy risks, making them highly vulnerable to botnet attacks. Botnet attacks are capable of degrading the performance of an IoT system in a way that makes it difﬁcult for IoT network users to identify them. Earlier studies mainly focused on the detection of IoT botnets, and there was a gap in predicting the botnet attack due to their complex behaviour, repetitive nature, uncertainty, and almost invisible presence in the compromised system. Based on the gaps, it is highly required to develop efﬁcient and stable AI models that can reliably predict botnet attacks. The current study developed and implemented DBoTPM, a novel deep-neural-network-based model for botnet prediction. The DBoTPM was optimized for performance and less computational overhead by utilizing rigorous hyperparameter tuning. The consequences of overﬁtting and underﬁtting were mitigated through dropouts. The evaluation of the DBoTPM demonstrated that it is one of the most accurate and efﬁcient models for botnet prediction. This investigation is unique in that it makes use of two real datasets to detect and predict botnet attacks with efﬁcient performance and faster response. The results achieved through the DBoTPM model were assessed against prior research and found to be highly effective at predicting botnet attacks with a real dataset.


Introduction
IoT infrastructure and its implementations have grown exponentially in recent years as communication efficiency has improved, costs have decreased, and demand has increased. IoT devices have been developed and deployed in a variety of industries, including smart living, intelligent security, real-time data acquisition; logistics, smart agriculture, etc. [1,2]. The number of smart sensory gadgets capable of automated communication on the internet has increased significantly over recent years. By 2050, it is expected to reach approximately 30 billion [2]. However, the tremendous growth in the number of IoT devices and their global presence has created a chance for hackers to exploit the IoT infrastructure's security and privacy by utilizing irregular entities, such as botnets. The biggest threat to infrastructure security is botnet-based assaults, in which unauthorized users insert harmful code into IoT infrastructure.
Botnets can be categorized into two categories: firstly, traditional botnets, and secondly, IoT botnets. A traditional botnet consists of infected servers, computers, or personal computers, whereas an IoT botnet consists of malware-infected IoT devices, such as smart TVs, smartwatches, wearables, and cameras. This malware enables an attacker to take control of the devices and perform tasks similar to those performed by a traditional botnet. Botnet operators or herders can control infected machines through a covert channel or peer-to-peer communication [2]. There are various types of botnet attacks, including scan, DDoS, flooding (TCP, UDP and acknowledgement), and hybrid attacks [2]. The CCNT dataset was from 2006; the reason for choosing the CCNT dataset in the present investigation is that it contains a significant number of records which were less explored to that could be analysed concerning the botnet attacks. In addition to this, the more recent N-BaIoT dataset was also used in this investigation. N-BaIoT contains traffic data from nine industrial IoT devices [2]. The data comprise benign traffic and a variety of malicious attacks such as scan, TCP, UDP, and SYN [2]. There were 1,486,418 instances in the second dataset, with 115 features for each file comprising a total of 7.58 GB data in size for benign and attack occurrences. Raw network traffic data were recorded using Wireshark before and after the infection on the switch [2]. The components of two MIRAI and BASHLITE botnet attacks were implemented in a private lab setting [2]. MIRAI and BASHLITE botnet attacks were classified into ten attacks and one non-attack category (see, Figure 1). The behaviours of the considered attacks fall into three categories: (1) scan instructions used to identify weak IoT devices; (2) ACK, SYN, UDP, and TCP flooding; and (3) combined attack used to initiate a connection and send spam to it [2].

Datasets
The present investigation used two datasets to model and predict the botnet attacks. The first dataset was the Context Computer Network Traffic (CCNT) dataset, which contains network traffic data with 20,803 records and ten local workstation IP addresses. The dataset spans three months between 2006-07-01 and 2006-09-30. During this time, half of these local IPs were hacked and became members of various botnets. There were four features in the first dataset: date, I_ipn, r_asn, and f. The date contains data records from 2006-07-01 to 2006-09-30. I_ipn stands for local IP address, which was coded as an integer from 0-9. The feature r_asn represents autonomous system number (ASN) from a remote location; it identifies the remote internet service provider (ISP) based on an integer value, e.g., 701 or 714. The count of connections for the day was represented by fluxes (f). This dataset was acquired from Stanford data and can be accessible at https://www.kaggle.com/datasets/crawford/computer-network-traffic, (1 August 2022).
The dataset was grouped by date to observe the net flow of the day based on all logs for the same day (Table 1). In Table 1, yday and wday represented the day of the year and day of the week, respectively.

Features
Date I_ipn  r_asn  f  yday  wday  0  2006-07-01  0  436,704  106  182  5  10  2006-07-02  0  460,025  920  183  6 The CCNT dataset was from 2006; the reason for choosing the CCNT dataset in the present investigation is that it contains a significant number of records which were less explored to that could be analysed concerning the botnet attacks. In addition to this, the more recent N-BaIoT dataset was also used in this investigation. N-BaIoT contains traffic data from nine industrial IoT devices [2]. The data comprise benign traffic and a variety of malicious attacks such as scan, TCP, UDP, and SYN [2]. There were 1,486,418 instances in the second dataset, with 115 features for each file comprising a total of 7.58 GB data in size for benign and attack occurrences. Raw network traffic data were recorded using Wireshark before and after the infection on the switch [2]. The components of two MIRAI and BASHLITE botnet attacks were implemented in a private lab setting [2]. MIRAI and BASHLITE botnet attacks were classified into ten attacks and one non-attack category (see, Figure 1). The behaviours of the considered attacks fall into three categories: (1) scan instructions used to identify weak IoT devices; (2) ACK, SYN, UDP, and TCP flooding; and (3) combined attack used to initiate a connection and send spam to it [2].

Data Pre-Processing
In this section, we explored the datasets before deploying the DBoTPM model. The class imbalance issue was found with the CCNT dataset. IP0 had a higher value of 3980 records for 3 months, while IP 3 had only 1186 records for the same period ( Figure 2). Previous research has concentrated on obtaining better accuracy values, with less emphasis on the illusory accuracy caused by the dataset's class imbalance [23]. Classification on an imbalanced dataset revealed that minority classes had lower accuracy, while majority classes had higher accuracy. In the current investigation, SMOTE was performed before the data was fed into the DBoTPM model to address the issue of class imbalance. SMOTE generates synthetic data using the k-nearest neighbour method. SMOTE begins by selecting random data from the minority class and then sets the data's k-nearest neighbours. The random data would then be combined with the randomly chosen k-nearest neighbour to create synthetic data (see Figure 2) [23]. The number of values for all minor IPs was increased as per the value of the major class with synthetic values using SMOTE to overcome the issue of class imbalance.

Data Pre-Processing
In this section, we explored the datasets before deploying the DBoTPM model. The class imbalance issue was found with the CCNT dataset. IP0 had a higher value of 3980 records for 3 months, while IP 3 had only 1186 records for the same period ( Figure 2). Previous research has concentrated on obtaining better accuracy values, with less emphasis on the illusory accuracy caused by the dataset's class imbalance [23]. Classification on an imbalanced dataset revealed that minority classes had lower accuracy, while majority classes had higher accuracy. In the current investigation, SMOTE was performed before the data was fed into the DBoTPM model to address the issue of class imbalance. SMOTE generates synthetic data using the k-nearest neighbour method. SMOTE begins by selecting random data from the minority class and then sets the data's k-nearest neighbours. The random data would then be combined with the randomly chosen k-nearest neighbour to create synthetic data (see Figure 2) [23]. The number of values for all minor IPs was increased as per the value of the major class with synthetic values using SMOTE to overcome the issue of class imbalance. There are various RNN algorithms, and the rationale of choosing LSTM is that it offers more memory capability, which remembers the output results of each node for a longer period in order to efficiently produce the output for the next node. LSTM networks also address the issue of vanishing gradient in RNNs. LSTMs work better with normalized data, so the transformation was applied [24]. Min-max scaling was performed, which transforms the time series data between -1 and 1. After the prediction stage, this transformation was inverted to return the values to their original scale, allowing the uncertainty calculation to be properly assessed.
To understand and visualise the daywise traffic flow for all 10 IPs Figure 3 was generated, with the x-axis showing the year of the day starting from 181 and going to 271. The red line shows the mean combined with the standard deviation along with the length of days. Figure 3 revealed that the IP4 showed much less traffic from day 180 to 259; however, there was a sudden spike on days 259 and 260, which showed the abnormal behaviour of traffic. All four features from the first dataset-date, I_ipn, r_asn, and f-were used to feed the ML models. For the N-BaIoT dataset, the columns with NaN were dropped, and the columns which contained more than one unique value were kept. The total number of features was 115 in the N-BaIoT dataset, earlier work from [2] applied principal component analysis (PCA) to reduce the dimension of the data, and the number of PCs There are various RNN algorithms, and the rationale of choosing LSTM is that it offers more memory capability, which remembers the output results of each node for a longer period in order to efficiently produce the output for the next node. LSTM networks also address the issue of vanishing gradient in RNNs. LSTMs work better with normalized data, so the transformation was applied [24]. Min-max scaling was performed, which transforms the time series data between -1 and 1. After the prediction stage, this transformation was inverted to return the values to their original scale, allowing the uncertainty calculation to be properly assessed.
To understand and visualise the daywise traffic flow for all 10 IPs Figure 3 was generated, with the x-axis showing the year of the day starting from 181 and going to 271. The red line shows the mean combined with the standard deviation along with the length of days. Figure 3 revealed that the IP4 showed much less traffic from day 180 to 259; however, there was a sudden spike on days 259 and 260, which showed the abnormal behaviour of traffic. All four features from the first dataset-date, I_ipn, r_asn, and f-were used to feed the ML models. For the N-BaIoT dataset, the columns with NaN were dropped, and the columns which contained more than one unique value were kept. The total number of features was 115 in the N-BaIoT dataset, earlier work from [2] applied principal component analysis (PCA) to reduce the dimension of the data, and the number of PCs was selected based on the CEVR (cumulative explained variance ratio). The first PC was selected for the N-BaIoT dataset to feed in the ML models to trade off between the dimensionality and efficiency in the present investigation. Another study which introduced the N-BaIoT dataset [2] extracted 23 features [25].
Electronics 2023, 12,1159 was selected based on the CEVR (cumulative explained variance ratio). The first P selected for the N-BaIoT dataset to feed in the ML models to trade off between the d sionality and efficiency in the present investigation. Another study which introduc N-BaIoT dataset [2] extracted 23 features [25].

Botnet Detection using Approximate Entropy (AE)
AE was developed to analyse the randomness in time series data [26]. N anomalies detection and monitoring can be effectively understood using flow en The present investigation used AE to evaluate the time series. Suppose a time se measurements a(1), a(2), …., a(K) which is K equally spaced. Suppose two param

Botnet Detection using Approximate Entropy (AE)
AE was developed to analyse the randomness in time series data [26]. Network anomalies detection and monitoring can be effectively understood using flow entropy. The present investigation used AE to evaluate the time series. Suppose a time series of measurements a(1), a(2), . . . , a(K) which is K equally spaced. Suppose two parameters b and r where b is a positive integer demonstrating the length of consecutive measurements Electronics 2023, 12, 1159 6 of 14 that must be compared and r is a positive real number demonstrating the tolerance level. Another study [27], suggested that the value of b should be 2 and that r should be 20% of the σ (SD) of the time series. AE allocates a positive number to the time series, and smaller values show regularity in the data, while higher values correspond to irregularity in the data. AE is calculated using the four steps procedure: Step 1: Calculate a sequence of real b-dimensional vectors s(1), s(2), . . . . . .
Step 3: The values of parameters b and r were selected as 2 and 3, respectively. The AE calculates the logarithmic empirical likelihood that is near the observations within the range of r for b consecutive observations following incremental comparison [26]. AE has been used for a long time and due to its utility, it has been employed more recently in applications including the volatility of bitcoin [28], electrochemical noise measurement [29], and neuronal spike sorting [30].

Development and Implementation of the DBoTPM Model
Tensorflow 2.0 backend, Keras 2.3.0 API, Python 3.8 with, pandas, NumPy matplotlib, sklearn, and SNS libraries were utilized to develop the DBoTPM model. Sklearn was used for feature scaling and estimation of errors, while Keras was used to import models, including LSTM.
Three auxiliary methods were coded in the current study to develop and implement the DBoTPM model. The first method (create_fdataset) was used to create the list of features and targets for our model. The method create_fdataset created the feature dataset, with each row containing the sequence of the previous observations (lookback) represented as a day of the week. The target value (y_train) was the traffic flow that needed to be predicted for the current day. The second method (tModel) was used to train, compile, and fit the developed LSTM model on the feature dataset. The model architecture is given in Figure 4.
The hyperparameters, including the batch size, number of units, and activation functions, must be chosen initially. Activation functions are used to map the input value to an output signal; that output signal is then used as an input in the following layer. Between two LSTM layers, a dropout layer was added to avoid overfitting. It drops out input variables from a layer, which could be from a preceding layer. With two dropout layers, a value of 0.5 was chosen. Compiling the network was the next step after model architecture. The compilation needs some parameters such as an optimization algorithm for network training and a loss function for network evaluation. After compilation, the next step was to fit the LSTM model. The weights are adjusted via model fitting, which depends on the train set. It necessitates both inputs and outputs be specified with training data. The number of epochs was set to 50 at the start, along with a validation split of 0.2, which means that 20% of values were used for internal model validation.
The third method we developed was used to perform a prediction (Flow_pred) with our trained model. The tmodel trained output for each IP traffic for wday was copied to separate variables (t0, t1, . . . , t9). The predicted flow for each IP from the developed model and the original dataset is shown in Figure 5 below. The SGD optimizer was used with mean squared error (MSE) as a measure of loss.  The hyperparameters of any ML model must be tuned before the model can be evaluated and appraised. (1) Batch size, (2) the number of nodes and epochs, and (3) optimizer choice were the primary hyperparameters of LSTM optimized in this study. For various setup values, a walk-forward approach to optimizing the hyperparameters was evaluated [21]. The optimization algorithm attempted in the present investigation were Adam, (SDG), AdaDelta, and RMSProp, among the optimization techniques assessed. In the current study, the optimized DBoTPM model containing seven layers trained for 100 epochs and with a batch size of 16 with an SGD optimizer exhibited encouraging results in terms of RMSE and computational efficiency. The external evaluation of the model prediction was conducted using a separate dataset of 30 pieces of time step data that were kept separate from the training and testing traffic data.
Seven layers of a deep neural network model have been employed in the present investigation that function on all ten IPs traffic. Two LSTM, two dropouts, and three dense layers were used. Defining the LSTM network architecture with layers is important. LSTM requires input data to be in a 3D form including the number of features, batch size, and timesteps. The batch size specifies the number of samples in each batch during training and testing. The term "timesteps" refers to the number of values in a sequence and the number of dimensions used to represent data in a single time step is represented through the features. The reshape function converts our data into a three-dimensional shape composed of training samples, time steps, and features. In the input layer, PC1 was fed as a feature with 33 timesteps. The number of inputs and outputs was 1 and 128 in the first LSTM unit with 33 timesteps. The dropout layer was used to prevent overfitting; it works by terminating the connections between neurons. The second LSTM layer contains 128 and 64 as the number of inputs and outputs with 33 timesteps. Rectifier activation function The architecture of the developed DBoTPM is shown in Figure 4.

Uncertainty Assessment
The moment correlation coefficient (MCC), root mean square error (RMSE), mean absolute percentage error (MAPE), and Nash-Sutcliffe coefficient (NSE) were utilized to evaluate the uncertainty of the models output [1], see Table 2.

Uncertainty Assessment
The moment correlation coefficient (MCC), root mean square error (RMSE), mean absolute percentage error (MAPE), and Nash-Sutcliffe coefficient (NSE) were utilized to evaluate the uncertainty of the models output [1], see Table 2. Table 2. Uncertainty assessment metrics.

Metrics
Formula Description X, y, and n are original, predicted, and the number of samples, respectively. A i −P i P i P i , A i , and n represent the predicted value, actual value, and number of samples, respectively.
a,â, and y represent the actual time series, mean of the actual time series, and predicted series, respectively.

Results and Discussions
The DBoTPM model was developed to predict the expected flow of each host given the previous number of days. To assess the botnet prediction performance, the predicted data vs. actual data were compared to understand whether the host belonged to a botnet or normal traffic data. In this section, the results were discussed based on the performance and computational complexity. For proper verification and validation of the DBoTPM model, the data which were kept separate from the training phase were utilized.
The major aim of the present investigation was botnet prediction using the DBoTPM model based on the real dataset CCNT that compromised botnet attack traffic data for 10 IPs (Figure 5). Figure 5 showed the actual flow and predicted flow for all 10 IPs. The performance of the developed novel DBoTPM model was given in Table 3 based on R2, MCC, RMSE, MAPE, and NSE for all 10 IPs. The comparison of the DBoTPM model with other well-known studies is presented in Table 4. Due to the interesting and significant statistical utility of AE was implemented to detect the botnet attack ( Table 5). The AE calculates the logarithmic empirical likelihood that is near the observations within the range of r for b consecutive observations following incremental comparison ( Figure 5). The comparison of botnet detection using AE in the present investigation with other studies was shown in Table 6. Table 6 showed that the AE performance was better compared to the earlier investigations based on accuracy and computational cost. The time duration of the datasets used in the present investigation was large compared to other datasets, such as SysNet and ISCX, which have durations between 8 h and 3 days as used by [28,29]. Table 4 shows that the developed DBoTPM shows better accuracy than other existing studies for botnet prediction.   The AE values for all 10 IPs were shown in Table 5 and Figure 6. AE was utilized to investigate the randomness network traffic dataset. It assigns lower values to indicate regularity in the traffic and higher values to indicate anomaly or irregularity in the traffic. The higher AE values were observed with values of 0.36, 0.17, 0.29, and 0.26 for IP3, IP5, IP6, and IP9, respectively. Hence, these IPs joined the botnets during the observational period. Figure 6 interestingly shows that the AE values for IP 0, 1, 2, 4, and 8 are shown as a straight line, which indicated normal activity. However, IPs 3, 5, 6, and 9 showed changes in values with time and were probably associated with botnets. IP 7 also looks abnormal. However, based on its value of 0.0, it was predicted as normal. The AE method can only detect if a specific node has been affected by a botnet attack, whereas the DBoTPM model can classify the attacks into three types: scan instructions used to identify weak IoT devices; ACK, SYN, UDP, and TCP flooding; and combined attack used to initiate a connection and send spam to it. Electronics 2023, 12, 1159 11 of 14 Figure 6. AE values for all IP addresses during the observation period. Figure 6. AE values for all IP addresses during the observation period.
The computation speed and accuracy of the model are compared to those of previous studies. The current study made use of Tensor Processing Units (TPUs) version 2-8. The TPU used in this experiment had eight cores and 64 GB of memory. It was frequently observed that the optimal DBoTPN model consumed around 1 s for each epoch in the current investigation (see Figure 7), but the LSTM model developed by [38] consumed 85 s per epoch, and the LSTM model developed by [24] consumed 3 s per epoch. The major reason for the higher computational speed was due the reduced number of features using PCA.

Conclusions
A botnet is a severe cybersecurity issue since it has control over many infected hosts and can use them to carry out numerous attacks. Despite numerous earlier attempts at botnet detection, botnet attack prediction has received little attention. The major contribution of the present investigation was botnet attack prediction using the developed DBoTPM model based on a real botnet attack dataset for 6 months of traffic datasets for 10 hosts. The R2 value for 10 hosts for botnet prediction was between 0.1 to 0.72, with an average R2 value of 70% and an efficient computation time of 119 s for 100 epochs. Botnet detection was also performed using the AE model in the present investigation with 100% accuracy. Unbiased assessment of the DBoTPM model's performance was based on the unanticipated data that were kept separate from the training procedure. Both models demonstrated superior performance and efficiency in terms of time. The current investigation's future scope is to implement existing models using publicly available datasets.

Conflicts of Interest:
The author declares that they have no conflict of interest to report regarding the present study.