1. Introduction
The universality of the internet and computer networks has revolutionized the way we interact with each other, enabling information sharing and collaboration at an unprecedented scale. However, this pervasive connectivity has also created new opportunities for malicious actors to exploit vulnerabilities and gain unauthorized access to sensitive information [
1]. As a result, the importance of effective intrusion detection systems cannot be overstated, and the need for proactive notification is emerging [
2]. IDS stands for a hardware device or software application used to monitor and detect suspicious network traffic and potential security breaches flagging malicious activity. This monitoring takes place at the packet level and, thus, such a system can distinguish malicious from benign packets. Traditionally, this component has been implemented as a firewall and later as a rule-based expert system. Due to the rise in ML in recent years [
3,
4,
5,
6], state-of-the-art approaches are based on ML technologies applied to data logs from IDSs, to classify packets as suspicious or not [
7,
8,
9,
10,
11].
Moreover, multivariate time series prediction [
12,
13,
14] is a sophisticated analytical approach that involves forecasting future values of multiple interrelated variables over time. Unlike univariate time series analysis, which focuses on a single variable, multivariate time series prediction considers the dynamic interactions and dependencies among several variables simultaneously. This method is particularly relevant in fields such as finance [
15,
16,
17], weather forecasting [
18], and industrial processes [
19], where various factors influence the outcome of interest. The complexity lies in capturing the intricate relationships among different variables and understanding how changes in one can impact the others. Advanced machine learning techniques, including recurrent neural networks (RNNs) [
20], LSTMs, and autoregressive integrated moving average (ARIMA) models [
21], are commonly employed to handle the complexity of multivariate time series data. Accurate predictions in this context can provide valuable insights into informed decision-making, risk management, and optimizing processes in diverse domains.
Machine learning IDSs can be broadly categorized into two types: (i) classification-based [
22,
23,
24] and (ii) anomaly-based [
25]. Classification-based IDSs use machine learning algorithms to classify incoming data into different categories based on a set of features. Even though classification-based IDSs are effective in detecting known attacks, they can be less effective in identifying new and unknown attacks that have a small correlation with the training dataset. On the other hand, anomaly detection-based approaches use statistical models and machine learning algorithms to establish a baseline of normal behavior and identify deviations from that baseline. Unlike classification-based IDSs, anomaly-based IDSs can detect unknown or novel attacks that have not been previously seen. However apart from this advantage, these models cannot easily specify the type of attack and perform worse than classification approaches for known data types [
26,
27,
28].
The IDS methodologies discussed above suffer from a significant drawback, namely the process of identifying and categorizing anomalies takes place in real-time. To address this issue, the present study suggests a proactive intrusion detection system (i.e., after the attack has taken place) that can detect and isolate malevolent packets prior to their entry into the system. This model is a novel combination of a convolutional neural network (CNN), LSTM, and attention (ATT) that can handle complex data and identify strong patterns. To achieve this, the proposed model utilizes a window of
W preceding packets to predict the existence of an attack in the subsequent
T packets (prediction window), allowing for the classification of the upcoming behavior in advance and enabling the application of security measures. To evaluate the effectiveness of this approach, we conducted experiments on the UNSW-NB15 dataset [
29], a widely used benchmark for evaluating intrusion detection systems. Our model achieved an F1 score of 83% for the
case, which is similar to a real-time IDS classification. Moreover, it attained an F1 score of 91% for predicting the existence of an attack in the next
packets.
Our main contributions can be summarized as follows:
We extend and improve existing intrusion detection systems by implementing proactive prediction.
We compare our approach with state-of-the-art current methodologies.
We experiment with the configuration parameters (e.g., size of the input window) of our method thoroughly to provide insights.
We propose a novel methodology combining deep learning components.
The current manuscript is an extension of the previous work presented at the 2023 International Symposium on Distributed Computing and Artificial Intelligence (DCAI) [
30], providing a more complex and robust architecture with more experiments, as well as an ablation study to validate the proposed architecture.
The structure of the remaining sections is as follows:
Section 2 provides an overview of the previous approaches on IDSs based on machine learning as well as multivariate time series prediction state-of-the-art methodologies.
Section 3 provides an overview of the IDS dataset with the relevant prediction features. In
Section 4, we delve into the methodology behind the proactive IDS, the pre-processing steps we conducted, as well as the architecture of the model.
Section 5 presents the results of our experiments, which evaluate the performance of our approach in different scenarios with many alternative hyperparameter settings. In
Section 6, we perform an ablation study to validate the architectural choices of the model. Lastly, in
Section 7, conclusions are drawn, and potential future directions for further research are outlined.
3. Dataset
The dataset we used to validate our approach is UNSW-NB15, which was created by researchers at the University of South Wales in Australia. This dataset was produced by the IXIA PerfectStorm where raw packets were gathered for common attacks such as fuzzers, analysis, backdoors, and denial-of-service (DoS). We selected this dataset primarily due to its extensive use in research studies, serving as a standard benchmark for comparing various network intrusion detection systems. Additionally, it includes a time feature, enabling the conversion of the problem into a time series context.
The UNSW-NB15 dataset contains approximately 2.5 million samples of network packets and 49 corresponding labels (48 features and 1 label). These features can be viewed in
Table 1. From these features, we employ 44 after the pre-processing steps. The features include basic information such as source and destination IP addresses, as well as more advanced features such as the packet size, time to live, and protocol type. This dataset also contains time-based features, namely the starting and ending times of a packet, which helps us sort and convert the samples to a time series format. Concerning the labels (binary), the dataset consists of 2,218,760 benign packets and 321,283 malicious. The mean number of packets involved in the interactions between specific entities (IPs) in this dataset is 8167. Also, the mean time between two consecutive packets is 1 second.
4. Methodology
In this section, we describe the proposed methodology for proactive notification based on IDS logs. We describe the pre-processing steps as well as the model we used for the prediction. An overview of the whole system can be viewed in
Figure 1.
The architecture consists of three main components: packet input, pre-processing, and ML model.
4.1. Preprocessing
Before feeding the dataset to the model, we first convert it into the right format. Thus, we initially clean the data, discarding and corrupted values. The latter could be unrealistic attribute values, such as the negative packet length or unrealistic values that are too far from the feature mean (outliers). Such errors could be due to feature extraction tool errors or transportation losses.
The second pre-processing step is data scaling, using the min-max scaler. Scaling is the process of transforming the values of numeric variables in a dataset to a common scale. Since neural networks tend to calculate the distance between data points, we do not want features with larger scales to dominate the ones with lower scales.
The third pre-processing step is
feature selection, which is performed using the ANOVA (analysis of variance)
F-value. This is a statistical measure that assesses whether the means of two or more groups are significantly different from each other. The mathematical formula for this is as follows:
where for
(mean squares between):
and for
(mean squares within):
For these equations, is the number of observations in the i-th group, is the mean of the i-th group, is the overall mean, k is the number of groups, and n is the number of observations within each group. This statistic is computed between all features and the label.
The fourth pre-processing step is one-hot encoding since we have many categorical features that can take multiple values. This technique gives a column with N different values and creates N different binary columns, where 1 indicates the presence of this value in the current instance and 0 indicates the absence. An example is the categorical variable ’service’, which indicates the type of packet and can be ’dns’, ’http’, ’smtp’, etc.
The final stage of pre-processing involves formatting the data into a time series format. To accomplish this, we begin by sorting the dataset based on the time (starting time) feature. During this pre-processing phase, there is an option to either generate time series objects for the entire sorted dataset or initially group the dataset by (sender IP and receiver IP) pairs, subsequently creating time series objects using a specific pair of IPs. The former analyzes the overall network traffic, while the latter specifically monitors the communication flow between two designated entities. Next, using the sorted dataset, we generate windows comprising W time points and a label, where W is the size of the input window. Here, the initial W points serve as the input to the model, and the label is the target that must be predicted. For each of the W input points, we keep all the features (not only the labels) to formulate the problem of multivariate time series forecasting. The label is calculated based on a value T and indicates the existence of an attack in the forthcoming T packets. For example, if W = 10 and T = 5, and the labels of the T packets are , the label of the time series instance would be 1 because there is a 1 in the T vector. Otherwise, it would be 0. Also, in such a case, we would use ten consecutive instances of the dataset as input for the model.
We also incorporate overlapping between the different windows to utilize the data to their fullest potential.
4.2. ML Model
In this subsection, we describe the architectural choices for the machine learning intrusion detection module. This model consists of a convolutional part followed by the time series processing part. A complete overview of the architectural choices is displayed in
Figure 2.
First, our model uses two convolutional layers that work on the feature axis. These layers have a dual purpose: they compress the feature space and merge information. This is crucial because the original dataset contains 44 features, some of which may be less useful or relevant. To address this, we employ a convolutional layer to identify and enhance important information during training. The output then passes through a max pooling layer to further emphasize significant feature aspects. It is worth noting that the time steps remain unchanged, and the convolution only affects the feature axis. As a result, we obtain a time series object with fewer features while preserving the same number of time steps.
Let
X be the input time series with
N time steps and
D features. The output
after the convolutional layers and max pooling can be represented as follows:
Then, the modified time series is passed into the time module, which comprises two LSTM layers with attention incorporated in between. The initial LSTM layer processes the time series object and concatenates information. However, due to the persistence of dense and information-rich input, we opt to not merely utilize the concatenated output of the LSTM; instead, we collect the outputs of each time step (LSTM cell). The output of the first LSTM is a time series with the same number of time steps as the input of the LSTM, where each timestep has accumulated information from previous states. Consequently, we feed the time series produced by the first LSTM into an attention layer. This attention layer operates as a self-attention module, assessing the correlation between all LSTM cell outputs, in other words, between all the time steps within the time series. This approach aids the model in concentrating on important information from previous time steps, which proves valuable in predicting potential attacks. The resultant attention map, denoted as
A, in conjunction with the output from the first LSTM layer (all cell outputs
), are fused and presented as input to the second LSTM layer, as follows:
With knowledge of the attention map at each time step, this second layer can effectively interconnect information, further enhancing its ability to spotlight significant past events within the time series object.
Lastly, the output of the second LSTM (only the last accumulated state) is fed to a fully connected layer, which produces a binary label, indicating the presence of an attack:
5. Experiments
This section outlines the experiments conducted to validate the approach presented in
Section 4. We describe the experimental setup, show the used metrics, and demonstrate results. Then, we conduct individual experiments to test what is the best time series grouping approach, the optimal window,
W, and the prediction horizon,
T.
5.1. Experimental Set-Up
Prior to conducting experiments on our model, we engage in feature selection and present outcomes for various feature sets.
To utilize our dataset, we employ 5-fold cross-validation and average the results across all 5 folds. We note that after the time series formatting—instead of individual samples—we have windows of size W and, thus, the shuffling and splitting to folds are performed on such independent windows. To demonstrate the robustness of our proposed approach, we make a comparison with the standard binary IDS classification approach. We also show the effect of the input window size on the results and, thus, exhibit the change in accuracy while the value of this variable is ascending.
Apart from experiments concerning the input window size, we also experiment with the prediction horizon. Based on the aforementioned methodology, we still predict a single label, which is not always the next packet. For each window, we create a new label that indicates the existence of an attack in the next T time steps.
To ensure the effectiveness of our predictive models, we evaluate their performance using the F1 score, precision, and recall. These metrics are necessary due to the imbalanced nature of our dataset, where simple accuracy could produce inaccurate results. F1 score, precision, and recall provide a more accurate measure of the model’s performance by accounting for the imbalanced dataset and ensuring a fair evaluation.
We train our models for 40 rounds, applying early stopping with a patience parameter of
. All the models are trained using the Adam Optimizer with a learning rate of
, a weight decay of
, and a batch size of
. A summary of the hyperparameter setup is shown in
Table 2.
5.2. Feature Selection
In this section, we present the statistical outcomes of the applied feature selection technique and perform tests using various feature sets, measuring both predictive performance and training time in seconds. The results of the ANOVA F-test are displayed in
Figure 3.
The features are arranged in order of relevance, and upon examining the figure, we note that 15 features exhibit significantly greater importance than the others. As a result, we generate three subsets, denoted as
,
, and
, where
comprises all the features of the dataset,
encompasses the top 15 features determined by the statistical test, and
contains the top 10 features. Using a window
and a predictive horizon
, we present the results in
Table 3.
We notice a reduction in training time with a smaller subset of features, as expected, given that the overall model has fewer parameters. This also leads to a more memory-efficient model. However, the decrease in features corresponds to a decline in accuracy, which is reasonable since patterns are discarded with fewer features. Consequently, there exists a natural trade-off between predictive robustness and model complexity, and this trade-off should be customized based on a specific use case. For the following experiments, subset is used.
5.3. Flow-Based vs. Network Grouping
Within this subsection, we examine the impact of diverse grouping approaches on the dataset to determine whether broader network information plays a crucial role in predicting forthcoming attacks or if it suffices to monitor the traffic between specific entities within the network. The grouping process is executed exclusively for the training set, leaving the testing set unchanged (i.e., the time series objects of the test set are created with network grouping). The input window W is set to 50 while the prediction horizon T is configured to 1. Specifically, we generate time series objects for the training set through the following:
Network grouping (NG): dataset sorting based on time features and subsequent creation of time series objects.
Flow grouping (FG): grouping the dataset based on IP pairs, followed by sorting each group and subsequently creating time series objects within each group.
The results of these experiments can be viewed in
Figure 4.
In our analysis, we observe significant differences in the results obtained from the two distinct approaches. Initially, the flow-based method exhibits a higher F1 score, with an up to 10% disparity at a certain point in, approximately, the first 10 rounds. This discrepancy arises because the pattern identification task is less intricate; correlations are more readily discernible in individual flows between two IPs.
However, the F1 score stabilizes after epoch 20 for the flow-based approach when compared to the network-based approach. After 40 epochs, the difference is ∼7%. Analyzing the evolution of gradients reveals that during the initial 20 epochs, both approaches exhibit steeper gradients, signifying rapid changes. Beyond epoch 20, a gradual decrease in gradients is observed, indicating that the model is making increasingly smaller updates to its parameters as the training progresses. Notably, the gradient plot for the NG (network grouping) consistently maintains higher values compared to FG (flow grouping), suggesting a greater learning capacity for the former grouping strategy.
To further compare these two grouping approaches, we also demonstrate the receiver operating characteristic (ROC) curve for the test set predictions at epoch 40 in
Figure 5.
The AUC score of the network grouping is better () compared to the flow grouping (), suggesting a better fit of the model. We also observe that the network-based curve is closer to the upper-left corner of the plot, which generally indicates better overall performance. This implies a higher true positive rate and a lower false positive rate across different threshold settings.
The overall network-based grouping appears superior, which is attributed to the fact that the dataset was collected within the same network, and information regarding impending attacks may manifest in locations beyond the attack itself. For instance, an attacker might assess the defenses of various entities within a network before deciding to target a specific node. Therefore, this experiment suggests that valuable information about impending attacks is embedded in the overall traffic of a network, and monitoring it can contribute to a model that is more adept at predicting the presence of attacks.
5.4. Impact of W in Model Performance
Firstly, we want to test the effect of the window size on the model’s accuracy. For this reason, we predict the label of the
packet using a window size (
W) of ascending size. For this experiment, we use network-based grouping. We also want to test these cases against the classic, real-time binary classification to compare the efficiency of the two different approaches. For this experiment, the value of
T is fixed, with
. With this value fixed, after the time series formatting, we end up with 2,540,033 time series objects with 2,540,043 initial instances. This is expected since we have overlapping time steps and we also have −10 instances for the first time series object. The label distribution is the same as the original dataset. The results can be viewed in
Table 4.
A simple (real-time) classification scenario yields marginally better results compared to the time series case, which is unsurprising, given that making classifications using present variables is substantially easier than making predictions about future events. Specifically, the F1 score of the classification method is only 1% higher than that of the time series model for , which is deemed acceptable. Additionally, the study indicates that an increase in the input window size enhances model accuracy. This is because utilizing more samples generally leads to improved predictions, by providing more data for analysis. Nevertheless, this also entails a greater computational burden, as a larger window entails more parameters for the model. More specifically, the study demonstrates that the metrics when W is increased from to , with a 5–6% increase for each case. No further significant enhancement in the metrics is observed between and , implying that all relevant information for the prediction has already been exploited considering that the lifetime of any interaction between an actor and the IDS is briefer). Consequently, the methodology described herein involves a natural compromise between computational complexity and accuracy, necessitating customization for each unique application.
We also present ROC curves for the IDS classification and the time series approach with
in
Figure 6.
In general, we observe slightly improved outcomes for the classification scenario, as anticipated as our approach lacks information about the packet whose label is predicted. Nevertheless, upon examining both the ROC curve and the AUC score, we note that the disparity is minimal. Subsequently, our method remains viable, supporting proactive security measures.
5.5. Impact of T in Model Performance
The results related to predictions that cover a longer time frame (different values of
T) are displayed in
Table 5 for a fixed window size
.
The table reveals that the metrics exhibit improvement as the predictive horizon expands. This trend is expected since a larger value of T leads to more general predictions, with the focus being on detecting the presence of an attack in the upcoming T packets, rather than predicting the label of a particular packet. The most significant increase is observed between and , where the F1 score improves by 6%, while precision and recall increase by 3% and 9%, respectively. As the value of T increases, the improvements in the metrics diminish, with only a 2% enhancement observed between the subsequent T values. We also display the label distributions for each of these cases. It is evident that with an increase in the value of T, the model achieves greater performance, as one would anticipate due to the higher probability of an attack occurring within a larger window frame. Nevertheless, anticipating an attack when using a larger value for T yields reduced information and results in sparser identification of attacks.
6. Model Architecture Ablation Study
In this section, we conduct a sequence of experiments to affirm the architecture outlined in
Section 4.2. The primary objective of this investigation is to validate the model’s architecture and substantiate that the chosen design elements are accurate and well-considered. To achieve this, we utilize the experimental configuration outlined in
Section 5.1, wherein time series objects are generated from the entire sorted dataset, and the metrics are computed through cross-validation. The outcomes of this study are presented in
Table 6.
Here, we observe that the simple models consisting of a standalone LSTM or CNN perform the worst since they are too shallow and cannot handle complex patterns within the dataset. This is apparent for all three metrics displayed in the table. When it comes to the combination of LSTM and CNN, we observe an overall improvement in the metrics, which is due to the fact that the synergistic integration of both architectures leverages the strengths of each model. The differences from the simple LSTM are ∼0.4, , and for the F1 score, precision, and recall. Finally, we see that the proposed model that adds an attention layer between the two LSTM layers performs the best because the introduction of attention mechanisms allows the model to focus on the most relevant parts of the input sequence, enhancing its ability to capture intricate patterns and dependencies. This targeted attention mechanism mitigates information loss and facilitates more effective learning of long-range dependencies, contributing to superior performance across all evaluated metrics.
7. Conclusions
In this study, we re-examined the conventional approach of intrusion detection systems based on machine learning and redefined the problem as a time series prediction task. Our results demonstrate that our proposed methodology is capable of proactive operation rather than merely reacting to security breaches, achieving an F1 score within a margin of compared to the real-time approach and an AUC score that is lower by only . By embracing the time series prediction approach, we acknowledge the dynamic nature of network threats and the need for a proactive defense strategy. This shift in perspective allows us to anticipate and counteract emerging attack techniques, thereby staying one step ahead of malicious actors. Through our research, we aim to contribute to the continuous improvement of intrusion detection systems, ensuring the safety and security of our interconnected digital world. The results indicate that there is a natural trade-off between computation complexity and window size, W, so the choice must be tailored to a specific use case. Based on our experiments, there is an additional trade-off between feature sets and performance, where fewer features result in faster training/inference but also smaller accuracy. Furthermore, we reveal that analyzing the overall traffic of a network (network grouping) can uncover patterns not directly linked to specific nodes, such as those associated with IP, providing valuable insights into intrusion detection. We also test many architectural approaches and conclude with a complex time series prediction model that achieves robust metric scores.
In the future, we will aim to validate our approach further by employing more IDS datasets to further support our claims. We also plan to evaluate the performance of a model trained on one dataset but tested on another dataset with the same features to assess cross-dataset validation. Lastly, we plan to test our methodology in both centralized and federated learning paradigms since such a model would benefit from diverse data gathered from multiple devices.