Experiments on generalizability consist of two phases. In the first one, the multi-class case is considered. The obtained results suggest limited but existing generalizability. It is further investigated to explore it more deeply, focusing on the Benign class, which is shown to be the main reason for such multi-class results. In the second phase of experiments, the co-occurring classes, i.e., classes present in more than one original set, are investigated to check how the generalizability looks in their case. The results reveal that the issue of generalizability is more complex than one may expect—it differs depending on the type of traffic class (benign/harmful).
5.1. Multi-Class Classification and Binary Benign-Class Case
In the first phase of experiments, we focused on testing the overall quality of cross-validated classifiers, measuring the multi-class measures for all combinations of training and test sets. In addition, the fifth case was introduced—the classifier was trained using a sum of all four training sets. In each case, the final measure is the mean value of two tests (2-fold cross-validation).
Table 4 presents the results.
In this table, some combinations of train-test sets generate higher metric values than others. Based on F1 and accuracy (which are highly correlated—their correlation equals 0.99), one may observe that classifiers trained on ToN perform pretty well when tested on IDS18 and NB15 (obviously, it obtained the highest score for ToN-origin test set). In contrast, its performance on the BoT test set is dramatically low. Similar patterns may be seen when training on NB15 and IDS18—also, the testing on BoT results in measures close to 0. High metrics values imply relatively high similarity of the network traffic between sets NB15 and IDS18 and lower similarity between each of them and the ToN. It may imply that they may be considered generalizable. The BoT set is, however, a different case, as indicated by the low values of measures. When, in turn, this set is used as a training one, the measures obtained when the remainder of sets as test ones are remarkably lower (F1: ,) or even close to 0 (). This behavior confirms that the BoT differs from other sets and is poorly generalizable.
To dive deeper into this issue, one must look into all the sets, performing quantitative and qualitative investigations. Looking at
Table 1, one may see that the principal difference between BoT and the remaining sets, apart from the origin of data and classes, relies on the class balance. In all but BoT datasets, the
Benign traffic class dominates, in IDS18 and NB15, it dominates totally (88.05% and 96.02%, respectively), while in ToN, it is still the most frequent class, gathering 36.01 % of samples. This is because the share of classes in the dataset tries to reflect the one in the typical network flow, where intrusions and attacks are relatively rare and what dominates is the benign network traffic. From this point of view, BoT looks completely different; the share of the
Benign class is less than 1%, and the set is almost equally divided into two harmful classes:
DoS and
DDoS. All of the above raises suspicion that the high values of metrics obtained in the cross-validation of IDS18, ToN, and NB15 are caused by the significant share of the benign traffic in the datasets.
To validate this hypothesis, the binary classification setup is applied, focusing on the
Benign traffic class and measuring the ability of classifiers to correctly classify samples of this class (true-case), treating the rest of the classes as the false-case.
Table 5 shows the classification metrics in this setup. Following the previous considerations, when analyzing the
Benign traffic class, we focus on precision and false positive rate (FPR) measures, which are the most adequate in this case. The F1 measure is used as the supplementary one.
The investigation confirms the influence of the
Benign class on the multi-class classification results. The high precision (P) and low false positive rate (FPR) values are present for the same train and test set combinations as in the multi-class case. In particular, the IDS18 and NB15 datasets are very close to one another regarding cross-validated classification results. It implies the similarity of samples from both sets belonging to the
Benign traffic class. The ToN dataset performs worse; however, the results of the
C class are lower than in the multi-class case. This property may come from differences in the size of the
Benign class. The number of samples for IDS18 and NB15 is approximately 2.5 times higher than in the ToN case (see
Table 1). Looking, in turn, on the BoT dataset, one may observe dramatically weak results.
However, not only qualities make the difference. In the case of NB15 and IDS18, the captured benign traffic consists of the natural transaction data based on standard application protocols, like HTTP, HTTPS, FTP, SSH, and email. In both cases, such a user’s normal behavior profile was selected. The BoT data set, however, was constructed to mimic the usage of adversarial behavior of compromised IoT networks as Botnet malware. Therefore the regular traffic is substantially reduced. One may notice that the
Benign class network traffic in the BoT consists only of a realistic smart home network, as the five IoT devices. Several instances are relatively small and do not include the normal user’s behavior like in the NB15 and IDS18 datasets [
6].
The t-SNE visualization of
Benign class is shown in
Figure 2. The area covered by samples belonging to particular datasets overlaps. Observation of distributions of points in reduced 2D space confirms previous conclusions—areas covered by IDS18 and NB15 are very close to one another—practically, they fully overlap; the ToN dataset only partially overlaps both. Observing, in turn, the distribution of BoT 2D points, one immediately finds that they are far apart from the others. Less intensive point cloud reflects the difference in the number of samples (approx 100 times less than ToN and 250 times less than NB15 and IDS18).
5.2. Harmful Traffic Classes Detection
The first phase of experiments revealed that the relatively good results that may confirm the generalizability of three of four datasets are mostly driven by the good classification of the Benign class that dominates over other classes and consists of samples that exhibits some intra-dataset resemblance. In the case of the BoT dataset, even multi-class measures do not confirm the generalizability.
A Benign class in the network traffic dataset is necessary to properly train the ML models to detect the remaining classes referring to harmful traffic, finally allowing them to distinguish between regular (benign) traffic and various dangerous ones. From this point of view, correctly detecting Benign traffic by models trained on datasets other than the one on which the model is tested would eventually allow for separating benign from harmful traffic. One need, however, to consider two additional issues. At first, it is restricted to the NB15 and IDS18 datasets, containing the enterprise network traffic. In the case of the IoT traffic, results are not that optimistic. Secondly, in most cases, the primary goal of NIDS exceeds simple differentiation between benign and harmful classes—the system should also detect the type of danger. In other words, it should correctly classify harmful data samples into one of the pre-defined classes.
To investigate this issue, we conducted experiments similar to the one with
Benign class binary classification but on classes representing harmful traffic. Such experiments should be conducted on co-occurring classes, i.e., classes in more than one dataset. Looking at
Table 1, one may easily observe that the collection of such classes is relatively small. It consists of just four classes:
DoS,
DDoS,
Reconnaissance, and
Backdoor.
All the co-occurring classes are analyzed using the previously mentioned measures. However, following again our previous considerations, in this case, we focus on the Recall measure, which should converge to 1 when the performance of classifiers grows, and the false negative rate (FNR) should converge to 0. The measures obtained for all co-occurring harmful classes are shown in
Table 6.
The only harmful traffic class in all four sets is
Denial of Service (DoS). It is one of two principal classes of the BoT dataset, so it is no wonder that the number of samples belonging to this set far exceeds the number of samples in the remainder of the sets. It is more than ten times higher than in the second most frequent class—ToN. This situation resembles the case of the
Benign class—some datasets contain more samples of a particular class than others. Nevertheless, contrary to the
Benign class case, this does not directly imply that the classifier trained on these sets performs well on others, which may be observed in metrics values. Here, despite high values of precision, recall, and F1 measures and low value of FNR for the case BoT-BoT (train-test), the model trained on BoT and tested on three other datasets perform poorly—the metric values are close to 0, while error value (FNR) is close to 1. It is because samples of the
DoS class from different original datasets occupy disjoint regions in the feature space. It is visible on the t-SNE visualization shown in
Figure 3 (left). For the same reason, other cross-validated combinations of train-test sets perform poorly. To understand why this happens, one needs to look carefully at the characteristics of the
DoS class, which, in general, is differentiated in particular sets. These datasets differ mainly in the types of systems targeted (enterprise networks vs. IoT) and the techniques employed in the attacks. In NB15, the DoS attacks involve traditional network-based denial-of-service attempts targeting system resources, typically from a single host. In IDS18, attacks are more varied, including different types of flooding (e.g., HTTP, SYN flood) that aim to disrupt services, targeting more modern network environments. BoT attacks focus on disrupting IoT devices’ functionality and exploiting IoT ecosystems’ limited resources, with traffic aimed at overwhelming low-power devices. Finally, ToN attacks target IoT devices and traditional networks, showcasing resource exhaustion in IoT environments while illustrating cloud-based and industrial infrastructure vulnerabilities.
Another frequent class is the Distributed Denial of Service (
DDoS) class in three datasets: BoT, ToN, and IDS18. Here also, BoT keeps most of the samples belonging to this class. Also, careful analysis, following that performed in the
DoS class (see
Figure 3, right, for t-SNE visualization) leads to the conclusion that the level of generalization of this class is extremely low—measures for all but own combination of datasets are close to 0, while FNR error rates—close to 1. The main differences lie in the attack targets: IDS18 focuses on traditional networks, while BoT and ToN deal with IoT environments, with ToN, also incorporating industrial and cloud systems in its scope. In IDS18, attacks involve large-scale, distributed attacks using botnets targeting traditional network services. In BoT, attacks specifically exploit IoT devices, using compromised IoT devices (botnets) to generate high volumes of traffic aimed at overloading the network or devices. In ToN, attacks target IoT and traditional systems, reflecting complex, mixed environments, including cloud and edge infrastructure.
It is worth noting that the detection of DoS and DDoS attacks is based on a filtered set of traffic features. This ultimate set does not include the source IP address. In many attacks, the huge repeatability of the IP source is the principal indicator of this type of attack. However, hopefully not the only one, other determining factors are usually present in higher layers of the protocol, which influence other traffic features in our datasets.
The last two co-occurring classes are
Backdoor and
Reconnaissance. Both are present in just two sets and are much less frequent than
Benign,
DoS, and
DDoS. In each of the two classes, analyzing the metrics and visualizations (
Figure 4) and applying the same train of thought, one quickly notices that, although samples from different sets are assigned to a class of the same name, the internal characteristics of features exhibits significant differences between features of traffic samples of the same class originated from different datasets. The main difference between the Backdoor class in ToN and NB15 is the environment being targeted. The ToN focuses on IoT devices and industrial systems, allowing attackers persistent, unauthorized access to critical infrastructure, including smart devices and edge systems. In the NB15, Backdoor attacks target traditional network systems and servers, aiming to maintain covert access to corporate or personal computing environments. Similarly, the difference between
Reconnaissance class in BoT and NB15 is the type of targets. The BoT focuses on probing and scanning IoT devices to find vulnerabilities specific to resource-constrained, internet-connected devices. In contrast, in NB15, the traditional network infrastructure is scanned, gathering information such as open ports or system configurations in more conventional IT environments.
Apart from models trained on particular datasets, the models trained on a superposition of datasets were also investigated. The training set was created by merging all four component datasets into a single superset of 4 million samples. The performance of such a classifier was measured in the same way as in individual cases—the resulting measures are shown in the last section of each table (the train set name is ’All’ in this case). Contrary to models trained on individual sets, the model trained on superset performs well on all test sets. The reason for such good results is the increase of intra-class variability achieved thanks to merging. This observation also indicates that extending the training set during the use of the model, followed by re-training, allows for adaptation of the NIDS to the real working environment.