5.1. Class Discovery
In fact, raw IoT data is unlabeled data, which may have various types of behaviors, namely normal or abnormal. And the abnormal can come with many types of attack. Therefore, to build robust supervised IoT-based IDS, the model must be trained on all behaviors. As previously discussed, the active learning approach is considered one of the solid techniques to select the most representative and informative instances that represent the whole behaviors of the systems. The fundamental challenge is to discover or select all attack types. Existing active learning methods rely on random initialization of the first initial seed, and this can only draw instances from the major or dominant behaviors (attack types). Thus, minority classes, which may represent new or zero-day attacks, may be missed in the discovery process. In this evaluation, we compare CLAIRE with the baselines on class label coverage (CCR). The metric CCR serves as the principal indicator and is used to determine whether the actively selected set spans the full label space. In active learning, coverage is critical because the learner should encounter at least one instance from every class to establish reliable decision boundaries. When coverage is high, the model gains broad access to training evidence and tends to be more robust across classes. If any label is absent from the selected set, the model cannot learn that class due to the lack of training examples.
In this context, ICL refers to the count of unique labels observed among the selected instances during active learning, whereas TCL denotes the total number of unique labels present in the source dataset as a whole. CCR treats all labels equally and focuses on coverage rather than per-class accuracy. The metric is defined as follows:
By definition, 0 . A value of 1 means complete coverage, with at least one instance from every class. A CCR below 1 implies missing class labels, which makes the selected subset less representative of the complete dataset.
As shown in
Table 3, the CIC variants, DDoS and DoS dominate CIC-V1 and CIC-V2 at 86% and 92%, while minority categories such as Web-based and Brute Force attacks each account for less than 1% of the data. Even under this pronounced skew, CLAIRE achieves a CCR of 1.00 for every labelling budget in both CIC-V1 and CIC-V2 with only one instance of 0.88 at 30 labeled instances in CIC-V2. In contrast, baseline methods struggle significantly: BE achieves 1.00 only at higher budgets starting from 200 instances in CIC-V1 and 250 instances in CIC-V2, LHCEIII shows poor performance with maximum scores of 0.75 in CIC-V1 and 0.88 in CIC-V2, MARGIN reaches 1.00 inconsistently, and USAP demonstrates high volatility with scores fluctuating between 0.38 and 1.00. Similarly, on the CIC-V3 dataset, the proposed CLAIRE approach demonstrates strong performance across all budgets except for 30 and 50 instances. However, it still shows satisfactory results of 0.88 for both budgets. On the other hand, the baseline models behave less predictably. As can be seen, RAND achieves a CCR of 1.00 when sufficient labeled instances are available. MARGIN performs well overall, although it exhibits fluctuating results. LHCEIII also reaches 1.00 for higher budgets, except at 300 instances where it shows 0.88. In contrast, USAP and BE demonstrate lower performance compared to all other methods.
In a similar manner,
Table 4 shows the evaluation on NB dataset variants. As can be seen, results further demonstrating that CLAIRE is still promising on all variants NB-V1 with relatively balanced distribution, NB-V2 where Normal and TCP attacks dominate at 89% combined while Combo, Junk, and UDP attacks represent only 0.19% each, and NB-V3 where Combo and Junk attacks constitute 84% while other classes become minorities. CLAIRE achieves perfect CCR scores of 1.00 across all NB variants and labelling budgets without exception. Baseline methods show varying degrees of struggle: while LHCEIII and some others achieve 1.00 in the balanced NB-V1 scenario, they fail significantly in imbalanced scenarios, with BE, MARGIN, RAND, and USAP showing poor coverage in NB-V2 and NB-V3, rarely exceeding 0.83 and often dropping as low as 0.50.
Overall, CLAIRE maintains strong and consistent coverage across datasets despite challenging class distributions. Although these CCR results demonstrate excellent class label coverage, this metric alone does not fully validate the practical effectiveness of the active learning approach. High coverage ensures that all classes are represented in the selected instances, but it does not guarantee that these instances are the most representative and informative for model training. The quality and representativeness of the selected instances are equally important for achieving good classification performance. Therefore, the following section evaluates the effectiveness of the selected instances.
5.2. Classification Performance
Although the results of CCR indicate that class labels are well covered, coverage by itself does not ensure that the selected instances are sufficiently representative and informative for effective training. To build a more rounded view of performance, the classification capability of CLAIRE was examined using three widely adopted metrics, namely macro-Precision, macro-Recall, and macro-F-measure. These measures were computed using macro-averaging, meaning that each class was evaluated individually and then averaged so that both frequent and infrequent classes exert the same influence on the final score.
In fact, the literature describes a wide range of attack types, each with different characteristics and severity levels. However, in this evaluation, we adopt a simplified approach where we treat all attack types equally with the same impact factor. This is because we aim to evaluate whether the proposed approach is capable of learning diverse behaviors in the unlabelled dataset. In other words, the goal is not to learn specific behaviors or attack types, but rather to capture the full range of behavioural diversity. For this purpose, we evaluate the classification performance of the proposed CLAIRE approach against the baselines using four well-known classifiers: J48, Naive Bayes, k-Nearest Neighbors, and Random Forest. Each classifier is trained on the labeled instances generated by each method. This evaluation investigates the quality of the labeled instances generated by each competing method to assess their representativeness and informativeness in capturing the characteristics of the original dataset. To ensure fairness and generalization of the trained models, we test them on unseen data.
For simplicity and due to space limitations, we report performance as averages across the four classifiers rather than listing separate results for each model. This provides an overall assessment of how well each set of labeled instances captured useful information that enables various models with different learning perspectives to generalize for classifying unseen data. Additionally, we include the standard deviation alongside the averaged performance to provide readers with insight into the stability and variability of each compared method.
Table 5,
Table 6 and
Table 7 show classification results for the three CIC variants. As shown in
Table 5, CLAIRE demonstrates promising and stable results on most labelling budgets, and it shows significant results as the budget size increases. As presented, CLAIRE achieves the highest Macro-F, Macro-Recall, and Macro-Precision results for most budget sizes, except on 50 instances where MARGIN shows better and competitive performance. In contrast, baseline methods show fluctuated results. Among the competitors, the MARGIN method performs closest, with scores from 0.69 to 0.89, yet it trails by about two to three percentage points at higher labelling budgets. The USAP method produces occasional gains at certain points, for example 0.72 at 100 labeled instances and 0.77 at 250, but lacks overall stability. Meanwhile, BE and RAND remain below 0.73, and LHCEIII consistently records lower values, staying at or under 0.55. These results indicate that CLAIRE not only achieves stronger performance overall but also learns minority classes more effectively than competing methods. Interestingly, the proposed approach demonstrates relatively lower standard deviations, meaning it is able to learn representative and informative instances that capture most characteristics and properties of the learning data.
Table 6 shows the results for CIC-V2, a far more imbalanced configuration in which DDoS and DoS together represent approximately 92/100 of the samples. The remaining four categories appear only in very small proportions, each below 0.3/100. Across labelling budgets, CLAIRE is the most consistent performer among the compared methods. As the labelling budget increases, CLAIRE’s scores climb in a steady way. The macro F-measure rises from 0.69 to 0.88, and macro Recall moves from 0.48 to 0.89. Macro Precision improves along the same path and reaches 0.88. USAP looks strong early, posting a macro F-score of 0.81 with 200 labeled samples, but its precision falls as the budget grows. As can be seen, BE, RAND, and MARGIN demonstrate low results, not exceeding 0.70. Notably, LHCEIII demonstrates the lowest results, not exceeding 0.60 across all budgets. Taken together, these results illustrate how class imbalance can limit learning effectiveness, particularly when rare classes must be learned under tight labelling budgets. Despite these challenges, CLAIRE maintains recall and precision in reasonable balance, even in cases where minority classes have only a few labels. With increased labeled data, its performance stabilizes further, indicating stronger and more reliable generalization compared to competing approaches. Furthermore, CLAIRE demonstrates relatively lower standard deviations across most budget sizes, revealing that the learned labeled instances are informative and contain clear patterns for building generalized and efficient classification models.
As detailed in
Table 2, CIC-V3 shows a different class distribution. 91% of the samples came from Benign traffic and from three attack types, namely, reconnaissance, spoofing, and web-based. However, attack types such as DDoS, DoS, and Mirai appear much less often, while brute-force activity shows a modest increase. Again, even with this shift, CLAIRE still performs best among the methods.
Table 7 demonstrates that macro F-measure grows steadily from 0.44 with 30 labeled instances to 0.91 at 300. It also reaches peak scores of 0.93 for macro Precision and 0.90 for macro Recall. MARGIN performs well in the early stages. With only 50 labeled examples, it already reaches a macro F-score of 0.74, and at 200 labels it achieves the highest macro Precision of 0.92. However, its recall still falls short of what CLAIRE achieves. RAND shows a broadly similar pattern but tends to stay a bit behind MARGIN at most labelling budgets. USAP behaves much less predictably, with its macro F-score moving anywhere between 0.38 and 0.74, which shows that it reacts quite strongly to changes in the dataset. BE is almost the opposite: its performance stays fairly steady at around 0.59 across budgets, although it never reaches the levels achieved by CLAIRE or MARGIN.
Similarly,
Table 8,
Table 9 and
Table 10 present the results for the NB dataset variants, where each one has different distributions of the traffic types, as summarized in
Table 1. In the NB-V1, traffic types are moderately distributed. This gives each model balanced exposure to learn from all classes. As shown in
Table 8, RAND in this type of distribution shows promising results, with macro F-measure scores increasing from 0.89 to 0.92. MARGIN reaches macro F-measure values between 0.47 and 0.91 and keeps its highest macro Precision of 0.92 with 150 labeled instances. CLAIRE demonstrates competitive results, improving gradually from 0.59 to 0.89 as the labelling budget increases. USAP is effective at low labelling budgets but shows limited improvement at higher labelling budgets, showing that these results are not stable since it is not performing well on higher labelling budgets. Both RAND and MARGIN demonstrate consistent behaviors for the different classifiers, while CLAIRE demonstrates gradual and reliable progress for all labelling budgets. As for the stability, we can see that CLAIRE demonstrates competitive standard deviations ranging from 0.05 to 0.15, comparable to RAND and MARGIN, presenting stable performance across different classifiers. However, USAP shows higher variability at larger budgets (SD up to 0.18 at 250 instances), suggesting less consistent behaviors.
For the second variant, NB-V2, the data distribution becomes highly uneven, showing a clear case of class imbalance. Almost 90% of all samples in the dataset account for Normal and TCP traffic, while combo, junk, and UDP attacks contribute only about 0.19% each, as shown in
Table 1. Despite this distribution, CLAIRE demonstrates reliable performance. As presented in
Table 9, it can be seen that macro F-measure improves from 0.58 to 0.83, while macro Recall increases in an even manner to 0.88 at 300 as well. This promising result demonstrates that CLAIRE is efficient in learning the representative and informative instances even with highly imbalanced data. On the other hand, we can see most baseline methods produce much lower scores, rarely exceeding a macro F-measure of 0.70. MARGIN and USAP show moderate improvement at mid-range budgets, but they are not showing stable results at higher budget sizes. BE and RAND remain mostly below 0.60 across all metrics, and LHCEIII shows little variation throughout. With respect to stability, CLAIRE shows relatively higher standard deviations (ranging from 0.06 to 0.20) compared to other methods, reflecting some variability across classifiers. Overall, this trend shows that severe class imbalance can weaken traditional active learning methods, especially in the case of IoT intrusion detection systems where detecting the rare attacks is crucial.
Similar to NB-V2, NB-V3 represents very imbalanced data but with different types of behaviors. As can be seen in
Table 1, attack traffic represents the largest portion of the dataset compared to few normal types. The combo and junk categories jointly comprise more than 80% of all the samples. TCP and scan traffic appear only occasionally. Despite this asymmetrical distribution, CLAIRE performs reliably and improves as more labeled data become available. Its macro F-measure increases from 0.58 to 0.93, macro Recall rises from 0.61 to 0.92, and macro Precision reaches 0.95, as shown in
Table 10. These results show that CLAIRE adjusts gracefully even when most of the data are concentrated in a few dominant attack types. Contrarily, the baseline approaches struggle to match this level of consistency. At a labelling budget of 300, LHCEIII reaches a macro F-measure of 0.84, followed by BE at approximately 0.72. MARGIN and RAND perform slightly lower, averaging around 0.73 and 0.67, respectively. USAP performs below average across all budget levels, reaching only up to 0.46. These results show the adaptive effectiveness of CLAIRE in case of pronounced class imbalance where it consistently maintains both precision and recall, even when the dataset is dominated by a few prevalent attack types. This balanced performance is highly valuable in real-world network environments where missing rare but critical threats can have serious consequences.
Overall, for every variant of the CIC and NB datasets, CLAIRE maintains consistently high performance compared to baseline methods. The results highlight the capability of CLAIRE to learn effective decision boundaries by identifying and leveraging the most informative and representative instances from the training dataset, thereby demonstrating the robustness of active learning with highly imbalanced data. By contrast, the baseline methods exhibit marked volatility, finding it challenging to learn informative and representative instances in such imbalanced scenarios.
Computational Efficiency
In this section, we demonstrate the computational cost of the proposed approach. We only focus on the labelling process, while the time of classification and monitoring phases are considered to be the same as they only work on the same number of instances. As depicted in
Figure 2, CLAIRE consists of four layers. Layer 1 adapts the VWC algorithm to partition unlabeled data into hundreds of micro-clusters. This layer performs a single scan to cluster the data based on a fixed width learned from the data. Any cluster exceeding the maximum cluster size
is recursively partitioned using a new fixed width learned from that cluster. This recursive process continues until no clusters exceed
. As shown in
Table 11, smaller values of
increase the total execution time. Layer 2 applies EM clustering with complexity O(
). However, as illustrated in
Figure 6, the computational cost of Layer 2 (L2) remains manageable because this layer operates only on the medoids produced by Layer 1, not on the original large dataset. Layer 3 requires similar time. Layer 4 also operates on very few instances and shows very small time. Overall, as shown in
Figure 6, the proposed approach demonstrates large computational time compared to the baseline methods. For example, on the CIC-IoT dataset it takes about 189 s, while the worst and best baseline methods take approximately 118 and 2 s, respectively. In fact, RAND demonstrates the fastest one and this is because it has no intelligent learning part and demonstrates poor results in imbalanced classes (see
Section 5.1). Similarly, all methods have the same behaviors and order of computational time on a much larger dataset named N-BaIoT. As can be seen, most of the time for the proposed approach is consumed in the first layer, which involves micro-clustering technique and medoid calculations for each cluster, which consumed a substantial part of the time.
Table 11 shows that VWC with
= 1000 only takes about 121 s. This means most of the time in Layer 1, which is 451 s, is consumed by medoid calculation. However, this time can be much reduced as each cluster can be treated independently in a parallelized fashion.