Consistent with the fault diagnosis method, we meticulously implemented several incremental learning algorithms for data streams in conjunction with a rebalancing and drift detection method. We then compared their performance to that of the basic implementations of these algorithms.
3.2. Evaluation Method and Imbalanced Stream Evaluation Metrics
The diagnosis method performs online classification of network failures, making it a learning model that must be evaluated in a streaming context. Prequential evaluation is the most commonly used method for this task [
28], so we adopted this approach.
This evaluation method uses each incoming instance to test and train the model, ensuring that the evaluation is always performed with not-seen-before instances. Each evaluation process updates the confusion matrix, so we can read the matrix every n incoming instance to compute a set of five performance metrics. It ensures we obtain a performance history over time, that is, how the fault classification has been adapted. In this experiment, n is equal to 200 samples.
For comparison purposes, this evaluation method was applied to the proposed diagnosis method, which performs incremental rebalancing learning, and to the online base algorithm without any rebalancing strategy. Although no separate section is explicitly labeled as an ablation study, this experimental design inherently performs a component-level ablation analysis by evaluating each incremental learning algorithm under two controlled configurations: (i) operating independently on the original data stream, and (ii) operating within the proposed diagnosis pipeline that incorporates incremental rebalancing and explicit drift detection. By keeping the learning algorithm, data stream, and evaluation protocol identical across both configurations, this comparison isolates the contribution of the rebalancing and drift-handling components to the observed performance differences.
We deal with an imbalanced data stream, so traditional metrics are inadequate for evaluating the performance of the fault diagnosis method. Prequential accuracy is the most commonly used measure [
53], but accuracy can be misleading in an imbalanced context. Thus, ref. [
53] suggests using the prequential AUC [
54], G-mean, and recall or sensitivity metrics for imbalanced data streams.
On the other hand, RebalanceStream utilizes the standard kappa statistic within the rebalance stream process, ensuring fidelity to the base code and incorporating the kappa statistic as one of the performance measures. Further, the kappa statistic is widely used when dealing with imbalanced data [
55]. Nevertheless, some studies prefer the Mathews Correlation Coefficient (MCC) over the kappa statistic [
56], so that measure was also used.
The above reasons led us to select these five metrics (recall or sensitivity, G-mean, kappa, MCC, and prequential AUC) as the most informative to evaluate the fault diagnosis method. Prequential AUC is a new metric suitable for data stream scenarios proposed by [
54], and MOA provides its calculation; thus, our experiment uses this implementation. The other four metrics applied in a streaming context are measured following the prequential evaluation process.
3.3. Experimental Results and Discussion
The prequential evaluation provides a set of five values, one for each metric, representing the fault diagnosis performance as the data arrive. Each prequential metric is plotted in a line chart, resulting in 250 performance graphs in the experiment (five metrics for each of the twenty-five algorithms and both dataset files).
Each graph represents the performance curve of the tested algorithm according to the corresponding metric.
Figure 3a illustrates the format of the obtained curves. The vertical axis indicates the metric values, and the horizontal axis shows the number of incoming instances that have arrived until the measurement time. The black series are the results of the diagnosis method, that is, with the rebalance stream strategy. The red series represents the online base algorithm performance results without rebalancing or concept drift treatment.
Due to the large number of graphs obtained in the experiment, it was necessary to condense the results and represent them visually to facilitate comparison of the algorithms’ performance according to each metric and to determine whether the rebalancing strategy for fault classification represents an improvement over the base algorithm. Therefore, heat maps were created for each of the metrics.
Three measurements were obtained to create the heatmaps and their calculation was also incorporated into the code of RebalanceStream:
The mean of the rebalanced curve (black curve),
The mean of the base curve, without balancing (red curve),
The weighted difference between them.
The weighted difference is a measure between 0 and 1, indicating the difference between the two performance lines (black and red curve), considering a weighting function that assigns some importance to the performance obtained at each measurement.
The authors of [
31] argue that each instance will become increasingly less significant to the overall average. Then, our weighting function is a convex exponential function that assigns a high valuation to differences in the initial intervals and then decreases in value. For example,
Figure 3b shows the weighting function for a prequential evaluation of 20 measurements. The weighted difference is the summation of the weighted differences at each point, and the result is normalized to obtain a value between 0 and 1.
For each performance metric (recall or sensitivity, G-mean, kappa, MCC, and prequential AUC), three heatmaps were obtained representing the three measures mentioned above (mean of the rebalanced curve, mean of the base curve, and weighted difference) for the 25 algorithms in
Table 1.
Figure 4a shows the color settings used in the heatmaps. In the color configuration for the mean measurements (rebalanced and base), a redder color indicates a better measurement as it is closer to 1. Conversely, a lighter color signifies that the value is closer to 0, indicating worse performance. In the color configuration for the weighted difference, a positive difference is colored green; the more saturated the green, the more significant the improvement. Gray indicates no difference, while blue signifies a negative difference.
Figure 4b–f present the created heatmaps. As can be seen, some image fields are set as “NaN”, which means that the metric was impossible to calculate at some points (due to deplorable behavior), making the developed tool unable to calculate the corresponding metric’s means.
The sensitivity during fault diagnosis is higher when using the rebalancing and concept drift approach than when using only the base incremental learning algorithm. So, the diagnosis method learns from minority class despite huge class imbalance. This behavior is attributed to the fact that SMOTE was able to engage in online learning, positively affecting failure classification.
The prequential measurements of G-Mean, kappa, and MCC also confirm that the diagnosis method does not neglect the learning of any of the two states of the network (in failure and healthy states).
The intensity of the color in the weighted difference heatmap for the Sensitivity, G-mean, kappa, and MCC metrics suggests that using an incremental algorithm without the components of the proposed diagnosis method is not enough. On the other hand, most of the base algorithms incorporate concept drift (as indicated in
Table 1 of the experiment setup); however, they perform poorly, highlighting the need for a rebalancing and concept drift detection procedure external to the algorithm.
Meanwhile, the proposed diagnosis method and the online base classifier had similar prequential AUC. If this is contrasted with the results mentioned above, it is safe to say that this metric does not provide reliable information to compare the two scenarios, nor to evaluate the classification performance of imbalanced data streams.
If an online-based algorithm were to be selected for the diagnosis method, it is evident that some do not perform well, even with the rebalancing process. However, this occurs with a few. Twenty-one algorithms out of twenty-five yield excellent results when coupled with the concept drift detector and the rebalancing method, as the proposed diagnosis method suggests. Hence, this proposal offers a valuable approach to network fault diagnosis and is particularly suitable for network scenarios where monitoring data is received in real-time.
To numerically summarize the performance of the 21 algorithms with good results in terms of ranks obtained for each metric, it can be observed that for the diagnosis method, a Recall range of 0.8033 to 0.95426 was obtained, while for the base algorithms, the range is 0.22496 to 0.73097. In the case of G-mean, a range of 0.86902 to 0.97121 was achieved for our method, while for the base algorithms, the range is 0.41878 to 0.8527. For the Kappa metric, a range of 0.77976 to 0.94918 was obtained with the proposed diagnosis method, while the Kappa range for the base algorithms is 0.22223 to 0.75529. In the case of MCC, a range of 0.79648 to 0.94929 was obtained for the diagnosis method and a range of 0.2591 to 0.76025 for the base algorithms.
When considering the ranges, it becomes evident that the diagnosis method, as presented in this article, outperforms the individual use of the learning algorithms it incorporates.
When highlighting the best algorithms, we found that for each individual metric there is a specific algorithm with the best performance. The diagnosis method achieved the highest recall value with the BOLE algorithm, the highest G-mean value with the ADOB algorithm, the highest Kappa value with SAM-kNN, and the highest MCC value with the SAM-kNN algorithm.
When comparing the metric values for the algorithms implemented within the diagnosis method and those for the individual algorithms, the SAM-KNN algorithm consistently showed the most significant difference. The models built based on this algorithm differed by 0.73097 for Recall, 0.8527 for G-mean, 0.75529 for Kappa, and 0.76025 for MCC.
The experimental results show that integrating drift detection with incremental rebalancing significantly improves minority-class detection, particularly in terms of recall and G-mean. This behavior can be explained by the interaction between class imbalance and concept drift in streaming environments. Under severe imbalance, incremental learners tend to favor the majority class, causing minority patterns to be underrepresented or forgotten over time.
When concept drift occurs, this bias is further amplified, as newly emerging minority patterns are often misclassified during adaptation phases. The application of incremental rebalancing mitigates this effect by increasing the representation of minority-class instances during training, enabling the classifier to better adapt to evolving concepts. Drift detection further supports this process by triggering timely model updates, preventing performance degradation caused by outdated decision boundaries.
Although some incremental classifiers incorporate internal drift-handling mechanisms, the results indicate that these mechanisms alone are insufficient under extreme imbalance conditions. The proposed method complements such classifiers by addressing imbalanced data explicitly, leading to more stable and robust performance across different learning algorithms.
For clarity,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6 present a representative subset of 10 algorithms, selected based on performance ranking, sensitivity to rebalancing, and algorithmic diversity. Although
Figure 4 reports the metric results for all evaluated algorithms, the tables focus on this subset to facilitate interpretation. The selected algorithms cover the main algorithmic families and illustrate both positive and negative effects of the the rebalancing and concept drift approach. In all tables, results summarize the average prequential performance over both datasets, comparing the base configuration and the rebalance stream strategy. The reported qualitative effects are derived from the weighted performance differences illustrated in
Figure 4, averaged over both datasets.
Table 2 reports the comparative Sensitivity/Recall performance of the selected algorithms under the base configuration and the proposed method. The results show that incorporating incremental rebalancing consistently improves minority-class detection for most learning paradigms, confirming that the proposed approach effectively mitigates the bias toward the majority class commonly observed in incremental learners. Ensemble-based methods benefit the most from rebalancing, whereas some margin-based and instance-based algorithms exhibit limited gains or slight degradation. Overall, the table highlights that rebalancing plays a critical role in improving sensitivity in highly imbalanced data streams, without contradicting the broader trends observed in the remaining evaluation metrics.
Table 3 reports the comparative G-Mean performance of the selected algorithms. Unlike Recall, G-Mean reveals that not all methods benefiting from rebalancing achieve balanced classification. Ensemble-based approaches such as ARF, AWE, and Online Bagging maintain or improve G-Mean, indicating robustness to class imbalance. In contrast, boosting and margin-based methods show performance degradation, suggesting that incremental rebalancing may introduce noise that negatively affects majority-class accuracy.
Table 4 reports the Kappa Statistic comparison. Unlike Recall and G-Mean, Kappa reveals that the the rebalancing and concept drift approach does not consistently improve chance-corrected agreement. Ensemble-based methods such as ARF and AWE maintain stable Kappa values, indicating robustness to class imbalance. In contrast, boosting and margin-based learners exhibit noticeable degradation, suggesting that incremental rebalancing may introduce inconsistencies that negatively affect global agreement.
Table 5 reports the MCC comparison. In contrast to Recall and G-Mean, MCC reveals that rebalancing rarely leads to substantial improvements, as the metric penalizes both false positives and false negatives simultaneously. Only robust ensemble-based methods such as ARF and AWE maintain stable or slightly improved MCC values, whereas boosting, margin-based, and instance-based methods experience noticeable degradation.
Table 6 reports the AUC comparison. Unlike Recall, G-Mean, Kappa, and MCC, AUC shows limited sensitivity to the stream rebalancing and concept drift strategy, with most methods exhibiting nearly identical performance under both configurations. This behavior confirms that AUC mainly captures ranking capability rather than class-specific performance, reinforcing the need to jointly analyze imbalance-aware metrics when evaluating fault diagnosis in data streams.
The comparative tables summarizing Sensitivity/Recall, G-Mean, Kappa Statistic, MCC, and AUC provide a consolidated view of the qualitative and quantitative effects of integrating incremental rebalancing with concept drift detection across representative learning algorithms. The results indicate that the proposed diagnosis method consistently improves or preserves performance in imbalance-aware metrics, particularly Sensitivity and G-Mean, while maintaining stable behavior in Kappa and MCC, which account for agreement beyond chance and balanced error distribution. In contrast, the use of base incremental algorithms without explicit rebalancing generally results in weaker performance across these metrics, even when internal drift-handling mechanisms are present. The tables further show that AUC remains largely unchanged between configurations, supporting the observation that ranking-based metrics alone are insufficient to capture the impact of imbalance and drift in streaming scenarios. Overall, the tabular analysis complements the heatmap-based visualization by clarifying which learning paradigms benefit from the proposed approach and by highlighting its consistent advantages under severe class imbalance.