FADES: Adaptive Drift Estimation via Conformal Signals for Streaming Intrusion Detection

Barrett, Seth; Dorai, Gokila; Li, Lin; Rajaganapathy, Swarnamugi

doi:10.3390/electronics15102114

Open AccessArticle

FADES: Adaptive Drift Estimation via Conformal Signals for Streaming Intrusion Detection

¹

School of Computer and Cyber Sciences, Augusta University, Augusta, GA 30912, USA

²

DFAIR Lab, Augusta, GA 30912, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2114; https://doi.org/10.3390/electronics15102114

Submission received: 15 April 2026 / Revised: 8 May 2026 / Accepted: 12 May 2026 / Published: 14 May 2026

(This article belongs to the Special Issue Security and Privacy Challenges in Integrated IoT and Edge Systems)

Download

Browse Figures

Versions Notes

Abstract

Machine learning-based intrusion detection systems (IDS) deployed in real-world environments frequently degrade due to concept drift, where evolving traffic patterns invalidate assumptions learned during training. This challenge is especially pronounced in Internet of Things (IoT) environments, where device behavior changes over time due to user interaction, firmware updates, and emerging attack strategies. Prior work introduced FIRCE, a framework that integrates conformal evaluation into streaming IDS pipelines to enable uncertainty-aware drift detection and adaptive retraining. In this journal extension, we present FADES, a framework for adaptive drift estimation that generalizes drift monitoring beyond prediction-space uncertainty by supporting both conformal evaluation and representation-space detectors within a unified streaming architecture. FADES incorporates multiple conformal evaluation variants, including Approximate Cross-Conformal Evaluation, which preserves the statistical structure of cross-conformal evaluation while eliminating repeated model training, as well as an Adaptive Chunking Controller that dynamically balances detection responsiveness and computational cost. We extend prior work through three major contributions: (i) a variance-aware evaluation protocol comprising 375 simulations across multiple seeds and runs, (ii) integration of a contrastive autoencoder-based detector to enable direct comparison between prediction-space and representation-space drift detection, and (iii) expanded evaluation across in-domain and cross-dataset transfer settings using UNSW-NB15, CICIDS2018, and a real-world IoT testbed. Approx-CCE achieves performance comparable to standard cross-conformal evaluation across hundreds of simulations, providing empirical evidence that the statistical benefits of CCE derive primarily from its disjoint calibration partition structure rather than fold-specific model diversity, a finding with implications for conformal evaluation in repeated recalibration settings more broadly. In contrast, representation-space drift detection via CADE incurs substantial computational cost under repeated retraining, limiting its practicality in streaming settings. These findings demonstrate that conformal evaluation provides a statistically grounded and computationally efficient foundation for real-time drift-aware intrusion detection, and that FADES enables flexible, unified evaluation of drift detection strategies under realistic deployment conditions.

Keywords:

conformal evaluation; concept drift; Intrusion Detection Systems; IoT security; streaming machine learning; adaptive retraining; cross-conformal evaluation; adaptive chunking

1. Introduction

As Internet of Things (IoT) devices become increasingly integrated into critical infrastructure and everyday environments, their exposure to cyber threats grows proportionally. Intrusion Detection Systems (IDS) play a vital role in identifying unauthorized or anomalous behavior within these networks. However, most IDS models are trained in static environments and then deployed in highly dynamic settings. As traffic patterns evolve due to firmware updates, user behavior, environmental changes, and emerging attacks, deployed models experience concept drift, where the underlying data distribution shifts in ways that invalidate earlier training assumptions. This results in degraded predictive performance and can increase false negatives for malicious activity [1].

To address this challenge, recent research has explored uncertainty-aware and adaptive learning approaches that trigger retraining or rejection of unreliable predictions. Among these, conformal evaluation (CE) techniques are particularly attractive because they quantify model confidence through calibrated nonconformity scores and p-values without requiring labels at test time. This enables statistically grounded rejection of low-confidence predictions and provides a natural signal for detecting violations of the i.i.d. assumption in streaming environments [2,3,4,5,6]. Complementary approaches include representation-space drift detection, pseudo-labeling pipelines, continual learning systems, and adaptive windowing strategies [7,8,9,10,11,12,13]. Despite these advances, practical deployment remains challenging. Recalibration can be computationally expensive, drift signals may be noisy or delayed, and fixed processing strategies often fail to balance responsiveness and efficiency under dynamic conditions.

In prior conference work (FIRCE) [14], we introduced a framework for integrating CE into streaming intrusion detection pipelines. FIRCE demonstrated that prediction-space uncertainty signals can effectively drive drift detection and adaptive retraining in IoT IDS settings, while remaining computationally efficient through mechanisms such as Approximate Cross-Conformal Evaluation (Approx-CCE) and adaptive chunking. However, FIRCE focuses exclusively on CE as the source of drift signals, limiting its ability to incorporate alternative detection paradigms.

In this work, we present FADES, a Framework for Adaptive Drift Estimation via Conformal Signals, which extends FIRCE by generalizing the drift monitoring layer. Rather than proposing a single new drift detection method, FADES provides a unified streaming architecture that supports both prediction-space uncertainty signals and representation-space drift detectors within a consistent pipeline (Figure 1). This design enables controlled, apples-to-apples comparison between fundamentally different approaches to drift detection, including CE and contrastive autoencoder-based methods such as CADE, under identical retraining and evaluation conditions.

This journal extension builds on FIRCE in three key directions. First, we expand the empirical evaluation through a multi-run, multi-seed protocol, enabling variance-aware analysis across 375 total simulations. Second, we integrate CADE as a representation-space drift detector within the same streaming pipeline, allowing direct comparison with CE in terms of detection performance and runtime feasibility. Third, we broaden the evaluation scope to include cross-dataset transfer settings using UNSW-NB15 and CICIDS2018, alongside a real-world IoT testbed, improving the realism and generality of the study. Together, these extensions enable a more comprehensive and operationally grounded evaluation of drift detection strategies in streaming intrusion detection.

The main contributions of this work are as follows:

We present FADES, a modular framework that generalizes drift detection in streaming IoT intrusion detection by unifying conformal evaluation and representation-space methods within a single pipeline.
We integrate CADE into the streaming pipeline, enabling controlled comparison between prediction-space and representation-space drift detection approaches under identical conditions.
We conduct a large-scale empirical evaluation consisting of 375 simulations across multiple seeds, runs, and dataset transfer settings, providing variance-aware performance analysis.
We demonstrate that conformal-evaluation-based drift detection achieves comparable performance to more complex methods while maintaining significantly lower runtime overhead.

Research Questions

To guide our evaluation, we investigate the following research questions:

RQ1: Can conformal-evaluation-based drift detection maintain detection performance while reducing computational overhead compared with alternative approaches?

RQ2: How does representation-space drift detection (e.g., CADE) compare with prediction-space uncertainty signals in terms of runtime feasibility and operational practicality?

RQ3: How stable are the observed performance characteristics across multiple runs, random seeds, and dataset transfer settings?

RQ4: Does a unified drift monitoring framework enable consistent evaluation and comparison across fundamentally different drift detection paradigms?

The remainder of this paper is organized as follows: Section 2 reviews prior work on CE, drift adaptation, and adaptive processing in intrusion detection. Section 3 presents the dataset collection process, baseline models, conformal evaluators, and the streaming simulation pipeline. Section 4 reports the empirical findings and evaluates the proposed research questions. Section 5 discusses novelty, operational implications, limitations, and future directions. Section 6 concludes the paper.

2. Related Work

The challenge of adapting machine learning models to dynamic, non-stationary environments has generated extensive research across concept drift detection, adaptive retraining, uncertainty quantification, and streaming control. In this section, we review the most relevant work for FADES.

2.1. Conformal Evaluation and Drift Detection

Conformal evaluation is derived from conformal prediction theory and provides a way to assess the reliability of model predictions through nonconformity scores and associated p-values [2,3]. In contrast with conformal predictors, which produce prediction sets, conformal evaluators quantify the plausibility of individual predictions and can be used to reject low-confidence outputs. This makes CE especially attractive for streaming settings where prediction reliability may degrade before labels become available.

In security, Transcend introduced conformal prediction as a mechanism for rejecting suspicious malware classifications under distribution shift [5]. Transcending Transcend later extended this line of work by formalizing additional evaluators, including inductive and cross-conformal approaches, and improving computational practicality [6]. These works are foundational for FADES because they show that conformal p-values can serve as a statistically grounded signal of i.i.d. assumption violations, which often correspond to drift in deployed security systems.

Compared with unsupervised anomaly detectors or heuristic confidence filters as seen in Table 1, CE offers an appealing separation between prediction and uncertainty estimation. This separation enables drift detection to remain classifier agnostic and interpretable while retaining formal calibration under the usual exchangeability assumptions [4].

2.2. Model Retraining and Adaptation Under Concept Drift

Adaptation under drift in intrusion detection broadly follows three directions. First, score-based approaches use uncertainty, nonconformity, or p-value signals to detect drift and trigger recalibration or retraining [5,6]. Second, confidence-driven and label-efficient pipelines attempt to minimize annotation cost by using pseudo-labeling, active learning, or confidence filtering to determine which samples should be incorporated into model updates [8,11]. Third, representation-centric methods seek to learn robust feature spaces so that downstream adaptation can remain lightweight [9,15,16,17].

CADE is particularly relevant here because it uses a contrastive autoencoder to identify drifting samples in representation space [7]. This makes it an appealing comparison point because it represents a conceptually different family of drift detection from conformal evaluation. However, it also introduces different computational and modeling requirements, which become especially important in streaming deployment settings.

FADES primarily follows the first direction, but it remains compatible with ideas from the latter two. Its rolling-buffer design can support confidence-based retraining policies, and its architecture can in principle be paired with representation-space detectors.

2.3. Adaptability in Streaming IDS

Adaptive processing mechanisms are frequently used to balance responsiveness and efficiency under drift. Methods such as ADWIN dynamically adjust window sizes based on detected changes in the data stream [18], while other works employ exponential smoothing or early drift indicators to improve sensitivity without becoming overly reactive to noise [13,19,20,21,22,23].

In IoT intrusion detection, adaptive strategies have also appeared through lightweight aggregation and chunked fog-processing pipelines [24,25]. These works motivate FADES’s Adaptive Chunking Controller, which does not directly implement adaptive windowing in the classical sense, but instead applies the same principle to CE evaluation granularity. By dynamically adjusting chunk size based on observed drift behavior, FADES reduces unnecessary recalibrations during stable periods while preserving responsiveness during volatile periods.

2.4. Models for Nonconformity Measures

Nonconformity measures quantify how atypical a sample appears relative to calibration data and are often computed from either calibrated class probabilities or decision margins. The model used to produce these scores affects calibration efficiency, retraining cost, score stability, and suitability for streaming use [4,26,27,28,29,30].

Table 2 summarizes common model families for nonconformity measure computation and highlights the practical reasons we center FADES on compact multilayer perceptrons for conformal evaluation. In particular, multilayer perceptrons naturally provide probability-based nonconformity measures, support lightweight temperature scaling, retrain quickly, and remain well suited to tabular network flow data [31,32,33,34].

3. Materials and Methods

This section outlines the data collection process, model configuration, CE framework, and streaming simulation pipeline used to evaluate FADES.

3.1. IoT Traffic Capture, Labeling, and Flow Preprocessing

To develop and evaluate FADES, we collected labeled IoT network traffic in a controlled laboratory environment. Our experimental network included a TP-Link TL-WR541N router running OpenWrt firmware and ten commercial off-the-shelf IoT devices spanning smart home, security, and entertainment domains. Each device was assigned a static IP address. Device interactions were performed through the corresponding mobile applications or voice assistants to ensure consistent and repeatable behavioral profiling as seen in Table 3.

We conducted two sets of data captures: a training dataset containing benign traffic and attack simulations for model training and a drift dataset containing additional captures under concept drift conditions to evaluate adaptation. All traffic captures were performed using tcpdump on a Debian Linux laptop directly connected to the IoT subnet. For each device, an 8-hour capture was collected. During these captures, a series of network-layer attacks was launched from the same Debian laptop, including TCP SYN flood, XMAS tree flood, UDP flood, HTTP flood, and, in the drift captures, HULK HTTP flood [35,36,37].

To ensure consistency across devices and maintain balanced exposure to each attack type, attacks were executed according to predefined schedules. Each attack type was launched three times across the 8-hour window for each device. A cooldown period was preserved in the final portion of each capture to avoid trailing attack packets contaminating benign traffic.

Following raw PCAP collection, flows were extracted using CICFlowMeter and labeled according to attack logs. Before model training, all numeric flow features were standardized using StandardScaler. For binary classification, the label was derived from Label or BinLabel, mapping benign traffic to 0 and all attack traffic to 1. All flows with missing or undefined feature values (e.g., NaNs or infinities) were removed prior to training. Features were standardized using z-score normalization, and no imputation was performed due to the low frequency of missing values.

3.2. Benchmark Datasets and Collection Provenance

In addition to our in-lab DFAIR and DFAIR Drift IoT datasets (Section 3.1), we evaluate cross-dataset transfer using two widely used [38] network intrusion detection benchmarks: CSE-CIC-IDS2018 (CICIDS2018) [39] and UNSW-NB15 [40]. This allows us to examine how FADES behaves when trained on datasets originating from substantially different environments, including enterprise network emulation and cyber-range traffic generation, while the target stream reflects IoT network behavior.

3.2.1. CSE-CIC-IDS2018 (CICIDS2018)

CSE-CIC-IDS2018 is a large-scale enterprise-network emulation dataset generated by the Canadian Institute for Cybersecurity [39]. The dataset simulates a multi-department corporate network with hundreds of hosts and multiple attacker machines executing scenario-driven attacks across several capture days. Attack scenarios include brute-force attacks, Heartbleed exploitation, botnet activity, denial-of-service attacks, web attacks, and infiltration scenarios.

Raw PCAP traffic was collected across the simulated network and processed using CICFlowMeter-V3 to produce bidirectional flow records containing more than eighty statistical features [39,41]. Ground-truth labels are derived from the attack execution schedule combined with network identifiers such as IP address, port, and protocol.

3.2.2. UNSW-NB15

UNSW-NB15 is a cyber-range dataset generated using the IXIA PerfectStorm traffic generator to produce modern benign traffic mixed with synthetic attack scenarios [40]. Traffic was captured as PCAP files using tcpdump and subsequently processed into tabular datasets containing labeled network flows.

The original UNSW-NB15 release derives features using Argus and Bro (Zeek) to produce 49 flow-based attributes. However, many downstream studies use CICFlowMeter-derived variants of UNSW-NB15 to enable feature alignment with CIC-style datasets. In our experiments, we use a CICFlowMeter-compatible variant so that feature representations remain consistent across datasets.

3.2.3. Flow-Export Harmonization for Transfer

Flow-based intrusion detection experiments are sensitive to the feature-extraction tool used to derive flows from PCAP data. Even when feature names match, different exporters may compute statistics using slightly different aggregation rules or timeouts.

To reduce exporter-induced differences, our experiments use datasets that expose a common CICFlowMeter-style feature representation. Both CICIDS2018 and our DFAIR datasets are derived directly from CICFlowMeter-style flow extraction, while the UNSW-NB15 dataset is used in a CICFlowMeter-compatible form when available.

Across all datasets used in this work, the same subset of CIC-style flow attributes is retained to ensure consistent model input dimensionality. These include temporal statistics, packet-length distributions, TCP flags, throughput statistics, and inter-arrival time measurements derived from bidirectional flows.

3.2.4. Comparison with the DFAIR IoT Dataset

Unlike the benchmark datasets above, the DFAIR dataset was collected on a physical IoT testbed consisting of ten commercial devices connected through an OpenWrt router. Traffic captures were performed using tcpdump on a Debian laptop connected to the IoT subnet. Captured PCAP files were converted into bidirectional flow records using a Python v3.11 implementation of CICFlowMeter [42].

During each capture, a human participant interacted with the devices through mobile applications, voice assistants, and on-device controls while simultaneously generating normal network activity from a laptop and smartphone connected to the same network. This was intended to approximate realistic household network conditions.

Each device was monitored for approximately 8 h while scripted attack traffic was periodically injected. Baseline attacks included TCP SYN floods, UDP floods, HTTP floods, and XMAS scans. In the drift dataset, additional HULK-style HTTP flooding attacks were introduced to induce distribution shift.

Table 4 summarizes the collection characteristics of the datasets used in our experiments.

3.3. Baseline Classifier and CE Model Design

Our codebase supports multiple classifier families, including decision trees, random forests, linear support vector machines, XGBoost, and feedforward neural networks. In this paper, we report classifier results using a feedforward neural network because it provides a strong and efficient baseline for tabular flow data while keeping the manuscript focused on the CE study [31,32,43,44,45].

The feedforward neural network classifier uses two hidden layers with widths

(64, 32)

, ReLU activations, dropout with probability

0.3

after each hidden layer, and a single-logit output trained using BCEWithLogitsLoss with Adam. Sigmoid is applied to logits at inference to obtain attack probabilities.

We use the term FNN for the classifier and MLP for the model used within the CE pipeline. This distinction is purely terminological and is intended to separate the classifier reported in the main results from the internal probability model used for nonconformity scoring.

Within the CE pipeline, FADES uses a compact multilayer perceptron consisting of three hidden layers with widths

(256, 128, 64)

, GELU activations, LayerNorm, dropout with probability

0.2

, BCEWithLogitsLoss, and Adam [46,47,48,49]. This model was selected because it natively produces probability-based nonconformity scores, supports lightweight temperature scaling, and retrains efficiently under streaming updates [31,33,34].

While CE is classifier agnostic, model choice affects operational efficiency in streaming environments. Prior CE-based security systems frequently employ support vector machines because their margins provide natural nonconformity scores. However, SVM-based pipelines often require repeated probability calibration (e.g., Platt scaling or isotonic regression) after each retraining step. In streaming settings where recalibration may occur frequently, this additional optimization overhead can become substantial.

In contrast, multilayer perceptrons naturally produce probability outputs through softmax or sigmoid activations. This allows probability-based nonconformity measures of the form

1 - p_{\hat{y}}

to be computed directly and calibrated through lightweight temperature scaling. Empirically, this design enables significantly faster retraining cycles while maintaining stable calibration behavior on tabular flow features.

While we focus on a feedforward neural network for clarity and consistency, evaluating the framework across a broader set of classifiers remains important future work to fully validate classifier-agnostic behavior.

3.4. Why Conformal Evaluation for Drift Detection

Conformal evaluation is particularly attractive for drift detection in IDS for the following three reasons:

It does not require ground-truth labels at test time, allowing drift to be signaled before delayed labels are available.
It supports per-class calibration through class-conditional score thresholds, improving sensitivity to distributional changes that affect different traffic classes differently.
It is classifier agnostic because it relies on model outputs rather than internal architecture details.

These properties make CE well suited for streaming IoT security environments, where fast reaction to changes is important and full supervision may lag behind live deployment [2,3,4].

Under standard exchangeability assumptions, FADES computes class-conditional thresholds as follows:

τ_{c} = {Quantile}_{1 - α} (S_{c})

and smoothed conformal p-values as follows:

p = \frac{1 + \sum_{s^{'} \in S_{\hat{y}}} 1 [s^{'} \geq s]}{| S_{\hat{y}} | + 1},

which provide the statistical validity foundation for the CE component of the framework.

3.5. Baseline Conformal Evaluators in FADES

FADES supports four conformal evaluation strategies:

ICE: a single-model inductive evaluator using a held-out calibration split.
CCE: a cross-conformal evaluator that aggregates nonconformity information across multiple folds.
Approx-TCE: an approximate transductive evaluator that reduces the cost of full transductive conformal evaluation.
Approx-CCE: our proposed lightweight approximation of CCE.

These evaluators differ mainly in calibration structure and runtime overhead as seen in Table 5. ICE is simple but can become conservative. CCE is statistically attractive but expensive under repeated recalibration. Approx-TCE reduces the cost of transductive evaluation but still inherits multi-fold overhead. Approx-CCE is designed to retain the main benefits of cross-conformal calibration while avoiding repeated model training [5,6].

3.6. Approx-CCE

Standard CCE couples two distinct mechanisms: the use of fold-specific models (model diversity) and the use of disjoint calibration partitions (held-out scoring structure). These two properties are typically introduced together, but their individual contributions to CCE’s statistical behavior have not been empirically separated. Approx-CCE isolates the partition structure by fixing a single shared model, thereby providing a controlled test of the hypothesis that calibration partition design, rather than model diversity, is the primary driver of CCE’s nonconformity score quality. The empirical results in Section 4 support this hypothesis across multiple datasets and transfer conditions.

This design preserves the main statistical structure of cross-conformal evaluation while eliminating repeated training across folds. As a result, Approx-CCE is better suited to real-time IDS environments, where drift response must remain computationally feasible [5,6]. The calibration and test-time prediction procedures for Approx-CCE are summarized in Algorithms 1 and 2, respectively.

Algorithm 1: Approx-CCE: Calibration [14]

Algorithm 2: Approx-CCE Test-Time Prediction [14]

3.7. Streaming Simulation with Rolling Calibration Buffer

FADES operates in a streaming setting using a fixed-size rolling calibration buffer. The buffer is seeded from the training data and updated continuously with recent samples so that retraining and recalibration reflect current traffic behavior rather than the full historical stream.

Each simulation run proceeds in three stages. First, the baseline classifier and conformal evaluator are trained and calibrated. Second, incoming flows are processed in chunks, with predictions, p-values, and drift statistics logged for each chunk. Third, when drift is detected, the classifier is retrained and the conformal evaluator is recalibrated using the most recent rolling history. The complete streaming simulation procedure with CE-based drift detection is summarized in Algorithm 3.

We note that this retraining strategy assumes access to sufficiently reliable labels within the rolling buffer. In practical deployments, labels may be delayed, noisy, or partially unavailable, which can affect retraining quality. While this work focuses on evaluating drift-triggered retraining under controlled conditions, more conservative retraining policies, including delayed updates, confidence filtering, or label validation, represent important directions for future work.

Algorithm 3: Streaming Simulation with CE Drift Detection

3.8. Drift Detection Criterion

Drift detection in FADES is based on deviations in conformal p-values from their expected distribution under exchangeability. Under the i.i.d. assumption, conformal p-values are approximately uniformly distributed on

[0, 1]

. Persistent deviation from this behavior is used as a signal of distribution shift.

For each processed chunk c of size n, we compute the fraction of low-confidence predictions as follows:

D (c) = \frac{1}{n} \sum_{i = 1}^{n} 1 [p_{i} \leq α],

where

p_{i}

is the conformal p-value for sample i and

α

is the significance level.

A drift event is triggered if

D (c) > τ,

where

τ

is a predefined threshold. In our experiments,

τ

is set based on the expected false rejection rate under calibration (i.e.,

τ \approx α

), and deviations beyond this level indicate violation of exchangeability.

This formulation ensures that drift detection is aligned with conformal validity: under no drift,

D (c)

remains near

α

, while sustained increases signal distribution shift.

While this criterion is theoretically motivated by the uniformity of conformal p-values under exchangeability, it is important to note that deviations may arise from both distributional shift and the presence of adversarial traffic. In intrusion detection settings, attack traffic itself can induce non-uniform p-value behavior even in the absence of underlying concept drift. To better understand this distinction, we analyze retraining behavior across simulation logs and examine whether drift triggers occur disproportionately during attack injection periods versus sustained distributional changes. To evaluate this concern empirically, Section 4.3 reports a chunk-level association analysis comparing retraining rates for chunks with and without attack samples. Across 53,947 completed chunks, attack-containing chunks were not more likely to trigger retraining than no-attack chunks, indicating that the retraining policy is not simply an attack-presence detector.

3.9. Framework Design Justification

FADES is designed for streaming intrusion detection environments where retraining and recalibration may occur repeatedly. The framework follows four design principles.

First, flows are processed in controllable chunks rather than individually to ensure sufficient statistical signal for CE.

Second, prediction and uncertainty estimation are separated through a CE layer. This allows drift to be detected using model confidence signals before delayed labels become available.

Third, a fixed-size rolling calibration buffer is maintained so that retraining and recalibration reflect the most recent traffic behavior without requiring access to the full historical dataset.

Fourth, lightweight neural models are used inside the CE components to maintain fast retraining cycles under streaming conditions. The MLP probability interface used to support CE-based nonconformity scoring is summarized in Algorithm 4.

Algorithm 4: MLP predict_proba [14]

3.10. Adaptive Chunking Controller

FIRCE introduces an Adaptive Chunking Controller to balance detection sensitivity and computational efficiency. Small chunks can improve responsiveness but lead to excessive recalibration, while large chunks reduce overhead but may delay drift detection. ACC addresses this trade-off by adjusting chunk size according to an exponential moving average of recent drift activity [20,21,22,23].

When drift is frequent, ACC decreases chunk size to improve temporal resolution. When the stream is stable, ACC increases chunk size to reduce unnecessary computation. This adaptive strategy is motivated by the same general principle that underlies adaptive windowing methods in streaming data analysis, but it is applied here to CE granularity. The ACC was incorporated into FADES. The ACC update procedure is summarized in Algorithm 5.

Algorithm 5: Adaptive Chunk Size Controller [14]

The ACC design introduces heuristic thresholds and update rules based on exponential smoothing of drift frequency. While these choices are empirically effective across our evaluated settings, they are not theoretically optimal and may require tuning under different data characteristics. A more principled formulation of adaptive chunking, including data-driven or theoretically grounded update strategies, remains an open research direction.

3.11. Experimental Protocol and Transfer Settings

FADES is evaluated in one in-domain setting and two cross-dataset transfer settings. In the in-domain setting, both the classifier and conformal evaluator are trained and calibrated on the DFAIR training dataset and evaluated on the corresponding drift dataset. In the transfer settings, the classifier and conformal evaluator are trained and calibrated on UNSW-NB15 or CICIDS2018 and then evaluated on the DFAIR drift dataset [38,39,40,50,51,52,53,54].

This design allows us to examine both local drift response and broader cross-dataset generalization. It also reflects reviewer feedback that the framework should not be presented as tied only to a single dataset or lab environment.

3.11.1. Dataset Characteristics and Transfer Behavior

Differences in recalibration frequency across datasets are primarily driven by dataset characteristics rather than the feature subset used in FADES. CICIDS2018 is organized into scenario-based capture days (e.g., DoS/DDoS day, web-attack day), producing long semi-stationary segments that allow the adaptive chunk controller to operate with larger windows.

UNSW-NB15, in contrast, mixes live network traffic with synthetic attack streams generated using IXIA PerfectStorm. These blended streams tend to produce more abrupt distribution changes, which can trigger recalibration more frequently during transfer experiments.

Finally, even when feature names match, CICIDS flows are produced using CICFlowMeter while UNSW-NB15 features originate from Argus and Bro. These differences alter the distributions of identically named statistics such as flow duration or inter-arrival times, which can affect conformal calibration stability under cross-dataset transfer.

3.11.2. Experimental Hardware and Runtime Environment

The experiments reported in the conference version of FADES were executed on a 16-inch Apple MacBook Pro equipped with an Apple M1 Max processor (Apple Inc., Cupertino, CA, USA). The new experiments introduced in this journal extension, including multi-seed evaluations and the CADE runtime comparison, were conducted on a dedicated workstation-class system.

The workstation consists of an AMD Threadripper processor (Advanced Micro Devices, Inc., Santa Clara, CA, USA) with 128 CPU cores, 512 GB of system memory, and three NVIDIA A6000 GPUs (NVIDIA Corporation, Santa Clara, CA, USA). To ensure reproducibility and to emulate realistic resource constraints, all experiments were scheduled using Slurm [55] and restricted to a single GPU, 10 CPU cores, and 64 GB of RAM per run.

This configuration ensures that runtime comparisons reflect realistic deployment conditions rather than unconstrained hardware utilization.

3.12. Journal-Extension Additions

This journal extension adds two major empirical enhancements beyond the conference versions.

3.12.1. Multi-Run, Multi-Seed Analysis

Reviewer feedback indicated that the framework should be evaluated across multiple runs and random seeds, with variability reflected in the reported results. To address this, all major experiments in the journal extension are repeated across multiple random seeds and independent runs, and results are summarized using mean and standard deviation.

Specifically, each experimental configuration is evaluated using five random seeds

{17, 42, 67, 92, 117}

and five independent runs per seed. This results in 25 repetitions per configuration. Experiments are performed across three dataset-transfer settings: (i) in-domain evaluation using DFAIR → DFAIR Drift, (ii) cross-dataset transfer from UNSW-NB15 to the DFAIR drift stream, and (iii) cross-dataset transfer from CICIDS2018 to the DFAIR drift stream.

For each dataset-transfer configuration, we evaluate five CE strategies: ICE, Approx-CCE, CCE, Approx-TCE, and a baseline configuration without CE. All experiments use the same classifier architecture (FNN) and the same internal CE model (a lightweight multilayer perceptron used to compute probability-based nonconformity scores).

In total, this experimental protocol produces

5 \times 5 \times 5 \times 3 = 375

simulation runs across the full evaluation suite. All simulations are executed using the hardware configuration described in Section 3.11, with each run scheduled independently through Slurm.

Aggregated results reported in the following sections therefore reflect both central tendency (mean performance) and run-to-run variability (standard deviation), allowing us to assess the stability of the FADES pipeline under repeated stochastic initialization and streaming drift conditions.

3.12.2. CADE Integration and Runtime Comparison

To expand the empirical comparison beyond prediction-space CE, we integrated CADE [7] into the FADES simulation pipeline by exposing the contrastive autoencoder as a local runtime component. This allowed CADE to act as a drop-in replacement for the conformal evaluator, enabling drift detection to trigger retraining under the same streaming pipeline used by FADES.

In this configuration, CADE used the identical experimental setup as the CE runs, including the same retraining policy, ACC settings, and rolling log size of

10^{5}

flows. This ensured that runtime differences reflect behavior under a unified streaming retraining policy, rather than differences in surrounding pipeline design. We note, however, that this setting does not necessarily reflect the optimal deployment configuration for CADE, which was originally proposed for offline or semi-offline scenarios. CADE training was executed using TensorFlow with GPU acceleration on a single NVIDIA A6000 GPU, verified via logging, nvidia-smi and nvtop during runtime.

We primarily evaluated CADE using the DFAIR → DFAIR Drift configuration, which represents the smallest dataset used in our evaluation. Even under this relatively lightweight setting, CADE incurred substantially higher computational cost than FADES’s CE-based components.

Using the original CADE configuration, a single simulation run failed to complete within the Slurm walltime limit of 7 days. This configuration used hidden dimensions

(512, 128, 32)

after the input layer, margin

10.0

, MAD threshold

3.5

, minimum drift ratio

0.05

, minimum drift count 1, batch size 64, and 250 training epochs per CADE update.

Because this original setting was computationally infeasible under the FADES streaming retraining loop, we next evaluated a reduced-epoch CADE configuration. This second configuration preserved the original CADE architecture and drift-detection parameters, but reduced the number of training epochs from 250 to 50. Even under this reduced training schedule, CADE again failed to complete a single simulation within the 7-day walltime limit. Under the 50-epoch configuration, CADE initiated 175 retraining cycles before the job reached the walltime limit.

In response to concerns that CADE was originally designed for offline or semi-offline drift analysis rather than high-frequency streaming retraining, we additionally evaluated a lightened CADE configuration. This third configuration reduced the hidden dimensions from

(512, 128, 32)

to

(128, 64, 32)

, reduced the number of epochs from 50 to 10, reduced the contrastive margin from

10.0

to

5.0

, increased the MAD threshold from

3.5

to

4.0

, increased the minimum drift ratio from

0.05

to

0.10

, and increased the minimum drift count from 1 to 3. The learning rate, batch size, regularization coefficient, similar-sample ratio, display interval, and GPU device assignment were kept unchanged. Table 6 summarizes the three CADE configurations evaluated in the streaming simulation loop.

Despite these reductions, the lightened CADE configuration still failed to complete a single DFAIR → DFAIR Drift simulation within the 7-day walltime limit. This result suggests that the bottleneck is not solely the number of CADE training epochs, but the interaction between representation-space retraining and the repeated online adaptation policy used in the FADES streaming loop.

The primary bottleneck across CADE configurations was repeated retraining of the contrastive autoencoder following drift-triggered updates. This frequent retraining behavior caused the representation model to be repeatedly re-optimized as new drift signals were detected, making CADE substantially more expensive than prediction-space CE methods under the specific high-frequency adaptation regime evaluated here.

For comparison, FADES using Approx-CCE with ACC completed the same simulation configuration in 785.36 s (13.1 min), triggering two retraining events over the entire run. Within the same 7-day period required for a single CADE run, we executed over 225 FADES simulations across multiple datasets, seeds, and CE configurations.

We also attempted CADE runs on the cross-dataset transfer settings (UNSW-NB15 → DFAIR Drift and CICIDS2018 → DFAIR Drift). However, these datasets are substantially larger than the DFAIR dataset, and CADE runs again exceeded practical runtime limits (exceeding 1 day of execution per simulation) without showing signs of completing. These runs were therefore terminated early once it became clear that runtime behavior did not improve relative to the in-domain experiment.

Taken together, these results highlight a key operational distinction between representation-space drift detection and prediction-space uncertainty signals. While CADE provides rich explanations for drifting samples, the computational cost of repeatedly retraining a contrastive autoencoder makes it difficult to deploy in streaming intrusion detection pipelines where drift-triggered retraining must occur frequently. In contrast, CE-based drift signaling operates directly on classifier prediction probabilities and requires only lightweight recalibration, enabling rapid retraining cycles suitable for real-time deployment.

It is important to emphasize that this comparison evaluates CADE under the same high-frequency retraining regime used by CE-based methods in FADES. While this provides a controlled comparison within a unified streaming pipeline, it may not reflect configurations in which CADE is applied with less frequent retraining or alternative update strategies. As such, our results should be interpreted as demonstrating the operational cost of directly integrating representation-space drift detection into a fully online retraining loop, rather than a definitive statement about the intrinsic efficiency of CADE across all settings.

3.13. Statistical Analysis

All experiments were repeated across multiple random seeds and independent runs to account for stochastic variability in model initialization and streaming behavior. Specifically, each experimental configuration was evaluated across five seeds

{17, 42, 67, 92, 117}

and five independent runs per seed, yielding 25 repetitions per configuration and a total of 375 simulations across all datasets and CE strategies.

Simulation outputs were logged per run and aggregated using an automated post-processing pipeline. For each configuration defined by dataset, classifier, CE method, and classification mode, we compute summary statistics over all runs, including mean, standard deviation, minimum, and maximum values for each reported metric. Coverage of all expected seed–run combinations is explicitly verified to ensure completeness of the evaluation.

Primary results are reported as mean ± standard deviation to capture both central tendency and variability across runs. This variance-aware reporting allows us to assess the stability of drift detection performance under repeated stochastic conditions.

In addition to reporting mean and standard deviation, we examine variability across runs to assess the consistency of observed trends. While formal hypothesis testing is not the primary focus of this work, the multi-run, multi-seed design enables qualitative assessment of statistical stability and reduces the likelihood that reported differences arise from random initialization effects. Future work will extend this analysis with formal statistical significance testing and effect size estimation.

4. Results

We evaluate whether Approx-CCE preserves CCE-level detection quality while reducing computational cost, whether ACC reduces calibration overhead without sacrificing performance, and whether the journal-only additions strengthen the robustness and breadth of the empirical study.

4.1. RQ1: Performance vs. Computational Overhead

We evaluate RQ1: whether CE-based methods, particularly Approx-CCE, can preserve detection performance while reducing computational overhead in streaming intrusion detection settings. Variance-aware results across multiple seeds and runs are presented in Section 4.4, confirming that these performance–efficiency trade-offs remain stable under stochastic variation.

Table 7 summarizes performance across in-domain and transfer settings. Approx-CCE matches CCE to within <0.1% across accuracy and F1 while consistently reducing runtime. ICE occasionally fails to trigger retraining under drift due to its conservative calibration structure, while Approx-TCE remains slower due to its inherited multi-fold overhead [6].

These results support the central motivation of Approx-CCE: in repeated recalibration settings, it provides a more practical operating point than standard CCE while maintaining essentially the same predictive behavior.

The near-ceiling performance observed in the DFAIR drift setting reflects an in-domain evaluation scenario. The drift dataset was collected on the same physical network and devices used for training, with new attack patterns introduced into otherwise stable background traffic. As a result, these results should be interpreted as demonstrating system behavior under controlled in-domain drift rather than as evidence of generalization to fully independent environments. The cross-dataset experiments using UNSW-NB15 and CICIDS2018 therefore provide a more challenging transfer scenario where covariate shift and attack taxonomy differences reduce separability. The corresponding metric traces for the in-domain DFAIR setting and the two cross-dataset transfer settings are shown in Figure 2, Figure 3 and Figure 4.

In the in-domain setting, Approx-CCE reduces runtime by approximately 13.9% relative to CCE while maintaining identical accuracy to four decimal places. Under cross-dataset transfer, runtime reductions increase to approximately 60–64%, while preserving near-parity in detection metrics.

To further analyze the trade-off between detection performance and computational cost, we evaluate the effect of adaptive chunking on recalibration behavior and runtime.

The Adaptive Chunking Controller is designed to reduce calibration overhead without degrading detection quality. Table 8 shows that adaptive chunking preserves strong metrics while avoiding the calibration explosions that occur with very small fixed chunks. At the same time, it avoids the underreaction and reduced recall that can appear with very large chunk sizes.

Across all datasets, adaptive chunking provides a superior operational trade-off compared with fixed chunking strategies. In the in-domain setting, it reduces recalibration frequency by over an order of magnitude relative to small fixed chunks while maintaining near-perfect detection performance. In transfer settings, it avoids the extreme runtime costs associated with fine-grained chunking and eliminates the need to manually tune chunk size, demonstrating robust behavior across varying drift conditions [18,24,25,56,57].

Answer to RQ1: These results support the hypothesis that the statistical benefits of CCE are primarily attributable to its disjoint calibration partition structure rather than fold-specific model diversity. Approx-CCE preserves this structure while eliminating repeated model training, suggesting that the latter contributes minimally to nonconformity score quality in this setting. This has practical implications beyond FADES: it suggests that cross-conformal-style calibration can be made computationally feasible for repeated recalibration scenarios without sacrificing the statistical properties that motivate CCE over simpler inductive evaluators.

4.2. RQ2: CE vs. CADE Runtime Feasibility

We evaluate RQ2: how representation-space drift detection (CADE) compares with CE-based methods in terms of runtime feasibility and operational practicality in streaming settings.

To broaden the baseline landscape beyond prediction-space CE methods, we evaluate CADE [7], a representation-space drift detection approach based on contrastive autoencoders. CADE learns a latent embedding of the training distribution and detects drift by identifying samples that deviate from this learned representation manifold.

To evaluate CADE within the FADES framework, we implemented a compatible runtime wrapper using the original CADE architecture and training procedure. The conformal evaluator component was replaced with CADE-based drift detection, while preserving the same streaming simulation, retraining, and recalibration pipeline.

This setup enables a controlled comparison between prediction-space and representation-space drift detection along three dimensions: (i) retraining cost, (ii) drift-trigger frequency, and (iii) overall runtime scalability. In particular, we compare the following:

CE-triggered retraining, where drift is signaled via conformal p-value deviations.
CADE-triggered retraining, where drift is signaled via representation-space distance from the learned manifold.

This integration allows CADE to be evaluated under the same streaming intrusion detection workload and operational constraints as FADES.

To avoid evaluating CADE only under a single potentially unfavorable configuration, we tested three increasingly lightweight CADE variants. The original configuration used the default high-epoch training schedule. The reduced-epoch configuration preserved the same architecture and drift parameters but lowered the training schedule to 50 epochs. The lightened configuration further reduced model width, training epochs, and drift-trigger sensitivity. This progression allows us to separate the effect of training duration from the broader cost of repeatedly updating a representation-space detector inside the streaming loop.

4.2.1. Runtime Feasibility

Our experiments reveal that CADE incurs substantially higher computational overhead than CE-based detectors in this streaming setting. These experiments were conducted on a workstation with 10 allocated CPU cores, 64 GB of RAM, and an NVIDIA A6000 GPU. Under these conditions, the original 250-epoch CADE configuration failed to complete a single DFAIR training dataset and DFAIR Drift simulation within the 7-day walltime limit.

We then evaluated a reduced-epoch CADE configuration that preserved the original architecture and drift-detection parameters but reduced the training schedule from 250 epochs to 50 epochs. This configuration also failed to complete within the 7-day walltime limit, initiating 175 retraining cycles before timeout. This result indicates that simply reducing the number of epochs was insufficient to make CADE feasible under the FADES streaming adaptation policy.

Finally, we evaluated a lightened CADE configuration that reduced the hidden dimensions from

(512, 128, 32)

to

(128, 64, 32)

, reduced the training schedule to 10 epochs, lowered the contrastive margin, and made drift triggering less aggressive by increasing both the MAD threshold and the minimum drift requirements. Despite these changes, the lightened CADE configuration again failed to complete a single simulation within the 7-day walltime limit.

In contrast, FADES’s CE-based pipeline completed the identical simulation configuration using Approx-CCE in approximately 785 s (13.1 min). Within the same 7-day period required for a single CADE run, we executed over 225 FADES simulations across multiple datasets, seeds, and CE configurations. Since 7 days corresponds to 604,800 s, the timeout threshold alone implies a lower-bound runtime difference of approximately

770 \times

relative to the completed Approx-CCE run.

4.2.2. Operational Implications

These results highlight an important operational distinction between representation-space drift detectors and prediction-space uncertainty signals. While CADE provides rich explanations for drifting samples, the computational cost of repeatedly retraining a contrastive autoencoder makes it difficult to deploy in streaming environments that require frequent retraining or recalibration.

In contrast, CE methods operate directly on classifier prediction probabilities and require only lightweight recalibration, allowing them to support rapid retraining cycles. This makes CE-based drift signaling more compatible with real-time intrusion detection pipelines where both responsiveness and computational efficiency are critical.

Table 9 summarizes the observed runtime differences.

Answer to RQ2: Under the specific high-frequency retraining regime evaluated in FADES, CE-based drift monitoring provides a substantially more practical runtime profile than the CADE configurations tested. The reduced-epoch CADE experiment showed that lowering the training schedule from 250 to 50 epochs was insufficient to make CADE feasible, and the lightened CADE experiment further showed that reducing model width, training epochs, and drift-trigger sensitivity still did not allow a single simulation to complete within the 7-day walltime limit. These results should not be interpreted as proving that CADE is inefficient in all deployment settings. Rather, they show that directly inserting CADE-style representation learning into a repeated online retraining loop introduces substantial operational cost, whereas CE-based drift monitoring can support the same streaming workload using lightweight prediction-space recalibration.

4.3. Attack Presence and Retraining Trigger Analysis

Reviewer feedback raised the concern that drift-triggered retraining may be caused merely by the presence of attack traffic rather than by broader distributional change. To evaluate this possibility, we performed a chunk-level association analysis across all CE-enabled seed runs. Each completed chunk was labeled according to whether it contained at least one attack sample (Actual = 1) and whether it ended in a drift-triggered retraining event. We then compared

P (retrain ∣ attack chunk)

against

P (retrain ∣ no attack chunk)

using

2 \times 2

contingency tables, Fisher exact tests, risk ratios, odds ratios, and phi coefficients.

Across 300 CE-enabled log files and 53,947 completed chunks, 34,640 chunks contained at least one attack sample and 995 chunks triggered retraining. Attack-containing chunks were not more likely to trigger retraining. Instead, retraining occurred in 551 of 34,640 attack-containing chunks (1.59%) and 444 of 19,307 no-attack chunks (2.30%). This corresponds to a risk difference of

- 0.71

percentage points, risk ratio of

0.69

, odds ratio of

0.69

, and phi coefficient of

- 0.025

. The aggregate and subgroup association results are summarized in Table 10. Although the Fisher exact test is significant due to the large number of chunks, the effect is small and in the opposite direction of the reviewer concern.

Dataset-level results further support this interpretation. In the DFAIR in-domain setting, retraining rates were nearly identical for attack-containing and no-attack chunks (1.22% vs. 1.25%; Fisher

p = 0.832

). In CICIDS2018-derived transfer and NB15 transfer settings, attack-containing chunks were less likely to trigger retraining than no-attack chunks. Similarly, CE-type-level analysis showed no positive attack-retraining association for Approx-CCE, Approx-TCE, CCE, or ICE in aggregate. Approx-CCE, the main proposed configuration, exhibited nearly identical retraining rates for attack-containing and no-attack chunks (1.03% vs. 1.16%; Fisher

p = 0.485

).

These findings do not imply that attack traffic is irrelevant to distributional change. Rather, they indicate that retraining in FADES is not triggered merely by the presence of attack samples in a chunk. Drift-triggered retraining appears to reflect broader changes in conformal confidence behavior across chunks rather than a direct attack-presence rule.

4.4. RQ3: Stability Across Seeds and Transfer Settings

We evaluate RQ3: the stability of drift detection performance across multiple runs, random seeds, and dataset transfer settings.

To assess robustness under stochastic variation, each experimental configuration was evaluated across five random seeds and five independent runs per seed, yielding 25 repetitions per configuration. Results are reported as mean ± standard deviation across runs.

Table 11 summarizes the aggregated results across all configurations.

Across all configurations, Approx-CCE with adaptive chunking exhibits low variance in both detection performance and runtime, indicating stable behavior under stochastic initialization and streaming variability. In in-domain settings, variance is minimal across all metrics, reflecting consistent performance when training and drift distributions are closely aligned.

Under cross-dataset transfer, variability increases slightly, particularly in precision and recall, due to differences in feature distributions and attack taxonomies. However, standard deviations remain small relative to mean values, indicating that performance degradation under transfer is systematic rather than stochastic.

Runtime variability is also limited, suggesting that the computational efficiency gains observed in RQ1 are consistent across runs and not sensitive to initialization or data ordering effects.

Answer to RQ3: Approx-CCE demonstrates stable performance across multiple runs and random seeds, with low variance in both detection metrics and runtime. While cross-dataset transfer introduces modest variability, overall behavior remains consistent, indicating that the observed performance and efficiency trade-offs are robust under stochastic and distributional variation.

4.5. RQ4: Unified Framework Comparison

We evaluate RQ4: whether a unified framework enables consistent and meaningful comparison across fundamentally different drift detection paradigms.

FADES provides a shared streaming evaluation pipeline in which both prediction-space and representation-space drift detection methods operate under identical conditions, including the same data streams, retraining policies, and evaluation metrics. This allows for controlled, apples-to-apples comparison between CE-based methods and representation-space approaches such as CADE.

Results from RQ1 and RQ2 demonstrate that while both paradigms are capable of detecting drift, they exhibit substantially different operational characteristics. Approx-CCE achieves detection performance comparable to standard CCE while significantly reducing computational overhead, and adaptive chunking further improves the trade-off between responsiveness and efficiency. In contrast, CADE provides a more expressive representation of distributional change, but the CADE configurations evaluated here incurred prohibitive runtime cost when embedded directly into the repeated online retraining loop.

RQ3 further shows that CE-based methods exhibit stable performance across multiple runs and seeds, with limited variability under both in-domain and transfer conditions. This stability, combined with low computational overhead, makes CE-based drift detection more suitable for deployment in real-time streaming environments.

Taken together, these results highlight a key insight: drift detection methods must be evaluated not only in terms of detection capability, but also in terms of their computational feasibility and stability under repeated adaptation. The unified FADES framework enables this evaluation by placing different drift detection strategies within a common operational context.

Answer to RQ4: FADES enables consistent and controlled comparison across drift detection paradigms, revealing that CE-based methods provide a more practical balance of performance, efficiency, and stability than representation-space approaches for streaming intrusion detection.

5. Discussion

The journal extension clarifies and strengthens several points that emerged through reviewer feedback and manuscript compression.

5.1. Novelty Relative to Transcend and Prior Work

A central question raised by reviewers concerns the novelty of FADES relative to Transcend and Transcending Transcend. FADES does not claim to introduce CE to security applications. Rather, its primary contribution lies in adapting CE to a realistic streaming IDS setting, where recalibration must occur repeatedly and under tight computational constraints [5,6].

Within this deployment-oriented context, Approx-CCE addresses a key limitation of standard cross-conformal evaluation: the repeated cost of training multiple models during recalibration. By replacing this with a single shared model and fold-based calibration, Approx-CCE preserves the structure of cross-conformal evaluation while substantially reducing computational overhead. ACC complements this by dynamically adjusting chunk sizes, eliminating the need for manual tuning of recalibration intervals.

Taken together, Approx-CCE and ACC are not isolated algorithmic modifications, but system-level components that enable CE-based drift detection to operate under realistic streaming constraints.

Beyond these operational contributions, FADES also advances understanding of CE behavior in two important ways. First, the Approx-CCE results provide empirical evidence that the statistical properties of cross-conformal evaluation can be decomposed, separating the role of calibration partitioning from the role of model diversity. Second, the 375-simulation protocol characterizes CE behavior under repeated recalibration, a regime that is analytically distinct from the fixed-calibration setting considered in prior work.

These aspects are not addressed by Transcend or Transcending Transcend, which primarily evaluate CE in static or infrequently recalibrated environments. As a result, FADES contributes both a practical framework for deployment and new empirical insight into the behavior of conformal methods under continuous adaptation.

5.2. Model Architecture Clarification

Another reviewer concern involved the terminology around the neural components of FADES. The framework uses standard feedforward multilayer architectures. The intrusion detection classifier is a feedforward neural network that produces probabilistic predictions for network flows. The CE pipeline uses a multilayer perceptron to model probability outputs for nonconformity scoring and temperature-scaled calibration. Both are feedforward networks, and the distinct naming is intended only to disambiguate roles inside the system.

The multilayer perceptron was selected for the CE role because it natively produces probabilities suitable for probability-based nonconformity measures, enabling efficient temperature scaling rather than heavier class-specific calibration procedures often required by other models [4,34]. FADES itself remains classifier agnostic in principle.

5.3. Evaluation Design and Drift Scenarios

The experiments in FADES emphasize abrupt drift caused by previously unseen attacks, a scenario that is common in security contexts where new attack behaviors can appear suddenly in traffic. The cross-dataset experiments with UNSW-NB15 and CICIDS2018 further show that the observed trends are not limited to a single DFAIR dataset [39,40].

At the same time, FADES does not yet exhaust the broader space of drift types. Gradual drift, recurring drift, and mixed open-world attack emergence remain important directions for future study. Likewise, the CADE comparison introduced in this journal extension begins to broaden the baseline set, but there is room for further comparison across representation-space, confidence-based, and streaming-window baselines [8,9,11,16,17].

We further analyze drift-trigger behavior using simulation logs to distinguish between attack-driven deviations and sustained distributional shifts. Preliminary observations indicate that retraining is not triggered solely by isolated attack bursts, suggesting that the drift criterion captures broader changes in data characteristics. A more systematic study of this behavior remains future work.

5.4. Operational Implications

From a deployment perspective, the findings suggest three practical conclusions. First, Approx-CCE offers a more attractive runtime-performance trade-off than CCE in repeated recalibration scenarios. Second, ACC helps eliminate brittle fixed chunk-size selection. Third, the CADE runtime observations reinforce that prediction-space CE can be more operationally feasible than representation-space approaches when the detector is embedded inside a repeated online retraining loop. Importantly, this does not imply that CADE is unsuitable for offline or semi-offline drift analysis. Instead, our results identify a specific deployment mismatch: representation-space retraining becomes costly when drift triggers can occur repeatedly during a streaming IDS simulation. The reduced-epoch and lightened CADE experiments strengthen this interpretation by showing that the runtime issue persists even after reducing training duration, model width, and drift-trigger aggressiveness.

These observations matter because the core value of FADES is not simply that it detects drift, but that it does so in a way that remains compatible with the real-time constraints of IDS operation [56,57]. In this context, real-time compatibility refers to the ability to process streaming flow records, evaluate drift signals, and complete any required recalibration within a time scale that does not cause the detector to fall behind the incoming traffic stream. A detector that requires hours or days of representation-model retraining after each drift trigger may still be useful for forensic analysis or periodic offline model maintenance, but it is poorly matched to inline or near-real-time IDS operation. By contrast, the CE-based components operate on prediction probabilities and calibration buffers, allowing drift checks and recalibration to remain lightweight relative to the streaming workload.

5.5. Limitations and Future Work

This study has several limitations. First, the current work focuses on binary intrusion detection. Extending FADES to multi-class and open-world settings remains an important future direction.

Second, while CE-based drift signaling operates independently of label quality, the retraining step can inherit biases from imperfect labels or pseudo-labels. Future work should explore more conservative retraining triggers, active selection, and uncertainty-aware filtering.

Third, the current evaluation emphasizes abrupt attack-driven distribution shifts. Additional drift scenarios, such as gradual or recurring drift, would further broaden the study.

Fourth, portions of benchmark datasets may contain sparse or unevenly distributed attack labels, which can affect rolling-buffer behavior and retraining dynamics.

Fifth, the CADE integration in this journal version clarifies the practicality gap between FADES’s CE-based detectors and contrastive autoencoder approaches under a repeated online retraining policy. We evaluated original, reduced-epoch, and lightened CADE configurations, but this comparison still does not constitute an exhaustive CADE tuning study, nor does it evaluate CADE under less frequent offline or semi-offline update schedules. Future work should evaluate representation-space detectors under update policies better aligned with their intended deployment assumptions.

Finally, the current experiments do not yet cover distributed or federated deployments. Extending FADES to such environments while preserving the practical benefits of CE and Approx-CCE is a promising area for future work.

6. Conclusions

We presented FADES, a framework for adaptive drift estimation via conformal signals in streaming IoT intrusion detection. FADES combines supervised classification, CE, drift signaling, retraining, and adaptive chunking into a single operational pipeline.

The framework is centered on two main design contributions. Approx-CCE preserves the main statistical structure of cross-conformal evaluation while reducing repeated calibration cost, making it better suited to streaming use than standard CCE. ACC dynamically adjusts evaluation granularity in response to recent drift activity, avoiding both calibration explosions and overly coarse fixed evaluation windows.

Across both in-domain and cross-dataset settings, FADES demonstrates that CE-based drift detection is a promising and efficient approach within the evaluated streaming IDS setting. The journal extension further strengthens this conclusion by incorporating variance-aware reporting and by evaluating CADE as a representation-space comparison point. Across original, reduced-epoch, and lightened CADE configurations, CADE did not complete a single streaming simulation within the 7-day walltime limit, whereas Approx-CCE with ACC completed the same setting in 785.36 s. This suggests that CE-based monitoring is better matched to the repeated recalibration regime studied here, while representation-space detectors may require less frequent or differently structured update policies.

For pairwise comparisons between CE configurations, we additionally compute practical effect sizes using absolute differences in mean F1 score, mean runtime, and mean retraining count. Because the same seed and run structure is used across configurations, these comparisons emphasize whether observed improvements are operationally meaningful rather than only statistically distinguishable. Runtime improvements are interpreted using both absolute time savings and multiplicative speedup factors, while predictive-performance differences are interpreted relative to their run-to-run standard deviations. This allows us to distinguish cases where two methods are statistically similar in detection quality but materially different in computational cost.

Future work will extend FADES to multi-class and open-world intrusion detection, broaden drift and baseline coverage, and refine the theoretical and empirical characterization of CE behavior under streaming non-exchangeability.

Author Contributions

Conceptualization, S.B., L.L., G.D. and S.R.; methodology, S.B.; software, S.B.; validation, S.B.; formal analysis, S.B.; investigation, S.B.; resources, L.L., G.D. and S.R.; data curation, S.B.; writing—original draft preparation, S.B.; writing—review and editing, S.B., L.L., G.D. and S.R.; visualization, S.B.; supervision, L.L., G.D. and S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Security Agency grant number 98230-21-1-0163.

Data Availability Statement

The code for FADES, the CADE runtime, CAPEX, our capture script, and our dataset are publicly available [37,58,59].

Acknowledgments

We also acknowledge Bradley Boswell, Lin Li, and Swarna Rajaganapathy for their help, insightful discussions, and work on this project. Portions of this work build on preliminary research supported in part by the National Security Agency (NSA) National Centers of Academic Excellence in Cybersecurity (NCAE-C) Grant H98230-21-1-0163. The views expressed herein are those of the authors and do not necessarily represent the views of the NSA.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 2014, 46, 1–37. [Google Scholar] [CrossRef]
Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: Boston, MA, USA, 2005; Volume 29. [Google Scholar]
Shafer, G.; Vovk, V. A tutorial on conformal prediction. J. Mach. Learn. Res. 2008, 9, 371–421. [Google Scholar]
Angelopoulos, A.N.; Bates, S. Conformal prediction: A gentle introduction. Found. Trends Mach. Learn. 2023, 16, 494–591. [Google Scholar] [CrossRef]
Jordaney, R.; Sharad, K.; Dash, S.K.; Wang, Z.; Papini, D.; Nouretdinov, I.; Cavallaro, L. Transcend: Detecting concept drift in malware classification models. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017; pp. 625–642. [Google Scholar]
Barbero, F.; Pendlebury, F.; Pierazzi, F.; Cavallaro, L. Transcending transcend: Revisiting malware classification in the presence of concept drift. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP); IEEE: Piscataway, NJ, USA, 2022; pp. 805–823. [Google Scholar]
Yang, L.; Guo, W.; Hao, Q.; Ciptadi, A.; Ahmadzadeh, A.; Xing, X.; Wang, G. CADE: Detecting and explaining concept drift samples for security applications. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; pp. 2327–2344. [Google Scholar]
Alam, M.T.; Piplai, A.; Rastogi, N. ADAPT: A Pseudo-labeling Approach to Combat Concept Drift in Malware Detection. arXiv 2025, arXiv:2507.08597. [Google Scholar] [CrossRef]
Ying, J.; Zhu, T.; Zheng, A.; Chen, T.; Lv, M.; Chen, Y. METANOIA: A Lifelong Intrusion Detection and Investigation System for Mitigating Concept Drift. arXiv 2024, arXiv:2501.00438. [Google Scholar]
Le, D.C.; Zincir-Heywood, N. Anomaly detection for insider threats using unsupervised ensembles. IEEE Trans. Netw. Serv. Manag. 2021, 18, 1152–1164. [Google Scholar] [CrossRef]
Gupta, R.; Liu, S.; Zhang, R.; Hu, X.; Kommaraju, P.; Wang, X.; Benkraouda, H.; Feamster, N.; Nahrstedt, K. Generative active adaptation for drifting and imbalanced network intrusion detection. arXiv 2025, arXiv:2503.03022. [Google Scholar] [CrossRef]
Baldini, G.; Amerini, I. Online Distributed Denial of Service (DDoS) intrusion detection based on adaptive sliding window and morphological fractal dimension. Comput. Netw. 2022, 210, 108923. [Google Scholar] [CrossRef]
Baena-Garcıa, M.; del Campo-Ávila, J.; Fidalgo, R.; Bifet, A.; Gavalda, R.; Morales-Bueno, R. Early drift detection method. In Proceedings of the Fourth International Workshop on Knowledge Discovery from Data Streams, Philadelphia, PA, USA, 20 August 2006; Volume 6, pp. 77–86. [Google Scholar]
Barrett, S.; Li, L.; Dorai, G.; Rajaganapathy, S. FIRCE: A Framework for Intrusion Response and Conformal Evaluation. arXiv 2026, arXiv:2605.01962. [Google Scholar] [CrossRef]
Soltani, M.; Khajavi, K.; Jafari Siavoshani, M.; Jahangir, A.H. A multi-agent adaptive deep learning framework for online intrusion detection. Cybersecurity 2024, 7, 9. [Google Scholar] [CrossRef]
Yang, S.; Zheng, X.; Li, J.; Xu, J.; Zhang, X.; Ngai, E.C. Self-Supervised Adaptation Method to Concept Drift for Network Intrusion Detection. IEEE Trans. Dependable Secur. Comput. 2025, 22, 7632–7646. [Google Scholar] [CrossRef]
Xu, R.; Cheng, Y.; Liu, Z.; Xie, Y.; Yang, Y. Improved Long Short-Term Memory based anomaly detection with concept drift adaptive method for supporting IoT services. Future Gener. Comput. Syst. 2020, 112, 228–242. [Google Scholar] [CrossRef]
Bifet, A.; Gavalda, R. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining; SIAM: Philadelphia, PA, USA, 2007; pp. 443–448. [Google Scholar]
Spinosa, E.J.; de Carvalho, A.P.d.L.F.; Gama, J. Novelty detection with application to data streams. Intell. Data Anal. 2009, 13, 405–422. [Google Scholar] [CrossRef]
Gardner, E.S., Jr. Exponential smoothing: The state of the art. J. Forecast. 1985, 4, 1–28. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]
Brown, R.G. Smoothing, Forecasting and Prediction of Discrete Time Series; Courier Corporation: North Chelmsford, MA, USA, 2004. [Google Scholar]
Holt, C.C. Forecasting seasonals and trends by exponentially weighted moving averages. Int. J. Forecast. 2004, 20, 5–10. [Google Scholar] [CrossRef]
Boswell, B.; Barrett, S.; Rajaganapathy, S.; Dorai, G. FLARE: Feature-based Lightweight Aggregation for Robust Evaluation of IoT Intrusion Detection. arXiv 2025, arXiv:2504.15375. [Google Scholar]
Boswell, B.; Dorai, G.; Barrett, S.; Rajaganapathy, S.; Li, L. FIRE: Fog-Based Intrusion Detection Framework for Real-Time Security in IoT Environments. In Proceedings of the Future Technologies Conference; Springer: Cham, Switzerland, 2025; pp. 209–226. [Google Scholar]
Kato, Y.; Tax, D.M.; Loog, M. A review of nonconformity measures for conformal prediction in regression. Conform. Probabilistic Predict. Appl. 2023, 204, 369–383. [Google Scholar]
Linusson, H.; Johansson, U.; Boström, H.; Löfström, T. Efficiency comparison of unstable transductive and inductive conformal classifiers. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations; Springer: Berlin/Heidelberg, Germany, 2014; pp. 261–270. [Google Scholar]
Messoudi, S.; Rousseau, S.; Destercke, S. Deep conformal prediction for robust models. In Proceedings of the International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems; Springer: Cham, Switzerland, 2020; pp. 528–540. [Google Scholar]
Johansson, U.; Boström, H.; Löfström, T.; Linusson, H. Regression conformal prediction with random forests. Mach. Learn. 2014, 97, 155–176. [Google Scholar] [CrossRef]
Hočevar, T.; Zupan, B.; Stålring, J. Conformal Prediction with Orange. J. Stat. Softw. 2021, 98, 1–22. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 2021, 34, 18932–18943. [Google Scholar]
Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
Gorishniy, Y.; Kotelnikov, A.; Babenko, A. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. arXiv 2024, arXiv:2410.24210. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: London, UK, 2017; pp. 1321–1330. [Google Scholar]
Barrett, S.; Boswell, B.; Dorai, G. Exploring the vulnerabilities of IoT devices: A comprehensive analysis of mirai and bashlite attack vectors. In Proceedings of the 2023 10th International Conference on Internet of Things: Systems, Management and Security (IOTSMS); IEEE: Piscataway, NJ, USA, 2023; pp. 125–132. [Google Scholar]
Boswell, B.; Barrett, S.; Dorai, G. Unraveling iot traffic patterns: Leveraging principal component analysis for network anomaly detection and optimization. In Proceedings of the 2024 12th International Symposium on Digital Forensics and Security (ISDFS); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Barrett, S. CAPEX-Capture-for-Evaluation: IoT Attack and Baseline Data Capture Scripts. 2024. Available online: https://github.com/DFAIR-LAB-Augusta/CAPEX-Capture-for-Evaluation (accessed on 30 June 2025).
Leevy, J.L.; Khoshgoftaar, T.M. A survey and analysis of intrusion detection models based on cse-cic-ids2018 big data. J. Big Data 2020, 7, 104. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
GitHub—Ahlashkari/CICFlowMeter: CICFlowmeter-V4.0 (Formerly Known as ISCXFlowMeter) is an Ethernet Traffic Bi-Flow Generator and Analyzer for Anomaly Detection That Has Been Used in Many Cybersecurity Datsets such as Android Adware-General Malware Dataset (CICAAGM2017), IPS/IDS dataset (CICIDS2017), Android Malware Dataset (CICAndMal2017) and Distributed Denial of Service (CICDDoS2019). Available online: https://github.com/ahlashkari/CICFlowMeter (accessed on 9 April 2026).
GitHub—Hieulw/Cicflowmeter: CICFlowmeter Written in Python for Easy to Try Out. Available online: https://github.com/hieulw/cicflowmeter (accessed on 9 April 2026).
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Hendrycks, D. Gaussian Error Linear Units (Gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Canadian Institute for Cybersecurity. IDS 2018|Datasets|Research|Canadian Institute for Cybersecurity|UNB. Available online: https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 9 November 2025).
Amazon Web Services. A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018)—Registry of Open Data on AWS. Available online: https://registry.opendata.aws/cse-cic-ids2018/ (accessed on 9 November 2025).
UNSW Canberra Cyber. The UNSW-NB15 Dataset|UNSW Research. Available online: https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 9 November 2025).
Canadian Institute for Cybersecurity. Applications|Research|Canadian Institute for Cybersecurity|UNB. Available online: https://www.unb.ca/cic/research/applications.html (accessed on 9 November 2025).
Songma, S.; Sathuphan, T.; Pamutha, T. Optimizing intrusion detection systems in three phases on the CSE-CIC-IDS-2018 dataset. Computers 2023, 12, 245. [Google Scholar]
Yoo, A.B.; Jette, M.A.; Grondona, M. Slurm: Simple linux utility for resource management. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing; Springer: Berlin/Heidelberg, Germany, 2003; pp. 44–60. [Google Scholar]
Nelson, A.; Rekhi, S.; Souppaya, M.; Scarfone, K. Incident Response Recommendations and Considerations for Cybersecurity Risk Management: A CSF 2.0 Community Profile; NIST Special Publication NIST SP 800-61r3; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2025. [CrossRef]
Verizon. 2025 Data Breach Investigations Report; Technical Report; Verizon: Basking Ridge, NJ, USA, 2025. [Google Scholar]
Barrett, S. XSecIoT—FIRCE Backup Branch. 2025. Available online: https://github.com/DFAIR-LAB-Augusta/XSecIoT/tree/FIRCE_bkp (accessed on 30 June 2025).
Limin Yang, S.B. GitHub—DFAIR-LAB-Augusta/CADE_FIRCE: Code from the USENIX Security 2021 Paper—CADE: Detecting and Explaining Concept Drift Samples for Security Applications; Updates to Work with FIRCE. Available online: https://github.com/DFAIR-LAB-Augusta/CADE_FIRCE (accessed on 9 April 2026).

Figure 1. System-level overview of the FADES streaming pipeline. Flow chunks are streamed incrementally, evaluated using CE, and logged. Upon drift detection, retraining and recalibration are triggered from a rolling history buffer.

Figure 2. FADES model metrics by CE type with the DFAIR training and drift datasets.

Figure 3. FADES model metrics by CE type with UNSW-NB15 as the source dataset and the DFAIR Drift dataset as the target stream.

Figure 4. FADES model metrics by CE type with CICIDS2018 as the source dataset and the DFAIR Drift dataset as the target stream.

Table 1. Comparison of model adaptation frameworks under concept drift.

Paper	Context	Drift Detection Method	Adaptation Trigger	Retraining Strategy	Notable Features/ Contributions
Jordaney et al. [5] (Transcend)	Malware Detection	Nonconformity score drop-off	Manual threshold	None (reject-only)	Introduces transductive conformal prediction for drift rejection
Barbero et al. [6] (Transcending Transcend)	Malware Classification	Approximate p-value tracking	Statistical p-value thresholds	Retrains with buffered calibration	Improves Transcend efficiency; adds ICE and CCE variants
Alam et al. [8] (ADAPT)	Malware Classification	Pseudo-label confidence filters	Low confidence + drift signal	Online retraining from pseudo-labeled cache	Confidence-aware retraining without labels
Soltani et al. [15]	Sequential Flow-Based IDS	Continual retraining using small packet windows	Stream update cycle	Lightweight federated deep model retrained on new flows	Demonstrates 95%+ detection rate and fast adaptation to new patterns
Gupta et al. [11]	IDS (CIC-IDS)	Density-aware active sampling	Generative + active retraining on selected samples	Combines augmentation with label-efficient active learning	F1-score improvement from 0.60 to 0.86 in experiments
Ying et al. [9] (METANOIA)	Unsupervised IDS (PIDS)	Streaming anomaly detection	Incremental anomaly model updates over time windows	Minimizes false positives via rehearsal nodes	Continuous adaptation with reduced false positives
Yang et al. [16] (ReCDA)	IDS under concept drift	None (self-supervised alignment)	New unlabeled window	Self-supervised rep. update + weakly supervised classifier tuning	Label-efficient; plug-and-play rep. module; results on UNSW-NB15, CICIDS-2017, Kyoto-2006+
Xu et al. [17] (I-LSTM + CDA)	IoT anomaly detection	None (time-weighted sampling)	Time-window update	Periodic LSTM retrain on CDA-balanced samples	Time-aware LSTM + smooth activation; smart-home results
Yang et al. [7] (CADE)	Security concept drift detection	Contrastive autoencoder in representation space	Drifting-sample discovery	Representation-space adaptation and explanation	Explains drifting samples but is runtime-heavy in our setting
FADES	IoT Intrusion Detection	Conformal Evaluation (Approx-CCE)	p-value drift trigger	Rolling log retrain + CE recalibration	Real-time CE with ACC

Table 2. ML models used to compute nonconformity measures [14]. A ✓ indicates that the model generally satisfies the corresponding criterion, whereas an × indicates that it generally does not. Text in parentheses provides qualifications or implementation-specific conditions for the entry.

Model	I1 Prob.-Based NCM $(1 - p_{\hat{y}})$	I2 Post Hoc Calib.	I3 Fast (Re)Train	I4 Stable NCM (After Calib.)	I5 Good for Tabular Flows	I6 Temp. Scaling Suffices
Linear/Kernel SVM	✓ (via Platt/iso)	✓	(linear:) ✓ (kernel:) ×	✓ (margins)	✓	× (needs $A, B$ or iso)
Logistic Regression	✓	✓	✓	✓	✓	×
Random Forest	✓ (vote probs)	✓	✓	✓	✓	×
Gradient Boosting/XGBoost	✓ (`softprob`)	✓	✓	✓	✓	×
k-NN	✓ (freq.)	(mixed)	× (scale)	(data dependent)	(depends)	×
Naive Bayes	✓	(mixed)	✓	(data dependent)	✓	×
MLP	✓	✓ (temp. scaling)	✓ (compact)	✓ (with reg.)	✓	✓

Table 3. IoT devices used in dataset collection [14].

Device	Interaction Method
Amazon Echo Dot (5th Gen)	App & Voice
Google Home Cam	App
Google Nest Mini	App & Voice
Kasa Smart Plug	App & On-Device Control
LongPlus Baby Monitor	App
NiteBird Smart Bulb	App
OKP K2 Vacuum	App
Philips Hue Hub	App
Ring Video Doorbell	App & On-Device Control
Roborock K2 Vacuum	App & On-Device Control

Table 4. Comparison of dataset collection characteristics.

Dataset	Environment	Devices/Hosts	Users	Capture Method	Flow Exporter	Attack Types
DFAIR	IoT testbed	10 IoT devices	1 user	tcpdump PCAP	CICFlowMeter	Floods
DFAIR Drift	IoT testbed	10 IoT devices	1 user	tcpdump PCAP	CICFlowMeter	Floods + HULK
CICIDS2018	Enterprise emulation	∼450 hosts	Multiple	PCAP logs	CICFlowMeter-V3	Scenario attacks
UNSW-NB15	Cyber-range	Synthetic hosts	Automated	PCAP capture	Argus/Bro	9 attack classes

Table 5. CE algorithm asymptotic time complexity [14].

Method	Calibration	Test (per Sample)
TCE	$O (n^{2})$	$O (1)$
Approx-TCE	$O (\frac{n}{1 - p})$	$O (1)$
ICE	$O (p n)$	$O (1)$
CCE	$O (\frac{p n}{1 - p})$	$O (1)$
Approx-CCE	$O (n)$	$O (1)$

Table 6. CADE configurations evaluated under the FADES streaming simulation loop.

Configuration	DFAIR Dims.	UNSW Dims.	Margin	MAD Thresh.	Min. Ratio	Min. Count	Epochs
Original CADE	$(76, 512, 128, 32)$	$(21, 512, 128, 32)$	10.0	3.5	0.05	1	250
Reduced-Epoch CADE	$(76, 512, 128, 32)$	$(21, 512, 128, 32)$	10.0	3.5	0.05	1	50
Lightened CADE	$(76, 128, 64, 32)$	$(21, 128, 64, 32)$	5.0	4.0	0.10	3	10

Table 7. Approx-CCE versus other CE variants across in-domain and cross-dataset benchmarks [14]. The arrow denotes the transfer setting, with the source dataset on the left and the target drift stream on the right.

Dataset	CE Type	CE Acc.	Prec.	Rec.	F1	Runtime (s)	#Calibs
DFAIR	ICE	1.0000	0.9999	0.9999	0.9999	626.6639	No Retrain
→	Approx-TCE	1.0000	1.0000	1.0000	1.0000	768.7549	4
DFAIR	CCE	1.0000	0.9998	0.9999	0.9998	785.3604	2
Drift	Approx-CC	1.0000	0.9998	0.9999	0.99987	676.2752	2
CICIDS2018	ICE	0.9953	0.8990	0.9701	0.9288	5628.1554	2
→	Approx-TCE	0.9976	0.9111	0.9852	0.9439	6637.4307	4
DFAIR	CCE	0.9950	0.9962	0.9667	0.9811	16584.1373	2
Drift	Approx-CCE	0.9952	0.9973	0.9667	0.9816	6020.9825	2
UNSW-NB15	ICE	0.9988	0.9897	0.9962	0.9929	600.9195	No Retrain
→	Approx-TCE	0.9991	0.9939	0.9961	0.9950	720.1210	4
DFAIR	CCE	0.9975	0.9881	0.9842	0.9861	1792.7051	7
Drift	Approx-CCE	0.9980	0.9892	0.9885	0.9889	717.1803	4

Table 8. Adaptive chunking controller results with Approx-CCE [14].

Dataset	Chunk Size	Num Calibs	CE Accuracy	CE Precision	CE Recall	CE F1 Score	Runtime (s)
DFAIR Drift	100	11	1.0000	0.9999	1.0000	1.0000	598.0396
	75	15	1.0000	0.9999	1.0000	1.0000	764.7714
	50	18	1.0000	0.9999	1.0000	1.0000	761.2931
	25	18	1.0000	0.9999	1.0000	1.0000	772.4752
	15	17	0.9954	0.9970	0.9658	0.9789	877.6462
	10	54	1.0000	1.0000	1.0000	1.0000	2247.3858
	5	22	1.0000	0.9999	1.0000	0.9999	978.3778
	1	39	1.0000	0.9999	1.0000	0.9999	1870.6970
	Adaptive	2	1.0000	0.9998	0.9999	0.9998	676.2752
CICIDS2018 → DFAIR Drift	100	4	0.9973	0.9478	0.9833	0.9637	6328.3502
	75	2	0.9957	0.9969	0.9701	0.9831	7856.0773
	50	4	0.9973	0.9624	0.9814	0.9703	6214.5404
	25	3	0.9963	0.9978	0.9750	0.9862	6481.8293
	15	3	0.9964	0.9980	0.9750	0.9863	6289.8090
	10	3	0.9968	0.9462	0.9779	0.9596	6016.8208
	5	2	0.9952	0.9970	0.9668	0.9815	6113.9819
	1	2	0.9952	0.9972	0.9668	0.9816	6426.8002
	Adaptive	2	0.9952	0.9973	0.9667	0.9816	6020.9825
UNSW-NB15 → DFAIR Drift	100	5	0.9978	0.9891	0.9866	0.9878	732.5098
	75	26	0.9977	0.9888	0.9855	0.9871	1227.4375
	50	2	0.9982	0.9881	0.9915	0.9898	627.9414
	25	90	0.9975	0.9878	0.9841	0.9859	2721.6069
	15	90	0.9976	0.9891	0.9838	0.9864	2879.1444
	10	102	0.9977	0.9896	0.9844	0.9870	3213.1473
	5	180	0.9976	0.9888	0.9839	0.9863	5257.7047
	1	1296	0.9976	0.9885	0.9840	0.9862	38,461.2289
	Adaptive	4	0.9980	0.9892	0.9885	0.9889	717.1803

Table 9. Runtime comparison between CADE variants and Approx-CCE under identical streaming simulation conditions.

Method	Configuration	Runtime	Completed	Retrains
CADE	Original, 250 epochs	>7 days	No	–
CADE	Reduced-epoch, 50 epochs	>7 days	No	175
CADE	Lightened, 10 epochs	>7 days	No	1016
Approx-CCE	Prediction-space CE	785.36 s	Yes	2

Table 10. Chunk-level association between attack presence and retraining across CE-enabled runs.

Group	$P (R ∣ A)$	$P (R ∣ \neg A)$	Risk Diff.	Odds Ratio	Fisher p
Overall	0.0159	0.0230	−0.0071	0.6866	<0.001
CIC_UNSW	0.0046	0.0128	−0.0082	0.3548	<0.001
DFAIR	0.0122	0.0125	−0.0003	0.9695	0.832
NB15	0.0312	0.0433	−0.0121	0.7108	<0.001
Approx-CCE	0.0103	0.0116	−0.0013	0.8827	0.485
Approx-TCE	0.0075	0.0156	−0.0082	0.4740	<0.001
CCE	0.0385	0.0441	−0.0056	0.8663	0.113
ICE	0.0080	0.0198	−0.0117	0.4024	<0.001

Table 11. Multi-run, multi-seed results for Approx-CCE + ACC, reported as mean ± standard deviation across 25 runs per configuration.

Dataset	Acc. ( $μ \pm σ$ )	Prec.	Rec.	F1	Runtime (s)	#Retrains
DFAIR → DFAIR Drift	0.9930 ± 0.0010	0.9653 ± 0.0054	0.9997 ± 0.0002	0.9819 ± 0.0028	238.0271 ± 16.8674	1.40 ± 0.82
UNSW-NB15 → DFAIR Drift	0.9921 ± 0.0018	0.9700 ± 0.0108	0.9465 ± 0.0259	0.9579 ± 0.0187	499.9271 ± 19.2662	3.40 ± 1.04
CICIDS2018 → DFAIR Drift	0.9936 ± 0.0005	0.7950 ± 0.2469	0.7621 ± 0.2508	0.7780 ± 0.2489	5069.8297 ± 140.1692	1.00 ± 0.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Barrett, S.; Dorai, G.; Li, L.; Rajaganapathy, S. FADES: Adaptive Drift Estimation via Conformal Signals for Streaming Intrusion Detection. Electronics 2026, 15, 2114. https://doi.org/10.3390/electronics15102114

AMA Style

Barrett S, Dorai G, Li L, Rajaganapathy S. FADES: Adaptive Drift Estimation via Conformal Signals for Streaming Intrusion Detection. Electronics. 2026; 15(10):2114. https://doi.org/10.3390/electronics15102114

Chicago/Turabian Style

Barrett, Seth, Gokila Dorai, Lin Li, and Swarnamugi Rajaganapathy. 2026. "FADES: Adaptive Drift Estimation via Conformal Signals for Streaming Intrusion Detection" Electronics 15, no. 10: 2114. https://doi.org/10.3390/electronics15102114

APA Style

Barrett, S., Dorai, G., Li, L., & Rajaganapathy, S. (2026). FADES: Adaptive Drift Estimation via Conformal Signals for Streaming Intrusion Detection. Electronics, 15(10), 2114. https://doi.org/10.3390/electronics15102114

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FADES: Adaptive Drift Estimation via Conformal Signals for Streaming Intrusion Detection

Abstract

1. Introduction

Research Questions

2. Related Work

2.1. Conformal Evaluation and Drift Detection

2.2. Model Retraining and Adaptation Under Concept Drift

2.3. Adaptability in Streaming IDS

2.4. Models for Nonconformity Measures

3. Materials and Methods

3.1. IoT Traffic Capture, Labeling, and Flow Preprocessing

3.2. Benchmark Datasets and Collection Provenance

3.2.1. CSE-CIC-IDS2018 (CICIDS2018)

3.2.2. UNSW-NB15

3.2.3. Flow-Export Harmonization for Transfer

3.2.4. Comparison with the DFAIR IoT Dataset

3.3. Baseline Classifier and CE Model Design

3.4. Why Conformal Evaluation for Drift Detection

3.5. Baseline Conformal Evaluators in FADES

3.6. Approx-CCE

3.7. Streaming Simulation with Rolling Calibration Buffer

3.8. Drift Detection Criterion

3.9. Framework Design Justification

3.10. Adaptive Chunking Controller

3.11. Experimental Protocol and Transfer Settings

3.11.1. Dataset Characteristics and Transfer Behavior

3.11.2. Experimental Hardware and Runtime Environment

3.12. Journal-Extension Additions

3.12.1. Multi-Run, Multi-Seed Analysis

3.12.2. CADE Integration and Runtime Comparison

3.13. Statistical Analysis

4. Results

4.1. RQ1: Performance vs. Computational Overhead

4.2. RQ2: CE vs. CADE Runtime Feasibility

4.2.1. Runtime Feasibility

4.2.2. Operational Implications

4.3. Attack Presence and Retraining Trigger Analysis

4.4. RQ3: Stability Across Seeds and Transfer Settings

4.5. RQ4: Unified Framework Comparison

5. Discussion

5.1. Novelty Relative to Transcend and Prior Work

5.2. Model Architecture Clarification

5.3. Evaluation Design and Drift Scenarios

5.4. Operational Implications

5.5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI