1. Introduction
Internet of Things (IoT) infrastructures have become a foundational layer of modern cyber-physical systems, connecting vast numbers of resource-constrained devices across heterogeneous and often untrusted environments. Billions of heterogeneous devices now operate at the network edge, continuously generating high-dimensional telemetry across diverse environments. Industry projections estimate that more than 39 billion IoT endpoints will be active by 2030 [
1], spanning critical sectors such as industrial control systems, smart cities, and healthcare infrastructures. While this proliferation enables fine-grained monitoring and automation, it also substantially enlarges the attack surface. Large-scale incidents, including Mirai-family botnets and stealthy infiltration campaigns, have demonstrated that compromised IoT devices can be leveraged to launch persistent and disruptive attacks [
2]. Consequently, anomaly-based intrusion detection at distributed gateways has emerged as a critical security requirement. For clarity, we distinguish between the terms intrusion detection and anomaly detection. Intrusion detection refers to the broader network security task of identifying malicious activity in network traffic, whereas anomaly detection denotes a learning paradigm that identifies deviations from normal behavior. In this work, anomaly detection serves as the underlying mechanism for intrusion detection, enabling the detection of previously unseen threats without requiring labeled attack data.
Despite its importance, practical intrusion detection in IoT environments faces several systemic challenges. First, raw traffic data is inherently sensitive and subject to regulatory and organizational constraints that preclude centralized collection. Second, traffic characteristics vary widely across deployment sites due to differences in device populations, protocol usage, and operational patterns, leading to pronounced non-independent and identically distributed (non-IID) behavior. Third, labeled attack data is typically scarce, incomplete, or entirely unavailable in operational settings, particularly for emerging zero-day threats. Together, these factors severely limit the applicability of conventional supervised learning approaches and post-hoc thresholding strategies, which implicitly assume access to representative labeled attacks.
Federated learning (FL) offers a natural architectural response to these constraints by enabling collaborative model training without centralizing raw telemetry [
3]. However, existing FL-based intrusion detection systems remain inadequate for realistic IoT deployments. Prior empirical analyses of centralized and federated IDS under data and label skew report substantial performance degradation in heterogeneous settings [
4,
5]. Supervised federated approaches rely on labeled attack samples and consequently exhibit poor generalization to previously unseen threat families, due to their dependence on closed-set decision boundaries learned from historical attack distributions [
6]. Reconstruction-based methods, while label-free, frequently entangle benign client-specific behavior with anomalous deviations, leading to elevated false-positive rates under heterogeneous conditions, because non-IID variations across clients distort the learned notion of global normality [
7]. Other approaches that depend on public proxy datasets, cross-client embedding exchange, or centralized alignment mechanisms introduce additional privacy risks and increase deployment complexity, due to reliance on shared data or representation exposure beyond local clients [
8,
9], thereby undermining the core motivations of federated learning.
Beyond the federated setting, representation learning for network traffic presents its own challenges. Many centralized contrastive and temporal models employ generic data augmentations such as random cropping, shuffling, or aggressive temporal distortion [
10,
11,
12]. When applied to network traffic, these operations can violate protocol causality and distort the semantics of communication flows. Some methods further rely on computationally intensive preprocessing, including dynamic time warping or causal discovery [
3,
13,
14], rendering them impractical for resource-constrained edge gateways. Moreover, standard instance-level contrastive objectives implicitly treat all non-positive samples as equally dissimilar [
10], an assumption that breaks down in benign-dominated regimes where semantically consistent traffic patterns may appear superficially different [
11,
12]. Collectively, these limitations expose a clear gap in existing solutions.
Based on limitations in prior work and the constraints of realistic federated IoT deployments, an effective intrusion detection framework should satisfy four key requirements: (i) operate in an unsupervised, benign-only training setting to enable zero-day detection when labels are limited; (ii) preserve protocol-level causal semantics during representation learning to avoid distorting communication behavior; (iii) disentangle globally shared invariants from client-specific patterns to mitigate non-IID effects in federated training; and (iv) remain lightweight and privacy-preserving to support deployment on resource-constrained edge devices. Existing approaches address some of these requirements in isolation, but do not satisfy them jointly.
To address this gap, we introduce Fed-DTCN (Federated Dual Temporal Contrastive Network), an unsupervised federated learning framework for zero-day anomaly detection in IoT networks. Fed-DTCN adopts a dual-encoder architecture built on compact Temporal Convolutional Networks (TCNs), comprising a shared encoder that captures globally invariant benign patterns and a private encoder that models client-specific behaviors. Training is performed via contrastive learning with semantic-aware, causality-preserving augmentations tailored to network traffic, including causal time warping, volumetric scaling, stochastic feature perturbation, and protocol-aware masking. To mitigate erroneous repulsion between semantically similar benign samples, Fed-DTCN employs a lightweight, soft-weighted contrastive objective that leverages a frozen auxiliary projector for affinity estimation. An explicit orthogonality regularizer further enforces decorrelation between shared and private representations, enabling selective aggregation of shared parameters while preserving personalization and privacy.
During inference, the learned representations support both global and client-adaptive anomaly scoring through centroid-based similarity measures, allowing deployments to balance sensitivity to universal attack patterns with robustness to local operational variations. Extensive experiments on diverse IoT and network security benchmarks demonstrate that Fed-DTCN matches or exceeds supervised baselines on standard attacks and achieves strong zero-day detection performance when evaluated on previously unseen attack classes. In particular, when the Botnet class in CSE-CIC-IDS2018 is withheld during training, Fed-DTCN achieves an F1-Score of 96%, whereas a state-of-the-art supervised federated baseline attains 0.52%. Additional ablation studies confirm the semantic value of the proposed augmentations, while evaluations under non-IID partitions show reduced inter-client variance and consistent per-client improvements, indicating suitability for heterogeneous federated deployments.
The main contributions of this work are summarized as follows:
We propose Fed-DTCN, a fully unsupervised federated framework for zero-day anomaly detection that learns representations of normal network traffic without relying on labeled attacks or reconstruction-based objectives.
We design semantic-aware, causality-preserving data augmentations tailored to network traffic, enabling effective contrastive learning while maintaining protocol semantics.
We introduce a dual-encoder architecture with an orthogonality regularizer to disentangle shared invariants from client-specific variations, addressing statistical heterogeneity and enhancing privacy through selective aggregation.
We develop a soft-weighted contrastive objective that reduces erroneous repulsion among semantically consistent benign samples in benign-dominated regimes.
We show through extensive experiments that Fed-DTCN provides strong zero-day generalization (96% F1-Score with a withheld Botnet class) and consistently lower inter-client variance across IID and non-IID federated partitions.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 formalizes the problem and the threat model.
Section 4 presents the proposed Fed-DTCN methodology.
Section 5 reports experimental results and ablation studies.
Section 6 concludes with limitations and deployment considerations.
3. Threat Model and Problem Formulation
3.1. Threat Model
We consider a network-level adversary targeting an IoT infrastructure by injecting previously unseen and stealthy attack traffic that complies with standard communication protocols. The adversary’s objective is test-time evasion: to generate malicious network flows whose statistical and protocol-level characteristics partially overlap with benign traffic, thereby reducing detectability while preserving functional attack semantics.
The attacker operates under an open-set (zero-day) setting. Entire attack families drawn from an unknown distribution are absent from the training data and appear only during inference. Formally, test samples are drawn from such that , while .
The adversary is assumed to be black-box with respect to the learning system and has no access to model parameters, internal representations, training data, or semantic augmentation policies. Its interaction with the system is limited to generating network traffic at test time.
The scope of this work is restricted to evasion under semantic distribution shift. Attacks against the federated learning process itself, including Byzantine behavior, data poisoning, model poisoning, and backdoor insertion, are outside the scope of this study. The federated infrastructure is therefore assumed to execute the training protocol correctly, with all clients following the prescribed optimization procedure.
Privacy Assumptions. In addition to the external network adversary described above, we assume the system follows an honest-but-curious server model in which the server correctly executes the aggregation protocol but may attempt to infer information from received model updates. Prior work has shown that parameter sharing in FL may expose systems to risks, such as gradient leakage or model inversion attacks. In Fed-DTCN, only shared encoder updates are transmitted during aggregation, while client-specific private encoder parameters remain strictly local, reducing the amount of information exposed to the server. However, the proposed framework does not implement formal cryptographic protections such as secure aggregation or differential privacy, and defending against such inference attacks is outside the scope of this work.
This threat model is consistent with the experimental protocol adopted in the evaluation, in which entire attack categories are excluded during training and introduced only at test time.
3.2. Problem Formulation
We consider a federated IoT environment comprising K edge gateways (clients), denoted by . Each client observes a private and continuous stream of multivariate network traffic, represented as a local dataset . Each sample corresponds to a traffic window of length T time steps with F features.
For clarity and to support semantic-aware augmentation, each traffic window is decomposed along the feature axis as
where
captures identity- or header-related attributes (e.g., protocol fields and device identifiers), while
represents behavioral flow statistics such as packet counts, byte volumes, and inter-arrival times. Throughout the remainder of the paper,
denotes a complete traffic window, while
and
are referenced explicitly when augmentation operations are applied selectively. The global dataset is defined as
, with a total number of training samples
. The main notations used in this paper are in
Table 2.
The anomaly detection task is formulated under three fundamental constraints. First, the system operates in an unsupervised zero-day setting: the training dataset contains only benign traffic, while anomalies originate from unseen attack distributions that are disjoint from the training support. Second, the federated environment exhibits pronounced statistical heterogeneity, as client data are drawn from distinct benign distributions , with in general. This non-IID property must be explicitly addressed to prevent performance degradation of the global model. Third, the representation learning process must preserve semantic validity: any augmentation used to generate contrastive views must maintain the causal semantics of the original traffic to ensure physically meaningful invariances.
To satisfy these requirements, the federated system maintains a global shared parameter set
, while each client maintains a local parameter set
. Here,
denotes the local instance of the globally shared parameters synchronized from the server, and
represents client-specific private parameters. The federated optimization objective is formulated as
This objective follows the standard federated empirical risk minimization formulation [
15], where the shared parameters
are optimized across clients while the private parameters
remain client-specific.This optimization is subject to strict privacy and communication constraints: raw data
never leave the local client, and only the local copies of the shared parameters
are synchronized with the central server. Private parameters
remain local to absorb client-specific variability. After convergence, each client computes an anomaly score
designed to maximize detection probability for zero-day attacks,
while ensuring a bounded false-positive rate on benign traffic,
4. Methodology
This section presents Fed-DTCN, a federated and unsupervised framework for zero-day anomaly detection in IoT systems. The proposed methodology is structured around four tightly integrated components: (i) semantic-aware causal augmentations, (ii) a dual-encoder architecture that disentangles globally shared and client-specific representations, (iii) a soft-weighted contrastive learning objective, and (iv) an orthogonality regularizer that explicitly enforces representational decorrelation. All components are designed to operate under strict privacy constraints and pronounced statistical heterogeneity across clients. An overview of the complete training and inference pipeline is illustrated in
Figure 1.
The figure distinguishes between the federated training phase, where shared representations are learned collaboratively across clients, and the local inference phase, where anomaly detection is performed independently at each client using frozen encoders.
4.1. Semantic-Aware Causal Augmentations
Let denote a family of augmentation operators designed to preserve protocol-level causal semantics. Following the formulation introduced in the problem definition, each input traffic window is decomposed into identity features and behavioral flow metadata . This decomposition enables augmentation operators to act selectively on semantically distinct subspaces without violating protocol constraints or causal ordering.
Throughout this section, the full traffic window is denoted by , while and are referenced explicitly only when an augmentation targets a specific feature subspace. For each input , two correlated augmented views and are generated locally at the client by sampling distinct but complementary pipelines from .
Unlike generic augmentation strategies commonly adopted in computer vision, all operators in
are explicitly constrained to preserve the physical semantics and causal structure of network traffic. The augmentation pipelines are summarized in
Figure 2 and described as follows:
Protocol-aware masking (Pipeline 1): Structured masking is applied to contiguous protocol fields within the identity subspace . By selectively obscuring fixed header attributes such as IP addresses and device identifiers, this operation suppresses shortcut learning based on site-specific information and encourages the shared encoder to extract invariant protocol-level patterns.
Volumetric scaling (Pipeline 2): Volume-related features in the behavioral metadata subspace , including packet counts and byte totals, are scaled by a multiplicative factor . This augmentation simulates benign traffic burstiness while preserving the intrinsic behavioral manifold relevant for detecting volumetric anomalies.
Causal time warping: Temporal variability is introduced through a smooth, monotonic time-warping function implemented via spline interpolation, where . This operation perturbs packet inter-arrival times while strictly preserving event ordering, thereby maintaining protocol causality.
Stochastic feature perturbation: Fine-grained additive noise is injected into continuous features of . For each feature , the perturbed value is , where . The noise magnitude is calibrated using robust interquartile ranges to reflect realistic sensor uncertainty.
Collectively, these augmentation pipelines expand the benign data manifold while preserving semantic validity and causal coherence. This design ensures that the learned representations remain invariant to operational fluctuations commonly observed in real-world IoT traffic, which is essential for robust zero-day anomaly detection. The effectiveness of these semantic-aware augmentation operators is empirically validated in the ablation analysis presented in
Section 5.8.1, where replacing the proposed transformations with unconstrained perturbations significantly degrades anomaly detection performance, demonstrating the importance of preserving protocol semantics during augmentation.
4.2. Dual Encoder Architecture
To explicitly disentangle global invariants from client-specific variations, each client maintains two backbone encoders: a shared encoder and a private encoder . The parameters of the shared encoder are aggregated across clients via federated learning, whereas the private parameters remain local to client k and capture client-specific characteristics.
For an augmented input
, the encoders produce backbone embeddings
Each embedding is subsequently passed through a lightweight projection head,
The canonical dimensionalities used in all experiments are reported later in
Section 5. Projection outputs are
-normalized and used exclusively for contrastive learning. Gradients propagate through the projection heads into the backbone encoders during training, while the backbone embeddings
and
are reserved for orthogonality regularization and inference-time anomaly scoring.
4.3. Soft-Weighted Contrastive Objective
To prevent representation collapse and encourage uniformity among semantically related benign samples, we introduce a soft-weighted contrastive learning objective that modulates negative sample contributions based on semantic proximity. Semantic proximity is estimated using a frozen auxiliary projector
, implemented as a lightweight MLP operating on flattened input windows. This auxiliary projector is used solely for affinity estimation and is never updated during federated training. The conceptual mapping and distance-aware weighting mechanism are illustrated in
Figure 3.
For a mini-batch of
B anchors, we construct a flattened view set
containing both augmented views of each anchor. Each view
is mapped to an auxiliary embedding
, and pairwise distances are computed as
Soft weights are assigned to negative samples according to
Equation (5) assigns a soft repulsion weight to each negative pair based on their distance in the auxiliary embedding space, allowing semantically similar negatives to exert stronger contrastive pressure. Here, denotes the sigmoid function. These weights emphasize semantically close (hard) negatives while suppressing distant, non-informative samples.
The shared and private contrastive losses are defined symmetrically using the normalized projection vectors, resulting in
and
, respectively. For a positive pair of augmented views
, the contrastive objective is defined by the soft-weighted negative log-likelihood:
This formulation extends the standard InfoNCE contrastive objective [
31] by incorporating soft weights
for negative samples. Here,
is the temperature parameter and
denotes the cosine similarity between
-normalized projections. The final shared and private contrastive losses,
and
, are computed by averaging these contributions over all
views in the mini-batch, where
j is the index of the positive counterpart for each anchor
i:
This ensures that the learned benign manifold is constrained effectively, facilitating reliable detection of out-of-distribution zero-day threats while remaining robust to benign operational variations.
4.4. Orthogonality Regularization
To explicitly decouple shared and private representations, we impose an orthogonality constraint on the backbone embeddings. Let
and
denote the column-wise mean-centered and
-normalized batch matrices constructed from the flattened set of augmented views. The orthogonality loss is defined as
This regularizer penalizes correlations between shared and private embeddings, encouraging the two representation spaces to capture complementary information.
4.5. Local Objective and Optimization
The total objective minimized locally at client
k is
This objective combines the shared contrastive loss, the client-specific private contrastive loss, and an orthogonality regularization term weighted by to encourage disentanglement between the two representation spaces. All loss components are optimized jointly during local training. Soft negative weights are computed once per mini-batch and reused across both contrastive branches to improve computational efficiency.
4.6. Federated Training with Selective Aggregation
Federated optimization proceeds over synchronous communication rounds. At each round
r, the central server selects a subset of participating clients
and broadcasts the current shared encoder parameters
to the selected clients. In the experimental configuration described in
Section 5.4, all clients participate in each round (i.e.,
). Each participating client performs local optimization on its private dataset, updating both the shared encoder parameters and its client-specific private encoder parameters.
After local training, each client transmits only the updates to the shared encoder parameters, denoted by , back to the server, while the private encoder parameters remain strictly local and are never communicated. This design enables personalization by allowing client-specific representations to adapt to local traffic characteristics while preserving privacy.
The server aggregates the received shared updates using a sample-size-weighted federated averaging scheme (FedAvg) rule [
15]:
where
denotes the number of local samples held by client
. By aggregating only the shared encoder parameters while preserving private components locally, the selective aggregation mechanism enables the global model to capture invariant representations across heterogeneous clients while maintaining client-level specialization.
4.7. Inference and Anomaly Scoring
During inference, an incoming traffic window
is encoded by the frozen shared and private backbones. Each client computes a global centroid
and a client-specific centroid
from a held-out benign calibration set. Cosine similarity scores are computed as
The final anomaly score is defined as
which combines the global and client-specific similarity measures through a convex combination. The score is thresholded using a client-specific value
selected from the calibration set to enforce the desired false-positive rate.
4.8. Optimization and Inference Procedures
4.8.1. Federated Training Procedure
Algorithm 1 summarizes the federated optimization workflow of Fed-DTCN. The server initializes the global shared encoder parameters
and coordinates training over
R communication rounds. At each round
r, a subset of clients
is selected, and the current shared parameters
are broadcast.
| Algorithm 1 Fed-DTCN Federated Training Procedure |
Require: Clients ; global rounds R; local epochs E; batch size B; augmentation family ; frozen auxiliary projector ; learning rate . Ensure: Global shared encoder ; local private encoders . // Server-side initialization
- 1:
Initialize global shared parameters - 2:
for round do - 3:
Select participating clients - 4:
Broadcast to all
- 5:
for each client do - 6:
ClientUpdate - 7:
end for
- 8:
- 9:
- 10:
end for - 11:
return
- 12:
function ClientUpdate() - 13:
Load shared encoder - 14:
Load private encoder - 15:
for epoch do - 16:
for mini-batch do
- 17:
- 18:
Flatten views into
- 19:
- 20:
Compute soft weights using Equation (5)
- 21:
- 22:
- 23:
- 24:
- 25:
Construct centered matrices - 26:
- 27:
- 28:
Update using Adam - 29:
end for - 30:
end for - 31:
return - 32:
end function
|
Each participating client executes a local ClientUpdate routine and returns only the shared-parameter update . Client-specific private parameters are never transmitted, ensuring that local variations remain isolated. The server aggregates updates using a sample-size-weighted average to obtain the next global model .
Within ClientUpdate, training proceeds using fully vectorized mini-batch operations. For each batch , two semantic-preserving augmented views are generated per sample using the transformation family . A frozen auxiliary projector maps the resulting views to an affinity space, from which pairwise distances and soft negative weights are computed. Shared and private encoders jointly process all views, and normalized projections are used to compute the corresponding contrastive objectives. An orthogonality regularizer is applied to explicitly disentangle global and client-specific representations. After E local epochs, only the shared-parameter update is returned to the server.
4.8.2. Local Inference and Calibration
Algorithm 2 describes the client-side inference and decision-making procedure. During an offline calibration phase, each client computes branch-specific centroids using a held-out benign calibration set. These centroids characterize the expected representation of normal behavior in both the global and private embedding spaces.
At test time, each incoming window is encoded by the shared and private encoders, normalized, and compared to the corresponding centroids using cosine similarity. Global and local anomaly scores are computed independently and fused using a fixed mixing coefficient
. The resulting anomaly score is thresholded using a client-specific decision threshold
. All calibration statistics remain local and may be refreshed periodically to accommodate slow changes in benign traffic patterns.
| Algorithm 2 Fed-DTCN Inference and Anomaly Detection |
Require: Test window ; Trained encoders ; Centroids ; Mixing weight ; Threshold . Ensure: Anomaly Label . // Step 1: Feature Extraction
- 1:
- 2:
- 3:
- 4:
// Cosine similarity to global centroid - 5:
// Cosine similarity to local centroid
- 6:
- 7:
- 8:
- 9:
if
then - 10:
return 1 // Anomaly Detected - 11:
else - 12:
return 0 // Benign Traffic - 13:
end if
|
5. Performance Evaluation
5.1. Evaluation Datasets
Fed-DTCN is evaluated on two widely used intrusion detection benchmarks that represent complementary network environments: CSE–CIC–IDS2018 and TON_IoT.
Table 3 reports the exact class labels and flow counts used consistently across all experiments to ensure reproducibility.
CSE–CIC–IDS2018 [
32] is an enterprise-scale intrusion detection dataset released by the Communications Security Establishment and the Canadian Institute for Cybersecurity. It contains benign traffic alongside multiple attack categories, including brute-force and denial-of-service variants. For zero-day evaluation, the Bot attack class is entirely withheld from training and used exclusively during testing.
TON_IoT [
33] is an industrial IoT dataset collected from a realistic IIoT testbed by the Australian Centre for Cyber Security. The network traffic portion includes normal activity and structured attack categories such as DoS, scanning, and injection. In this dataset, all attack classes are included during training, and no category is withheld.
While
Table 3 details the global class distributions, the training regimes for the evaluated methods differ fundamentally. The supervised baseline (FeCo [
6]) utilizes the full labeled training partition (benign and attack flows). In contrast, Fed-DTCN operates in a strictly unsupervised manner, accessing only the Normal/Benign samples during training to learn a nominal representation. For CSE–CIC–IDS2018, this creates an Open-Set challenge where the Bot class is entirely withheld from the training pool. For ToN_IoT, although attack classes are available in the training partition for supervised methods, Fed-DTCN treats them as out-of-distribution entities and trains solely on benign telemetry. This setup ensures that Fed-DTCN is evaluated under the most stringent conditions relative to supervised alternatives.
5.2. Global Preprocessing Pipeline
The preprocessing pipeline transforms heterogeneous raw network traffic into a standardized feature representation suitable for federated contrastive learning. Categorical features are partitioned based on cardinality using a threshold of 50 unique values. Low-cardinality attributes undergo One-Hot Encoding, while high-cardinality features (e.g., URI strings and DNS queries) are processed using a Robust Feature Hasher with 64 output dimensions to prevent dimensionality explosion. Missing values are handled via median imputation for numeric features and a constant MISSING token for categorical entries.
To ensure that the model learns an uncontaminated baseline of normal behavior, numeric features are standardized using statistics computed exclusively from benign training samples:
This benign-only normalization strategy prevents anomaly-induced distribution shifts from influencing the normalization parameters, ensuring that anomalous samples remain detectable as statistical deviations. The resulting feature vectors are subsequently passed to the semantic-aware augmentation module and dual-encoder representation learning framework.
5.3. Federated Data Partitioning and Heterogeneity
To simulate heterogeneous federated deployments, the global dataset is partitioned across
clients under three distribution regimes: (i) an IID balanced split obtained by uniformly shuffling and evenly distributing the samples across clients, (ii) a non-IID quantity-skew split generated using a Dirichlet allocation over client sample counts, and (iii) a non-IID label distribution skew generated via class-wise Dirichlet sampling. These scenarios emulate federated environments where clients may exhibit different traffic volumes and observe different subsets of traffic classes (
Figure 4).
5.3.1. Scenario I: Statistical Homogeneity (IID)
An IID partition is constructed by uniformly shuffling and allocating samples to each client. This ensures that each client’s empirical class distribution closely approximates the global prior and serves as an upper-bound reference for convergence and detection performance.
5.3.2. Scenario II: Quantity Skew (Resource Imbalance)
To model heterogeneous device sampling rates and storage capacities, client sample counts are generated using a Dirichlet proportion vector , where controls concentration. Each sample is assigned to client k with probability . This approach preserves global class proportions within each client while inducing severe volume imbalance. We evaluate .
5.3.3. Scenario III: Label Distribution Skew
To simulate semantic heterogeneity, where devices observe disjoint subsets of the attack manifold, we employ class-wise Dirichlet allocation [
34]. For each class
, a distribution
is sampled, and each sample of class
c is assigned to client
k with probability
. The concentration parameter
controls the severity of label skew, with
inducing extreme fragmentation and
approaching balanced allocation.
5.4. Implementation and Hyperparameter Settings
The Fed-DTCN framework is implemented in PyTorch 2.6 and evaluated in a GPU-accelerated federated simulation environment. To ensure fair and reproducible comparison, all baselines and proposed variants share the same backbone architecture, optimization schedule, and random seed initialization.
For feature extraction, we employ a Temporal Convolutional Network (TCN) with causal dilated convolutions. The architecture uses dilation factors to capture multi-scale temporal dependencies while strictly preventing future information leakage. The contrastive learning module consists of a two-layer MLP that projects backbone embeddings into a latent projection space. Shared and private projection dimensions are set to and , corresponding to backbone latent sizes of and , respectively.
Optimization is performed using the Adam optimizer with a local learning rate of
and a batch size of
. All semantic augmentation parameters, including time-warp amplitudes, masking probabilities, and volumetric scaling ranges, follow the final configurations listed in
Table 4. In the experiments, the federated system consists of
simulated clients. All clients participate in each communication round (
), corresponding to a full-participation setting. Each client performs
local training epochs per round using mini-batches of size
. Aggregation of the shared encoder parameters is performed once per communication round using the sample-size-weighted FedAvg scheme defined in Equation (10). Preliminary experiments indicated that model convergence stabilizes within approximately 12–15 communication rounds; therefore, we set
as the termination criterion.
To ensure a fair and transparent comparison, the personalized supervised baseline FeCo [
6] is evaluated under the same preprocessing and experimental protocol as Fed-DTCN. In particular, it uses the identical feature sets (77 features for CSE–CIC–IDS2018 and 43 features for TON_IoT) and the normalization procedures described in
Section 5.2.
Hyperparameters for the personalized supervised baseline are initialized according to the configurations reported in the original FeCo study [
6] and further refined using a held-out validation split from the training data. To preserve realistic data characteristics, we retain the original dataset distributions and do not introduce additional resampling or synthetic balancing techniques. This unified experimental setup ensures that all methods are evaluated under the same preprocessing and data conditions.
5.5. Evaluation Metrics
All evaluation metrics are computed at the sliding-window level. Each input window is treated as an independent decision instance and assigned a binary label, where the positive class corresponds to malicious activity and the negative class corresponds to benign traffic.
Let denote the number of attack windows correctly identified as malicious, the number of benign windows correctly classified as normal, the number of benign windows incorrectly flagged as attacks, and the number of attack windows that are missed by the detector. These quantities form the basis of all reported metrics.
Accuracy: Measures the proportion of correctly classified windows over the total number of windows:
Precision: Also known as positive predictive value, it quantifies the reliability of attack alarms:
Recall: Also referred to as sensitivity, it measures the fraction of actual attacks successfully detected:
F1-Score: The harmonic mean of precision and recall, providing a balanced view when classes are uneven:
Area under the Precision–Recall Curve (PR-AUC): Measures the area under the precision–recall curve across all possible decision thresholds. PR-AUC is particularly informative for intrusion detection tasks with highly imbalanced data, as it focuses on the detector’s ability to correctly identify the positive (attack) class without being dominated by the large number of benign samples.
Receiver Operating Characteristic (ROC-AUC): Represents the area under the ROC curve, which characterizes the trade-off between the true positive rate and false positive rate across decision thresholds. Although widely used, ROC-AUC may appear overly optimistic in severely imbalanced intrusion detection settings; therefore, it is interpreted together with PR-AUC and F1-score.
All metrics reported in
Section 5.6 are computed on the held-out test data using the above definitions. In addition to threshold-dependent metrics, PR-AUC and ROC-AUC provide threshold-independent evaluation of the anomaly scores produced by the model. These metrics assess the ranking quality of anomaly scores across all possible decision thresholds, which is particularly important for intrusion detection scenarios characterized by highly imbalanced traffic distributions.
5.6. Comparative Analysis of Anomaly Detection Performance
We evaluate the anomaly detection performance of the proposed unsupervised Fed-DTCN framework against FeCo [
6], a supervised federated contrastive learning baseline that relies on explicit attack labels during training. This comparison is intentionally stringent: FeCo [
6] benefits from label supervision that is fundamentally unavailable to Fed-DTCN, thereby providing a strong reference point for detection performance on known attack categories. Although FeCo [
6] is also designed as a personalized federated learning framework, the personalization mechanisms differ in their granularity. FeCo [
6] introduces personalization through device-type grouping, where IoT devices are first categorized by type and separate federated models are trained for each device category. This strategy enables device-type-specific intrusion detection models but implicitly assumes that devices within the same category exhibit similar traffic distributions. In contrast, Fed-DTCN performs personalization at the individual client level through selective parameter aggregation. The proposed dual-encoder architecture disentangles shared and client-specific representations, where the shared encoder captures global traffic patterns and is aggregated across clients, while the private encoder remains local to each client. This design allows Fed-DTCN to preserve client-specific behavioral characteristics while still benefiting from collaborative knowledge sharing across the federation. Moreover, this client-level personalization allows Fed-DTCN to adapt to heterogeneous client traffic distributions without requiring predefined device-type groupings.
To ensure the statistical reliability of the reported results, we conducted multiple experimental runs using three different random seeds (42, 123, and 1046).
Table 5 summarizes the detection performance on the TON_IoT dataset, reporting the mean and standard deviation for each metric. The results indicate that both Fed-DTCN and the supervised baseline exhibit high stability, with Fed-DTCN consistently achieving near-saturated performance across all independent trials. We analyze performance under two complementary regimes: detection of attacks observed during training and generalization to previously unseen (zero-day) threats. Threshold-dependent metrics are reported in
Table 6, while threshold-independent behavior is examined using ROC curves in
Figure 5.
5.6.1. Performance on Standard IoT Traffic (TON_IoT)
We first report results on the TON_IoT dataset, which represents a relatively stationary IoT environment characterized by structured and recurring attack patterns. This setting evaluates the ability of Fed-DTCN to learn a discriminative representation of benign traffic in the presence of predictable attack behavior. As shown in
Table 6, both approaches achieve near-saturated detection performance. Fed-DTCN attains a marginally higher F1-Score (99.99% vs. 99.34%), indicating that an unsupervised contrastive model of benign traffic—constructed using semantic-aware augmentations and dual-encoder disentanglement—can match the discriminative capability of a supervised approach when test-time attacks closely resemble those observed during training.
The ROC curves in
Figure 5a indicate that both models operate near the optimal region when evaluated on known attacks. This narrow gap suggests that learning a robust representation of benign traffic through semantic-aware contrastive objectives can provide detection capability comparable to supervised approaches when the test distribution aligns with training conditions.
5.6.2. Robustness Under Open-Set Conditions (CSE-CIC-IDS2018)
To evaluate zero-day detection capability, we adopt an open-set protocol on the CSE–CIC–IDS2018 dataset. The entire Botnet attack category is excluded from the training data and reserved exclusively for testing, ensuring that the model does not observe this attack type during representation learning. Training is performed in a strictly unsupervised benign-only regime. After training, anomaly detection thresholds are calibrated using a held-out benign calibration subset that is disjoint from both the training and testing data. Each client computes anomaly-score statistics on this benign subset and selects a local threshold to satisfy a target false-positive rate on benign traffic. No attack samples or labels are used during threshold calibration; all attack classes, including the withheld Botnet category, are introduced only during testing.
Known (Volumetric) Attack Classes. For attack categories observed during training, both methods achieve high recall, indicating effective detection of high-intensity volumetric threats. Fed-DTCN attains higher precision (94.72% vs. 87.78%), yielding an improved F1-Score of 96.85%. FeCo [
6] achieves a higher PR-AUC (99.99% vs. 93.33%), which is consistent with supervised optimization toward labeled attack classes. In contrast, Fed-DTCN assigns anomaly scores based on deviation from a learned benign manifold, which may be more conservative for attacks whose characteristics partially overlap with normal high-volume traffic.
Zero-Day (Stealthy) Attack Detection. Under the zero-day evaluation protocol, the supervised FeCo [
6] model exhibits a pronounced performance collapse (Recall: 0.28%, F1-Score: 0.52%), indicating limited generalization to the held-out Botnet category. This outcome reflects decision boundaries that are closely coupled to observed attack semantics.
In contrast, Fed-DTCN preserves strong detection capability, achieving an F1-Score of 96.00% and a PR-AUC of 89.16%. The ROC curves in
Figure 5b show that Fed-DTCN maintains a steep ascent, whereas FeCo [
6] exhibits near-random discrimination. These results indicate that unsupervised modeling of benign traffic dynamics yields superior robustness to unseen attack families.
5.7. Robustness to Data Heterogeneity and Zero-Day Threats
In realistic federated IoT deployments, client data are rarely IID. To evaluate robustness under statistical heterogeneity, we benchmark Fed-DTCN under IID, Quantity Skew (), and Label Skew () partitioning schemes.
5.7.1. Impact of Non-IID Distributions on Global Detection
Figure 6 summarizes global detection performance on known volumetric attacks. Across all heterogeneity settings, Fed-DTCN demonstrates stable and consistently high performance. This robustness is attributable to the dual-encoder architecture, in which the shared encoder captures globally consistent benign patterns while private encoders absorb client-specific variability.
Under severe label skew (
), the supervised FeCo [
6] baseline exhibits noticeable degradation, whereas Fed-DTCN maintains an F1-Score near 96%, indicating reduced sensitivity to class imbalance.
5.7.2. Generalization to Zero-Day Threats Under Heterogeneity
Figure 7 shows that the supervised baseline exhibits severe generalization failure across non-IID settings. In contrast, Fed-DTCN maintains F1-Scores above 95% and recall exceeding 97% even under extreme quantity skew. Although PR-AUC values are moderately lower, such behavior is expected for stealthy attacks under class imbalance.
5.7.3. Client-Level Stability and Personalization
Figure 8 illustrates client-level F1-Score distributions. Fed-DTCN produces tightly clustered scores with low variance, whereas the supervised baseline exhibits substantial inter-client variability. This stability is attributed to orthogonality regularization, which isolates client-specific variability while preserving globally shared benign structure.
Finally, to further characterize client-level personalization beyond variance-based analysis, we quantify the absolute performance lift achieved by the proposed unsupervised framework relative to a supervised federated baseline.
Figure 9 reports the absolute F1-score gain, defined as
, for each client, with clients sorted in descending order of improvement.
All participating clients exhibit strictly positive gains, with improvements ranging from approximately 0.58 to 0.70 absolute F1-score points. This consistent behavior holds across all data partitioning regimes, including IID and severe quantity-skewed settings, indicating that Fed-DTCN does not induce negative transfer for any individual client.
Taken together with the low inter-client variance observed earlier in this subsection, these results demonstrate that Fed-DTCN achieves personalization without sacrificing global consistency. The explicit disentanglement of shared and private representations enables local client characteristics to be absorbed by private encoders, while shared updates remain broadly beneficial across heterogeneous data distributions.
5.7.4. Computational Complexity and Communication Overhead
To assess the practical feasibility of the proposed framework in federated IoT environments, we analyze both the model complexity and the communication overhead incurred during federated training. The computational complexity of the shared temporal convolutional encoder for an input sequence of length
T can be approximated as
where
L denotes the number of temporal convolution layers,
represents the kernel size of layer
l, and
denotes the number of output channels. This formulation indicates that the computational cost of the encoder scales linearly with the sequence length, which is suitable for streaming IoT traffic analysis.
The complete Fed-DTCN architecture contains 6,277,536 parameters, corresponding to approximately 23.95 MB when stored using 32-bit floating-point representation. However, due to the selective aggregation mechanism described in
Section 4.6, only the parameters of the shared encoder are transmitted between clients and the central server during each communication round. The shared encoder contains 4,023,808 parameters, representing 64.1% of the total model size.
Consequently, the communication payload per client per round is approximately 15.36 MB, while the remaining parameters belong to the client-specific private encoder and remain strictly local to each client. This design reduces the amount of data exchanged during federated training while allowing the model to maintain personalized representations for heterogeneous IoT environments.
Table 7 summarizes the model complexity and communication characteristics of Fed-DTCN and compares them with the supervised federated contrastive baseline FeCo. Although Fed-DTCN employs a larger backbone due to the temporal convolutional architecture and the dual-encoder design, selective aggregation ensures that only the globally shared parameters are exchanged during training. For a federation with
K participating clients, the total uplink communication per round is approximately
MB.
5.8. Ablation Study
To isolate and quantify the contribution of the proposed Semantic-Aware Causal Augmentation module, we conduct an ablation study under a challenging setting: zero-day detection on CSE-CIC-IDS2018 with severe data scarcity (Quantity Skew ). This configuration stresses both representation robustness and generalization under heterogeneous federated conditions.
We compare the full Fed-DTCN model against four degraded variants: (i) replacement of semantic augmentation with Gaussian noise, and (ii) systematic removal of individual semantic components—Time-Warping, Volumetric Scaling, and Protocol Masking.
5.8.1. Effectiveness of Semantic-Aware Augmentation
Replacing semantic-aware augmentation with Gaussian noise produces a substantial degradation in detection performance, as detailed in
Table 8. The F1-Score drops from 96.57% to 79.62%, accompanied by significant reductions in both Precision and Recall. These results indicate that unconstrained statistical perturbations disrupt the causal structure required to learn a coherent manifold of benign traffic behavior.
In addition to the augmentation analysis, we evaluate the contribution of the core architectural components of Fed-DTCN. As shown in
Table 8, removing the private encoder (w/o Dual-Encoder) significantly reduces the F1-Score from 96.57% to 90.89%, indicating that separating shared and client-specific representations is critical for mitigating negative transfer across heterogeneous clients. Replacing the soft-weighted contrastive objective with a standard hard contrastive loss (w/o Soft-Weighting) further decreases the F1-Score to 89.29%, suggesting that adaptive weighting improves representation stability under non-IID conditions. Finally, replacing the TCN backbone with a standard CNN (w/o TCN) reduces performance to 90.29%, highlighting the importance of dilated temporal receptive fields for modeling the temporal dynamics of network traffic. These results confirm that each architectural component contributes independently to the overall effectiveness of Fed-DTCN.
The ROC curves in
Figure 10 further confirm this observation. The full semantic-aware model achieves a ROC-AUC of 0.97, whereas the Random Noise variant exhibits a flatter trajectory with a ROC-AUC of 0.93, consistent with reduced discriminative power.
5.8.2. Component-Level Analysis and Decision Boundary Behavior
Removing individual semantic components yields a consistent trade-off between sensitivity and precision. In all cases, Recall approaches saturation (>99.6%), while Precision decreases to approximately 81–82%, indicating a more permissive decision boundary. Each component enforces invariance along a specific dimension; its removal increases sensitivity to that dimension, causing benign fluctuations to be misclassified as anomalies.
The full Fed-DTCN model jointly enforces temporal, volumetric, and protocol-level invariances, achieving the best balance between sensitivity and false alarms, as reflected by the highest F1-Score, Accuracy, and PR-AUC.
5.9. Discussion and Analysis
The experimental results highlight a fundamental limitation of supervised federated intrusion detection approaches. While supervised baselines such as FeCo [
6] perform well under stationary and closed-set conditions, their decision boundaries become brittle under distributional shift. When exposed to zero-day threats or heterogeneous client data distributions, performance degradation can be substantial.
5.9.1. Decoupling Representation from Supervision
Fed-DTCN demonstrates that learning a representation of benign traffic dynamics provides improved robustness in open-world IoT settings. Benign traffic patterns tend to exhibit greater statistical consistency across clients than attack behaviors, enabling a federated model trained on normal activity to generalize more effectively than models optimized over skewed and incomplete attack labels.
5.9.2. Structural Robustness to Heterogeneity
Client-level analysis further indicates that Fed-DTCN mitigates negative transfer effects commonly observed in supervised federated learning. Because local objectives are aligned around benign protocol dynamics rather than conflicting class labels, global aggregation remains stable even under severe non-IID conditions. This property is particularly important for large-scale IoT deployments, where client data distributions are inherently diverse and evolving.
5.9.3. Semantic Constraints and Decision Boundary Formation
The ablation study clarifies the role of semantic constraints as more than a representational aid. Removing individual constraints—such as protocol-aware masking, temporal warping, or volumetric scaling—leads to near-saturated recall but substantially reduced precision, indicating an overly permissive notion of normality.
In contrast, the full Fed-DTCN configuration jointly enforces temporal, volumetric, and protocol-level invariances, yielding a more selective and operationally viable decision boundary. These constraints suppress benign variability while preserving sensitivity to stealthy deviations, highlighting their importance not only for contrastive learning but also for practical anomaly detection in heterogeneous IoT environments.
6. Conclusions and Future Work
This paper presented Fed-DTCN, a federated unsupervised framework for zero-day intrusion detection in IoT networks operating under statistical heterogeneity. By integrating a dual-encoder architecture with semantic-aware causal augmentations, the framework disentangles globally shared benign invariants from client-specific variations without reliance on labeled attack data. This design enables robust anomaly detection while preserving privacy and accommodating non-IID data distributions.
Extensive evaluation on standard and industrial IoT benchmarks demonstrates that Fed-DTCN matches supervised federated baselines on known attacks and substantially outperforms them under zero-day and heterogeneous conditions. The results indicate that modeling benign traffic dynamics provides a more stable and generalizable foundation for intrusion detection than supervised alignment to observed attack semantics.
Future work will explore several extensions of the proposed framework. First, we plan to investigate communication-efficient training strategies, including gradient compression and partial model aggregation, to further reduce bandwidth overhead in large-scale deployments. Second, we aim to extend Fed-DTCN to an online and continual learning setting to address gradual concept drift in benign traffic over long-term operation. Finally, incorporating lightweight adaptation mechanisms at the client level may further improve responsiveness to rapidly evolving local behaviors without compromising global model stability.