1. Introduction
The rapid expansion of encrypted network communications has significantly transformed modern traffic analysis and cybersecurity practices. As encryption becomes pervasive across web services, mobile applications, and cloud platforms, traditional payload inspection techniques have become increasingly ineffective. Consequently, network operators and security systems have shifted toward statistical flow-level analysis combined with machine learning techniques to maintain visibility over network behavior [
1,
2].
Traffic flow classification has therefore become a critical component in network management, quality-of-service enforcement, and threat detection. Prior studies demonstrate that machine learning models can successfully infer traffic types from aggregated flow statistics even when payloads are unavailable. However, the growing scale and diversity of network traffic continue to challenge the robustness and generalization capability of these models in real-world deployments [
3,
4].
Most existing traffic classification systems rely on centralized learning architectures in which raw traffic data are collected and processed at a single location. While centralized models often achieve strong predictive accuracy, they raise significant concerns related to privacy protection, regulatory compliance, and cross-domain data sharing. In distributed network environments such as IoT and multi-organizational infrastructures, transferring raw traffic traces is frequently impractical or restricted, limiting the applicability of purely centralized solutions [
5,
6].
Federated Learning (FL) has recently emerged as a promising paradigm that enables collaborative model training while keeping sensitive data localized at client devices. In networking and cybersecurity contexts, FL has been applied to intrusion detection, anomaly detection, and traffic classification tasks, showing encouraging performance while preserving data confidentiality [
2]. Several recent studies report that federated models can approach centralized accuracy under controlled conditions, making FL an attractive candidate for privacy-sensitive network analytics [
7,
8].
Despite this progress, deploying FL for network traffic classification in realistic environments remains challenging. A fundamental issue is the presence of heterogeneous client data distributions. In operational networks, different monitoring points typically observe different subsets of applications and services, resulting in strongly non-identically distributed (non-IID) datasets. Prior research indicates that such heterogeneity can slow convergence, destabilize training, and degrade global model performance, yet many studies still evaluate FL under simplified or near-IID assumptions [
2,
8].
Another critical factor influencing federated performance is the server-side aggregation mechanism. The widely adopted FedAvg algorithm provides an efficient baseline but is known to suffer under highly skewed data distributions. To improve robustness, adaptive and momentum-based optimizers such as FedAdam, FedYogi, and FedAvgM have been proposed. Although these methods show promise in general federated optimization tasks, their comparative behavior in encrypted traffic classification scenarios has not been thoroughly investigated [
9,
10].
Moreover, prior work often reports overall accuracy as the primary evaluation metric, with limited attention to class-level reliability and misclassification patterns. For security-oriented applications, detailed analysis using F1-score, ROC-AUC, and confusion matrices is essential to assess detection quality across diverse traffic categories. The lack of controlled, side-by-side comparisons across multiple aggregation strategies and heterogeneous data settings reveals a clear gap in the current literature [
2,
11].
To address these limitations, this study presents a systematic investigation of federated traffic classification under heterogeneous data distributions. The proposed framework evaluates multiple client partitioning strategies ranging from near-IID to highly non-IID conditions and compares four representative aggregation algorithms within a unified experimental setting. By maintaining a consistent neural architecture and training protocol, the study isolates the impact of aggregation design and data heterogeneity on classification performance in realistic network environments.
Modern networks carry a large share of their traffic in encrypted form, which limits the usefulness of payload inspection and shifts attention toward flow-level statistical analysis. Network operators still need reliable visibility to manage performance, enforce policies, and detect misuse, yet the flow-level metadata that enables this visibility, such as packet sizes and inter-arrival times, remains sensitive and subject to privacy and regulatory constraints even when payloads are encrypted. In practice, traffic data are generated across many sites and devices, each reflecting local usage patterns and services, making centralised collection impractical due to privacy rules, data ownership concerns, and operational costs. It is important to note that encryption and federated learning address complementary privacy dimensions: encryption protects packet content from network observers, while federated learning prevents the flow-level statistics used for classification from being centralised and exposed. Together they provide layered protection that motivates the use of federated learning even when traffic is already encrypted. Federated learning fits this setting by enabling collaborative model training without moving raw traffic records across organisational boundaries. Understanding how well it operates under realistic network diversity, where different sites observe different subsets of services, is therefore essential for dependable deployment.
Current research on privacy-preserving traffic analysis has not yet provided a clear picture of how federated learning behaves under realistic network conditions. In operational networks, different sites typically observe different subsets of services and applications, which leads to strongly skewed and non-identical data distributions. Such heterogeneity can affect model convergence and reduce classification reliability, yet it is often simplified in experimental studies. In addition, multiple server-side aggregation methods have been proposed, but their comparative impact on encrypted traffic classification remains insufficiently examined. There is also a need for evaluations that go beyond overall accuracy and reflect class-level behavior. This work addresses these issues by examining federated traffic classification under controlled and diverse data partitions. It further studies how aggregation design influences stability and predictive performance in these settings.
This study makes several practical contributions to the study of privacy-preserving traffic classification in distributed environments. It provides a structured comparison of four aggregation algorithms selected to cover the principal design families in the federated optimisation literature: FedAvg as the foundational weighted averaging baseline, FedAvgM as a momentum-based extension, and FedAdam and FedYogi as adaptive server-side optimizers from the FedOpt framework, where FedAdam follows the standard Adam update and FedYogi replaces the second moment accumulation with a Yogi-style update designed to control second moment growth under non-IID conditions. This selection covers the main axes of variation in server-side aggregation design, namely plain averaging, momentum correction, and adaptive optimisation, within a controlled and comparable experimental setting. The evaluation is conducted across multiple client data distributions that reflect realistic variations in how services are observed at different sites. A consistent learning architecture and training protocol are maintained to ensure fair and reproducible comparisons. Performance is examined using complementary metrics, including accuracy, F1-score, ROC-AUC, and confusion matrices, to capture both overall and class-level behaviour. This design enables a clear view of how aggregation strategy and data heterogeneity interact in practice, and the findings offer practical insights for selecting federated configurations in network monitoring scenarios.
2. Related Work
The rapid growth of encrypted network traffic has significantly complicated application and behavior identification, as modern protocols such as TLS, QUIC, and VPN obscure payload visibility and limit the effectiveness of traditional inspection techniques [
3]. Recent surveys therefore report a clear shift toward flow-level statistical analysis combined with machine learning to maintain visibility in encrypted environments, while noting persistent challenges related to dataset realism, protocol diversity, and model generalization [
1]. To enhance discriminative capability, multi-flow behavioral modeling has been explored, demonstrating improved performance by capturing contextual relationships among flows generated by the same application [
12]. In parallel, graph-based deep learning and feature-fusion architectures have been shown to effectively model temporal and structural dependencies in encrypted traffic using metadata alone, achieving strong results across multiple benchmark datasets [
13,
14]. Attention-enhanced hybrid models further improve classification accuracy and interpretability, although most evaluations remain confined to centralized training scenarios [
15]. Despite these advances, dataset construction and labeling granularity continue to represent major bottlenecks, motivating automated fine-grained labeling frameworks that support reliable supervised learning without requiring privileged device access [
16]. Classical TLS fingerprinting and SNI-based techniques remain useful baselines but suffer from feature overlap and reduced robustness under evolving encryption patterns [
17]. More recently, few-shot meta-learning has been investigated to mitigate data scarcity in encrypted traffic classification; however, these approaches largely assume centralized data availability and do not address distributed heterogeneity [
18]. Earlier deep learning studies also confirmed that flow-level models can preserve user privacy while maintaining high classification accuracy, yet they did not consider decentralized training environments [
19].
Federated learning (FL) has consequently emerged as a promising paradigm for privacy-preserving network analytics by enabling collaborative model training without sharing raw traffic data. Comprehensive reviews highlight its potential for intrusion detection and traffic analysis while identifying key challenges, including non-IID data distributions, communication overhead, and secure aggregation requirements [
20]. In the domain of traffic classification specifically, Bakopoulou et al. [
21] propose a federated learning approach to mobile packet classification that enables devices to collaboratively train a global model without uploading locally collected traffic data, representing one of the closest prior works to the federated traffic classification framework studied here and confirming the practical feasibility of the approach in mobile network environments. In response, several federated frameworks have been proposed to improve robustness under heterogeneous conditions. Lightweight federated intrusion detection systems that combine dynamic feature fusion with hybrid deep models have demonstrated strong detection capability with reduced computational cost in vehicular and IoT environments [
22]. Adaptive FL frameworks further incorporate personalization and cryptographic protection of model updates to maintain accuracy under heterogeneous traffic distributions [
23], while knowledge-distillation-based approaches attempt to reduce model drift and improve consistency across distributed IoT clients [
24]. Additional studies confirm the practical feasibility of federated IDS deployments in industrial settings, showing that properly configured federated models can approach centralized performance while preserving data locality and reducing communication overhead [
25,
26]. Mutual-learning-based federated traffic classification has also been explored to enhance generalization across non-IID clients through collaborative knowledge exchange prior to aggregation [
27]. Hybrid approaches that combine federated learning with incremental learning have been proposed to further improve adaptability in privacy-sensitive deployments. The HERALD framework [
28] demonstrates the feasibility of decentralised model updates under heterogeneous and evolving data distributions in healthcare scenarios, highlighting design principles that are transferable to dynamic network monitoring environments.
Despite these advances, the performance and reliability of federated systems remain strongly influenced by the server-side aggregation mechanism. The Federated Averaging (FedAvg) algorithm established the foundational communication-efficient paradigm for decentralized optimization through weighted aggregation of locally trained models [
29]. Subsequent analyses demonstrated that statistical heterogeneity can significantly slow convergence and destabilize training, necessitating careful control of optimization dynamics under non-IID data distributions [
30]. To address these limitations, adaptive federated optimizers such as FedAdam and FedYogi were introduced, showing improved convergence behavior in heterogeneous environments [
31]. A comprehensive survey of model aggregation techniques in federated learning [
32] provides a broad taxonomy of server-side aggregation methods, highlighting the trade-offs among communication efficiency, convergence stability, and robustness under heterogeneous data distributions, which further motivates the systematic comparison of aggregation strategies conducted in this study. From a security standpoint, secure multi-party computation-based aggregation has been investigated to enhance robustness against poisoning attacks, although it introduces additional communication overhead and does not fully eliminate vulnerabilities to malicious clients [
33]. Privacy-focused studies further emphasize the importance of protecting model updates using differential privacy, homomorphic encryption, and secure computation, while noting persistent risks related to gradient leakage and system heterogeneity [
34]. More recent work on verifiable secure aggregation enables post-aggregation correctness validation using lightweight cryptographic commitments, yet achieving an effective balance among robustness, scalability, and heterogeneity tolerance remains an open challenge in practical federated traffic analysis systems [
35].
Overall, although substantial progress has been achieved in encrypted traffic analysis and federated intrusion detection, existing studies rarely provide a unified and systematic evaluation of server-side aggregation strategies under controlled heterogeneous data distributions. In particular, the comparative behavior of classical and adaptive federated optimizers across varying degrees of client heterogeneity in flow-level traffic classification remains insufficiently explored.
6. Experimental Results
This section explores the performance of network traffic classification under different federated learning environments. We compare a centralized baseline with multiple federated settings that differ in client data distribution and label balance with different aggregation models.
6.1. Evaluation Metrics
We evaluate the proposed federated traffic classification framework using accuracy, weighted precision, weighted recall, weighted F1-score, macro F1-score, and ROC-AUC. These metrics quantify overall correctness, robustness under class imbalance, and threshold-independent separability. We further use confusion matrices to analyze class-level error patterns.
Let
C denote the set of classes. For each class
, we compute true positives
, false positives
, and false negatives
in a one-vs.-rest manner. Class-wise precision, recall, and F1-score are
Overall accuracy is defined as
where
N is the number of test samples,
is the true label,
is the predicted label, and
is the indicator function.
To account for class imbalance, we report weighted averages. Let
be the number of test samples in class
c. Weighted precision, weighted recall, and weighted F1-score are computed as
Macro F1-score gives equal importance to each class
We report ROC-AUC using two aggregation schemes. Micro-averaged AUC () pools all one-vs-rest scores across classes and samples before computing the ROC curve, which emphasizes frequent classes in figures. Macro one-vs-rest AUC () computes AUC separately for each class and then averages over classes, giving equal importance to all services.
Finally, confusion matrices summarize the number of samples assigned to each predicted class versus the true class and highlight dominant misclassifications across services.
6.2. Centralized vs. Federated Performance
Table 5 compares the centralized baseline (C0) against the best federated configuration obtained under each client partitioning case. The centralized model achieves near-perfect performance, confirming that the selected flow-level statistics provide strong discriminative power even without payload access. Under mixed-label clients (C1), federated learning matches the centralized baseline, with FedAdam reaching the same accuracy (0.9969) and an almost identical AUC (1.0000). Under hash-based semi-IID clients (C3), performance remains high (best accuracy 0.9946 with FedAdam), indicating that moderate heterogeneity does not substantially reduce classification quality when clients still observe multiple services. In contrast, the service-based split (C2) produces a marked performance drop across all strategies. Even the best C2 configuration (FedYogi) reaches 0.9287 accuracy, which is substantially lower than C0 and C1. This gap highlights that extreme label skew, where each client observes a single service, is the dominant factor limiting federated traffic classification performance.
Figure 3 supports this observation. ROC curves for C0, C1, and C3 largely overlap near the upper-left corner, whereas C2 configurations show visibly weaker discrimination.
Figure 4 shows the comparison accuracy bar chart of all cases used in the current study.
6.3. Impact of Client Data Distribution
Table 6 quantifies the effect of client data heterogeneity by averaging performance across aggregation strategies within each federated case. The results show that client partitioning has a stronger impact than the choice of server-side optimizer. Mixed-label clients (C1) and hash-based semi-IID clients (C3) remain close to the centralized baseline, which indicates that federated optimization is stable when each client observes a reasonable mixture of services and the global objective is represented locally. The service-based split (C2) represents the most challenging setting because each client contains a single service class. This extreme label-skew scenario reduces the ability of local training to produce updates that generalize across classes, and it increases client drift during aggregation. As a result, C2 yields the lowest average performance among all cases (accuracy 0.8661 and weighted F1-score 0.8388 in
Table 6). Despite this degradation, the achieved C2 performance remains meaningful for a realistic distributed deployment, where sites may naturally observe narrow service subsets. Importantly, the C2 results should be interpreted relative to a naive distributed alternative in which each client trains a model only on its local single-class data. In such a setting, a client-specific model cannot learn a global multiclass decision boundary and will not generalize to unseen services, making pooled multiclass evaluation effectively infeasible. In contrast, federated learning in C2 still produces a single global classifier that recognizes all five services and reaches strong performance when evaluated on the pooled test set. This confirms that server-side aggregation enables knowledge transfer across disjoint clients and converts isolated single-service observations into a usable global traffic classifier. The remaining performance gap compared to C1 and C3 therefore reflects the inherent difficulty of learning under extreme label skew rather than a failure of the federated approach.
6.4. Impact of Aggregation Strategy
Table 7 isolates the impact of server-side aggregation under fixed local training and evaluation protocols. Under C1 and C3, aggregation has limited influence. All strategies achieve high accuracy and near-perfect macro-AUC, which indicates that when client data approximate IID or semi-IID conditions, standard averaging is sufficient and adaptive server-side optimizers offer only marginal gains. Under C2, aggregation choice becomes critical. FedAvg performs the worst (accuracy 0.7860, weighted F1-score 0.7373), reflecting severe client drift when local updates are dominated by single-class gradients. Adaptive or momentum-based strategies improve performance substantially. FedAdam increases accuracy to 0.8450, while FedAvgM reaches 0.9047. FedYogi attains the highest accuracy (0.9287) and the best weighted F1-score (0.9283), demonstrating that server-side optimization can partially mitigate non-IID effects under extreme label skew.
However, macro-AUC trends reveal an important inconsistency between threshold-dependent and threshold-independent metrics. In C2, FedAvgM achieves the highest macro-AUC (0.9885), while FedYogi attains the highest accuracy (0.9287) and weighted F1-score (0.9283) despite its lower macro-AUC (0.9586). This divergence arises because accuracy and weighted F1-score measure classification performance at a fixed decision threshold, specifically the argmax of the predicted softmax probabilities, whereas macro-AUC evaluates probability score ranking quality across all possible thresholds with equal weight given to each class regardless of its sample count. This equal weighting makes macro-AUC particularly informative in the presence of class imbalance, as it reflects classification reliability for minority classes such as Google Music that would otherwise be overshadowed in micro-averaged metrics. Notably, from figures and result tables FedAdam shows the largest gap between micro-AUC and macro-AUC in C2 (0.9776 vs. 0.9461), indicating that its adaptive scaling disproportionately benefits frequent classes while leaving minority class discrimination weaker. FedYogi’s Yogi-style second moment update controls the growth of the second moment estimate, which under extreme label skew sharpens the decision boundary at the default operating point and produces higher accuracy, but simultaneously compresses the spread of predicted probability scores across classes, reducing ranking quality and lowering macro-AUC. FedAvgM, by contrast, applies server-side momentum that smooths conflicting client updates over successive rounds, producing better-calibrated probability estimates across thresholds and therefore higher macro-AUC, but without the same degree of decision boundary sharpening at the default threshold.
Figure 5 reflects this difference: FedAvgM produces a stronger curve shape across the full threshold range for all classes equally, while FedYogi yields stronger point performance at the default operating point. The practical implication is that the preferred aggregator in C2-like environments depends on the deployment requirement: if a fixed threshold decision suffices, FedYogi is preferable; if the operating threshold will be tuned or well-calibrated probability scores are needed across all service classes including minority ones, FedAvgM is the stronger choice.
6.5. ROC Curve Analysis
ROC curves provide a threshold-independent view of service separability and allow direct comparison between aggregation strategies under the same client distribution.
Figure 3 presents a global overview across the centralized baseline (C0) and all federated configurations. To isolate the role of aggregation within each distribution regime, we further report case-specific ROC overlays for C1, C2, and C3 in
Figure 5 and
Figure 6. Under C1 (mixed-label) and C3 (hash-based semi-IID), ROC curves for all strategies remain close to the upper-left corner, which indicates that the classifier preserves strong separability despite distributed training. In these settings, differences between FedAvg, FedAdam, and FedAvgM are marginal, and the macro-AUC values remain close to 1.0, confirming that near-IID conditions produce reliable classification across all service classes including minority ones. This behavior is consistent with the near-centralized accuracy observed in
Table 7. Under C2 (service-based), the ROC curves separate clearly across strategies. FedAvg yields the weakest curve (macro-AUC 0.9489), reflecting sensitivity to extreme label skew and poor ranking quality for minority classes. FedAdam shows a notable gap between its macro-AUC (0.9461) and its overall curve shape, indicating that its adaptive scaling disproportionately benefits frequent classes while leaving minority class discrimination weaker. FedAvgM attains the highest macro-AUC in C2 (0.9885), producing the strongest and most consistent curve shape across all service classes equally. FedYogi exhibits a different trade-off, achieving the highest accuracy in C2 (0.9287) while producing a lower macro-AUC (0.9586) than FedAvgM, which suggests that the server optimizer can change both ranking behavior and the decision boundary under severe heterogeneity, and that threshold-dependent and threshold-independent metrics can favour different aggregators when class imbalance and label skew are both present.
6.6. Confusion Matrix Analysis
Confusion matrices provide a class-level view of model errors and reveal systematic confusions that are not visible in aggregate metrics. We report confusion matrices computed on the pooled test set for the centralized baseline (C0) and for every federated configuration across C1, C2, and C3. This presentation supports a complete comparison of how client heterogeneity and server-side aggregation interact to shape misclassification patterns.
Figure 7 shows the centralized baseline. The matrix is almost perfectly diagonal, indicating that the selected flow-level statistics produce strong separability among the five services. The residual errors are rare and concentrated in very small off-diagonal entries.
Figure 8,
Figure 9 and
Figure 10 summarize the federated results by client case. Under C1 (mixed-label), all aggregation strategies preserve a largely diagonal structure, consistent with the near-centralized performance reported in
Table 7. Small differences appear mainly in minority classes, but no dominant confusion pattern emerges. Under C3 (hash-based semi-IID), the confusion matrices remain close to diagonal for FedAvg, FedAdam, and FedAvgM, while FedYogi shows slightly larger off-diagonal mass, consistent with its reduced accuracy and AUC relative to the other strategies.
Under C2 (service-based), confusion patterns become more pronounced and strategy-dependent. FedAvg exhibits the strongest degradation, with increased off-diagonal mass indicating instability under extreme label skew. Adaptive and momentum-based strategies reduce the dominant confusions to varying degrees. FedAvgM and FedAdam improve separability relative to FedAvg, and FedYogi yields the best point classification performance in this setting, although some residual confusions remain. These results support the key finding of this work: server-side aggregation has limited impact under near-IID and semi-IID partitions, but it becomes a decisive factor under highly heterogeneous, label-skewed client data.
8. Conclusions and Future Work
This paper studied the impact of server-side aggregation on federated QUIC traffic classification under heterogeneous client data distributions. Using a consistent model architecture, training protocol, and pooled-test evaluation, we compared four aggregation strategies across three data partitioning cases. The results show that client data distribution is the primary factor that determines performance. Under mixed-label (C1) and hash-based semi-IID (C3) clients, federated learning matches or closely approaches the centralized baseline, and the choice of aggregation strategy has a limited effect. Under the service-based split (C2), which represents extreme label skew, performance drops for all methods, but adaptive and momentum-based aggregation substantially improve results compared to FedAvg. These findings confirm that advanced server-side optimizers become most valuable when heterogeneity induces strong client drift.
Several limitations of the current study should be noted. The experiments are conducted on a single QUIC traffic dataset with five Google service classes captured in a controlled environment, and evaluation under additional protocols, larger class sets, and different capture environments would help establish how broadly the conclusions transfer. The federated simulation assumes synchronous rounds and full client participation, which may not reflect client availability constraints in operational networks. The partitioning cases C1 and C3 produce statistically similar label distributions; adopting a Dirichlet-based partition in future work would provide a cleaner three-way heterogeneity spectrum. The analysis is based on flow-level statistical features and a fixed MLP architecture; more expressive sequence-based representations or self-supervised embeddings may change the relative advantage of aggregation strategies. Finally, evaluation is conducted on a pooled test set derived from the same dataset source, and cross-dataset testing and domain shift scenarios remain necessary to assess generalization across networks and time.
Future work should extend the study in directions that reflect operational deployments. First, evaluation should include additional heterogeneity sources such as unbalanced client sizes, partial participation, stragglers, and asynchronous training. Second, robustness should be tested under domain shift by training on one capture setting and evaluating on different networks, devices, or time periods. Third, the dataset scope should be broadened to include additional protocols and larger-scale traffic collections to assess the generalizability of the findings across diverse network environments. Fourth, model and feature design can be expanded beyond aggregated statistics to include packet-length and timing sequences, as well as self-supervised representations tailored to encrypted traffic. Fifth, a Dirichlet-based partitioning scheme should be adopted to produce a genuinely distinguishable semi-IID heterogeneity regime. Sixth, personalization and clustered federated learning can be explored to handle persistent client-specific distributions in C2-like environments. Finally, integrating privacy mechanisms such as secure aggregation or differential privacy would enable a more complete assessment of the accuracy–privacy trade-off for practical federated traffic classification.