Next Article in Journal
Symbolic Disentangled Representations for Images
Previous Article in Journal
Large Language Models for Energy Market Analytics: An Exploratory Feasibility Study Across Geopolitical Monitoring, Commodity Summarisation, and Renewable Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Impact of Server-Side Aggregation on Federated Traffic Classification Under Heterogeneous Data Distributions

by
Salam Allawi Hussein
* and
Sándor R. Répás
Faculty of Informatics and Electrical Engineering, Department of Electrical Engineering and Infocommunications, Széchenyi István University of Győr, Egyetem tér1, 9026 Győr, Hungary
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(6), 167; https://doi.org/10.3390/bdcc10060167
Submission received: 18 March 2026 / Revised: 12 May 2026 / Accepted: 20 May 2026 / Published: 22 May 2026

Abstract

The growing prevalence of encrypted network traffic has rendered traditional payload-based inspection ineffective, shifting attention toward flow-level statistical analysis combined with machine learning. At the same time, privacy regulations and distributed network architectures make centralised data collection increasingly impractical, motivating federated learning as a privacy-preserving alternative. Despite its promise, deploying federated learning for encrypted traffic classification in realistic environments remains challenging, particularly under heterogeneous client data distributions that arise when different network sites observe different subsets of services. This paper examines how server-side aggregation affects federated QUIC traffic classification under such heterogeneous conditions. We use a five-class Google QUIC dataset and represent each flow with eight statistical features derived from packet size and timing. We compare a centralised baseline with federated learning under three client partitions: mixed-label clients (C1), service-based single-class clients (C2), and hash-based semi-IID clients (C3). For each case, we evaluate four Flower aggregation strategies: FedAvg, FedAdam, FedAvgM, and FedYogi. Results show that client distribution has a greater impact on performance than the choice of aggregation strategy. Federated models match or closely approach centralised performance in C1 and C3, with accuracy up to 0.9969 and macro-AUC near 1.0. In C2, accuracy drops due to extreme label skew, but adaptive aggregation mitigates the effect. FedYogi achieves the best C2 accuracy of 0.9287, while FedAvgM attains the highest C2 macro-AUC of 0.9885. ROC curves and confusion matrices confirm that the choice of aggregation matters mainly under severe heterogeneity.

1. Introduction

The rapid expansion of encrypted network communications has significantly transformed modern traffic analysis and cybersecurity practices. As encryption becomes pervasive across web services, mobile applications, and cloud platforms, traditional payload inspection techniques have become increasingly ineffective. Consequently, network operators and security systems have shifted toward statistical flow-level analysis combined with machine learning techniques to maintain visibility over network behavior [1,2].
Traffic flow classification has therefore become a critical component in network management, quality-of-service enforcement, and threat detection. Prior studies demonstrate that machine learning models can successfully infer traffic types from aggregated flow statistics even when payloads are unavailable. However, the growing scale and diversity of network traffic continue to challenge the robustness and generalization capability of these models in real-world deployments [3,4].
Most existing traffic classification systems rely on centralized learning architectures in which raw traffic data are collected and processed at a single location. While centralized models often achieve strong predictive accuracy, they raise significant concerns related to privacy protection, regulatory compliance, and cross-domain data sharing. In distributed network environments such as IoT and multi-organizational infrastructures, transferring raw traffic traces is frequently impractical or restricted, limiting the applicability of purely centralized solutions [5,6].
Federated Learning (FL) has recently emerged as a promising paradigm that enables collaborative model training while keeping sensitive data localized at client devices. In networking and cybersecurity contexts, FL has been applied to intrusion detection, anomaly detection, and traffic classification tasks, showing encouraging performance while preserving data confidentiality [2]. Several recent studies report that federated models can approach centralized accuracy under controlled conditions, making FL an attractive candidate for privacy-sensitive network analytics [7,8].
Despite this progress, deploying FL for network traffic classification in realistic environments remains challenging. A fundamental issue is the presence of heterogeneous client data distributions. In operational networks, different monitoring points typically observe different subsets of applications and services, resulting in strongly non-identically distributed (non-IID) datasets. Prior research indicates that such heterogeneity can slow convergence, destabilize training, and degrade global model performance, yet many studies still evaluate FL under simplified or near-IID assumptions [2,8].
Another critical factor influencing federated performance is the server-side aggregation mechanism. The widely adopted FedAvg algorithm provides an efficient baseline but is known to suffer under highly skewed data distributions. To improve robustness, adaptive and momentum-based optimizers such as FedAdam, FedYogi, and FedAvgM have been proposed. Although these methods show promise in general federated optimization tasks, their comparative behavior in encrypted traffic classification scenarios has not been thoroughly investigated [9,10].
Moreover, prior work often reports overall accuracy as the primary evaluation metric, with limited attention to class-level reliability and misclassification patterns. For security-oriented applications, detailed analysis using F1-score, ROC-AUC, and confusion matrices is essential to assess detection quality across diverse traffic categories. The lack of controlled, side-by-side comparisons across multiple aggregation strategies and heterogeneous data settings reveals a clear gap in the current literature [2,11].
To address these limitations, this study presents a systematic investigation of federated traffic classification under heterogeneous data distributions. The proposed framework evaluates multiple client partitioning strategies ranging from near-IID to highly non-IID conditions and compares four representative aggregation algorithms within a unified experimental setting. By maintaining a consistent neural architecture and training protocol, the study isolates the impact of aggregation design and data heterogeneity on classification performance in realistic network environments.
  • Motivation
Modern networks carry a large share of their traffic in encrypted form, which limits the usefulness of payload inspection and shifts attention toward flow-level statistical analysis. Network operators still need reliable visibility to manage performance, enforce policies, and detect misuse, yet the flow-level metadata that enables this visibility, such as packet sizes and inter-arrival times, remains sensitive and subject to privacy and regulatory constraints even when payloads are encrypted. In practice, traffic data are generated across many sites and devices, each reflecting local usage patterns and services, making centralised collection impractical due to privacy rules, data ownership concerns, and operational costs. It is important to note that encryption and federated learning address complementary privacy dimensions: encryption protects packet content from network observers, while federated learning prevents the flow-level statistics used for classification from being centralised and exposed. Together they provide layered protection that motivates the use of federated learning even when traffic is already encrypted. Federated learning fits this setting by enabling collaborative model training without moving raw traffic records across organisational boundaries. Understanding how well it operates under realistic network diversity, where different sites observe different subsets of services, is therefore essential for dependable deployment.
  • Problem statement
Current research on privacy-preserving traffic analysis has not yet provided a clear picture of how federated learning behaves under realistic network conditions. In operational networks, different sites typically observe different subsets of services and applications, which leads to strongly skewed and non-identical data distributions. Such heterogeneity can affect model convergence and reduce classification reliability, yet it is often simplified in experimental studies. In addition, multiple server-side aggregation methods have been proposed, but their comparative impact on encrypted traffic classification remains insufficiently examined. There is also a need for evaluations that go beyond overall accuracy and reflect class-level behavior. This work addresses these issues by examining federated traffic classification under controlled and diverse data partitions. It further studies how aggregation design influences stability and predictive performance in these settings.
  • Contributions
This study makes several practical contributions to the study of privacy-preserving traffic classification in distributed environments. It provides a structured comparison of four aggregation algorithms selected to cover the principal design families in the federated optimisation literature: FedAvg as the foundational weighted averaging baseline, FedAvgM as a momentum-based extension, and FedAdam and FedYogi as adaptive server-side optimizers from the FedOpt framework, where FedAdam follows the standard Adam update and FedYogi replaces the second moment accumulation with a Yogi-style update designed to control second moment growth under non-IID conditions. This selection covers the main axes of variation in server-side aggregation design, namely plain averaging, momentum correction, and adaptive optimisation, within a controlled and comparable experimental setting. The evaluation is conducted across multiple client data distributions that reflect realistic variations in how services are observed at different sites. A consistent learning architecture and training protocol are maintained to ensure fair and reproducible comparisons. Performance is examined using complementary metrics, including accuracy, F1-score, ROC-AUC, and confusion matrices, to capture both overall and class-level behaviour. This design enables a clear view of how aggregation strategy and data heterogeneity interact in practice, and the findings offer practical insights for selecting federated configurations in network monitoring scenarios.

2. Related Work

The rapid growth of encrypted network traffic has significantly complicated application and behavior identification, as modern protocols such as TLS, QUIC, and VPN obscure payload visibility and limit the effectiveness of traditional inspection techniques [3]. Recent surveys therefore report a clear shift toward flow-level statistical analysis combined with machine learning to maintain visibility in encrypted environments, while noting persistent challenges related to dataset realism, protocol diversity, and model generalization [1]. To  enhance discriminative capability, multi-flow behavioral modeling has been explored, demonstrating improved performance by capturing contextual relationships among flows generated by the same application [12]. In parallel, graph-based deep learning and feature-fusion architectures have been shown to effectively model temporal and structural dependencies in encrypted traffic using metadata alone, achieving strong results across multiple benchmark datasets [13,14]. Attention-enhanced hybrid models further improve classification accuracy and interpretability, although most evaluations remain confined to centralized training scenarios [15]. Despite these advances, dataset construction and labeling granularity continue to represent major bottlenecks, motivating automated fine-grained labeling frameworks that support reliable supervised learning without requiring privileged device access [16]. Classical TLS fingerprinting and SNI-based techniques remain useful baselines but suffer from feature overlap and reduced robustness under evolving encryption patterns [17]. More recently, few-shot meta-learning has been investigated to mitigate data scarcity in encrypted traffic classification; however, these approaches largely assume centralized data availability and do not address distributed heterogeneity [18]. Earlier deep learning studies also confirmed that flow-level models can preserve user privacy while maintaining high classification accuracy, yet they did not consider decentralized training environments [19].
Federated learning (FL) has consequently emerged as a promising paradigm for privacy-preserving network analytics by enabling collaborative model training without sharing raw traffic data. Comprehensive reviews highlight its potential for intrusion detection and traffic analysis while identifying key challenges, including non-IID data distributions, communication overhead, and secure aggregation requirements [20]. In the domain of traffic classification specifically, Bakopoulou et al. [21] propose a federated learning approach to mobile packet classification that enables devices to collaboratively train a global model without uploading locally collected traffic data, representing one of the closest prior works to the federated traffic classification framework studied here and confirming the practical feasibility of the approach in mobile network environments. In response, several federated frameworks have been proposed to improve robustness under heterogeneous conditions. Lightweight federated intrusion detection systems that combine dynamic feature fusion with hybrid deep models have demonstrated strong detection capability with reduced computational cost in vehicular and IoT environments [22]. Adaptive FL frameworks further incorporate personalization and cryptographic protection of model updates to maintain accuracy under heterogeneous traffic distributions [23], while knowledge-distillation-based approaches attempt to reduce model drift and improve consistency across distributed IoT clients [24]. Additional studies confirm the practical feasibility of federated IDS deployments in industrial settings, showing that properly configured federated models can approach centralized performance while preserving data locality and reducing communication overhead [25,26]. Mutual-learning-based federated traffic classification has also been explored to enhance generalization across non-IID clients through collaborative knowledge exchange prior to aggregation [27]. Hybrid approaches that combine federated learning with incremental learning have been proposed to further improve adaptability in privacy-sensitive deployments. The HERALD framework [28] demonstrates the feasibility of decentralised model updates under heterogeneous and evolving data distributions in healthcare scenarios, highlighting design principles that are transferable to dynamic network monitoring environments.
Despite these advances, the performance and reliability of federated systems remain strongly influenced by the server-side aggregation mechanism. The  Federated Averaging (FedAvg) algorithm established the foundational communication-efficient paradigm for decentralized optimization through weighted aggregation of locally trained models [29]. Subsequent analyses demonstrated that statistical heterogeneity can significantly slow convergence and destabilize training, necessitating careful control of optimization dynamics under non-IID data distributions [30]. To address these limitations, adaptive federated optimizers such as FedAdam and FedYogi were introduced, showing improved convergence behavior in heterogeneous environments [31]. A comprehensive survey of model aggregation techniques in federated learning [32] provides a broad taxonomy of server-side aggregation methods, highlighting the trade-offs among communication efficiency, convergence stability, and robustness under heterogeneous data distributions, which further motivates the systematic comparison of aggregation strategies conducted in this study. From a security standpoint, secure multi-party computation-based aggregation has been investigated to enhance robustness against poisoning attacks, although it introduces additional communication overhead and does not fully eliminate vulnerabilities to malicious clients [33]. Privacy-focused studies further emphasize the importance of protecting model updates using differential privacy, homomorphic encryption, and secure computation, while noting persistent risks related to gradient leakage and system heterogeneity [34]. More recent work on verifiable secure aggregation enables post-aggregation correctness validation using lightweight cryptographic commitments, yet achieving an effective balance among robustness, scalability, and heterogeneity tolerance remains an open challenge in practical federated traffic analysis systems [35].
Overall, although substantial progress has been achieved in encrypted traffic analysis and federated intrusion detection, existing studies rarely provide a unified and systematic evaluation of server-side aggregation strategies under controlled heterogeneous data distributions. In particular, the  comparative behavior of classical and adaptive federated optimizers across varying degrees of client heterogeneity in flow-level traffic classification remains insufficiently explored.

3. Dataset and Feature Extraction

3.1. Traffic Dataset

The experiments use a QUIC traffic dataset (https://www.kaggle.com/datasets/guillaumefraysse/ucdavisquic, accessed on 28 December 2025) containing five Google services: Google Docs, Google Drive, Google Music, Google Search, and YouTube [36,37]. The dataset consists of multiple independently captured QUIC flows per service. Since QUIC encrypts transport-layer payloads, packet content is not available. The learning task therefore relies only on flow-level statistics derived from observable packet headers and timing.

3.2. Flow Representation and Feature Extraction

Each bidirectional QUIC flow is mapped to a fixed-length feature vector by aggregating packet-level information over the flow duration. We extract eight statistical features per flow
  • Packet count
  • Mean packet size
  • Standard deviation of packet size
  • Minimum packet size
  • Maximum packet size
  • Mean inter-arrival time
  • Standard deviation of inter-arrival time
  • Uplink packet ratio
These features summarize traffic volume, dispersion, temporal dynamics, and directionality without using any encrypted content. Table 1 reports the number of flows per service in the QUIC dataset.
We adopt a stratified holdout protocol with a fixed random seed for reproducibility. For each experimental case, the data assigned to each client are split into training and testing subsets using an 80/20 ratio. When a client contains more than one class, the split is stratified to preserve the local class distribution. Feature standardization is performed using a global scaler fitted only on the union of all client training samples. The fitted scaler is then applied to transform both training and testing data for every client. This design avoids test set leakage while ensuring that all clients share the same feature scaling during federated optimization.

4. Learning Framework

4.1. Implementation Environment

All experiments were implemented in Python 3.9.21 using TensorFlow 2.11.0 for neural network training and the Flower 1.6.0 framework for federated learning simulation, with Ray as the client simulation backend. Supporting libraries included NumPy 1.23.5, pandas 1.5.3, scikit-learn 1.2.2, SciPy 1.7.3, and Matplotlib 3.8.3. Experiments were executed on a workstation equipped with an AMD Ryzen 7 CPU, 32 GB system memory, and an NVIDIA GeForce GPU with 12 GB of video memory. To ensure full reproducibility, a fixed random seed of 42 was applied to all NumPy operations (np.random.seed(42)), TensorFlow graph construction (tf.random.set_seed(42)), and all data partitioning procedures. All partition-level random number generators were instantiated as np.random.default_rng(42).

4.2. Model Architecture

We use a feedforward neural network implemented in TensorFlow Keras. The model takes a fixed-length feature vector as input and outputs a probability distribution over services.
  • Input layer with dimension d equal to the number of extracted flow features ( d = 8 in all experiments)
  • Fully connected layer with 128 units and ReLU activation
  • Fully connected layer with 64 units and ReLU activation
  • Output layer with K units and softmax activation ( K = 5 for the five Google services)
The MLP architecture was selected for three reasons. First, the  fixed-length tabular feature representation used in this study does not carry sequential or spatial structure that would benefit from recurrent or convolutional models. For aggregate statistical flow features of this kind, MLP architectures are well established as strong and efficient baselines in the encrypted traffic classification literature. Second, the primary objective of this study is to isolate the effect of server-side aggregation strategy and client data heterogeneity on federated classification performance. Introducing multiple model architectures would add a confounding variable that would prevent clean attribution of observed performance differences to aggregation design. Third, the simplicity of the MLP reduces client-side computational cost, which is relevant for federated deployments on resource-constrained monitoring devices. The network is compiled with the Adam optimizer at a fixed learning rate of 10 3 , using sparse categorical cross-entropy as the loss function. The same architecture and optimizer configuration are used identically on every client in all federated experiments; no client-specific modifications are applied. Each client instantiates a fresh local model at the start of every communication round by loading the current global parameters via set_weights before local training begins. Figure 1 shows the architecture of the feedforward neural network for service classification.  

4.3. Centralized Baseline (C0)

The centralized baseline trains a single global model on the full dataset. A stratified holdout split with an 80/20 ratio is applied to preserve the overall class distribution, yielding 5151 training samples and 1288 test samples. A StandardScaler is fitted on the training subset only and applied to both training and test subsets to prevent test-set leakage. The model is trained for 30 epochs with a batch size of 32. Evaluation is performed on the held-out test set using the metrics reported for the federated runs. Table 2 summarises all training hyperparameters shared across the centralized and federated configurations.

5. Federated Learning Setup

Figure 2 illustrates the federated learning workflow adopted in this study. The framework consists of multiple clients and a central aggregation server. In each communication round, clients train the local model on their private data and transmit the updated model parameters to the server. The server then aggregates the received parameters to produce an updated global model, which is broadcast back to clients for the next round. This iterative procedure enables collaborative model training without sharing raw traffic data.

5.1. System Configuration

We simulate federated learning using the Flower 1.6.0 framework with Ray as the virtual client engine. Each client trains a local instance of the same neural network architecture described in Section 4. The server coordinates training for 30 communication rounds with full client participation in every round. All configuration values are listed in Table 2.
For evaluation, we construct a pooled test set by concatenating all client test partitions. After each communication round, the server evaluates the current global model on this pooled test set. After the final round, the final global model is evaluated and accuracy, together with class-level metrics, is reported. Feature standardization follows a shared scaling policy: a single StandardScaler is fitted on the concatenation of all clients’ training data only, and the fitted scaler is then applied to transform both training and test data for every client. This design avoids test-set leakage while ensuring consistent feature scaling across all clients and experimental cases.

Computational and Communication Complexity

All four aggregation strategies share the same asymptotic time complexity. Let P denote the total number of trainable model parameters, K the number of clients, R the number of communication rounds, N k the number of local training samples at client k, E the number of local epochs per round, and B the local batch size. For the model used in this study, P = 9733 , derived from three fully connected layers: 8 × 128 + 128 = 1152 parameters in the first layer, 128 × 64 + 64 = 8256 in the second, and  64 × 5 + 5 = 325 in the output layer.
The dominant cost in every round is local client training. Each client performs one forward and one backward pass per batch, with per-batch complexity O ( d · L 1 + L 1 · L 2 + L 2 · K ) = O ( 9536 ) multiply-accumulate operations, where d = 8 , L 1 = 128 , L 2 = 64 , and  K = 5 . With  E = 1 local epoch and batch size B = 32 , each client processes approximately N k / B batches per round. Under C1 and C3 each client holds approximately 1030 training samples (≈32 batches), while under C2 client sizes range from 473 (Google Music) to 1532 (Google Search) training samples.
The server-side aggregation cost per round for each strategy is as follows. FedAvg performs a single weighted average requiring O ( K · P ) operations. FedAvgM adds one momentum buffer update of O ( P ) . FedAdam and FedYogi additionally maintain first and second moment vectors and perform element-wise adaptive scaling, adding O ( 3 P ) operations. Since K · P = 5 × 9733 = 48 , 665 and the additional adaptive operations cost at most 3 P = 29 , 199 element-wise scalar operations, the server overhead is negligible relative to the local training cost across all five clients per round. Communication overhead is identical across all strategies: each round transmits P = 9733 parameters from the server to each client and back, giving 2 K P = 97 , 330 parameter transfers per round regardless of aggregation rule. Table 3 summarises these costs.

5.2. Client Data Partitioning Strategies

Three partitioning strategies are used to cover a range of data heterogeneity conditions: Mixed-Label Clients (C1), Service-Based Clients (C2), and Hash-Based Semi-IID Clients (C3). In all three cases the number of clients is N = 5 , matching the five Google service classes in the dataset.

5.2.1. C1 Mixed-Label Clients

C1 approximates an IID setting. All flow indices are shuffled using np.random.default_rng(42) and split into five equal partitions via np.array_split, giving approximately 1288 flows per client. Each partition becomes one client dataset and contains samples from all five classes in proportions close to the global distribution. Within each client an 80/20 stratified train–test split is applied to preserve the local class distribution.

5.2.2. C2 Service-Based Clients

C2 represents an extreme non-IID setting. One client is created per service class, so client i receives only flows from class i. Client sizes reflect the class counts in the dataset: Google Doc (1221), Google Drive (1634), Google Music (592), Google Search (1915), and YouTube (1077). Within each client an 80/20 train–test split is applied without stratification, since labels are single-class. This setting produces maximally heterogeneous label distributions with zero label overlap across clients.

5.2.3. C3 Hash-Based Semi-IID Clients

C3 represents a structurally distinct but distributionally similar variant of C1. All flow indices are shuffled using np.random.default_rng(42) and sample j in the permuted order is assigned to client ( j mod 5 ) , producing five clients of approximately equal size ( 1288 flows each) with partial label mixing. Within each client an 80/20 stratified train–test split is applied. Because both C1 and C3 apply a global random shuffle before assignment, their per-client class distributions are statistically similar and both produce near-IID conditions. The distinction is structural: C1 uses contiguous index slicing while C3 uses modular interleaving. The two cases are therefore not intended to represent two distinct heterogeneity regimes, but rather to confirm that near-IID federated performance is consistent across two mechanically different partitioning procedures.

5.3. Aggregation Algorithms

Four aggregation algorithms are evaluated to isolate the impact of server-side design on federated classification performance. All strategy-specific server hyperparameters are listed in Table 4; local training follows the shared protocol in Table 2 for all strategies.

5.3.1. FedAvg

FedAvg performs weighted averaging of client model updates at the server. In each round, all clients train locally for one epoch and return updated weights. The server aggregates client weights using the number of local training samples as weights, forming the baseline aggregation rule [38].
w ¯ t + 1 = k S t n k j S t n j w t k .
The next global model is set directly to the aggregated parameters:
w t + 1 = w ¯ t + 1 .

5.3.2. FedAvgM

FedAvgM extends FedAvg with server-side momentum. Flower implements momentum using a pseudo-gradient defined as the difference between the current global parameters and the aggregated parameters [39].
g t = w t w ¯ t + 1 .
The server maintains a momentum buffer u t updated as
u t = g t , t = 0 , μ u t 1 + g t , t > 0 ,
and applies an SGD-style server step
w t + 1 = w t η u t ,
where μ = 0.9 is the server momentum coefficient and η = 1.0 is the server learning rate.

5.3.3. FedAdam

FedAdam applies an Adam-style adaptive optimizer at the server, following the FedOpt design [40]. Starting from the FedAvg aggregate w ¯ t + 1 , a server pseudo-gradient is formed as
Δ t = w ¯ t + 1 w t η l ,
where η l = 0.1 is the client learning-rate reference parameter used by the FedOpt server optimizer (Flower 1.6.0 default). The server maintains first and second moments
m t = β 1 m t 1 + ( 1 β 1 ) Δ t ,
v t = β 2 v t 1 + ( 1 β 2 ) Δ t 2 ,
with bias-corrected effective learning rate
η t = η 1 β 2 t + 1 1 β 1 t + 1 ,
and updates the global model as
w t + 1 = w t + η t m t v t + τ .
All FedAdam server hyperparameters follow the Flower 1.6.0 defaults ( η = 0.1 , β 1 = 0.9 , β 2 = 0.99 , τ = 10 9 ); no explicit values were passed in the implementation.

5.3.4. FedYogi

FedYogi applies a Yogi-style adaptive update at the server, designed to control the growth of the second moment estimate under non-IID updates [41]. It uses the same Δ t and m t as in Equations (6) and (7), but replaces the second moment update with
v t = v t 1 ( 1 β 2 ) Δ t 2 sign v t 1 Δ t 2 ,
and applies the same server step as FedAdam:
w t + 1 = w t + η m t v t + τ .

6. Experimental Results

This section explores the performance of network traffic classification under different federated learning environments. We compare a centralized baseline with multiple federated settings that differ in client data distribution and label balance with different aggregation models.

6.1. Evaluation Metrics

We evaluate the proposed federated traffic classification framework using accuracy, weighted precision, weighted recall, weighted F1-score, macro F1-score, and ROC-AUC. These metrics quantify overall correctness, robustness under class imbalance, and threshold-independent separability. We further use confusion matrices to analyze class-level error patterns.
Let C denote the set of classes. For each class c C , we compute true positives T P c , false positives F P c , and false negatives F N c in a one-vs.-rest manner. Class-wise precision, recall, and F1-score are
Precision c = T P c T P c + F P c , Recall c = T P c T P c + F N c ,
F 1 c = 2 × Precision c × Recall c Precision c + Recall c .
Overall accuracy is defined as
Accuracy = 1 N i = 1 N I y ^ i = y i ,
where N is the number of test samples, y i is the true label, y ^ i is the predicted label, and I ( · ) is the indicator function.
To account for class imbalance, we report weighted averages. Let n c be the number of test samples in class c. Weighted precision, weighted recall, and weighted F1-score are computed as
Precision weighted = c C n c N Precision c , Recall weighted = c C n c N Recall c , F 1 weighted = c C n c N F 1 c .
Macro F1-score gives equal importance to each class
F 1 macro = 1 | C | c C F 1 c .
We report ROC-AUC using two aggregation schemes. Micro-averaged AUC ( AUC micro ) pools all one-vs-rest scores across classes and samples before computing the ROC curve, which emphasizes frequent classes in figures. Macro one-vs-rest AUC ( AUC macro , OVR ) computes AUC separately for each class and then averages over classes, giving equal importance to all services.
Finally, confusion matrices summarize the number of samples assigned to each predicted class versus the true class and highlight dominant misclassifications across services.

6.2. Centralized vs. Federated Performance

Table 5 compares the centralized baseline (C0) against the best federated configuration obtained under each client partitioning case. The centralized model achieves near-perfect performance, confirming that the selected flow-level statistics provide strong discriminative power even without payload access. Under mixed-label clients (C1), federated learning matches the centralized baseline, with FedAdam reaching the same accuracy (0.9969) and an almost identical AUC (1.0000). Under hash-based semi-IID clients (C3), performance remains high (best accuracy 0.9946 with FedAdam), indicating that moderate heterogeneity does not substantially reduce classification quality when clients still observe multiple services. In contrast, the service-based split (C2) produces a marked performance drop across all strategies. Even the best C2 configuration (FedYogi) reaches 0.9287 accuracy, which is substantially lower than C0 and C1. This gap highlights that extreme label skew, where each client observes a single service, is the dominant factor limiting federated traffic classification performance. Figure 3 supports this observation. ROC curves for C0, C1, and C3 largely overlap near the upper-left corner, whereas C2 configurations show visibly weaker discrimination. Figure 4 shows the comparison accuracy bar chart of all cases used in the current study.

6.3. Impact of Client Data Distribution

Table 6 quantifies the effect of client data heterogeneity by averaging performance across aggregation strategies within each federated case. The results show that client partitioning has a stronger impact than the choice of server-side optimizer. Mixed-label clients (C1) and hash-based semi-IID clients (C3) remain close to the centralized baseline, which indicates that federated optimization is stable when each client observes a reasonable mixture of services and the global objective is represented locally. The service-based split (C2) represents the most challenging setting because each client contains a single service class. This extreme label-skew scenario reduces the ability of local training to produce updates that generalize across classes, and it increases client drift during aggregation. As a result, C2 yields the lowest average performance among all cases (accuracy 0.8661 and weighted F1-score 0.8388 in Table 6). Despite this degradation, the achieved C2 performance remains meaningful for a realistic distributed deployment, where sites may naturally observe narrow service subsets. Importantly, the C2 results should be interpreted relative to a naive distributed alternative in which each client trains a model only on its local single-class data. In such a setting, a client-specific model cannot learn a global multiclass decision boundary and will not generalize to unseen services, making pooled multiclass evaluation effectively infeasible. In contrast, federated learning in C2 still produces a single global classifier that recognizes all five services and reaches strong performance when evaluated on the pooled test set. This confirms that server-side aggregation enables knowledge transfer across disjoint clients and converts isolated single-service observations into a usable global traffic classifier. The remaining performance gap compared to C1 and C3 therefore reflects the inherent difficulty of learning under extreme label skew rather than a failure of the federated approach.

6.4. Impact of Aggregation Strategy

Table 7 isolates the impact of server-side aggregation under fixed local training and evaluation protocols. Under C1 and C3, aggregation has limited influence. All strategies achieve high accuracy and near-perfect macro-AUC, which indicates that when client data approximate IID or semi-IID conditions, standard averaging is sufficient and adaptive server-side optimizers offer only marginal gains. Under C2, aggregation choice becomes critical. FedAvg performs the worst (accuracy 0.7860, weighted F1-score 0.7373), reflecting severe client drift when local updates are dominated by single-class gradients. Adaptive or momentum-based strategies improve performance substantially. FedAdam increases accuracy to 0.8450, while FedAvgM reaches 0.9047. FedYogi attains the highest accuracy (0.9287) and the best weighted F1-score (0.9283), demonstrating that server-side optimization can partially mitigate non-IID effects under extreme label skew.
However, macro-AUC trends reveal an important inconsistency between threshold-dependent and threshold-independent metrics. In C2, FedAvgM achieves the highest macro-AUC (0.9885), while FedYogi attains the highest accuracy (0.9287) and weighted F1-score (0.9283) despite its lower macro-AUC (0.9586). This divergence arises because accuracy and weighted F1-score measure classification performance at a fixed decision threshold, specifically the argmax of the predicted softmax probabilities, whereas macro-AUC evaluates probability score ranking quality across all possible thresholds with equal weight given to each class regardless of its sample count. This equal weighting makes macro-AUC particularly informative in the presence of class imbalance, as it reflects classification reliability for minority classes such as Google Music that would otherwise be overshadowed in micro-averaged metrics. Notably, from figures and result tables FedAdam shows the largest gap between micro-AUC and macro-AUC in C2 (0.9776 vs. 0.9461), indicating that its adaptive scaling disproportionately benefits frequent classes while leaving minority class discrimination weaker. FedYogi’s Yogi-style second moment update controls the growth of the second moment estimate, which under extreme label skew sharpens the decision boundary at the default operating point and produces higher accuracy, but simultaneously compresses the spread of predicted probability scores across classes, reducing ranking quality and lowering macro-AUC. FedAvgM, by contrast, applies server-side momentum that smooths conflicting client updates over successive rounds, producing better-calibrated probability estimates across thresholds and therefore higher macro-AUC, but without the same degree of decision boundary sharpening at the default threshold. Figure 5 reflects this difference: FedAvgM produces a stronger curve shape across the full threshold range for all classes equally, while FedYogi yields stronger point performance at the default operating point. The practical implication is that the preferred aggregator in C2-like environments depends on the deployment requirement: if a fixed threshold decision suffices, FedYogi is preferable; if the operating threshold will be tuned or well-calibrated probability scores are needed across all service classes including minority ones, FedAvgM is the stronger choice.

6.5. ROC Curve Analysis

ROC curves provide a threshold-independent view of service separability and allow direct comparison between aggregation strategies under the same client distribution. Figure 3 presents a global overview across the centralized baseline (C0) and all federated configurations. To isolate the role of aggregation within each distribution regime, we further report case-specific ROC overlays for C1, C2, and C3 in Figure 5 and Figure 6. Under C1 (mixed-label) and C3 (hash-based semi-IID), ROC curves for all strategies remain close to the upper-left corner, which indicates that the classifier preserves strong separability despite distributed training. In these settings, differences between FedAvg, FedAdam, and FedAvgM are marginal, and the macro-AUC values remain close to 1.0, confirming that near-IID conditions produce reliable classification across all service classes including minority ones. This behavior is consistent with the near-centralized accuracy observed in Table 7. Under C2 (service-based), the ROC curves separate clearly across strategies. FedAvg yields the weakest curve (macro-AUC 0.9489), reflecting sensitivity to extreme label skew and poor ranking quality for minority classes. FedAdam shows a notable gap between its macro-AUC (0.9461) and its overall curve shape, indicating that its adaptive scaling disproportionately benefits frequent classes while leaving minority class discrimination weaker. FedAvgM attains the highest macro-AUC in C2 (0.9885), producing the strongest and most consistent curve shape across all service classes equally. FedYogi exhibits a different trade-off, achieving the highest accuracy in C2 (0.9287) while producing a lower macro-AUC (0.9586) than FedAvgM, which suggests that the server optimizer can change both ranking behavior and the decision boundary under severe heterogeneity, and that threshold-dependent and threshold-independent metrics can favour different aggregators when class imbalance and label skew are both present.

6.6. Confusion Matrix Analysis

Confusion matrices provide a class-level view of model errors and reveal systematic confusions that are not visible in aggregate metrics. We report confusion matrices computed on the pooled test set for the centralized baseline (C0) and for every federated configuration across C1, C2, and C3. This presentation supports a complete comparison of how client heterogeneity and server-side aggregation interact to shape misclassification patterns.
Figure 7 shows the centralized baseline. The matrix is almost perfectly diagonal, indicating that the selected flow-level statistics produce strong separability among the five services. The residual errors are rare and concentrated in very small off-diagonal entries.
Figure 8, Figure 9 and Figure 10 summarize the federated results by client case. Under C1 (mixed-label), all aggregation strategies preserve a largely diagonal structure, consistent with the near-centralized performance reported in Table 7. Small differences appear mainly in minority classes, but no dominant confusion pattern emerges. Under C3 (hash-based semi-IID), the confusion matrices remain close to diagonal for FedAvg, FedAdam, and FedAvgM, while FedYogi shows slightly larger off-diagonal mass, consistent with its reduced accuracy and AUC relative to the other strategies.
Under C2 (service-based), confusion patterns become more pronounced and strategy-dependent. FedAvg exhibits the strongest degradation, with increased off-diagonal mass indicating instability under extreme label skew. Adaptive and momentum-based strategies reduce the dominant confusions to varying degrees. FedAvgM and FedAdam improve separability relative to FedAvg, and FedYogi yields the best point classification performance in this setting, although some residual confusions remain. These results support the key finding of this work: server-side aggregation has limited impact under near-IID and semi-IID partitions, but it becomes a decisive factor under highly heterogeneous, label-skewed client data.

7. Discussion

7.1. Key Observations

The results show that the client data distribution is the dominant factor that determines federated traffic classification performance. Under C1 (mixed-label) and C3 (hash-based semi-IID), all aggregation strategies achieve accuracy and macro-AUC values that remain close to the centralized baseline (Table 5 and Table 7). This indicates that when each client observes multiple services, local training provides updates aligned with the global objective, and the server can aggregate them reliably with limited sensitivity to the chosen optimizer.
C2 (service-based) creates an extreme label-skew regime, where each client is restricted to a single service. This setting causes the largest degradation across all metrics (Table 6), and it produces the most visible separation among aggregation strategies in ROC curves and confusion matrices (Figure 3, Figure 5, and Figure 9). The key implication is that the value of advanced server-side aggregation becomes most apparent only when the heterogeneity is severe enough to induce client drift.
Despite the difficulty of C2, the achieved results remain meaningful in a practical sense. A naive alternative where each site trains a local model only on its single-service data cannot yield a usable global multiclass classifier. In contrast, the federated approach still produces a single model that recognizes all five services and reaches strong pooled-test performance. This confirms that server-side aggregation enables transfer of discriminative structure across disjoint clients, even when local data are not representative of the global label space.
The trade-off between privacy and efficiency in federated traffic classification is acceptable in several practical situations. When regulatory constraints such as data sovereignty or cross-domain ownership rules prohibit centralised collection, federated learning is the only viable path to collaborative model training, making the efficiency trade-off a necessity rather than a choice. In multi-organisational environments where data sharing is contractually restricted, federated learning enables collaboration that would otherwise be infeasible. When client data heterogeneity is moderate, as in C1 and C3 in this study, the accuracy cost of federation is negligible, making the privacy-efficiency trade-off particularly favourable. Even under severe heterogeneity as in C2, adaptive aggregation strategies recover substantial performance, suggesting that the trade-off remains acceptable for deployments where high-frequency class detection is the primary requirement. These considerations suggest that the acceptability of the trade-off should be evaluated relative to the regulatory environment, the organisational structure, and the specific performance requirements of the deployment scenario rather than as an absolute judgment.

7.2. Interpreting the Aggregation Methods

FedAvg provides a strong baseline under C1 and C3, but it degrades substantially under C2 (Table 7). This behavior is consistent with client drift in label-skewed settings, where local updates move the model toward single-class optima that conflict across clients. Under such conditions, simple averaging fails to correct for the bias in local gradients, and the global model converges to a compromised solution.
FedAdam improves upon FedAvg in C2 by using adaptive scaling of aggregated updates. This reduces sensitivity to heterogeneous update magnitudes and stabilizes convergence. The gain is reflected in higher accuracy and weighted F1-score compared to FedAvg under C2, while differences remain small in C1 and C3 where heterogeneity is moderate.
FedAvgM introduces server-side momentum, which can smooth oscillations caused by conflicting client updates. The results show that FedAvgM achieves the strongest macro-AUC in C2 (Figure 5 and Table 7). This suggests improved ranking behavior across thresholds, which is desirable when operating points may vary or when decision thresholds are tuned for deployment.
FedYogi achieves the best point classification performance in C2, with the highest accuracy and weighted F1-score among all C2 strategies (Table 5 and Table 7). At the same time, its macro-AUC is lower than FedAvgM in C2, revealing a meaningful inconsistency between threshold-dependent and threshold-independent metrics. This divergence has a mechanistic explanation rooted in how each aggregator’s server-side update rule affects the predicted probability distribution. Accuracy and weighted F1-score are evaluated at a fixed decision threshold, specifically the argmax of the softmax output, and therefore reflect decision boundary sharpness at a single operating point. macro-AUC, by contrast, measures the quality of probability score ranking across all possible thresholds and is therefore sensitive to how well-spread and calibrated the predicted scores are across classes. FedYogi’s Yogi-style second moment update controls the growth of the second moment estimate, which under extreme label skew concentrates probability mass near the predicted class and sharpens the decision boundary, boosting accuracy and F1-score at the default threshold. However, this concentration compresses the score spread across classes, degrading ranking quality and lowering AUC. FedAvgM’s server-side momentum smooths the conflicting gradient directions produced by single-class clients over successive rounds, preserving broader score separation across classes and producing better-calibrated probability estimates across thresholds, which explains its higher AUC. This trade-off is visible in both the ROC curves and the confusion matrix structure in Figure 9, and it carries a practical implication: in C2-like deployments where the decision threshold is fixed, FedYogi is preferable; where the threshold will be tuned or probability calibration matters, FedAvgM is the stronger choice. Overall, the results support a deployment-oriented interpretation. If the expected environment is closer to C1 or C3, FedAvg is likely sufficient and provides competitive performance with minimal complexity. If the environment resembles C2 with strong label skew, then the choice of aggregator becomes a primary design decision. In that regime, adaptive or momentum-based methods mitigate client drift and improve the feasibility of federated traffic classification.

8. Conclusions and Future Work

This paper studied the impact of server-side aggregation on federated QUIC traffic classification under heterogeneous client data distributions. Using a consistent model architecture, training protocol, and pooled-test evaluation, we compared four aggregation strategies across three data partitioning cases. The results show that client data distribution is the primary factor that determines performance. Under mixed-label (C1) and hash-based semi-IID (C3) clients, federated learning matches or closely approaches the centralized baseline, and the choice of aggregation strategy has a limited effect. Under the service-based split (C2), which represents extreme label skew, performance drops for all methods, but adaptive and momentum-based aggregation substantially improve results compared to FedAvg. These findings confirm that advanced server-side optimizers become most valuable when heterogeneity induces strong client drift.
Several limitations of the current study should be noted. The experiments are conducted on a single QUIC traffic dataset with five Google service classes captured in a controlled environment, and evaluation under additional protocols, larger class sets, and different capture environments would help establish how broadly the conclusions transfer. The federated simulation assumes synchronous rounds and full client participation, which may not reflect client availability constraints in operational networks. The partitioning cases C1 and C3 produce statistically similar label distributions; adopting a Dirichlet-based partition in future work would provide a cleaner three-way heterogeneity spectrum. The analysis is based on flow-level statistical features and a fixed MLP architecture; more expressive sequence-based representations or self-supervised embeddings may change the relative advantage of aggregation strategies. Finally, evaluation is conducted on a pooled test set derived from the same dataset source, and cross-dataset testing and domain shift scenarios remain necessary to assess generalization across networks and time.
Future work should extend the study in directions that reflect operational deployments. First, evaluation should include additional heterogeneity sources such as unbalanced client sizes, partial participation, stragglers, and asynchronous training. Second, robustness should be tested under domain shift by training on one capture setting and evaluating on different networks, devices, or time periods. Third, the dataset scope should be broadened to include additional protocols and larger-scale traffic collections to assess the generalizability of the findings across diverse network environments. Fourth, model and feature design can be expanded beyond aggregated statistics to include packet-length and timing sequences, as well as self-supervised representations tailored to encrypted traffic. Fifth, a Dirichlet-based partitioning scheme should be adopted to produce a genuinely distinguishable semi-IID heterogeneity regime. Sixth, personalization and clustered federated learning can be explored to handle persistent client-specific distributions in C2-like environments. Finally, integrating privacy mechanisms such as secure aggregation or differential privacy would enable a more complete assessment of the accuracy–privacy trade-off for practical federated traffic classification.

Author Contributions

Conceptualization, S.A.H. and S.R.R.; methodology, S.A.H. and S.R.R.; validation, S.A.H. and S.R.R.; formal analysis, S.A.H. and S.R.R.; investigation, S.A.H. and S.R.R.; resources, S.A.H.; data curation, S.A.H. and S.R.R.; writing-original draft preparation, S.A.H.; writing-review and editing, S.A.H. and S.R.R.; visualization, S.A.H.; supervision, S.R.R.; project administration, S.A.H. and S.R.R.; funding acquisition, S.A.H. and S.R.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. The study did not involve humans or animals.

Informed Consent Statement

Not applicable. The study did not involve humans.

Data Availability Statement

The data presented in this study are openly available on Kaggle at https://www.kaggle.com/datasets/guillaumefraysse/ucdavisquic, accessed on 28 December 2025.

Acknowledgments

The authors would like to acknowledge Széchenyi István University of Győr for institutional support during the research period and for the Stipendium Scholarship Program.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Razooqi, Y.S.; Pekar, A. VPN traffic analysis: A survey on detection and application identification. IEEE Access 2025, 13, 132830–132848. [Google Scholar] [CrossRef]
  2. Pekar, A.; Makara, L.A.; Biczok, G. Incremental federated learning for traffic flow classification in heterogeneous data scenarios. Neural Comput. Appl. 2024, 36, 20401–20424. [Google Scholar] [CrossRef]
  3. Razooqi, Y.S.; Pekar, A. Binary VPN Traffic Detection Using Wavelet Features and Machine Learning. In Proceedings of the 2025 International Conference on Software, Telecommunications and Computer Networks (SoftCOM); IEEE: Split, Croatia, 2025; pp. 1–6. [Google Scholar] [CrossRef]
  4. Sonal; Deswal, S. Enhancing IoT gateway management security: A deep learning based framework for intrusion detection and threat mitigation. Int. J. Inf. Technol. 2025, 17, 3695–3705. [Google Scholar] [CrossRef]
  5. Mahlool, D.H.; Alsalihi, M.H. A Comprehensive Survey on Federated Learning: Concept and Applications. In Mobile Computing and Sustainable Informatics; Lecture Notes on Data Engineering and Communications Technologies; Springer: Singapore, 2022; pp. 539–553. [Google Scholar] [CrossRef]
  6. Srivastava, V.; K, S.R.; Lamba, V.; Mathada, V.S.; Bulla, C.; Gupta, N.; Veeramanikandan, P. An IoT-based framework employing fuzzy logic and federated learning for decentralized decision-making. Int. J. Inf. Technol. 2025, 17, 4943–4949. [Google Scholar] [CrossRef]
  7. Sharma, P.; Sharma, S.K.; Dani, D. Edge-assisted federated learning for anomaly detection in diverse IoT network. Int. J. Inf. Technol. 2025, 17, 3035–3045. [Google Scholar] [CrossRef]
  8. Verma, R.; Bhatt, R. SCA-FLOD: A federated multi-objective learning framework for big data-driven threat detection and adaptive security policy enforcement in cloud infrastructure. Int. J. Inf. Technol. 2026, 18, 165–172. [Google Scholar] [CrossRef]
  9. Mahlool, D.H.; Abed, M.H. Optimize weight sharing for aggregation model in federated learning environment of brain tumor classification. J. Al-Qadisiyah Comput. Sci. Math. 2022, 14, 76–87. [Google Scholar] [CrossRef]
  10. Husseina, S.A.; Repas, S.R. A federated learning approach to network event classification. In Proceedings of the International Conference on Formal Methods and Foundations of Artificial Intelligence (FMF-AI), Eger, Hungary, 5–7 June 2025. [Google Scholar]
  11. Kishanthan, S.; Hevapathige, A. Deep learning meets oversampling: A learning framework to handle imbalanced classification. Int. J. Inf. Technol. 2025, 17, 4491–4503. [Google Scholar] [CrossRef]
  12. Ge, M.; Feng, R.; Liu, L.; Yu, X.; Vinay, S.; Xie, X.; Liu, Y. Enmob: Unveil the Behavior with Multi-flow Analysis of Encrypted App Traffic. Cybersecurity 2025, 8, 26. [Google Scholar] [CrossRef]
  13. Liu, Z.; Wei, Q.; Song, Q.; Duan, C. Fine-Grained Encrypted Traffic Classification Using Dual Embedding and Graph Neural Networks. Electronics 2025, 14, 778. [Google Scholar] [CrossRef]
  14. Li, H.; Tao, J.; Yu, L.; Luo, Y.; Wang, Z. GSPB: A global-statistic and packet-byte fusion framework for encrypted traffic classification. Cybersecurity 2025, 8, 120. [Google Scholar] [CrossRef]
  15. Sharma, A.; Lashkari, A.H. Hybrid attention-enhanced explainable model for encrypted traffic detection and classification. Int. J. Inf. Secur. 2025, 24, 144. [Google Scholar] [CrossRef]
  16. Xu, K.; Cheng, G. F3L: An automated and secure function-level low-overhead labeled encrypted traffic dataset construction method for IM in Android. Cybersecurity 2024, 7, 1. [Google Scholar] [CrossRef]
  17. Burgetová, I.; Matoušek, P.; Ryšavý, O. Towards identification of network applications in encrypted traffic. Ann. Telecommun. 2025, 80, 1015–1032. [Google Scholar] [CrossRef]
  18. Li, Z.; Wang, J.; Song, Y.F.; Yue, S.H. Unlocking Few-Shot Encrypted Traffic Classification: A Contrastive-Driven Meta-Learning Approach. Electronics 2025, 14, 4245. [Google Scholar] [CrossRef]
  19. Zhang, C.; Li, Q.; Zhang, P.; Chen, G.C. A TDMA Protocol Based on Data Priority for In-Vivo Wireless NanoSensor Networks. In Proceedings of the IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
  20. Buyuktanir, B.; Altinkaya, S.; Baydogmus, G.K.; Yildiz, K. Federated learning in intrusion detection: Advancements, applications, and future directions. Clust. Comput. 2025, 28, 473. [Google Scholar] [CrossRef]
  21. Bakopoulou, E.; Tillman, B.; Markopoulou, A. FedPacket: A federated learning approach to mobile packet classification. IEEE Trans. Mob. Comput. 2021, 21, 3609–3628. [Google Scholar] [CrossRef]
  22. Li, J.; Ma, Y.; Bai, J.; Chen, C.; Xu, T.; Ding, C. A Lightweight Intrusion Detection System with Dynamic Feature Fusion Federated Learning for Vehicular Network Security. Sensors 2025, 25, 4622. [Google Scholar] [CrossRef]
  23. Naz, A.; Ullah, I.; Uzair, M.; Khokhar, M.F.; Sabir, A.; Khan, R.U. AFL-SecNet: An adaptive federated learning framework for secure and privacy-preserving network traffic analysis. Peer-Peer Netw. Appl. 2026, 29, 25. [Google Scholar] [CrossRef]
  24. Peng, H.; Wu, C.; Xiao, Y. FD-IDS: Federated Learning with Knowledge Distillation for Intrusion Detection in Non-IID IoT Environments. Sensors 2025, 25, 4309. [Google Scholar] [CrossRef]
  25. Pecherle, G.D.; Györödi, R.Ș.; Györödi, C.A. Federated Learning-Based Intrusion Detection in Industrial IoT Networks. Future Internet 2026, 18, 2. [Google Scholar] [CrossRef]
  26. Devine, M.; Ardakani, S.P.; Al-Khafajiy, M.; James, Y. Federated Machine Learning to Enable Intrusion Detection Systems in IoT Networks. Electronics 2025, 14, 1176. [Google Scholar] [CrossRef]
  27. Xue, H.; Hu, Y.; Wang, Y. Federated Distributed Network Traffic Classification Based on Deep Mutual Learning. Electronics 2025, 14, 4928. [Google Scholar] [CrossRef]
  28. Tricomi, G.; Cicceri, G.; Ficili, I.; Vitabile, S.; Merlino, G.; Puliafito, A. HERALD: A Hybrid distributEd leaRning incrementAL & feDerated solution for knowledge distillation in COVID-19 classification. Future Gener. Comput. Syst. 2025, 174, 107991. [Google Scholar] [CrossRef]
  29. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
  30. Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the Convergence of FedAvg on Non-IID Data. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
  31. Reddi, S.J.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, H.B. Adaptive Federated Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
  32. Qi, P.; Chiaro, D.; Guzzo, A.; Ianni, M.; Fortino, G.; Piccialli, F. Model aggregation techniques in federated learning: A comprehensive survey. Future Gener. Comput. Syst. 2024, 150, 272–293. [Google Scholar] [CrossRef]
  33. Abdullah, Y.; Alshawki, M.B.; Ligeti, P.; Soussi, W.; Stiller, B. Byzantine-Resilient Federated Learning: Evaluating MPC Approaches. In Proceedings of the IEEE International Conference on Distributed Computing Systems Workshops (ICDCSW), Glasgow, UK, 21–23 July 2025. [Google Scholar] [CrossRef]
  34. Shenoy, D.; Bhat, R.; Prakasha, K. Exploring privacy mechanisms and metrics in federated learning. Artif. Intell. Rev. 2025, 58, 223. [Google Scholar] [CrossRef]
  35. Yao, W.; Zhou, T.; Han, Y.; Wang, X. Verifiable secure aggregation scheme for privacy protection in federated learning networks. Discov. Comput. 2025, 28, 175. [Google Scholar] [CrossRef]
  36. Rezaei, S.; Liu, X. How to Achieve High Classification Accuracy with Just a Few Labels: A Semi-supervised Approach Using Sampled Packets. arXiv 2018, arXiv:1812.09761. [Google Scholar] [CrossRef]
  37. Tong, V.; Tran, H.A.; Souihi, S.; Mellouk, A. A novel QUIC traffic classifier based on convolutional neural networks. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM); IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
  38. Flower FedAvg Strategy API Reference. Available online: https://flower.ai/docs/framework/ref-api/flwr.serverapp.strategy.FedAvg.html (accessed on 27 February 2026).
  39. Flower FedAvgM Strategy Source Code. Available online: https://flower.ai/docs/framework/_modules/flwr/serverapp/strategy/fedavgm.html (accessed on 27 February 2026).
  40. Flower FedAdam Strategy Source Code. Available online: https://flower.ai/docs/framework/_modules/flwr/serverapp/strategy/fedadam.html (accessed on 27 February 2026).
  41. Flower FedYogi Strategy Source Code. Available online: https://flower.ai/docs/framework/_modules/flwr/serverapp/strategy/fedyogi.html (accessed on 27 February 2026).
Figure 1. Architecture of the feedforward neural network for service classification.
Figure 1. Architecture of the feedforward neural network for service classification.
Bdcc 10 00167 g001
Figure 2. Illustration of the federated learning framework. The global server initializes and distributes the model to all participating clients (solid arrows). Each client trains independently on its local data within an isolated local environment.
Figure 2. Illustration of the federated learning framework. The global server initializes and distributes the model to all participating clients (solid arrows). Each client trains independently on its local data within an isolated local environment.
Bdcc 10 00167 g002
Figure 3. Micro-averaged ROC curves for the centralized baseline (C0) and all federated configurations across client partitioning cases and server-side aggregation strategies.
Figure 3. Micro-averaged ROC curves for the centralized baseline (C0) and all federated configurations across client partitioning cases and server-side aggregation strategies.
Bdcc 10 00167 g003
Figure 4. Accuracy comparison across the centralized baseline and all federated configurations grouped by client partitioning case (C1, C2, C3).
Figure 4. Accuracy comparison across the centralized baseline and all federated configurations grouped by client partitioning case (C1, C2, C3).
Bdcc 10 00167 g004
Figure 5. Micro-averaged ROC curves under different server-side aggregation strategies for C2 (service-based, extreme non-IID).
Figure 5. Micro-averaged ROC curves under different server-side aggregation strategies for C2 (service-based, extreme non-IID).
Bdcc 10 00167 g005
Figure 6. Micro-averaged ROC curves under different server-side aggregation strategies for C1 and C3. (a) C1 mixed-label. (b) C3 hash-based semi-IID.
Figure 6. Micro-averaged ROC curves under different server-side aggregation strategies for C1 and C3. (a) C1 mixed-label. (b) C3 hash-based semi-IID.
Bdcc 10 00167 g006
Figure 7. Confusion matrix for the centralized baseline (C0) evaluated on the pooled test set.
Figure 7. Confusion matrix for the centralized baseline (C0) evaluated on the pooled test set.
Bdcc 10 00167 g007
Figure 8. Confusion matrices for C1 (mixed-label clients) under different aggregation strategies. (a) FedAvg. (b) FedAdam. (c) FedAvgM. (d) FedYogi.
Figure 8. Confusion matrices for C1 (mixed-label clients) under different aggregation strategies. (a) FedAvg. (b) FedAdam. (c) FedAvgM. (d) FedYogi.
Bdcc 10 00167 g008
Figure 9. Confusion matrices for C2 (service-based, extreme non-IID) under different aggregation strategies. (a) FedAvg. (b) FedAdam. (c) FedAvgM. (d) FedYogi.
Figure 9. Confusion matrices for C2 (service-based, extreme non-IID) under different aggregation strategies. (a) FedAvg. (b) FedAdam. (c) FedAvgM. (d) FedYogi.
Bdcc 10 00167 g009
Figure 10. Confusion matrices for C3 (hash-based semi-IID clients) under different aggregation strategies. (a) FedAvg. (b) FedAdam. (c) FedAvgM. (d) FedYogi.
Figure 10. Confusion matrices for C3 (hash-based semi-IID clients) under different aggregation strategies. (a) FedAvg. (b) FedAdam. (c) FedAvgM. (d) FedYogi.
Bdcc 10 00167 g010
Table 1. Google services and number of flows in the QUIC dataset.
Table 1. Google services and number of flows in the QUIC dataset.
Google ServiceNumber of Flows
Google Drive1634
YouTube1077
Google Docs1221
Google Search1915
Google Music592
Table 2. Training hyperparameters shared across all experimental configurations.
Table 2. Training hyperparameters shared across all experimental configurations.
ParameterValue
Random seed42
Train/test split ratio80/20
Centralized training epochs30
Centralized batch size32
Local optimizer (all clients)Adam
Local learning rate 10 3
Loss functionSparse categorical cross-entropy
FL communication rounds30
Local epochs per round1
Local batch size32
Clients per round (fraction_fit)1.0 (full participation)
Number of clients (C1, C2, C3)5
Feature standardizationGlobal StandardScaler fitted on union of all clients’ training data
Table 3. Computational and communication complexity of the four aggregation strategies per round. P = 9733 model parameters; K = 5 clients; server extra ops refers to element-wise vector operations beyond the weighted average.
Table 3. Computational and communication complexity of the four aggregation strategies per round. P = 9733 model parameters; K = 5 clients; server extra ops refers to element-wise vector operations beyond the weighted average.
StrategyLocal TrainingServer AggregationServer Extra opsCommunication
FedAvg O ( K · N · P / B ) O ( K · P ) = 48,665 ops 2 K P = 97,330 params
FedAvgM O ( K · N · P / B ) O ( K · P ) = 48,665 ops O ( P ) = 9733 ops 2 K P = 97,330 params
FedAdam O ( K · N · P / B ) O ( K · P ) = 48,665 ops O ( 3 P ) = 29,199 ops 2 K P = 97,330 params
FedYogi O ( K · N · P / B ) O ( K · P ) = 48,665 ops O ( 3 P ) = 29,199 ops 2 K P = 97,330 params
Table 4. Server-side hyperparameters for each aggregation strategy as configured in Flower 1.6.0. Dashes indicate parameters not applicable to that strategy.
Table 4. Server-side hyperparameters for each aggregation strategy as configured in Flower 1.6.0. Dashes indicate parameters not applicable to that strategy.
StrategyServer η η l β 1 β 2 Momentum μ τ
FedAvg
FedAvgM1.00.9
FedAdam0.10.10.90.99 10 9
FedYogi1.00.90.99 10 9
Table 5. Centralized baseline versus best federated model per client distribution.
Table 5. Centralized baseline versus best federated model per client distribution.
SettingClient CaseBest StrategyAcc.Prec. wRec. wF1 wF1 MacroAUC Macro
CentralizedC0Centralized0.99690.99700.99690.99690.99541.0000
FederatedC1 Mixed-labelFedAdam0.99690.99690.99690.99690.99510.9999
FederatedC2 Service-basedFedYogi0.92870.93940.92870.92830.91860.9586
FederatedC3 Hash semi-IIDFedAdam0.99460.99490.99460.99460.99210.9994
Table 6. Average performance across aggregation strategies for each client distribution. Values are averaged across FedAvg, FedAdam, FedAvgM, and FedYogi for each client case.
Table 6. Average performance across aggregation strategies for each client distribution. Values are averaged across FedAvg, FedAdam, FedAvgM, and FedYogi for each client case.
Client CaseAcc.Prec. wRec. wF1 wF1 MacroAUC Macro
C1 Mixed-label0.99190.99220.99190.99180.98800.9980
C2 Service-based0.86610.84940.86610.83880.75740.9605
C3 Hash semi-IID0.98860.98870.98860.98860.98360.9944
Table 7. Impact of aggregation strategy under each client distribution.
Table 7. Impact of aggregation strategy under each client distribution.
Client CaseStrategyAcc.Prec. wRec. wF1 wF1 MacroAUC Macro
C1 Mixed-labelFedAvg0.99300.99300.99300.99300.99020.9999
C1 Mixed-labelFedAdam0.99690.99690.99690.99690.99510.9999
C1 Mixed-labelFedAvgM0.99150.99150.99150.99150.98710.9999
C1 Mixed-labelFedYogi0.98600.98620.98600.98590.97950.9923
C2 Service-basedFedAvg0.78600.79120.78600.73730.62710.9489
C2 Service-basedFedAdam0.84500.83100.84500.82450.71850.9461
C2 Service-basedFedAvgM0.90470.83600.90470.86530.76530.9885
C2 Service-basedFedYogi0.92870.93940.92870.92830.91860.9586
C3 Hash semi-IIDFedAvg0.99380.99400.99380.99380.99100.9995
C3 Hash semi-IIDFedAdam0.99460.99490.99460.99460.99210.9994
C3 Hash semi-IIDFedAvgM0.99380.99400.99380.99380.99090.9995
C3 Hash semi-IIDFedYogi0.97210.97200.97210.97200.96060.9793
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hussein, S.A.; Répás, S.R. Impact of Server-Side Aggregation on Federated Traffic Classification Under Heterogeneous Data Distributions. Big Data Cogn. Comput. 2026, 10, 167. https://doi.org/10.3390/bdcc10060167

AMA Style

Hussein SA, Répás SR. Impact of Server-Side Aggregation on Federated Traffic Classification Under Heterogeneous Data Distributions. Big Data and Cognitive Computing. 2026; 10(6):167. https://doi.org/10.3390/bdcc10060167

Chicago/Turabian Style

Hussein, Salam Allawi, and Sándor R. Répás. 2026. "Impact of Server-Side Aggregation on Federated Traffic Classification Under Heterogeneous Data Distributions" Big Data and Cognitive Computing 10, no. 6: 167. https://doi.org/10.3390/bdcc10060167

APA Style

Hussein, S. A., & Répás, S. R. (2026). Impact of Server-Side Aggregation on Federated Traffic Classification Under Heterogeneous Data Distributions. Big Data and Cognitive Computing, 10(6), 167. https://doi.org/10.3390/bdcc10060167

Article Metrics

Back to TopTop