1. Introduction
5G has been commercially deployed for about six years, and research toward 6G commercialization around 2030 is actively underway. As related technologies have matured, a wide range of services built on 5G infrastructure has emerged, increasing the importance of Mobile Edge Computing (MEC). The network closest to the User Equipment (UE) is the Radio Access Network (RAN). Since MEC is deployed in the RAN and provides computing services there, it offers advantages such as achieving lower latency and improved energy efficiency compared to communicating with remote service servers.
6G aims for even higher communication speeds and lower latency compared to 5G, and has added target metrics for new service areas such as intelligence, security, and ubiquitous connectivity. The 6G use case scenario services defined in the 6G framework recommendation published by International Telecommunication Union Radiocommunication Sector (ITU-R) [
1] are as follows:
Immersive communication: Services such as extended reality (XR), holographic communication, and mixed video/audio traffic.
Hyper Reliable and Low-Latency Communication: Use cases including smart industries, automated processes, energy services, and remote medical treatment.
Massive communication: Services supporting smart cities, transportation, logistics centers, healthcare, energy, and agriculture.
Integrated AI and communication: Applications such as autonomous driving, digital twins, medical assistance, and robotics.
Integrated sensing and communication: 6G-based navigation, motion and gesture detection, environmental monitoring, and provision of sensing information for AI, XR, and digital twin applications.
Ubiquitous connectivity: Ranging from Internet of Things (IoT) services to basic broadband access.
As shown above, it is expected that the introduction of MEC will become increasingly effective for services that need to meet these target requirements or that leverage artificial intelligence (AI).
In particular, research on deploying AI models within mobile networks for use in services or security has continued steadily to this day. Most studies on AI models for detecting attack traffic or anomalies have been conducted under the assumption that training takes place on a single node. In today’s advanced 5G and 6G environments, the importance of edge networks has increased, leading to the deployment of multiple MEC nodes. As a result, each MEC can become a target for attacks, and attack traffic can be distributed across multiple nodes. Therefore, distributed learning across multiple nodes within the network needs to be discussed [
2,
3,
4,
5,
6].
As mobile networks evolve to connect an ever-growing number of devices, the amount of data generated also increases significantly. Since such data is produced in a distributed manner across different base stations, collecting and processing it entirely on a central server inevitably causes additional latency. As AI models become increasingly complex, distributed learning across multiple MEC nodes becomes essential for efficient training while maintaining low latency [
7,
8].
Our main novelty lies in the learning procedure: we design an intrusion detection AI model using a stacking ensemble structure that trains across multiple MECs while keeping node-specific traffic and avoiding heavy central aggregation. Specifically, to reduce the burden on Worker and Master nodes in a distributed learning environment, we partition the features of the original training data in a column-wise manner to compress the data, enabling the meta-classifier to learn patterns from diverse nodes. Each node trains an encoder on its own feature subset, and only compact latent features are sent to the master for meta-classifier training, avoiding repeated parameter averaging used in federated learning.
In this paper, we propose a distributed learning approach for multi-MEC environments. The proposed method considers minimizing network traffic overhead while enabling the sharing of training data within mobile networks. To evaluate the effectiveness of the proposed approach, we utilized the 5G-NIDD dataset, which contains normal and denial-of-service (DoS) traffic collected from two real base stations. The main contributions of this paper are as follows:
We propose a novel distributed learning framework for multi-MEC environments, where base models are trained locally at each MEC node and a meta-classifier is aggregated at the master MEC node. This design reflects realistic mobile network conditions and ensures scalability.
Through extensive experiments using the 5G-NIDD dataset, we demonstrate that the proposed distributed learning approach achieves performance comparable to centralized learning, with slight gains in Accuracy, Recall, and F1-score, while maintaining near-perfect AUC performance.
While centralized learning used three encoders as base models for meta-classifier training, the proposed distributed learning approach combined latent features from six encoders—three base encoders from each node (MEC1 and MEC2). Despite doubling the input dimension to the meta-classifier from 21 to 42, the proposed method achieved comparable or slightly improved performance metrics. It absorbed data variance between next-generation Node Bs (gNBs), securing more stable integrated performance. This represents a significant achievement, overcoming the limitations of existing stacking-based IDS research where increased input dimensions typically lead to higher model complexity and learning difficulty. It simultaneously achieves model scalability and performance enhancement.
The remainder of this paper is organized as follows:
Section 2 introduces background concepts on autoencoders and stacking ensemble models;
Section 3 reviews related work in distributed learning and network security;
Section 4 presents the proposed multi-MEC-based stacking ensemble distributed learning framework;
Section 5 describes experiments across four stages (local single-node, centralized, multi-edge distributed, and logical N-node scalability analysis);
Section 6 analyzes experimental results; and
Section 7 concludes with contributions and future directions.
4. Proposed Method
For clarity, we provide a short autoencoder overview in
Section 2. Here, we summarize its role in our proposed framework: each encoder compresses its assigned feature subset into a compact latent feature, and these latent features are used by the stacking meta-classifier for final classification.
This study proposes a distributed learning method across edge nodes using a stacking ensemble technique, with the goal of minimizing data transfer between nodes.
As illustrated in
Figure 3, multiple gNBs are deployed across different regions, and MEC nodes provide access to specific 5G services. The proposed approach is applicable in such an environment. In this context, an attacker may target an MEC that delivers 5G edge services, attempting to launch attacks through any connected gNB.
From the administrator’s perspective, attacks can be detected either within the 5G core network or in the RAN network. Detection in the RAN network can be performed directly within the base stations or by deploying additional intrusion detection systems (IDS) in the RAN domain.
In this paper, we propose a training approach for AI-based IDS models that are distributed across MEC nodes in the RAN, rather than adopting a centralized IDS in the core network. A centralized approach requires aggregating all traffic into the core, which leads to resource consumption and increased latency. In contrast, since attacks targeting MEC nodes often bypass the core network, it is more effective to implement detection directly at the MEC level, where attacks are likely to occur. Accordingly, this study adopts a distributed IDS architecture based on MEC and designs a training procedure to support it.
The distributed learning procedure across multiple edge nodes proposed in this paper is as follows:
Each MEC located near a base station (Worker MEC1, MEC2) holds the base models of the stacking ensemble, while MEC3 (Master) holds the meta-classifier.
Each Worker MEC collects data from its associated base station. The data is split into training sets for the base models and for the meta-classifier in a 7:3 ratio, and the base models are trained.
Each worker node (MEC1, MEC2) preprocesses the original traffic data stored in CSV format (e.g., scaling, missing value handling), then converts it into a pickle file containing both the preprocessed data and the parameters of the trained base model. This pickle file is transmitted to the master node (MEC3).
The Master MEC (MEC3) integrates the received base model parameters to construct the full ensemble model.
Finally, the master node (MEC3) uses the meta-classifier training data, derived from portions of the datasets of MEC1 and MEC2, to train and test the complete ensemble model and evaluate its performance.
Cooperation among MECs follows a star layout. Encoders are trained at worker MECs, and only encoder parameters and preprocessed validation/test splits are sent to the master MEC. Raw packets and gradients are not exchanged between workers, and there is no direct worker-to-worker synchronization, which lowers coordination cost. Unlike federated learning, we do not repeatedly average parameters or gradients: each worker sends encoder parameters once, and the master trains the meta-classifier on concatenated latent features from all workers, preserving local patterns without repeated synchronization. This architecture reduces communication by sending only encoder parameters and compact latent features, keeps node-specific patterns through separate encoders, and avoids mismatch from global averaging because the meta-classifier learns on concatenated features.
The key design choice of the proposed framework is to separate local representation learning from global decision learning. Instead of aggregating raw traffic or repeatedly synchronizing model parameters across MEC nodes, each worker learns a compact latent representation only from its own feature subset. This column-wise design reduces the dimensionality handled by each local encoder, preserves node-specific traffic characteristics, and enables the master node to build a global detector from concatenated latent features rather than from centrally aggregated raw data. As a result, the proposed framework is better suited to heterogeneous multi-MEC environments, where direct data aggregation may blur local traffic patterns and increase communication overhead.
The structure of the stacking ensemble model proposed in this study for distributed learning is illustrated in
Figure 4. The base models are constructed using the encoder components of autoencoders, while the meta-classifier is implemented with an XGBoost classifier. Each autoencoder encoder consists of two fully connected layers that compress the input features: the first layer reduces the input dimension to 16, and the second layer further reduces it to 12, finally producing a latent representation of dimension 7.
More specifically, the input features are partitioned into fixed column groups, and each encoder processes only its assigned subset to produce a low-dimensional latent vector. This design enables each base model to specialize in a limited feature space instead of learning from the full input dimension. The resulting latent vectors are then concatenated to form the input of the meta-classifier, which learns the final decision boundary for intrusion detection. Because only compact latent representations are propagated to the meta level, the proposed architecture reduces the dimensionality of the data handled in the final stage and lowers the communication burden compared with transferring raw features. At the same time, the use of separate encoders allows each feature subset to preserve its own local structure, while the XGBoost meta-classifier captures cross-subset interactions from the concatenated latent space. In this way, the model combines lightweight local encoding with expressive global classification, making it suitable for distributed training across multiple MEC nodes.
A key characteristic of this model is that it sequentially partitions the features of the training data across multiple encoders, which locally compress the data to form the input dataset for the meta-classifier. This design addresses the limitations mentioned in Related Work: (1) it reduces communication overhead by transmitting only compact latent features instead of raw data, addressing the high aggregation cost of centralized IDS; (2) it handles heterogeneous data by preserving node-specific patterns through separate encoders at each node, addressing the heterogeneous data handling challenge in distributed IDS; and (3) it enables efficient scaling to MEC by distributing computation across nodes and reducing training cost through column-wise feature partitioning, addressing the scaling and training cost issues in ensemble IDS.
The training procedure is as follows:
Split the columns of the training dataset evenly, according to the number of autoencoders used as base models.
Train each autoencoder using its assigned subset of the training data as input.
Extract the trained encoders from the autoencoders and utilize them as the base models of the stacking ensemble.
Feed the training data into the base encoders, concatenate the resulting latent features, and construct the dataset for the meta-classifier.
Train the XGBoost meta-classifier using the dataset composed of the latent features.
Algorithm 1 summarizes the overall learning procedure of the proposed architecture, which corresponds to the conceptual workflow illustrated in
Figure 4. The algorithm proceeds in four major stages as follows.
| Algorithm 1 Distributed stacking ensemble training workflow across multi-MEC nodes |
Require: Local datasets (MEC1), (MEC2) Require: At MEC3: identical encoder architecture and received encoder parameters (PyTorch state_dict) Require: Feature-partition vector splitting the input into three fixed subsets assigned to three autoencoders Ensure: Meta classifier and threshold
: j-th autoencoder at worker i : encoder reconstructed at MEC3 using parameters : feature-wise concatenation
Worker nodes (MEC1, MEC2) for do Load and preprocess into Split into train/val/test sets Slice into 3 subsets using for to 3 do Train using subset j Extract parameters Save via torch.save() end for Transmit and to MEC3 end for
Master node (MEC3) Load all and reconstruct Merge validation/test sets: , in the same way Set Meta dataset construction for do for do Slice into 3 subsets according to Create latent vectors for Construct node-specific meta dataset: end for Merge node-level meta datasets: Convert labels to binary form: = 1 (if end for Train with Threshold selection Sweep with step and compute F1-score with Select that maximizes validation F1-score Final evaluation Evaluate with , and threshold Report Accuracy, Precision, Recall, F1-score, AUC
|
5. Experiments
In this study, we considered four experimental scenarios to compare AI training methods in multi-node environments: (1) local single-node model training, (2) centralized model training, (3) the proposed 2-node distributed model training, and (4) logical N-node distributed model training for scalability evaluation. This section describes the experimental settings for each scenario and compares both the traffic generated during training and the attack detection performance of the models.
To clarify the evaluation scope, this comparison is designed as a system-level scenario comparison rather than a strictly capacity-controlled algorithm-only comparison. The three scenarios share the same proposed model structure (column-wise autoencoder with stacking ensemble), but the number of integrated encoders and the meta-input dimensionality vary by deployment scenario. In the centralized setting, the meta-classifier uses three encoders trained on the aggregated data, resulting in a meta-input dimension of 21 (3 encoders × 7 latent dimensions). In the multi-node distributed setting, the master-side ensemble integrates latent features from two MEC nodes, resulting in six encoder outputs and a meta-input dimension of 42 (6 encoders × 7 latent dimensions). Accordingly, the local vs. multi-node performance difference is interpreted by jointly considering deployment-scenario effects and representational integration, rather than being treated as a pure algorithm-only superiority claim under identical input dimensionality. For fairness between centralized and distributed settings, we use identical encoder architectures and training configurations: feature split (8,8,9), three encoders per node (centralized uses three encoders, distributed uses six encoders from two nodes), XGBoost meta-classifier, and 20 training epochs for base models. All experiments use a fixed random seed of 42 for reproducibility.
5.1. Experimental Setup
5.1.1. Experimental Environment Configuration
The experimental environment was configured to simulate a system with three MEC nodes, as shown in
Figure 5. All nodes were workstations running Windows 11 OS, and Docker Desktop (Docker version 28.3.2) was employed to build containerized environments to ensure consistent execution conditions across nodes. The PyTorch 2.8.0 deep learning framework was used for model implementation.
Each node contained either a worker or a master service container, and additionally included a TCPDUMP service container (tcpdump version 4.9.0) for measuring traffic performance. The worker and master containers provided functionalities for model training and data transmission and reception. To enable GPU-based model training within Docker containers, container images developed by NVIDIA were used [
31]. Furthermore, to capture traffic in conjunction with the training containers, the tcpdump container developed by Kaazing [
32] was employed.
Node 1 and Node 2 were equipped with identical hardware environments, each consisting of an AMD Ryzen7 7800X3D CPU (Advanced Micro Devices, Santa Clara, CA, USA), 128 GB of memory, and an NVIDIA GeForce RTX 5070 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA).
In contrast, Node 3, which served as the Master node, was configured with a higher-performance AMD Ryzen9 9950X CPU (Advanced Micro Devices, Santa Clara, CA, USA), 128 GB of memory, and an NVIDIA GeForce RTX 5090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) to handle computation and model integration tasks.
For the logical N-node distributed learning experiments, we extended the experimental setup to evaluate the scalability of the proposed method across a larger number of worker nodes. Due to hardware constraints in our experimental environment, we conducted logical N-node experiments where multiple worker nodes are simulated on a single physical machine. The logical N-node experiments were designed with 10 logical worker nodes, where each worker node is assigned to process data from either gNB1 or gNB2 without overlap. The 5G-NIDD dataset’s gNB1 and gNB2 data are evenly and non-overlappingly distributed across the logical worker nodes, maintaining the same data distribution characteristics as the physical 2-node distributed learning setup. This logical node assignment allows us to evaluate the distributed learning framework’s behavior under various scenarios including node failures and data loss. The master node configuration remains the same as in the 2-node distributed learning experiments, handling the aggregation of encoder parameters and training the meta-classifier. Note that these are logical validations rather than physical multi-MEC deployments, which represents a limitation of this study.
5.1.2. Dataset Description
The dataset used in the experiments is 5G-NIDD [
33]. This dataset was collected in a real 5G network environment at the University of Oulu, Finland. It was gathered in a 5G setup comprising two base stations and contains normal traffic from real users as well as five types of DoS attacks targeting specific MEC nodes within the 5G environment and three types of port-scan attacks.
The 5G-NIDD dataset provides separated normal and attack traffic collected from each of the two base stations. The proportions of normal traffic, attack traffic, and specific attack types collected from each base station are shown in
Figure 6.
Looking at the overall distribution of normal and malicious traffic, gNB1 contained a larger portion of normal traffic at 85.18%, which was 35.18% higher than that of gNB2. In contrast, gNB2 had a larger share of malicious traffic, accounting for 56.46%. When examining the detailed attack traffic, the amounts of SYN flood and UDP flood attacks were about 2.09 times and 1.60 times greater in gNB2, respectively. For the remaining attack types, the differences ranged from approximately 1.0005 to 1.1767 times, indicating that gNB1 and gNB2 had similar volumes of data.
5.1.3. Dataset Splitting
In both experiments, the datasets from gNB1 and gNB2 were divided into three subsets: train, validation, and test. First, the original data was split into train + validation and test sets at a ratio of 7:3. Then, the train + validation set was further divided into train and validation sets using the same 7:3 ratio. Following this procedure, the resulting data volumes for gNB1 and gNB2 are shown in
Table 2.
5.1.4. Model Layer Architecture and Hyper-Parameters
The stacking ensemble proposed in this study is composed of autoencoder encoders as the base models and an XGBoost classifier as the meta-classifier. The autoencoder encoders are based on fully connected layers, and the structures and hyper-parameters of each component are summarized in
Table 3 and
Table 4.
All training hyperparameters and stopping rules used in our experiments are detailed below. For base autoencoders, we use Adam optimizer with a learning rate of 0.005 and MSE loss. The batch size is fixed to 256 for all train/validation/test loaders, and the maximum training epoch is 20 for base models. Early stopping is consistently applied in
min mode with a patience of three epochs based on validation loss. For the meta-classifier, we use XGBoost with the hyperparameters specified in
Table 4. The XGBoost model uses its built-in early stopping mechanism based on validation error. To ensure reproducibility, we apply random seed control with a fixed seed value of 42 throughout the training pipeline. The operating threshold is selected on the validation set by maximizing validation F1-score, and then fixed for test evaluation.
5.1.5. Evaluation Metrics
This study employs the following metrics to evaluate performance from two perspectives:
In the experiments conducted in this study, the AI model performs binary classification of traffic (0: normal, 1: attack). The prediction results of the AI model can be represented using a confusion matrix. The confusion matrix consists of four components: true positive (TP), false positive (FP), false negative (FN), and true negative (TN). These values are used to calculate accuracy, precision, recall, and F1-score.
Accuracy represents the proportion of correct predictions among all predictions made by the model. However, accuracy can be misleading when the test dataset is imbalanced across classes. For example, if the proportion of normal traffic is significantly higher, the model may predict all instances as normal and still achieve a high accuracy value, while performing poorly on minority classes. Therefore, accuracy should only be used as a supplementary metric.
Precision indicates the proportion of correctly predicted positive samples among all samples predicted as positive. This metric is affected by FP, and precision increases as FP decreases. From a security perspective, a low precision value means a high number of false alarms, where normal traffic is misclassified as attacks, thereby increasing the workload for administrators.
Recall indicates the proportion of actual positive samples that are correctly identified as positive. This metric is affected by FN, and recall increases as FN decreases. From a security perspective, a low recall value means that more attacks are missed, which increases the risk of security incidents.
F1-score is the harmonic mean of precision and recall. Since it reflects the balance between FP (affecting precision) and FN (affecting recall), it is suitable as a comprehensive performance metric for evaluating the model.
The ROC curve visualizes the relationship between the true positive rate (TPR, equivalent to recall) and the false positive rate (FPR) at various thresholds. AUC represents the area under the ROC curve. A higher AUC value indicates that the model is more reliable in classifying attacks across different thresholds. The operating threshold
used in
Table 5 is the decision boundary applied to the meta-classifier’s sigmoid output (probability score). We select
on the validation set by sweeping
and maximizing F1-score, and then fix it for test evaluation. Smaller thresholds in the centralized and distributed settings reflect different score distributions and class ratios after feature concatenation; in this paper,
is defined as the meta-classifier model’s decision boundary.
5.2. Experiment 1: Local Single-Node Model Training
The first experiment trains and evaluates the stacking ensemble model at MEC1 and MEC2 using only the data generated from the gNB1 and gNB2 base stations, respectively.
To further characterize this locality-preserving setting, the pre-split class distributions differ substantially across regions: MEC1 (gNB1) is relatively balanced in binary labels (Benign 55.88%, Attack 44.12%), whereas MEC2 (gNB2) is attack-dominant (Benign 14.52%, Attack 85.48%). In addition, the dominant attack composition also varies by node (e.g., UDPFlood-heavy distribution in MEC2). These results confirm regional data heterogeneity across gNB/MEC sites. Therefore, Experiment 1 intentionally preserves regional heterogeneity and should be interpreted as a realistic deployment baseline.
The dataset used for training and evaluation at MEC1 (gNB1 data) contained the following number of samples:
Train set: 356,874.
Validation set: 152,947.
Test set: 218,495.
The dataset used for training and evaluation at MEC2 (gNB2 data) contained the following number of samples:
Train set: 238,910.
Validation set: 102,391.
Test set: 146,273.
Through this experiment, the performance of the model trained on traffic from each independent base station network at a local single node is evaluated.
This experiment presents the results of training autoencoders separately in the MEC1 (gNB1 data) and MEC2 (gNB2 data) environments, and subsequently using the trained encoders as base models for meta-classifier training.
In the MEC1 environment, the losses and validation losses of the three autoencoders converged stably, and the training of all base models took approximately 1 min and 18.99 s. Then, training and validation meta data were generated through the encoder outputs. Based on this meta data, the meta-classifier was trained for about 20.2 s over four epochs. The loss started at 0.6449 and remained in the range of 0.6761–0.6909.
In the MEC2 environment, the three autoencoders completed training in a comparatively shorter time of about 57.80 s, with both training and validation losses converging at low levels. Using the training and validation meta data generated from the encoder outputs, the meta-classifier was trained for about 19.5 s over six epochs. The loss gradually decreased from an initial value of 0.3266 to 0.2411 at the final epoch.
To mitigate class imbalance in MEC2 without synthetic data generation, we applied benign-weighted BCE loss only during meta-classifier training on the train split, while keeping validation and test distributions unchanged.
In both environments, the base models converged stably. For meta-classifier training based on the generated meta data, the MEC1 results showed that the loss remained within a stable range, while the MEC2 results exhibited a gradual decrease in the loss.
In all main experiments, the operating threshold is selected on the validation set by maximizing F1-score (F1-opt) for consistency. For MEC2 in Experiment 1, we additionally report a balanced-accuracy-optimized threshold (BalAcc-opt) as a sensitivity analysis under severe class imbalance.
As shown in
Table 5, using the F1-opt operating point, MEC1 adopts a best threshold of 0.40, while MEC2 uses 0.66. Under this setting, MEC1 achieves Accuracy 0.4449, Precision 0.4426, Recall 0.9963, and F1-score 0.6130, whereas MEC2 attains Accuracy 0.8568, Precision 0.8592, Recall 0.9954, and F1-score 0.9223. The AUC is 0.6163 for MEC1 and 0.6214 for MEC2. The number of training samples is 356,874 for MEC1 and 238,910 for MEC2, and the total runtime is 262.7 s for MEC1 and 177.0 s for MEC2.
These results indicate that MEC2 provides higher F1-score and overall balanced performance than MEC1 under the common F1-opt protocol, while MEC1 still exhibits slightly lower AUC.
5.3. Experiment 2: Centralized Model Training
In the second experiment, the traffic data collected from the two base stations was preprocessed at MEC1 and MEC2 through scaling and missing value handling, then converted into a pickle file as a preprocessed dataset and transmitted to MEC3. Based on this data, MEC3 carried out the training of the centralized stacking ensemble model.
The dataset used for training and evaluation at MEC3 (gNB1 data + gNB2 data) contained the following number of samples:
Train set: 595,784.
Validation set: 255,338.
Test set: 364,768.
The model achieved overall high detection performance.
Looking at the detailed performance metrics, the Accuracy, Precision, Recall, and F1-score reached 0.9992, 0.9994, 0.9992, and 0.9993, respectively, while the AUC was 0.9999. The best threshold for the meta-classifier was 0.70.
The time metrics refer to the working time of each MEC node. MEC1 (worker1) and MEC2 (worker2) required 00:27 and 00:13, respectively, to preprocess and transmit the data to the Master (MEC3). The Master (MEC3) then required 05:35 to train the stacking ensemble model using the received data. Since the worker nodes operate in parallel, the total execution time was approximately 06:02 (max(00:27, 00:13) + 05:35), which reflects the characteristics of the centralized learning architecture, where all data must be aggregated at the Master node (
Table 6).
In summary, centralized learning achieved stable detection performance with high accuracy, precision, and recall, but it showed the limitation of concentrated workload on the Master node and relatively longer working time during the training process.
5.4. Experiment 3: Multi-Edge Distributed Model Training
The third experiment begins with preprocessing data generated from two base stations on each worker node (MEC1, MEC2), followed by training the base model for the stacking ensemble model. Once base model training is complete, the worker nodes extract the encoder parameters from the trained autoencoder and transmit them to the master node, MEC3, along with the preprocessed validation and test sets. The master node reconstructs the base model using the received base model parameters to train the encoder. It then trains and evaluates the meta-classifier model based on the received preprocessed validation and test sets. The composition of the dataset used in this experiment is identical to that in Experiment 2.
In the distributed learning environment, the training convergence was observed across both worker nodes. Each worker’s three autoencoders converged to stable reconstruction losses within 3–5 epochs, similar to the local single-node experiments. As shown in
Figure 7, the master node’s XGBoost meta-classifier training showed efficient convergence with validation error steadily decreasing. The distributed training maintained stable learning dynamics despite the increased model complexity from combining encoders from multiple nodes. The model exhibited very high detection performance.
Looking at the detailed performance metrics, the Accuracy, Precision, Recall, and F1-score were 0.9993, 0.9996, 0.9992, and 0.9994, respectively, again with an AUC of 0.9999. The best threshold for the meta-classifier was 0.75.
The time metrics refer to the working time measured at each MEC node. MEC1 (worker1) and MEC2 (worker2) required 04:29 and 03:07, respectively, to preprocess the data and train the base models, and then to transmit the meta-training data and trained base model parameters to the Master node (MEC3). The Master (MEC3) required an additional 00:04 to load the received base models, reconstruct the stacking ensemble model, and perform training. Since the worker nodes operate in parallel, the total execution time was approximately 04:33 (max(04:29, 03:07) + 00:04) (
Table 7).
In summary, distributed learning maintained both precision and recall at high levels, achieving performance comparable to centralized learning in F1-score and AUC. Furthermore, the total working time was shorter than that of centralized learning, demonstrating improved efficiency while maintaining comparable performance.
5.5. Experiment 4: Logical N-Node Distributed Model Training
The fourth experiment extends the distributed learning framework to evaluate scalability and robustness across multiple worker nodes. This experiment was designed to assess how the proposed method performs when the number of worker nodes increases, and how it handles realistic distributed system challenges such as node failures and data loss.
The logical N-node experiment configuration consists of 10 logical worker nodes, where each worker node processes data from either gNB1 or gNB2 without overlap. The 5G-NIDD dataset’s gNB1 and gNB2 data are evenly distributed across the logical worker nodes, maintaining the same underlying data characteristics as the two-node distributed learning physical setup. The master node aggregates encoder parameters from all active worker nodes and trains the XGBoost meta-classifier using the concatenated latent features.
We conducted multiple experimental scenarios to evaluate different aspects of the distributed learning framework:
Baseline scenario: All 10 worker nodes are active with no failures or data loss. This scenario establishes the baseline performance when all nodes contribute their encoder parameters to the meta-classifier training.
Node failure scenarios: We simulated node failures at different rates (10%, 20%, 30%, and 40% of worker nodes) to evaluate how the system responds when some nodes become unavailable. When a node fails, its encoder parameters are not available for meta-classifier training, reducing the input dimension to the meta-classifier accordingly.
Data loss scenarios: We simulated data loss at different rates (10%, 20%, 30%, and 40% of total samples) to evaluate how the system performs when training data is partially unavailable. Data loss is distributed proportionally across nodes and data splits (train, validation, test).
For each scenario, the experimental procedure follows the same workflow as described in Experiment 3, but with the extended number of worker nodes. Each worker node trains three autoencoder encoders locally using its assigned data subset, extracts the encoder parameters, and transmits them to the master node along with preprocessed validation and test sets. The master node reconstructs all available encoders, creates the meta-dataset by concatenating latent features from all active nodes, and trains the XGBoost meta-classifier.
The dataset composition for logical N-node experiments follows the same splitting strategy as Experiments 2 and 3. Each worker node processes data from either gNB1 or gNB2, maintaining the same train/validation/test ratio of 7:3. The total number of training samples across all worker nodes is the same as in the centralized and two-node distributed experiments, but distributed across 10 logical nodes instead of two physical nodes.
The performance evaluation for logical N-node experiments includes the same metrics as previous experiments: Accuracy, Precision, Recall, F1-score, and AUC. Additionally, we analyze the impact of node failures and data loss on detection performance and communication overhead.
Table 8 summarizes the performance metrics across different scenarios. In the baseline logical N-node scenario with all 10 workers active, the meta-classifier achieved Accuracy 0.9995, Precision 0.9997, Recall 0.9994, F1-score 0.9996, and AUC 0.9999, with a total training time of 719 s at the master side.
These results confirm that the logical N-node distributed learning framework maintains near-perfect detection performance while scaling to a larger number of logical worker nodes, even under node failure and data loss conditions.
6. Analysis of Experimental Results
6.1. Performance Comparison of Local Models in Experiment 1
Comparing the local models under the primary F1-opt protocol, MEC1 achieved Accuracy 0.4449, Precision 0.4426, Recall 0.9963, and F1-score 0.6130, while MEC2 achieved Accuracy 0.8568, Precision 0.8592, Recall 0.9954, and F1-score 0.9223 (
Table 5). Accordingly, MEC2 showed better performance than MEC1 across all metrics, particularly in F1-score, despite both models achieving near-perfect recall (
Figure 8a).
For threshold-independent ranking metrics, ROC-AUC was 0.6163 for MEC1 and 0.6214 for MEC2, indicating comparable ranking-based discrimination in this local setting.
From the perspective of training resources, MEC1 used 356,874 training samples and required 262.7 s, while MEC2 used 238,910 samples and required 177.0 s. Thus, MEC2 was faster with fewer samples in this local training setup.
In summary, MEC2 provides higher Accuracy and F1-score under the F1-opt operating point, while both MECs attain very high recall. Therefore, in Experiment 1, the main difference between MEC1 and MEC2 is the overall balance between precision and recall under heterogeneous class distributions with severe class imbalance.
6.2. Performance Comparison of AI Models in Experiments 2 and 3
Centralized and distributed learning were compared in terms of performance, and both methods achieved near-perfect detection metrics (
Table 9). Accuracy was 0.9992 in centralized learning and 0.9993 in distributed learning, while Precision remained above 0.9994 in both settings. Recall and F1-score also exceeded 0.9992 for both architectures, and AUC stayed at 0.9999, indicating virtually identical classification stability.
In contrast, the time characteristics differed more clearly. The centralized setting required a total of 06:02, whereas the distributed setting completed in 04:33, reflecting the benefit of parallelized base-model training at worker MECs. As illustrated in
Figure 9, these results visually confirm that the proposed distributed learning maintains centralized-level performance while reducing overall training time. The slight performance improvement comes from keeping node-specific features. Each MEC trains its own encoders on local data, and the master concatenates the resulting latent features for stacking. This keeps different traffic patterns and helps the meta-classifier reduce false negatives without increasing false positives; in our experiments, Accuracy, Recall, and F1 showed slight improvements while AUC stayed almost the same.
Following the comparison of performance metrics and training time, the network traffic generated by the two approaches was analyzed. In the centralized learning (Master) scenario, a total of 617.85 MB of data and 46,088 packets were generated, with an average bandwidth of 112.67 Mbps and a maximum bandwidth of 183.46 Mbps.
In contrast, the distributed learning (Master) scenario produced 64.21 MB of data and 4911 packets. This significant reduction, approximately one-tenth of the centralized case, resulted from transmitting binary variable data containing preprocessed features directly usable by the AI model instead of raw traffic data. The average bandwidth was reduced to 53.86 Mbps, less than half of the centralized value, while the maximum bandwidth was similar at 173.98 Mbps. However, the traffic duration was considerably shorter.
Additionally, a key analytical point is that the proposed distributed learning architecture maintained comparable performance while improving efficiency, despite utilizing more encoders than the centralized approach. While centralized learning trained the meta-classifier using three encoders (meta-input dimension of 21), the distributed learning method combined a total of six latent features by utilizing all three encoders trained at MEC1 and MEC2, respectively (meta-input dimension of 42). Despite the input dimension to the meta-classifier doubling from 21 to 42, the distributed approach demonstrated comparable performance with slight gains in some metrics while reducing computational time. These results indicate that in distributed learning, combining latent features generated across two nodes absorbs data imbalance and distribution differences, improves learning efficiency, and secures more stable integrated performance.
6.3. Performance Comparison Across Experiments
Table 9 summarizes the performance metrics and training time across all experimental scenarios. As shown in the table, local single-node learning showed lower performance (average Accuracy 0.6508, F1-score 0.7677) due to limited data diversity, with notable differences between MEC1 and MEC2 reflecting the heterogeneous traffic distribution across base stations. Centralized and distributed learning both achieved near-perfect detection performance, with distributed learning maintaining centralized-level performance (Accuracy 0.9993 vs. 0.9992, F1-score 0.9994 vs. 0.9993) while reducing training time from 368 s to 293 s. The logical N-node distributed learning demonstrated scalability with the highest performance metrics (Accuracy 0.9995, F1-score 0.9996) even under node failure and data loss conditions, though with increased training time (719 s) due to the larger number of worker nodes.
Figure 10 visually compares the performance metrics across all learning methods. The figure clearly shows that local single-node learning exhibits substantially lower performance across all metrics compared to centralized and distributed learning, which both achieve near-perfect scores. Distributed learning shows slight improvements over centralized learning in Accuracy, Precision, and F1-score, while maintaining identical AUC performance. The logical N-node distributed learning achieves the highest performance metrics, demonstrating the scalability of the proposed framework.
Figure 11 compares the total training time across different learning methods. Local single-node learning requires the shortest time (219.9 s on average) but at the cost of significantly lower performance. Centralized learning takes the longest time (368 s) due to the need to aggregate all data at the master node. Distributed learning achieves a balance between performance and efficiency, reducing training time to 293 s while maintaining comparable or slightly improved performance. The logical N-node distributed learning requires 719 s, reflecting the increased computational overhead from processing data across 10 worker nodes, but still demonstrates efficient scalability given the larger number of nodes involved.
6.4. Performance Analysis of Logical N-Node Distributed Learning
The logical N-node distributed learning experiments were conducted to evaluate the scalability and robustness of the proposed framework across a larger number of worker nodes. The baseline logical N-node scenario (10 worker nodes, no failures, no data loss) establishes the reference performance when all nodes contribute their encoder parameters. In this configuration, the meta-classifier receives latent features from 30 encoders (10 nodes × 3 encoders per node), resulting in a meta-input dimension of 210 (30 encoders × 7 latent dimensions). This increased input dimension compared to the two-node setup (42 dimensions) allows the meta-classifier to learn from a more diverse set of feature representations while still achieving Accuracy 0.9995, Precision 0.9997, Recall 0.9994, F1-score 0.9996, and AUC 0.9999.
The node failure scenarios evaluate how the system responds when a portion of worker nodes become unavailable. As the failure rate increases from 10% to 40%, the number of active nodes decreases from 9 to 6, and the meta-input dimension reduces proportionally. As shown in
Table 8, detection metrics remain near-perfect across all failure rates, with Accuracy and F1-score staying above 0.9994 even at 40% failure rate. The resulting performance curves in
Figure 12 visually confirm this robustness, indicating graceful degradation and resilience against node outages.
The data loss scenarios evaluate the system’s performance when training data is partially unavailable. Data loss affects the base model training at worker nodes, potentially reducing the quality of encoder representations. As shown in
Table 8, as the data loss rate increases from 10% to 40%, all detection metrics remain near-perfect, with Accuracy and F1-score staying above 0.9995 across all scenarios.
Figure 13 visually illustrates that Accuracy, Precision, Recall, F1-score, and AUC all stay near 1.0, with only minor degradation, demonstrating robustness to partial data unavailability.
Overall, the logical N-node distributed learning framework maintains the core advantages observed in the two-node setup: reduced communication overhead compared to centralized learning, preservation of node-specific traffic patterns, and efficient meta-classifier training. In addition, the extended node configuration confirms that near-perfect detection quality can be sustained even when scaling to 10 logical MEC nodes and under realistic node failure and data loss conditions.