1. Introduction
The vision for 6G mobile networks is a hyper-connected, intelligent infrastructure capable of supporting ultra-reliable low-latency communication (URLLC), massive-scale IoT, and bandwidth-intensive applications such as holographic communications and extended reality (XR) [
1,
2]. Realizing this vision requires more than ultra-high data rates and ultra-low latency: it demands intelligent, adaptive, and fine-grained resource management that can react to highly dynamic and heterogeneous traffic. Network slicing will be essential for enhancing adaptability across diverse vertical applications, including mission-critical communications, enhanced broadband with massive multiple input multiple output (MIMO), and ultra-reliable low-latency services. Recent studies demonstrate the potential of deep reinforcement learning (DRL)-based slicing and resource allocation in these domains, such as dynamic SDN-based slicing for URLLC and enhanced mobile broadband (eMBB) services [
3], DRL-enabled resource allocation for massive MIMO slicing [
4], and mission-critical service management architectures [
5]. These examples underline the need for intelligent, fine-grained control mechanisms capable of supporting heterogeneous verticals under dynamic 6G conditions.
Complementing these intelligent core network mechanisms, radio resource management (RRM) in the radio access network (RAN) plays a crucial role in maintaining Quality of Service (QoS) and optimizing spectral efficiency. In 6G systems, RRM must dynamically allocate time–frequency resources considering both the QoS requirements of diverse service types and the instantaneous channel quality experienced by users. A recent study [
6] proposes channel-aware and QoS-aware RRM frameworks based on multi-numerology operation. Such adaptive RRM mechanisms enable flexible coexistence of multiple waveform configurations within the same spectrum, improving service differentiation and spectral utilization for heterogeneous latency and throughput demands.
Conventional resource allocation in mobile networks is largely static or rule-based, designed for relatively stable conditions. Such mechanisms lack the agility needed in 6G, where QoS requirements shift rapidly and traffic is fundamentally non-stationary [
7]. Machine learning (ML) and reinforcement learning (RL), in particular, have emerged as a promising paradigm for autonomous network control. For example, ref. [
8] reports that multi-agent RL can reduce latency by up to 32.4% and increase throughput by 6.3% in 5G radio access networks compared to heuristic methods, underscoring RL’s effectiveness in dynamic settings.
However, existing RL-based solutions remain constrained by limited responsiveness to real-time conditions, poor generalization across diverse network states, and significant computational overhead [
9,
10].
These challenges are exacerbated by a scarcity of fine-grained simulation environments that capture the spatio-temporal complexity of next-generation networks. Although most prior studies target slice- or service-level allocation, they overlook micro-level dynamics at individual queues—where QoS violations often originate. Addressing this gap is crucial for dependable, fine-grained performance in 6G.
SDN provides programmability and centralized control, making it a natural enabler for intelligent resource orchestration [
11]. However, integrating RL with SDN introduces additional complexity: achieving real-time, per-queue decisions at scale while preserving stability and robustness is non-trivial [
12]. In the context of 6G, SDN plays a central role as a unifying control and management layer. Future 6G infrastructures are expected to be highly heterogeneous, spanning terrestrial, aerial, and satellite segments, and to integrate massive numbers of connected devices and intelligent services. SDN provides the programmability, abstraction, and centralized orchestration required to manage such complexity. By decoupling the control and data planes, SDN enables the network’s reconfiguration in real time according to service-level and context-specific requirements, which is essential for 6G’s dynamic environment.
SDN-enabled resource allocation is crucial for 6G, as it affects the network’s URLLC, spectrum efficiency, and energy-aware operation. RL offers an intelligent mechanism to automate these SDN control decisions and to adapt dynamically to varying traffic conditions, user mobility, and network states. Its integration with SDN can change the 6G control plane into a self-optimizing and self-healing entity, capable of meeting diverse QoS demands across network slices and service types. This research contributes to that vision by addressing how queue-level RL scheduling can enhance responsiveness and efficiency in SDN-enabled 6G networks.
This paper tackles these challenges by introducing a novel actor–critic RL framework for queue-level bandwidth allocation in SDN-enabled 6G networks. Unlike approaches confined to the slice or service level, our method performs real-time, per-queue decision-making based on live QoS metrics (e.g., packet latency and loss). We train the RL agent in a custom simulation environment that reflects realistic traffic variability, topological dynamics, and stochastic load patterns, improving robustness and generalization. While prior studies have used DRL for bandwidth allocation and slicing, they mostly focus on slice-level control. This work introduces a queue-level actor–critic framework integrated with SDN through a modular interface, enabling real-time per-queue allocation. A stochastic traffic generator and EVE-NG-based testbed further demonstrate a practical and deployable RL–SDN integration for beyond-5G/6G networks.
Paper Contributions
The main contributions of this paper can be summarized as follows:
Modular Resource Scheduling Framework—we propose a modular architecture that integrates reinforcement learning agents for queue-level bandwidth allocation in SDN-enabled 6G networks.
High-fidelity Stochastic Simulation—we design a dynamic, stochastic simulation environment to support robust RL training under realistic traffic variations.
Queue-level Actor–Critic Algorithm—we develop an actor–critic RL algorithm tailored for fine-grained per-queue scheduling, achieving lower latency, reduced packet loss, and improved adaptability under varying conditions.
Modular SDN Interface—we introduce an abstract SDN interface that enables plug-and-play connection of RL agents to SDN controllers, bridging the gap between theoretical research and practical deployment.
Comprehensive Evaluation—we validate the proposed framework through extensive testing in a virtualized EVE-NG-based 6G testbed, demonstrating superior adaptability and robustness compared to a baseline forecasting model.
The rest of this paper is organized as follows:
Section 2 reviews related work and highlights open research gaps.
Section 3 presents the proposed framework and actor–critic model.
Section 4 discusses the evaluation setup and results, while
Section 5 concludes the paper.
2. Related Work
The development of intelligent resource management in 6G networks builds on multiple research directions, including network slicing with SDN support, the application of reinforcement learning in network control, and the advancement of standardization for autonomous architectures. This section outlines key efforts in each area. It highlights the limitations that our work addresses across three main domains: (1) network slicing and SDN-based architectures, (2) reinforcement learning for resource management, and (3) standardization and emerging frameworks for intelligent 6G networks.
2.1. Network Slicing and SDN-Oriented Architectures
Network slicing has become a foundational concept in 5G and beyond, enabling the deployment of multiple isolated logical networks over shared physical infrastructure. Standardization bodies such as the NGMN Alliance, 3GPP, and ITU-T define network slicing as the logical partitioning of virtualized resources managed through programmable control and data planes [
13,
14,
15]. These logical slices can be tailored to distinct service types, including eMBB, URLLC, and massive machine-type communications (mMTC). As discussed in [
16], network slicing introduces key challenges in mobility management and dynamic resource allocation to meet diverse service requirements in 5G and future mobile networks.
Programmability, elasticity, isolation, and automation are core enablers of slicing architectures [
17]. Researchers have extended this foundation by integrating SDN-based control and orchestration to improve the responsiveness and manageability of slicing mechanisms. For instance, SDN-based control planes have been used to dynamically reconfigure slice resources in response to traffic fluctuations [
18,
19]. As ref. [
20] highlights, SDN significantly enhances network management by decoupling the control plane from the data plane, enabling centralized, programmable, and more agile network control. Additionally, platforms like Open Source MANO (OSM) have been adopted to demonstrate and validate real-world slicing deployments [
21].
However, despite architectural advances, existing slicing systems typically depend on pre-configured policies or simple adaptation mechanisms, which cannot satisfy the stringent QoS demands of dynamic, heterogeneous 6G applications [
22]. The lack of real-time, adaptive decision-making at a fine-grained level (e.g., per-queue or per-flow) remains a critical shortcoming, as ensuring slice-level QoS/SLA guarantees and isolation generally requires QoS-aware resource allocation mechanisms [
23]. Although SDN introduces significant flexibility, most approaches still rely on static decision-making, fixed policies, or heuristic algorithms that respond slowly to sudden traffic fluctuations, thereby overlooking the micro-dynamics at the queue level where QoS degradation often originates. This gap directly motivates our approach, which focuses on queue-level adaptability through reinforcement learning.
2.2. Reinforcement Learning in Management of Network Resources
Reinforcement learning (RL) has emerged as a powerful tool for enabling autonomous, data-driven decision-making in complex systems. In networking, RL has been applied to traffic engineering, dynamic resource scheduling, edge caching, and slicing. The survey by Luong et al. [
24] comprehensively covers applications of DRL in wireless networks. Tang et al. [
25] further demonstrate its effectiveness for dynamic network slicing.
Zhou et al. [
8] demonstrated a multi-agent RL (MARL) approach for RAN slicing in 5G environments, reporting a 32.4% reduction in latency and a 6.3% improvement in throughput over heuristic baselines. RL has also been applied to highly dynamic settings, such as V2V communication [
26], edge-native multi-tenant orchestration [
27], and federated control tasks [
28], where rule-based approaches often struggle to scale and adapt. Recently, Zhang et al. [
29] introduced a multi-agent algorithm based on Stackelberg game theory for joint resource allocation and task offloading in mobile edge computing (MEC)-enabled cooperative intelligent transportation systems (C-ITSs). Such a hierarchical multi-agent framework balances computational and communication resources efficiently. Moreover, it demonstrates the potential of game-theoretic RL for distributed decision-making in 6G environments.
Despite these advances, most RL models operate at coarse levels of granularity, typically at the service or slice level. This limits their ability to respond to fine-grained variations in traffic and queue behavior, particularly in SDN environments, dynamically. Moreover, existing implementations often rely on simplified simulation frameworks that lack key features such as per-queue state tracking, realistic packet-level dynamics, and stochastic traffic generation. These limitations restrict their applicability in practical 6G deployments.
In summary, although RL has emerged as a promising paradigm, existing approaches generally fall short in achieving fine-grained adaptability and robust generalization under realistic network conditions. This work contributes by introducing a queue-level actor–critic RL model, trained in a stochastic environment and validated in the EVE-NG testbed. Through this design, we provide a step toward bridging the gap between theoretical RL research and practical deployment in 6G networks.
2.3. Standardization and AI-Based Network Autonomy
Recent efforts by standardization organizations have recognized the role of AI in future network management. The ETSI Experiential Networked Intelligence (ENI) framework [
21] introduces a reference architecture for AI-driven closed-loop network control, supporting cognitive decision-making based on observed experience. Similarly, ITU-T Recommendations Y.3156 [
30] and Q.5023 [
31] emphasize the need for coordination between AI modules and programmable control planes in IMT-2020 and IMT-2030 networks.
Beyond ETSI ENI and ITU-T, 3GPP and the O-RAN Alliance [
32] are actively standardizing AI-ready management and control enablers. In 3GPP, SA5’s TS 28.530 [
33] defines concepts, use cases, and requirements for network slicing management and orchestration; TS 28.531 [
34] specifies provisioning (procedures, services, and NSMF/NSSMF interactions) for the lifecycle of network slice and slice subnet instances. Additionally, TS 28.541 [
35] specifies the 5G Network Resource Model (NRM), the information model, managed objects, and relationships used to manage 5G networks and network slicing. SA2’s NWDAF (TS 23.288) [
36] exposes network data analytics to control functions (e.g., PCF/SMF/AMF) for policy optimization and intent-driven operations. These capabilities are being extended in Releases 18 and 19 [
37,
38] toward enhanced data collection, analytics placement, and automation. O-RAN’s AI-native split (Non-RT RIC: WG2/A1; Near-RT RIC: WG3/E2), together with SMO/O1, provides integration points for our learning-based scheduler; enforcement can use A1/E2/O1 or 3GPP models, complementing our abstract SDN interface. In addition, numerous studies have explored AI-based admission control, congestion management, and load-balancing algorithms, which adapt resource usage in near real time. For example, supervised learning models have been used for slice selection and traffic prediction [
39]. On the other hand, unsupervised and hybrid models are being tested for anomaly detection and proactive configuration [
39,
40].
Even with these frameworks and early studies, challenges remain in integrating new systems. Issues like the lack of standard APIs, compatibility with older systems, and complex deployment processes make it difficult for IT staff to adopt these solutions. In other words, although standardization efforts acknowledge the importance of AI in network autonomy, they are often abstract and lack concrete mechanisms for integration. Our work addresses this by introducing a modular SDN interface that allows plug-and-play connection of RL agents to SDN controllers, thereby narrowing the gap between standardization recommendations and practical implementation.
In summary, prior research has laid a strong theoretical and architectural foundation for intelligent resource allocation in 5G and future networks. Network slicing and SDN provide the structural enablers for control, while reinforcement learning has shown promising early results in optimizing network behavior. However, the intersection of fine-grained control (e.g., per-queue scheduling), real-time responsiveness, and practical SDN integration remains underexplored. These components collectively enable scalable, real-time, and autonomous control of resource allocation in SDN-enabled 6G networks. Moreover, they highlight the need for our proposed approach, which we describe in detail in the subsequent sections.
Prior studies have addressed various aspects of intelligent and adaptive network control. However, most of them focus on resource allocation at the service or slice level and rely on simplified or static simulation environments, which limits their responsiveness and generalization in realistic conditions. Unlike these approaches, our work introduces a queue-level actor–critic RL model integrated with a modular SDN interface, enabling real-time and fine-grained scheduling decisions. In addition, we provide a stochastic traffic generator and an EVE-NG-based testbed that together offer a realistic training and evaluation environment for beyond-5G/6G scenarios. The next section presents our proposed framework designed to bridge these gaps.
3. Proposed Method
QoS-aware resource allocation across network slices is essential to meet heterogeneous SLA targets in 5G/6G, and SDN programmability is a practical enabler for such control [
23]. We instantiate these principles in an SDN-enabled 6G architecture with multiple slices, where programmable SDN switches support per-queue scheduling (
Figure 1). Each slice is an isolated logical network tailored to a service type (e.g., URLLC, eMBB, mMTC). Within each slice, one or more logical egress queues correspond to flows or service classes, enabling fine-grained control. Centralized SDN controllers provide programmability, dynamic allocation, and real-time reconfiguration via APIs, such as OpenFlow/OVSDB, allowing end-to-end QoS enforcement across access, transport, and core domains [
23].
3.1. System Architecture
We propose a feedback-driven control loop that combines reinforcement learning (RL) with SDN for real-time, queue-level bandwidth allocation. Each control cycle consists of (i) observing per-queue metrics, (ii) computing bandwidth allocations, (iii) enforcing them via the SDN control plane, and (iv) updating the policy based on observed QoS (
Figure 2). This closed loop adapts to traffic dynamics and service demands while preserving slice-level QoS.
Figure 2 illustrates the architecture of the proposed resource scheduling framework. It consists of the following main components:
Synthetic Traffic Generator—emulates realistic, stochastic traffic conditions for training and evaluation.
SDN Infrastructure—provides programmable per-queue bandwidth control via OpenFlow-compatible switches and routers.
The Abstract SDN Interface connects the SDN system with the RL agent. It collects per-queue telemetry (such as throughput, packet drops, and latency) from the controller using standardized APIs. The data are then transformed into state vectors and applied through OpenFlow or OVSDB commands to enforce the agent’s decisions. This interface decouples the learning logic from the SDN implementation, ensuring flexible and real-time data exchange.
The monitoring interface provides three primary QoS metrics for each queue in a slice: packet loss rate, average queuing latency, and throughput. Low packet loss and minimal latency are critical for time-sensitive applications, while throughput ensures overall service efficiency. These metrics form the state vector
, which is defined as
where
—state at time step t;
—bytes sent;
—bytes received;
—packet drop count;
—average queue latency;
i—queue index;
n—total number of monitored queues.
This fine-grained observability allows for rapid adaptation to micro-level variations in traffic and service conditions.
We identified packet drop count () and average queue latency () as key QoS metrics for capturing congestion and responsiveness at the queue level. Packet drops indicate buffer overflows and insufficient bandwidth allocation, while queue latency reflects cumulative delays caused by resource contention and scheduling inefficiencies. Together, these metrics provide a compact yet comprehensive representation of the network state that correlates strongly with end-to-end service degradation under dynamic load conditions. Although jitter (latency variation) is also a relevant QoS parameter for time-sensitive applications, it was excluded from the state vector for two main reasons: (i) jitter can be derived from latency dynamics and is therefore a correlated secondary measure, and (ii) including it would increase the dimensionality of the state space without yielding a proportional improvement in learning stability or policy convergence. Focusing on packet drop count and average queue latency thus ensures a balanced trade-off between observability, computational efficiency, and model convergence in reinforcement learning-based scheduling.
The reward engine evaluates the effectiveness of each action taken by the RL agent by computing a scalar reward signal based on observed QoS metrics. The function used to calculate the reward is defined as
The reward engine gives a reward of +1 when all queues have no packet loss and no latency. It provides a reward of −1 if any packet drops occur. The goal of this negative reward is to penalize an agent in situations where its actions cause packet loss in any of the queues causing it to favor actions that do not degrade reliability. In other situations, it awards a score between 0 and 1, based on the average latency across the queues, with lower latency resulting in a higher score. This design encourages the agent to focus on being reliable and quick to respond. It penalizes congestion and delays but allows for gradual improvement in less-than-ideal conditions. The chosen reward interval ensures numerical stability during training by preventing exploding gradients.
The data storage layer logs interactions in tuples: . It is responsible for maintaining the historical interaction data between the RL agent and the SDN environment. It records tuples of state, action, reward, and next state at each time step, enabling both online learning and offline training through experience replay. This layer also supports logging and diagnostic tools for performance analysis and visualization. It is designed to be lightweight and modular, able to operate with various back-ends, ensuring flexibility and ease of integration with different system architectures.
3.2. RL Agent Architecture
The RL agent is implemented using the so-called actor–critic architecture, where
In this approach, the actor learns a policy to map observed network states to resource allocation actions, and the critic estimates the expected reward from the current state. These estimates are used to guide the actor to learn the optimal policy. Both the actor and critic share input in the form of recent queue metrics, enabling temporal awareness. This architecture allows the agent to continuously adapt its resource allocation policy based on real-time feedback, balancing immediate QoS needs with long-term performance goals. The actions produced by the actor are
Transmitted to the SDN controller;
Translated into configuration commands (e.g., OpenFlow meter or queue settings);
Applied to the corresponding switches in the data plane;
This real-time actuation ensures low-latency responsiveness to traffic changes.
The process of training the actor–critic model can be viewed as an optimization problem with the objective of finding the policy
that maximizes the expected cumulative discounted reward, where we define policy
as
and the optimal policy
as
where
Both the actor and critic components are implemented as neural networks designed to process detailed queue-level state information in real time. Specifically, the first four layers of both networks are convolutional layers that extract relevant features from the input state vectors, with each layer using ReLU activation functions. For each queue requiring resource scheduling, a dedicated set of these convolutional layers is instantiated. The extracted features are then forwarded to a fully connected (dense) part of the network, which also employs ReLU activations to enhance non-linear representation capabilities.
The final output layer of the actor network has as many neurons as there are queues plus an additional neuron for unassigned resources, and it uses a Softmax activation function. This ensures that the output vector can be interpreted as normalized bandwidth allocation percentages for each queue. Meanwhile, the output layer of the critic network consists of a single neuron without an activation function, representing the estimated expected reward for the given state. A graphical representation of the architecture of the implemented actor–critic model is given in
Figure 3. This design allows the agent to continuously refine its policy based on real-time feedback, effectively balancing immediate Quality of Service (QoS) requirements with long-term network efficiency.
The training of the actor and critic networks is implemented using the following Algorithm 1:
| Algorithm 1 Actor–critic training loop |
- Require:
policy network , critic network , environment - 1:
, ⇐ rand(),𝛾 ⇐ (0, 1], 𝛼 ⇐ (0, 1], 𝛽 ⇐ (0, 1] - 2:
1 to ... do - 3:
for t = 1 to...do - 4:
St ⇐ nextState(env) - 5:
At ⇐ 𝜋θ(St) - 6:
St+1, rt+1 ⇐ env(At) - 7:
- 8:
- 9:
- 10:
end for - 11:
end for
|
The training process continues until the agent’s performance stabilizes, defined as when the moving average of cumulative rewards over the last 100 episodes changes by less than 1% for 20 consecutive iterations. This criterion helps prevent overtraining. Hyperparameters were chosen based on empirical tuning and previous research on actor–critic models for network control. The actor and critic weights, and w, respectively, were updated using the Adam optimization algorithm. The actor’s learning rate and the critic’s learning rate were set to 0.01 for balanced convergence speed and stability. The discount factor was set to 0.85 to prioritize long-term QoS improvements over short-term gains. These values yielded the most stable learning dynamics and fastest convergence in preliminary experiments using the stochastic SDN environment.
4. Evaluation
4.1. Training Data
Training data for the RL agent is generated using a custom SDN environment simulator. It allows for the creation of large amounts of data without needing a complete physical SDN setup or a simulator that uses many resources. This generator simulates stochastic queue behaviors that closely mirror real-world SDN traffic dynamics, including variations in resource usage and QoS performance. It works step-by-step for precise control over simulation resolution. During each step, the queue can be updated to reflect changes before the environment moves forward. This allows the RL agent and the simulated environment to interact smoothly. This design supports the closed-loop training process by providing timely feedback in the form of updated states, actions, and rewards.
For each logical queue, an individual data generator produces usage patterns by sampling from one uniform distribution and two Gamma distributions: the connection count change distribution and the packet size distribution. In each simulation step, the uniform distribution is sampled to determine whether there is a change in the number of active connections. If the sampled value exceeds a predefined threshold, a new value is drawn from the connection count change distribution, and it is randomly multiplied by −1, 0, or 1 to represent possible decreases, no change, or increases in connections. Suppose that the value from the uniform distribution is below the threshold. In that case, the threshold is reduced by a factor , which incrementally increases the probability of future changes, reflecting temporal traffic variability.
The total number of active connections is then updated accordingly. Based on this number, an array of packet sizes is generated, where each element corresponds to a sampled value from the packet size distribution. The number of dropped packets is computed by calculating the cumulative sum of packet sizes and counting packets for which this sum exceeds the available bandwidth of the assigned queue. Estimate each packet’s latency by dividing its cumulative size by the queue’s bandwidth and scaling by . Finally, compute the overall latency for the step as the average of per-packet latencies.
This simulation methodology allows for the flexible creation of diverse usage scenarios by fine-tuning distribution parameters and control variables such as the threshold value, factor , and dt. It supports situations with multiple queues, where each queue has its generator in a shared training environment. The RL agent can be trained in dynamic traffic conditions, leading to more adaptive resource allocation policies.
4.2. Testbed Setup
We validate the proposed approach in a virtualized multi-slice testbed built on the EVE-NG platform [
7,
41,
42], designed to emulate the control and dataplane characteristics of beyond-5G/6G deployments (
Figure 4). The environment instantiates three logically isolated slices, each governed by its own SDN controller responsible for intra-slice routing, access control, and packet management. To make the design more realistic, it allows users to subscribe to multiple service options at the same time. This way, they can use services from different providers. At the subscriber edge, each user equipment (UE) hosts an SDN-capable switch with a 6G modem and exposes dedicated uplink and downlink queues per subscribed slice, enabling fine-grained QoS enforcement.
The BasestationSDN is an SDN-enabled base station that aggregates traffic from multiple user devices. It allows for custom scheduling for each slice and controls data limits. This element connects through the distribution network to a centralized QoSController that operates as the infrastructure-level orchestrator. The QoSController collects slice-level telemetry from UE and BasestationSDN switches, computes bandwidth allocations using the RL algorithm, and enforces rate settings over the SDN control channel. The QoSController hosts the RL agent. It also sends optimized allocation settings back to the BasestationSDN and the UE for managing traffic both downstream and upstream.
Our goal is to simulate realistic and time-varying wireless channel conditions. For this reason, a configurable traffic mechanism is inserted between the BasestationSDN and UE, artificially constraining available capacity, latency, jitter, and packet loss. This mechanism induces controlled congestion and can emulate link outages, thereby closely mimicking the variable interface capacity in operational 6G deployments.
For algorithm input, the system monitors parameters such as maximum slice queue size, current bandwidth load, number of dropped packets, and end-to-end communication delay for each slice individually (we calculate the communication delay in each slice on the QoSController through the POX SDN controller module [
43]). Then, we aggregate data for each slice so that all the necessary input data are collected and stored in one location.
Queue state information is retrieved from the BasestationSDN and delivered to the QoSController (Listing 1), where data are aggregated and processed. The computed output defines the maximum queue size each slice should utilize under the constraint of total available bandwidth. These parameters are then applied at the BasestationSDN via the OVSDB protocol, ensuring dynamic and adaptive QoS enforcement across all slices (for remote OpenVSwitch management [
44]).
| Listing 1. Queue input parameters for RL-based QoS control. |
{"queue": "1:c", "data": {"sent": 5,416,166,310, "bytes": 4,983,710, "dropped": 0}} {"queue": "1:d", "data": {"sent": 6,046,803,152, "bytes": 5,564,520, "dropped": 0}} {"queue": "1:e", "data": {"sent": 5,047,958,851, "bytes": 4,646,440, "dropped": 0}} |
4.3. Baseline Model
To evaluate the results of our approach, we provide a baseline resource scheduling model for comparison. The baseline model is a sliding window neural network model trained on the same data sample as our proposed model. The model takes a single window of predefined size and outputs a vector of values corresponding to resource allocation percentages. The size of the vector is equal to the number of SDN nodes within the simulation environment; in our case, these were three individual nodes. The output of the model is a vector containing percentages of bandwidth assigned to each slice. The model was trained using backpropagation and the Adam optimization algorithm with a learning rate of 0.001. The mean squared error was used to calculate loss. The same dataset was used to train both the baseline and RL model; however, for training the baseline model, the dataset was split into randomly chosen subsets and shuffled in each epoch.
Figure 5 shows that the baseline model reached convergence after 1400 epochs. Both models were tested in the previously described testbed environment.
4.4. Evaluation Setup
The testbed is set up to have a limited total amount of bandwidth that it can allocate to each slice defined within it. The traffic in the testbed is generated in the same stochastic manner by sampling the same distributions used for generating training data, as described in
Section 4.1. The sampled values are then used to construct UDP packets that are sent to the appropriate queues. Unlike during training, the latency and packet loss are calculated by the QoS controller. To facilitate seamless communication between ML models and the testbed, we also implemented an abstract SDN interface and a data storage layer, as described in
Section 4.2. The data storage layer was implemented as a simple server that can handle requests to store data generated by the testbed and provide requested data to ML models. Each data point generated by the testbed is stored in a relational database as a tuple
, where
is the queue unique identifier and
t is a Unix timestamp.
Besides storing the time series data, the server also stores the environment metadata, such as the total bandwidth available to the environment, the configurations of each slice at each timestep, and the QoS value at each timestep. The most important function of the abstract SDN interface is to allow ML algorithms to allocate bandwidth for each queue.
4.5. Results
We evaluate the performance of the baseline model and our model with respect to training speed and resource scheduling performance. The training speed is evaluated by comparing the cumulative reward received from the environment during 1000 epochs. We estimate the reward for both models as described in
Section 3.1.
Figure 6 shows the cumulative reward trends for both models over 1000 training epochs. The RL-based scheduler shows increased rewards after 240 epochs, leading to faster convergence and better policy optimization. In comparison, the baseline model has a nearly linear reward progression, indicating its limited capacity to adapt to dynamic state changes. This highlights that the actor–critic structure enhances learning efficiency and ensures more stable policy updates in stochastic traffic conditions.
The performance of resource scheduling is evaluated by comparing the average reward, the cumulative number of dropped packets, and the amount of overestimated needed bandwidth per queue. The models are evaluated in five different scenarios. Although the last metric is not as important as the number of dropped packets, it provides an insight into the different behavior of the baseline model and our model in scenarios where resources are abundant. For each of these metrics, fully trained models running in inference mode are evaluated. Additionally, in case of the actor–critic model, only the actor network was used for evaluation, since the critic network is only relevant during training. Both models were run 3 times for 1000 time steps, and the average for all of the runs was calculated.
Scenario 1—slowly changing resource demand patterns, demand cannot be greater than the total resources available, no network outages.
Scenario 2—rapid resource demand changes, very stochastic, demand cannot be greater than the total resources available, no network outages.
Scenario 3—slowly changing resource demand patterns, the total available resources are less than the total possible demand, no network outages.
Scenario 4—slowly changing resource demand patterns, demand cannot be greater than the total resources available, common network outages.
Scenario 5—rapid resource demand changes, very stochastic, demand cannot be greater than the total resources available, common network outages.
The last two scenarios are intended to evaluate the behavior of models when some of the nodes are unavailable, and their resources can be reallocated to other nodes.
The RL model consistently earns higher average rewards in all five scenarios, as shown in
Figure 7. The difference in performance grows in situations with quickly changing resource needs (Scenarios 2 and 5), showing that the RL agent handles variable traffic better. In situations with limited resources (Scenario 3), both models perform similarly. This is mainly due to the fact that there is an insufficient amount of resources to allocate to each queue; thus, when the sum of all resources required by queues exceeds the total available resources, both models are unable to properly distribute resources and do not receive full rewards. Any apparent advantage of one model over the other in this scenario is mainly due to the randomness of generated evaluation data. Under conditions of node outage (Scenarios 4 and 5), both models struggle, but the RL agent recovers faster once the network becomes stable.
Figure 8 shows the number of dropped packets after 1000 time steps. The RL scheduler keeps the packet drop count low in all scenarios. This matches its reward system, which punishes congestion. In contrast, the baseline model has higher variability and more frequent points in packet loss. This indicates that it reacts to problems instead of predicting them under changing traffic loads.
Figure 9 shows the bandwidth allocation for a single node in Scenario 5, which involves sudden demand changes and node failures. The baseline model tends to overestimate resources after sudden demand increases, while the RL model makes more gradual adjustments and has fewer overestimations. However, both models misallocate resources during the outage from time steps 400 to 700. This means that understanding faults is still a challenge for future improvements.
Figure 10 shows how accurate our predictions are for resource allocations. It compares the predicted values to the actual values. A lower difference means that our predictions are more accurate. The RL-based scheduler demonstrates smaller errors and more consistent performance over time, effectively capturing traffic variation dependencies. This result complements the findings from
Figure 6 and
Figure 7, reinforcing the stability and precision of the proposed method.
4.6. Discussion
Although the current prototype assumes a globally observable environment, the proposed MARL framework can be feasibly deployed in SDN-enabled 6G networks under a centralized-training and distributed-execution paradigm. In real deployments, rewards can be computed by SDN control entities, such as the QoSController, ensuring scalability and minimal signaling overhead. Each local agent operates on partial observations, while the controller aggregates per-queue telemetry and delivers reward signals through lightweight channels (e.g., OpenFlow statistics). Similar to the approach of Miuccio et al. [
45], this architecture supports MARL-based coordination under constrained observability and enables near-real-time adaptability in operational 6G environments.
5. Conclusions
In this paper, we presented a reinforcement learning framework for queue-level resource scheduling in SDN-enabled 6G networks. The proposed actor–critic model achieves faster convergence, higher cumulative rewards, and reduced packet loss compared to a baseline forecasting model, particularly under dynamic traffic conditions. The main contribution is a novel queue-level actor–critic RL approach that focuses on micro-level queue dynamics for more precise and adaptive network control. On the practical side, the work delivers a modular plug-and-play environment that allows for the seamless integration of different ML algorithms with SDN controllers via an abstract interface. The environment connects simulation and testing by using a random traffic generator and a virtual EVE-NG testbed.
The current framework relies on centralized control, which limits scalability and fault tolerance. Future research will focus on extending the system with multi-agent and federated RL to enable decentralized learning and distributed control. Integrating real network trace data and developing self-healing mechanisms represent promising directions for further improving adaptability and resilience in real-world 6G deployments.
Overall, this study contributes both theoretically and practically to the development of intelligent, fine-grained, and adaptive resource management for next-generation SDN-enabled 6G networks.