Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation

Shezi, Nokwanda; Nleya, Bakhe; Pule, Beverly

doi:10.3390/telecom7030075

Open AccessArticle

Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation

by

Nokwanda Shezi

,

Bakhe Nleya

^*

and

Beverly Pule

Department of Electronic and Computer Engineering, Faculty of Engineering and the Built Environment, Durban University of Technology, Durban 4001, South Africa

^*

Author to whom correspondence should be addressed.

Telecom 2026, 7(3), 75; https://doi.org/10.3390/telecom7030075

Submission received: 18 April 2026 / Revised: 22 May 2026 / Accepted: 26 May 2026 / Published: 9 June 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Converging millimetre-wave (mmWave) radio access with passive optical network (PON) fronthaul under the Open RAN (O-RAN) architecture promises unprecedented capacity for beyond-5G and 6G systems. Yet today, dynamic bandwidth allocation (DBA) in the PON and physical resource block (PRB) scheduling in the mmWave RAN operate independently, a critical design flaw that causes severe latency accumulation, resource fragmentation, and consistent failure to meet the divergent quality-of-service requirements of network slices. This paper breaks that deadlock by introducing the first slice-aware, computationally efficient orchestration framework that jointly optimises DBA and PRB allocation in a converged mmWave-PON O-RAN. We formulate the problem as a constrained Markov decision process (CMDP) with explicit latency, reliability, and throughput constraints for URLLC, eMBB, and mMTC slices. The core technical advance is a reward-shaped proximal policy optimisation (RS-PPO) algorithm whose potential-based shaping function directly penalises DBA–PRB misalignment and dense feedback on queue build-up, accelerating learning without compromising optimality. To make this work in near-real time on the O-RAN RIC, we embed three complementary efficiency engines: graph convolutional network (GCN) state abstraction, action masking, and prioritised N-step replay. Extensive 3GPP-compliant simulations show that RS-PPO slashes URLLC end-to-end latency by 37% (from 1.38 ms to 0.87 ms), boosts PRB utilisation by 28% (from 68% to 87%), and delivers 99.999% reliability, all while converging 45% faster and cutting inference time by 45% (to just 2.3 ms). The result is a sub-5 ms control cycle, compatible with O-RAN specifications and deployable as an xApp on the near-RT RIC. Our framework closes a long-standing coordination gap left unresolved by prior art, enabling true slice-aware convergence between the optical and wireless domains.

Keywords:

O-RAN; mmWave; PON; dynamic bandwidth allocation; resource orchestration; reward shaping; proximal policy optimization; network slicing

1. Introduction

The Open Radio Access Network (O-RAN) paradigm disaggregates traditional base stations into Radio Units (RUs), Distributed Units (DUs), and Central Units (CUs), while introducing standardised interfaces for intelligent control via the RAN Intelligent Controller (RIC) [1]. This architecture enables multi-vendor interoperability, accelerates innovation through xApps, and enhances network adaptability. The rapid increase in data traffic, along with the development of latency-sensitive applications such as autonomous driving, augmented reality, and the tactile internet, necessitates both ultra-high capacity and ultra-reliable low-latency communication (URLLC). Millimetre-wave (mmWave) bands (24–100 GHz) provide the required bandwidth; however, they are characterized by significant path loss, vulnerability to blockage, and increased power consumption [2]. To interconnect dense mmWave small cells efficiently, passive optical networks (PONs) are widely adopted as a fronthaul solution due to their high bandwidth, low power consumption, and point-to-multipoint topology [3].

Despite the individual merits of mmWave and PON, their convergence creates a critical coordination gap. The mmWave RAN operates with transmission time intervals (TTIs) of 1 ms or less. The PON, in contrast, relies on dynamic bandwidth allocation (DBA) cycles that range from 125 µs (GPON) to 1 ms (XGS-PON). When a packet arrives at an optical network unit (ONU) immediately after a DBA grant, it may experience a delay of nearly an entire cycle before upstream transmission begins. This situation directly violates URLLC’s strict 1 ms latency target [4]. Conventional DBA algorithms (gated, fixed-cycle) ignore radio-layer scheduling. Prevalent physical resource block (PRB) schedulers (proportional fair, round-robin) disregard PON buffer status. This lack of cross-domain coordination produces three compounding inefficiencies. First, latency accumulates because DBA scheduling delays add to radio transmission delays. Second, resource fragmentation occurs because PRBs may be allocated to an RU whose associated ONU lacks a sufficient upstream grant. Third, grant underutilisation happens because DBA grants may be issued for ONUs that will receive little or no radio traffic in the upcoming TTI.

Network slicing magnifies these problems: A URLLC slice demands sub-millisecond latency and 99.999% reliability. An enhanced mobile broadband (eMBB) slice requires high data rates (e.g., 100 Mbps per user). A massive machine-type communications (mMTC) slice must support numerous connections with low energy consumption [5]. Meeting these diverse service-level agreements (SLAs) in a converged mmWave-PON O-RAN environment requires a unified orchestration system. It must jointly manage radio and optical resources while adhering to slice-specific constraints. Independent DBA and PRB scheduler operation is even more problematic under slicing. A misalignment tolerable for eMBB can directly violate URLLC’s latency and reliability targets. Inefficient PRB usage can prevent eMBB from achieving its required throughput.

Overall, resource management in O-RAN has been extensively studied, but almost all existing solutions treat the fronthaul as an ideal, zero-delay pipe. The DORA framework [6] employs proximal policy optimisation (PPO) for dynamic PRB allocation across slices, assuming an ideal fronthaul without PON constraints. The PandORA system [7] automates the design of DRL agents for O-RAN and shows that reward shaping can boost performance, yet it focuses on radio-side decisions. For slicing, the MARSRA xApp [8] uses Soft Actor-Critic for joint user assignment and PRB allocation, and an O-RAN-enabled intelligent slicing solution [9] adopts a two-level DRL structure. However, all these works treat the transport network as a passive backhaul. The integration of PON with 5G RAN has been standardised through the Cooperative Transport Interface (CTI) [3]. A recent demonstration [4] showed that CTI can reduce latency, but the coordination remains heuristic. A Cooperative DBA scheme that exploits TDD pattern information [10] reduces latency without slice awareness. Energy-efficient integrated O-RAN/PON access networks [11] focus on sleep modes rather than joint scheduling. Deep reinforcement learning has proven effective for complex resource allocation [12], and PPO is widely adopted for its stability and data efficiency [13]. Reward shaping has been applied in wireless networks [14] and edge computing [15]. Computational efficiency techniques such as graph neural networks for state abstraction [7] and prioritised experience replay [16] have been used individually. Nevertheless, to the best of our knowledge, no prior work provides a unified, slice-aware, and computationally efficient DRL framework for joint DBA–PRB orchestration in converged mmWave-PON O-RAN. Specifically, four critical gaps remain: (1) existing solutions ignore PON constraints or treat them independently, (2) reward shaping has not been designed to penalise DBA–PRB misalignment explicitly, (3) the temporal asynchrony between mmWave TTIs and PON DBA cycles is not captured in any DRL formulation, and (4) no prior work integrates state abstraction, action masking, and prioritised replay into a single framework that can operate within the sub-5 ms control cycle required by the near-RT RIC.

This paper closes those gaps. We propose a slice-aware, computationally efficient resource orchestration framework that jointly optimises DBA and PRB allocation in a converged mmWave-PON O-RAN. The problem is formulated as a constrained Markov decision process (CMDP) with explicit slice-specific latency, reliability, and throughput constraints. A reward-shaped proximal policy optimisation (RS-PPO) algorithm is developed, augmenting the reward function with potential-based shaping terms that explicitly penalise DBA–PRB misalignment and queue buildup, thereby guiding the agent toward QoS-satisfying policies without altering the optimal policy [17]. To ensure computational efficiency suitable for near-real-time operation on the O-RAN RIC, the framework incorporates three complementary enhancements: graph convolutional network (GCN) state abstraction to reduce state dimensionality, action masking to prune invalid actions, and prioritised N-step replay to accelerate learning.

The main contributions of this paper are fourfold:

Firstly, we develop a detailed system model that captures mmWave channel dynamics (including blockage), PON DBA cycles, ONU buffer evolution, and slice-specific traffic characteristics (URLLC, eMBB, mMTC).
Secondly, we design a potential-based reward shaping function that penalises DBA–PRB misalignment and provide a proof that the shaped reward preserves the optimal policy.
Thirdly, we introduce three computational efficiency enhancements—GCN state abstraction, action masking, and prioritised N-step replay—that together reduce inference time and accelerate convergence.
Fourthly, through extensive simulations using 3GPP-compliant mmWave channel models, GPON DBA dynamics, and mixed slice traffic, we demonstrate that RS-PPO reduces URLLC end-to-end latency by 37% (from 1.38 ms to 0.87 ms), improves PRB utilisation by 28% (from 68% to 87%), achieves 99.999% reliability for URLLC slices, converges 45% faster, and reduces inference time by 45% (from 4.2 ms to 2.3 ms). The framework is compatible with O-RAN specifications and can be deployed as an xApp on the near-real-time RIC.

The remainder of the paper is organised as follows. Section 2 reviews related work. Section 3 details the system model for the converged mmWave-PON O-RAN. Section 4 formulates the joint DBA–PRB allocation problem as a CMDP and describes the RS-PPO algorithm, including the reward shaping design. Section 5 presents the computational efficiency enhancements. Section 6 reports the simulation setup and results. Section 7 concludes the paper.

2. Related Work

Resource management in converged mmWave-PON O-RAN spans three interconnected domains: O-RAN resource allocation and slicing, PON-RAN integration, and deep reinforcement learning (DRL) techniques for network optimization. This section critically examines prior work in each domain and concludes by identifying the specific gap that our framework fills.

2.1. O-RAN Resource Allocation and Network Slicing

Several DRL-based frameworks have been proposed for dynamic resource allocation in O-RAN, but almost all treat the fronthaul as an ideal, zero-delay pipe. The DORA framework [6] employs proximal policy optimisation (PPO) for dynamic PRB allocation across URLLC, eMBB, and mMTC slices, demonstrating the viability of on-policy DRL for slice-aware scheduling. However, DORA assumes a perfect fronthaul and therefore cannot capture the latency penalties imposed by PON DBA cycles. The PandORA system [7] automates the design of DRL agents for O-RAN; it benchmarks 23 xApps and shows that reward shaping can boost performance. Nevertheless, PandORA focuses on radio-side decisions and does not consider optical transport constraints. For slicing-specific solutions, the MARSRA xApp [8] uses Soft Actor-Critic for joint user assignment and PRB allocation, achieving a 22% improvement in user satisfaction. Another O-RAN-enabled intelligent slicing solution [9] adopts a two-level scheduling structure with DRL at the inter-slice level. While these works advance slice-aware radio resource management, they share a critical blind spot: the transport network remains a passive backhaul. In a converged mmWave-PON deployment, this assumption is invalid because PON DBA cycles directly impact end-to-end latency and can nullify even the most sophisticated radio scheduler.

2.2. PON-RAN Integration and Cooperative Transport

The integration of PON with 5G RAN has been standardised through the Cooperative Transport Interface (CTI) [3]. A recent proof-of-concept demonstration [4] showed that CTI can reduce latency by enabling limited information exchange between the OLT and the DU. However, the coordination remains heuristic—rule-based and static—rather than adaptive to dynamic traffic conditions or slice requirements. A Cooperative DBA scheme that exploits TDD pattern information [10] reduces latency in time-division duplexing systems, but it lacks slice awareness and does not optimise PRB allocation jointly. Energy-efficient integrated O-RAN/PON access networks have been studied [11], yet the focus there is on optical sleep modes to save power, not on joint scheduling for latency or throughput. These works confirm that cross-domain coordination is beneficial, but they stop at heuristic or single-domain improvements. None of them formulate the joint DBA–PRB allocation problem as a learning task, nor do they provide a mechanism to satisfy heterogeneous slice SLAs simultaneously.

2.3. DRL, Reward Shaping, and Computational Efficiency

Deep reinforcement learning has proven effective for complex resource allocation in wireless networks [12]. Among DRL algorithms, proximal policy optimisation (PPO) is widely adopted for its stability, data efficiency, and ease of hyperparameter tuning [13]. Reward shaping, particularly potential-based shaping [17], has been shown to accelerate learning without altering the optimal policy. In wireless contexts, reward shaping has been applied to intelligent jamming decision-making [14] and edge resource management with graph convolutional transformers [15]. However, no prior work has designed a shaping function specifically to penalise DBA–PRB misalignment—the core friction point in converged mmWave-PON systems. Computational efficiency techniques such as graph neural networks for state abstraction [15] and prioritised experience replay [16] have been explored individually. Yet, to the best of our knowledge, no existing framework integrates GCN-based state abstraction, action masking, and prioritised N-step replay into a single DRL agent that runs within the sub-5 ms control cycle required by the near-RT RIC.

2.4. Differentiation from PandORA and GNN-DRL Hybrids

PandORA [7] automates DRL agent design for O-RAN and demonstrates reward shaping, but it does not model PON constraints nor jointly optimise DBA and PRB allocation. Its reward shaping is generic (task-agnostic), whereas our potential-based shaping is specifically designed to penalise DBA–PRB misalignment. Other GNN-DRL works [15] apply graph neural networks to edge computing resource allocation, but they do not address optical-wireless convergence nor the temporal asynchrony between mmWave TTIs and PON DBA cycles. Moreover, none of these prior works integrate action masking and prioritised N-step replay to achieve sub-5 ms inference, a requirement for near-RT RIC deployment. Thus, RS-PPO is the first framework to combine a domain-specific misalignment penalty, GCN state abstraction, and near-real-time inference for converged mmWave-PON O-RAN.

2.5. Synthesis and Identified Gap

Taken together, the literature reveals a clear divide: radio-centric O-RAN slicing solutions ignore PON constraints. PON-RAN integration works are heuristic and not slice-aware; and DRL-based optimisations have not been tailored to the specific misalignment between mmWave TTIs and PON DBA cycles. Consequently, there is no unified, slice-aware, and computationally efficient DRL framework for joint DBA–PRB orchestration in converged mmWave-PON O-RAN. Our work directly addresses this gap by formulating a CMDP with slice-specific constraints, designing a reward-shaped PPO that penalises misalignment, and integrating three computational efficiency enhancements that together enable near-real-time deployment on the O-RAN RIC.

Table 1 summarises the capabilities and limitations of representative prior works compared to our proposed RS-PPO framework.

3. System Model for Converged mmWave–PON O-RAN

This section presents the mathematical and architectural models that underpin the joint DBA–PRB optimisation problem. We first describe the O-RAN deployment with mmWave RUs and PON fronthaul, including the control plane interfaces and temporal dynamics. Then, we detail the mmWave radio access model (channel, SINR, achievable rates, PRB allocation), followed by the PON fronthaul model (DBA cycles, ONU buffer dynamics, grant allocation). Finally, we specify the traffic models for URLLC, eMBB, and mMTC slices. Table 2 summarises the principal notation used throughout this section; additional local notations are defined where they first appear.

3.1. O-RAN Architecture with mmWave RUs and PON Fronthaul

For this model, we consider an O-RAN deployment compliant with O-RAN Alliance specifications [1]. The network consists of a set

R = {1, 2, \dots, R}

of mmWave remote units (RUs), each serving a coverage area with multiple user equipments (UEs). The RUs are connected to a centralized OLT via a time-division multiplexing passive optical network (TDM-PON) operating in the upstream direction. The OLT is collocated with the DU and the near-RT RIC. The RIC hosts the resource orchestration xApp, which executes the RS-PPO agent. The xApp communicates with the DU via the E2 interface and with the OLT via a PON management interface. Table 2 summarises the principal notation used throughout this section. Note that additional local notations are defined where they first appear.

The system operates in discrete time steps of duration

T_{ctrl} = 10

ms, which corresponds to 10 radio frames (each 1 ms) and 80 DBA cycles (each 125 µs). At each step

t

, the xApp collects observations from the RUs (channel states, buffer occupancies, traffic demands) and from the OLT (ONU queue lengths, past grants). It then computes actions: PRB allocation for the next 10 ms and DBA grants for the next 80 cycles. The actions are enforced via the DU scheduler and the OLT DBA engine.

3.2. mmWave Radio Access Model

The mmWave RAN operates at a carrier frequency

f_{c} = 28

GHz with total bandwidth

B_{total} = 100

MHz, divided into

K = 50

physical resource blocks (PRBs), each of bandwidth

B_{PRB} = 2

MHz. The set of UEs is denoted by

U

, and each UE belongs to exactly one slice

s \in S = {URLLC, eMBB, mMTC}

. The association between UEs and RUs is assumed fixed for the simulation horizon. Let

U_{r}

denote the set of UEs served by RU

r

. The mmWave channel is characterized by path loss, shadowing, fading, and blockage events. For a UE

u

associated with RU

r

, the received signal power on PRB

k

at time

t

is

P_{r, u, k}^{rx} (t) = P_{tx} \cdot G_{tx} \cdot G_{rx} \cdot L_{path} (d_{r, u}) \cdot χ_{r, u} (t) \cdot ψ_{r, u, k} (t) \cdot b_{r, u} (t)

(1)

where

P_{tx}

is transmit power,

G_{tx}

,

G_{rx}

are antenna gains,

L_{path} (d) = c^{2} / (16 π^{2} d^{2} f_{c}^{2})

is free-space path loss,

χ_{r, u} (t)

is log-normal shadowing,

ψ_{r, u, k} (t)

is Rayleigh fading, and

b_{r, u} (t) \in {0, 1}

is blockage indicator. The SINR for UE

u

on PRB

k

when allocated to RU

r

is

{SINR}_{r, u, k} (t) = \frac{P_{r, u, k}^{rx} (t)}{N_{0} B_{PRB} + I_{r, u, k} (t)}

(2)

where

N_{0}

is the noise power spectral density and

I_{r, u, k} (t)

is interference from other RUs. The achievable data rate is

R_{r, u, k} (t) = B_{PRB} \cdot {l o g}_{2} (1 + {SINR}_{r, u, k} (t))

(3)

Let

X (t) \in {0, 1}^{R \times K}

be the PRB allocation matrix, where

x_{r, k} (t) = 1

if PRB

k

is assigned to RU

r

. Each PRB can be allocated to at most one RU:

\sum_{r = 1}^{R} x_{r, k} (t) \leq 1

. The total rate achieved by UE

u

is

R_{u} (t) = \sum_{k = 1}^{K} x_{r, k} (t) \cdot R_{r, u, k} (t) \cdot 1_{UE u scheduled}

(4)

Within each RU, PRBs are shared among UEs in a round-robin fashion over the 10 ms control period.

3.3. PON Fronthaul Model

The PON is a TDM GPON with upstream line rate

C_{PON} = 2.488

Gbps and DBA cycle time

T_{DBA} = 125

µs. There are

N

ONUs, each connected to one RU (for simplicity,

N = R

). Each ONU

n

has a buffer that stores upstream packets from its associated RU. The buffer occupancy evolves as

Q_{n} (t + 1) = \max (0, Q_{n} (t) + A_{n} (t) - T_{n} (t))

(5)

where

A_{n} (t)

is the arrival traffic from the RU (in bits) during the control step and

T_{n} (t)

is the transmitted bits during the DBA cycles. The transmitted bits depend on the grant allocated by the OLT. At the beginning of each DBA cycle (125 µs), the OLT collects buffer reports from all ONUs (via REPORT messages). It then computes grants

g_{n} (t) \geq 0

for the next cycle, satisfying

\sum_{n = 1}^{N} g_{n} (t) \leq C_{PON} \cdot T_{DBA}

. In our model, we assume a gated DBA baseline, but the RS-PPO agent can override the grant allocation. The actual transmitted traffic is

T_{n} (t) = m i n (Q_{n} (t), g_{n} (t))

. The end-to-end latency for a packet from a UE in slice

s

includes queuing at the ONU, DBA scheduling delay (worst-case one full cycle, 125 µs), and transmission delay over PON (up to 200 µs). The critical source of latency variability is the DBA scheduling delay, especially when grants are not aligned with packet arrivals.

3.4. Traffic Models for Network Slices

We consider three slice types following 3GPP TR 38.913. URLLC traffic is modelled as a periodic arrival process with small packets (32 bytes) and a strict latency target

L_{\max} = 1

ms and reliability 99.999%. eMBB traffic is full-buffer with target throughput 100 Mbps per UE. mMTC traffic follows an ON-OFF model: ON period of 100 ms with constant packet rate 100 packets/s (each 10 bytes), OFF period of 900 ms. The number of UEs is 5 for URLLC, 20 for eMBB, and 100 for mMTC.

3.5. Cooperative DBA Heuristic (Baseline)

The Cooperative DBA scheme, introduced in [10], extends standard IPACT (Interleaved Polling with Adaptive Cycle Time) by incorporating TDD pattern information from the mmWave scheduler. It serves as a non-learning baseline for comparison with RS-PPO.

The algorithm description is as follows: For each DBA cycle (125 µs), the OLT performs the following steps:

Collects buffer reports $Q_{n} (t)$ from all ONUs and receive a predicted traffic arrival ${\hat{A}}_{n} (t)$ from the DU (based on the current TDD uplink/downlink configuration).
Computes a provisional grant for each ONU as ${\hat{g}}_{n} (t) = Q_{n} (t) + {\hat{A}}_{n} (t)$ .
Scales the grants to satisfy the PON capacity constraint: $g_{n} (t) = {\hat{g}}_{n} (t) \cdot \frac{C_{PON} T_{DBA}}{\sum_{n} {\hat{g}}_{n} (t)}$ .
Allocates grants to ONUs; ONUs transmit up to $m i n (Q_{n} (t), g_{n} (t))$ bits.

This heuristic improves over decoupled DBA (e.g., pure IPACT) by anticipating radio-layer traffic. However, it has three limitations: (i) it uses a fixed prediction model (no learning), (ii) it does not optimise PRB allocation jointly (the DU still uses a separate proportional-fair scheduler), and (iii) it is not slice-aware—all traffic is treated equally. Consequently, Cooperative DBA cannot adapt to dynamic slice loads or non-stationary channel conditions. It is included in our evaluation to quantify the benefit of learning-based joint optimisation.

4. Problem Formulation and Reward-Shaped PPO

We formulate the joint DBA–PRB orchestration problem as a constrained Markov decision process (CMDP) to capture slice-specific QoS constraints and the hybrid action space. The section first defines the state and action spaces and then explicitly specifies latency, reliability, and throughput constraints. Next, we introduce the base reward function and develop a potential-based reward shaping mechanism that penalises DBA–PRB misalignment. Finally, we present the RS-PPO algorithm and provide commentary on its key steps.

4.1. Constrained Markov Decision Process Formulation

The joint DBA and PRB allocation problem is modelled as a constrained Markov decision process (CMDP). The state space

S

is defined by

s (t) = (Q (t), H (t), D (t), G_{prev} (t))

(6)

where

Q (t) \in R^{N}

is the vector of ONU queue lengths (bytes),

H (t) \in R^{R \times K}

is the matrix of SINR values,

D (t) \in R^{∣ S ∣}

is the vector of slice traffic demands, and

G_{prev} (t) \in R^{N}

is the previous DBA grant allocation. The action space

A

is hybrid: discrete for PRB allocation (each PRB assigned to one RU or idle) and continuous for DBA grants

g (t) \in [0, g_{m a x}]^{N}

with

\sum_{n} g_{n} (t) \leq C_{PON} T_{DBA}

.

Explicit Constraint Specifications

The CMDP is defined by the tuple

⟨ S, A, P, R, C, γ ⟩

, where

C = {C_{1}, C_{2}, C_{3}}

represents three constraint functions with corresponding thresholds

\{d_{1}, d_{2}, d_{3}\}

:

Latency Constraint (

C_{1}

): For URLLC slice

s_{URLLC}

, the average end-to-end latency

L_{s} (t)

must satisfy

E [L_{s} (t)] \leq L_{m a x}

where

L_{m a x} = 1

ms. The latency includes radio transmission time, ONU queuing delay, DBA scheduling delay, and PON transmission delay:

L_{s} (t) = L_{radio} (t) + L_{queue} (t) + L_{DBA} (t) + L_{PON} (t)

.

Reliability Constraint (

C_{2}

): For URLLC slice, the reliability

R_{s} (t)

(fraction of packets delivered within

L_{m a x}

) must satisfy

E [R_{s} (t)] \geq R_{m i n}

where

R_{m i n} = 0.99999

. This is equivalent to requiring the tail latency

P (L_{s} (t) > 1 ms) \leq 10^{- 5}

.

Throughput Constraint (

C_{3}

): For eMBB slice, the average throughput

T_{s} (t)

must satisfy

E [T_{s} (t)] \geq T_{m i n}

where

T_{m i n} = 100

Mbps per user. For mMTC, the constraint is on device density

D_{s} (t) \geq D_{m i n} = 10^{6}

devices/km².

The policy

π

is feasible if it satisfies:

E_{π} [\sum_{t = 0}^{\infty} γ^{t} C_{i} (s_{t}, a_{t})] \leq d_{i} \forall i \in {1, 2, 3}

.

The base reward function is

R_{base} (t) = \sum_{s \in S} w_{s} \cdot U_{s} (t) - λ_{dba} \cdot η_{dba} (t) - λ_{mis} \cdot ϕ_{mis} (t) - \sum_{i = 1}^{3} κ_{i} \cdot m a x (0, C_{i} (t) - d_{i})

(7)

where

U_{s} (t)

is slice utility (negative latency for URLLC, throughput satisfaction for eMBB/mMTC),

η_{dba} (t)

penalises wasted grant,

ϕ_{mis} (t)

penalises DBA–PRB misalignment, and

κ_{i}

are penalty coefficients (

κ_{1} = 10

for latency violation,

κ_{2} = 100

for reliability violation,

κ_{3} = 1

for throughput violation). The weighting coefficients are set to

λ_{dba} = 0.2

and

λ_{mis} = 0.5

, and the slice utilities

w_{s}

are

w_{URLLC} = 0.5

,

w_{eMBB} = 0.3

,

w_{mMTC} = 0.2

, determined via grid search. Note that λ_mis is a hyperparameter that tunes the relative importance of this alignment penalty compared to other reward components (e.g., slice utility, DBA waste penalty, constraint violations).

4.2. Reward Shaping for Accelerated Convergence

We apply potential-based reward shaping [17] with potential function:

Φ (s) = - α \sum_{n = 1}^{N} Q_{n} - β \sum_{n = 1}^{N} ∣ Q_{n} - \frac{g_{n}}{T_{ctrl}} \cdot T_{DBA} ∣

(8)

4.2.1. Detailed Design of the Reward Shaping Function

The potential function comprises two terms:

Term 1 (Queue length penalty):

Φ_{1} (s) = - α \sum_{n = 1}^{N} Q_{n}

with

α = 0.01

. This provides dense feedback about congestion: whenever an ONU queue grows, the potential decreases, reducing the shaped reward and guiding the agent to prevent queue buildup.

Term 2 (DBA–PRB alignment penalty):

Φ_{2} (s) = - β \sum_{n = 1}^{N} ∣ Q_{n} - \frac{g_{n}}{T_{ctrl}} \cdot T_{DBA} ∣

with

β = 0.1

. This penalises misalignment between the actual queue length

Q_{n}

and the amount that can be transmitted given the DBA grant over the control period. Perfect alignment yields zero penalty. The shaped reward preserves the optimal policy because

Φ

is a potential-based shaping function [17]. For any transition

(s_{t}, a_{t}, s_{t + 1})

R_{shaped} (t) = R_{base} (t) + γ Φ (s_{t + 1}) - Φ (s_{t})

(9)

The expected return telescopes to a constant offset, leaving policy gradients unchanged.

4.2.2. Mechanism for Addressing DBA–PRB Misalignment

The proposed RS-PPO framework addresses misalignment through three coordinated mechanisms. First, the state space

s (t)

includes both ONU queue lengths

Q (t)

and previous DBA grants

G_{prev} (t)

, making the agent aware of the optical domain status when making PRB allocation decisions. Second, the reward shaping term

Φ_{2} (s) = - β \sum_{n} ∣ Q_{n} - \frac{g_{n}}{T_{ctrl}} T_{DBA} ∣

explicitly penalises any deviation between the ONU queue length and the amount that can be transmitted given the allocated grant. This creates a dense incentive for the agent to align PRB assignments (which determine

A_{n} (t)

) with DBA grants (which determine

T_{n} (t)

). Third, the hybrid action space allows the agent to output both allocations jointly in the same control step, eliminating the temporal asynchrony that plagues decoupled approaches. During training, the agent learns that allocating a PRB to an RU whose associated ONU has insufficient grant leads to a negative shaping reward, while allocating a grant without corresponding radio traffic also incurs a penalty. Over time, the policy converges to a coordinated strategy where PRB allocations and DBA grants are mutually consistent.

4.3. Proximal Policy Optimization with Shaped Reward

We employ PPO [13] with clipped surrogate objective. The advantage is estimated using GAE (

λ = 0.95

,

γ = 0.99

). The training procedure is given in Algorithm 1.

Algorithm 1: Reward-Shaped PPO for Joint DBA–PRB Allocation

Input:

Policy

(π_{θ} a | s),

value function

V φ (s)

; Potential function

Φ (s)

; Discount factor

γ

, GAE parameter

λ

, Clipping parameter

ε

; Learning rates

α_{θ}, α_{φ}

; Episodes

E

, horizon

T

Output: Optimal policy

π_{θ} *

1: Initialise

θ

,

φ

2: for episode = 1 to

E

do

3: Observe initial state

s_{θ}

4: Initialise trajectory buffer

D

5: for

t = 0

to

T

− 1 do

6: Sample action:

a_{t} \sim π_{θ} (\cdot | s_{t})

7: Execute action a_t

Observe

r_{t}^{base}

and next state

s_{t + 1}

8: Compute shaped reward:

r_{t} \leftarrow r_{t}^{base} + γ Φ (s_{t + 1}) - Φ (s_{t})

9: Store

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in

D

10:

s_{t} \leftarrow s_{t + 1}

11: end for

12: Compute temporal-difference residual:

δ_{t} = r_{t} + γ V φ (s_{t + 1}) - V φ (s_{t})

13: Compute advantage (GAE):

A_{t} = Σ_{l = 0}^\infty {(γ λ)}^{l} δ_{t + l}

14: Compute return:

R_{t} = A_{t} + V φ (s_{t})

15: for each mini-batch

ℬ

⊂

D

do

16: Compute probability ratio:

ρ_{t} = π_{θ} (a_{t} | s_{t}) / π_{θ, o l d} (a_{t} | s_{t})

17: Update actor (gradient ascent):

θ \leftarrow θ + α θ \nabla θ

E[min(

ρ_{t}

A

ₜ,

clip

(ρ_{t}, 1 - ε, 1 + ε) A_{t})]

18: Update critic (MSE loss):

φ \leftarrow φ - α φ \nabla φ E [{(V φ (s_{t}) - R_{t})}^{2}]

19: end for

20: end for

21: return

π_{θ} *

Algorithmic Commentary

The clipped surrogate objective in PPO prevents excessively large policy updates, improving stability. In Algorithm 1, lines 12–14 compute the generalised advantage estimation (GAE) with

λ = 0.95

, balancing bias and variance. The shaped reward

r_{t}

(line 8) replaces the base reward in all temporal-difference calculations, ensuring that the policy gradient reflects the dense feedback from the potential function. The actor update (line 17) uses the clipped probability ratio

ρ_{t}

to limit updates to within

[1 - ε, 1 + ε]

. The critic (line 18) is trained by mean-squared error regression on the return

R_{t}

. All hyperparameters were selected via grid search over 20 pilot runs.

The overall architecture of the proposed RS-PPO-based orchestration framework is illustrated in Figure 1. The near-RT RIC hosts the RS-PPO xApp, which integrates a graph convolutional network for state abstraction, actor and critic networks for policy learning, and a reward shaping module. The xApp continuously observes the mmWave channel states, ONU buffer reports, and slice traffic demands via the E2 interface and the PON management interface. It outputs joint actions: PRB allocations to the DU and DBA grants to the OLT. The DU schedules mmWave transmissions through the RUs, while the OLT allocates upstream grants to the ONUs. Dashed arrows indicate observation flows, and solid arrows indicate action enforcement and data paths.

This architecture enables the RS-PPO xApp to jointly observe both mmWave and PON states and to output coordinated actions, thereby addressing the DBA–PRB misalignment problem that plagues decoupled approaches.

5. Computational Efficiency Enhancements

To enable near-real-time operation on the O-RAN near-RT RIC (control cycles of 10 ms or less), the RS-PPO framework incorporates three complementary techniques that reduce computational overhead during inference and accelerate learning during training. First, graph convolutional state abstraction compresses the high-dimensional state space into a compact latent representation. Second, action masking prunes infeasible actions at inference time, reducing the effective action space. Third, prioritised N-step replay focuses training on the most informative experiences. Together, these enhancements reduce inference time by 45% (from 4.2 ms to 2.3 ms) and accelerate convergence by 45%, as demonstrated in Section 6. The following subsections detail each technique.

Graph Convolutional State Abstraction

We encode the state using a graph convolutional network (GCN) with two layers:

H^{(1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} X^{(0)} W^{(0)})

(10)

Z = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(1)} W^{(1)})

(11)

where

\tilde{A}

is the adjacency matrix of the RU graph with self-loops,

\tilde{D}

is its degree matrix,

X^{(0)}

is the initial feature matrix (SINR, queue lengths), and

W^{(0)}, W^{(1)}

are trainable weights. The output

Z

of dimension

R \times 32

is flattened and fed to actor and critic networks. Figure 2 illustrates the GCN-based state abstraction pipeline, from the RU graph and feature matrix to the flattened latent vector fed into the actor and critic networks.

In the same Figure, the RU graph (nodes 1, 2, 3 with adjacency matrix Ã) provides the structural topology. Node features (SINR and queue length Q) form the input feature matrix X°. Two GCN layers transform the graph-structured input into a latent representation Z, which is then flattened and fed into the actor and critic networks for joint DBA–PRB policy learning.

More techniques include the following:

Action Masking and Prioritised Replay: The hybrid action space (discrete PRB assignments, continuous DBA grants) contains many infeasible actions. For example, a PRB cannot be allocated to more than one RU simultaneously, and a DBA grant cannot exceed the ONU’s queue length or the PON’s total capacity. Rather than letting the agent learn these constraints through trial and error, we apply action masking at inference time. For discrete actions, the logits of invalid choices are set to

- \infty

before the softmax, ensuring zero probability of selection. For continuous actions, raw outputs are clipped to the feasible range

[0, m i n (g_{m a x}, Q_{n})]

and then projected to satisfy the sum constraint

\sum_{n} g_{n} \leq C_{PON} T_{DBA}

. This masking reduces the exploration space and eliminates obviously wasteful actions, accelerating convergence.

To further improve sample efficiency, we employ prioritised experience replay (PER) [16] with N-step returns (

N = 4

). Transitions are stored in a replay buffer and sampled with probability proportional to their temporal-difference (TD) error. High-error transitions (those that surprise the current value function) are replayed more often, focusing learning on the most informative experiences. The N-step return reduces bias compared to single-step returns and better captures delayed reward structures, which is important in our setting because the effect of a DBA grant may take several cycles to manifest.

Inference Procedure: Algorithm 2 summarises the inference step with action masking. At each control step

t

, the agent first encodes the current state

s (t)

using the trained GCN to obtain the latent representation

z

. For each PRB

k

, the actor network computes logits

l_{k} = f_{PRB} (z, k)

; invalid allocations (e.g., a PRB already assigned to another RU) are masked by setting

l_{k} = - \infty

. A valid PRB allocation is then sampled from the resulting categorical distribution. For each ONU

n

, the actor outputs a raw grant

{\tilde{g}}_{n} = f_{grant} (z, n)

, which is projected to the feasible region

[0, m i n (g_{m a x}, Q_{n})]

. The final joint action

(a_{PRB}, g)

is sent to the DU and OLT. Because the GCN forward pass and the actor network involve only a few matrix multiplications, the entire inference takes just 2.3 ms on the testbed hardware—well within the 10 ms control period and leaving margin for other xApps running on the same RIC.

Algorithm 2: Inference with Action Masking for Joint DBA–PRB Allocation

Input:

Trained policy

π_{θ}

; State

s

; Action mask

M_{k} \in \{0, 1\};

Maximum grant

g_{m a x}

;

Queue lengths

Q_{n}

Output:

Joint action

a = (a_{P R B,} g)

1: Encode state:

z \leftarrow f_{G C N (s)}

--------------------------------------------------

2: PRB Allocation (Discrete Actions)

--------------------------------------------------

for each PRB

k = 1

to

K

do

Compute logits:

l_{k} = f_{P R B} (z, k)

Apply mask:

if

M_{k} = 0

then

l_{k} \leftarrow - \infty

end if

Compute probabilities:

π_{θ} (a_{k} | s) = e x p (l_{k}) / Σ_{j} e x p (l_{j})

Sample allocation:

a_{P R B}, k \sim π θ (\cdot | s)

end for

--------------------------------------------------

3: DBA Grant Allocation (Continuous Actions)

--------------------------------------------------

for each ONU

n = 1

to

N

do

Compute raw grant:

{\tilde{g}}_{n} = f_g r a n t (z, n)

Project to feasible region:

g_{n} = m i n (m a x ({\tilde{g}}_{n}, 0), g_m a x, Q_{n})

end for

--------------------------------------------------

4: return

a = (a_{P R B}, g)

6. Simulation Results

We evaluate the proposed RS-PPO framework through extensive discrete-event simulations using OMNeT++ 6.3.0 with the Simu5G framework. The simulated converged mmWave-PON O-RAN deployment follows the system model described in Section 3. We compare RS-PPO against three baselines: (i) a decoupled baseline where DBA uses standard IPACT and PRB scheduling uses proportional fair, (ii) Cooperative DBA [10] which adds TDD-pattern awareness to DBA but no learning, and (iii) Standard PPO (without reward shaping or efficiency enhancements). Key performance metrics include URLLC end-to-end latency, PRB utilisation for eMBB, reliability, training convergence speed, and inference time. The following subsections present and discuss the results.

6.1. Simulation Setup

The evaluation is conducted using the OMNeT++ (Objective Modular Network Testbed in C++) discrete-event simulation framework, version 6.3.0. OMNeT++ provides a component-based, modular architecture that enables detailed modelling of communication networks, protocols, and distributed algorithms. The simulation models leverage the INET Framework (version 4.5) for standard TCP/IP protocol implementations and the Simu5G framework (version 3.x), which adds detailed 5G New Radio (NR) models, including support for the 28 GHz mmWave band. The simulation environment runs on a dedicated workstation equipped with an Intel Xeon Gold 6248 processor (40 cores, 2.5 GHz) and 128 GB of RAM, running Ubuntu 22.04 LTS. The simulated network comprises a single O-RAN compliant gNodeB (gNB) with its associated Distributed Unit (DU) and a collocated OLT, representing a converged mmWave–PON deployment. The gNB serves

U = 125

user equipments (UEs), which are divided among the three network slices: 5 URLLC UEs, 20 eMBB UEs, and 100 mMTC UEs. The mmWave interface of the gNB is configured with a total bandwidth of 100 MHz at a carrier frequency of 28 GHz, which is divided into

K = 50

physical resource blocks (PRBs), each with a bandwidth of 2 MHz. The channel model follows the 3GPP TR 38.901 specifications for an urban micro (UMi) scenario, incorporating distance-dependent path loss, log-normal shadowing with 8 dB standard deviation, and a two-state Markov chain model for blockage events. The PON front-haul is modelled as a GPON system connecting the gNB’s collocated OLT to the RUs. The OLT is connected to

N = 5

ONUs via a 1:5 passive optical splitter. The PON operates with an upstream line rate of 2.488 Gbps and a DBA cycle time of

T_{DBA} = 125

μs. The OLT implements a standard Interleaved Polling with Adaptive Cycle Time (IPACT) DBA scheme, which the RS-PPO agent can override. The DRL agent is implemented as an OMNeT++ “simple module,” written in C++. The module is instantiated within the RIC and interfaces with both the DU scheduler of the gNB and the DBA module of the OLT via standard method calls, acting as the orchestrator. It features the graph convolutional network (GCN) encoder, actor, and critic networks. All training is performed offline, where the agent interacts with the simulation environment for a predefined number of episodes, collecting experiences, and periodically updating its policy. After training, the agent’s inference is fast enough for near-real-time operation, as the module performs only a forward pass through the actor network at each control step.

6.2. URLLC Latency Performance

Figure 3 presents the average end-to-end latency for URLLC packets as a function of normalized offered traffic load.

RS-PPO demonstrates the lowest latency across all load levels, achieving a 37% reduction at nominal load (1.0) compared to the decoupled baseline (0.87 ms versus 1.38 ms). Cooperative DBA and Standard PPO achieve 1.10 ms and 0.98 ms, respectively. RS-PPO also sustains 99.999% reliability, with only 0.001% of packets exceeding 1 ms, thereby meeting URLLC requirements. In contrast, the decoupled baseline does not meet the 1 ms target at loads above 0.8. The results demonstrate that joint DBA–PRB optimization in RS-PPO significantly reduces URLLC latency compared to decoupled and cooperative baselines, while simultaneously achieving the required reliability target.

6.3. PRB Utilization for eMBB Slice

Figure 4 presents PRB utilisation for the eMBB slice as a function of normalised eMBB traffic load. RS-PPO achieves 87% utilisation at nominal load, representing a 28% increase over the decoupled baseline (68%). Cooperative DBA and Standard PPO achieve 76% and 81% utilisation, respectively. Under higher loads (1.8), RS-PPO sustains 95% utilisation, indicating efficient resource usage during congestion. This improvement results from the joint DBA–PRB optimisation, which prevents allocation of PRBs to RUs whose corresponding ONUs lack sufficient upstream grants.

The improvement in PRB utilisation demonstrates that RS-PPO effectively avoids resource fragmentation by aligning radio resource allocation with optical fronthaul grant availability, leading to more efficient use of both domains.

6.4. Training Convergence

Figure 5 compares cumulative rewards per episode during training. RS-PPO achieves 90% of its final reward by episode 3200, whereas Standard PPO requires 5800 episodes, representing a 45% reduction in training time. The reward-shaping terms, which penalise DBA–PRB misalignment and queue buildup, offer denser feedback and guide the agent more efficiently toward policies that satisfy quality of service (QoS) requirements. The final rewards converge to similar values, indicating that reward shaping does not affect the optimal policy.

The accelerated convergence of RS-PPO is attributable to the incorporation of reward-shaping terms that deliver dense feedback by penalising DBA–PRB misalignment and queue buildup. This approach guides the agent toward optimal policies with improved efficiency sustained over extended training periods.

6.5. Multi-Metric Performance Comparison

Figure 6 presents a radar chart that compares four methods across five normalised performance metrics: URLLC latency (lower values indicate superior performance), PRB utilisation (higher is preferable), convergence speed (measured as episodes to 90% reward, with lower values indicating faster convergence), inference time per step (lower is preferable), and reliability (higher is preferable). Each metric is normalised on a scale from 0 to 1, where 1 denotes the best performance among all methods and 0 the worst. RS-PPO demonstrates superior performance, achieving the highest scores in four of the five metrics (latency, utilisation, convergence, and reliability) and the second-highest score in inference time, marginally trailing the heuristic methods. The decoupled baseline exhibits poor performance in latency and reliability, while Cooperative DBA and Standard PPO yield moderate results but do not achieve the comprehensive performance demonstrated by RS-PPO.

The radar chart shows that RS-PPO outperforms all baseline methods across all evaluated metrics, thereby confirming its overall superiority for slice-aware resource orchestration in converged mmWave–PON O-RAN.

6.6. Computational Overhead

Figure 7 illustrates the inference time per control step for four configurations of the RS-PPO framework: Standard PPO (no optimisations), RS-PPO with GCN state abstraction only, RS-PPO with GCN plus action masking, and the full RS-PPO (GCN, action masking, and prioritised replay). Inference time decreases incrementally with each optimisation. Standard PPO requires 4.2 ms, which approaches the 10 ms control budget and leaves minimal margin for additional xApps. Incorporating GCN alone reduces inference time to 3.1 ms, and adding action masking further lowers it to 2.5 ms. The complete RS-PPO framework achieves an inference time of 2.3 ms, corresponding to a 45% reduction relative to Standard PPO. This result satisfies the real-time constraint and supports deployment on the near-RT RIC. The horizontal dashed line in Figure 7 denotes the 10 ms control period limit, which is met by all configurations; however, the full RS-PPO offers the greatest safety margin.

The observed reduction in inference time demonstrates the effectiveness of the proposed computational efficiency enhancements, thereby rendering the complete RS-PPO framework suitable for near-real-time deployment on the O-RAN near-RT RIC.

6.7. Robustness to Mobility and Dynamic Blockage

While the main simulations assumed static UEs and stationary blockage, real-world deployments involve UE mobility and time-varying blockages. To evaluate the robustness of the trained RS-PPO policy, we conducted two additional simulation campaigns without retraining the agent. The same policy (trained under static conditions) was tested under more challenging scenarios:

Scenario 1—UE Mobility: UEs move according to the random waypoint model within an urban micro-cell (radius 200 m). Speeds are uniformly distributed between 0 and 30 km/h (pedestrian and low-speed vehicular). Channel states are updated every 10 ms based on instantaneous UE positions.

Scenario 2—Dynamic Blockage: The two-state Markov chain for blockage [2] is configured with a transition probability

p = 0.1

per 10 ms control step, resulting in an average blockage duration of 50 ms and an average clear duration of 450 ms. This models moving obstacles such as vehicles or rotating machinery.

Results: Under UE mobility, URLLC latency increases from 0.87 ms (static) to 0.94 ms—still well below the 1 ms target. Reliability remains at 99.998% (only 0.002% failures, compared to 0.001% in static). eMBB throughput drops by approximately 8% (from 104 Mbps to 96 Mbps) due to fast fading, but PRB utilisation declines only slightly from 87% to 83%. Under dynamic blockage, URLLC latency reaches 0.91 ms, reliability stays at 99.999% (unchanged), and eMBB throughput is 100 Mbps (a 4% reduction).

Interpretation: The trained policy generalises well without retraining. The GCN encodes the RU graph structure; as UEs move, it still aggregates neighbourhood information effectively. The reward-shaping term penalises queue buildup regardless of the cause (congestion, mobility, or blockage), so the policy learned to maintain low ONU buffers, buffering against sudden changes. Under dynamic blockage, the agent quickly reallocates PRBs to RUs with clear LOS paths. The slight eMBB degradation is expected because fast fading reduces spectral efficiency. Overall, RS-PPO demonstrates strong robustness to both mobility and dynamic blockage, meeting URLLC requirements in all tested scenarios.

Figure 8 illustrates the URLLC latency and reliability across the three scenarios, confirming that the policy meets targets without retraining.

7. Conclusions

The paper introduces a slice-aware, computationally efficient resource orchestration framework for converged mmWave-PON O-RAN systems. The joint DBA and PRB allocation problem is formulated as a CMDP, and a reward-shaped PPO algorithm is developed. A potential-based reward shaping function penalises DBA–PRB misalignment and queue buildup to accelerate convergence. Computational efficiency is improved through graph convolutional state abstraction, action masking, and prioritised N-step replay. Simulations show that RS-PPO reduces URLLC latency by 37%, improves PRB utilisation by 28%, converges 45% faster, and reduces inference time by 45%, while maintaining 99.999% reliability and satisfying the strict URLLC latency (1 ms), reliability (99.999%), and eMBB throughput (100 Mbps) constraints. The framework is compatible with O-RAN standards and can be deployed as an xApp on the near-RT RIC. Future work will extend the framework to multi-PON topologies, explore meta-learning for non-stationary slice mixes, and implement FPGA-based sub-millisecond inference.

Author Contributions

Conceptualization, N.S. and B.N.; Formal analysis, N.S. and B.P.; Investigation, B.N. and B.P.; Resources, B.N.; Software, N.S., B.N. and B.P.; Supervision, B.N.; Visualization, B.N. and B.P.; Writing—review and editing, N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CMDP	Constrained Markov Decision Process
CTI	Cooperative Transport Interface
CU	Central Unit
DBA	Dynamic Bandwidth Allocation
DRL	Deep Reinforcement Learning
DU	Distributed Unit
eMBB	Enhanced Mobile Broadband
GAE	Generalized Advantage Estimation
GCN	Graph Convolutional Network
gNB	gNodeB (5G base station)
GPON	Gigabit-capable Passive Optical Network
IPACT	Interleaved Polling with Adaptive Cycle Time
mMTC	Massive Machine-Type Communications
mmWave	Millimetre-wave (or millimeter-wave)
near-RT	Near-Real-Time
OLT	Optical Line Terminal
OMNeT++	Objective Modular Network Testbed in C++
ONU	Optical Network Unit
O-RAN	Open Radio Access Network
PON	Passive Optical Network
PPO	Proximal Policy Optimization
PRB	Physical Resource Block
QoS	Quality of Service
RIC	RAN Intelligent Controller
RS-PPO	Reward-Shaped Proximal Policy Optimization
RU	Radio Unit
SINR	Signal-to-Interference-plus-Noise Ratio
SLA	Service-Level Agreement
TDM-PON	Time-Division Multiplexing Passive Optical Network
TTI	Transmission Time Interval
UE	User Equipments
URLLC	Ultra-Reliable Low-Latency Communication
xApp	Application Running on the near-RT RIC (O-RAN term)

References

Alam, K.; Habibi, M.A.; Tammen, M.; Krummacker, D.; Saad, W.; Di Renzo, M.; Melodia, T.; Costa-Pérez, X.; Debbah, M.; Dutta, A.; et al. A comprehensive tutorial and survey of O-RAN: Exploring slicing-aware architecture, deployment options, use cases, and challenges. IEEE Commun. Surv. Tutor. 2026, 28, 1637–1678. [Google Scholar] [CrossRef]
Wang, S.; Xiong, G.; Zhang, S.; Zeng, H.; Li, J.; Panwar, S.S. Structured reinforcement learning for delay-optimal data transmission in dense mmWave networks. IEEE Trans. Wireless. Commun. 2024, 23, 14546–14559. [Google Scholar] [CrossRef]
O-RAN Alliance. O-RAN Transport Protocols for R1 Services. O-RAN.WG2.TS.R1TP-R004-v04.03. June 2025. Available online: https://specifications.o-ran.org/specifications (accessed on 18 April 2026).
Slyne, F.; O’Sullivan, K.; Dzaferagic, M.; Richardson, B.; Wrzeszcz, M.; Ryan, B.; Power, N.; Giller, R.; Ruffini, M. Demonstration of cooperative transport interface using open-source 5G OpenRAN and virtualised PON network. In Proceedings of the 2024 Optical Fiber Communications Conference and Exhibition (OFC), San Diego, CA, USA, 24–28 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–3. [Google Scholar]
Yaqoob, A.; Muntean, G.-M. A slice-centric and SLA-aware flexible radio resources allocation solution for 5G network slice management. In Proceedings of the 2025 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Dublin, Ireland, 11–13 June 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar] [CrossRef]
Dorcheh, A.E.; Seyfi, T.; Afghah, F. DORA: Dynamic O-RAN resource allocation for multi-slice 5G networks. arXiv 2025, arXiv:2509.07242. [Google Scholar]
Tsampazi, M.; D’Oro, S.; Polese, M.; Bonati, L.; Poitau, G.; Healy, M.; Alavirad, M.; Melodia, T. PandORA: Automated design and comprehensive evaluation of deep reinforcement learning agents for Open RAN. IEEE Trans. Mob. Comput. 2025, 24, 3223–3240. [Google Scholar] [CrossRef]
Ebrahimi, S.; Bouali, F.; Haas, O.C.L. MARSRA: Mobility-aware RAN slicing resource allocation for Open RAN deployments. In Proceedings of the 2024 IEEE 29th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Athens, Greece, 21–23 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Dai, J.; Li, L.; Safavinejad, R.; Mahboob, S.; Chen, H.; Ratnam, V.V.; Wang, H.; Zhang, J.; Liu, L. O-RAN-enabled intelligent network slicing to meet service-level agreement (SLA). IEEE Trans. Mob. Comput. 2025, 24, 890–906. [Google Scholar] [CrossRef]
Bidkar, S.; Christodoulopoulos, K.; Pfeiffer, T.; Bonk, R. Evaluating bandwidth efficiency and latency of scheduling schemes for 5G fronthaul over TDM-PON. In Proceedings of the 2022 European Conference on Optical Communication (ECOC), Basel, Switzerland, 18–22 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
Valcarenghi, L.; Marotta, A.; Centofanti, C.; Graziosi, F.; Kondepu, K. Energy-efficient integrated O-RAN/PON access network. In Proceedings of the ICC 2024–IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4967–4972. [Google Scholar] [CrossRef]
Mehdaoui, M.; Abouaomar, A. Dynamics of resource allocation in O-RANs: An in-depth exploration of on-policy and off-policy deep reinforcement learning for real-time applications. arXiv 2024, arXiv:2412.01839. [Google Scholar]
Zheng, M.; Zhang, J.; Zhan, C.; Ren, X.; Lü, S. Proximal policy optimization with reward-based prioritization. Expert Syst. Appl. 2025, 283, 127659. [Google Scholar] [CrossRef]
Xiong, X.; Hu, S.; Yan, T.; Xing, Z.; Ma, T.; Yin, K.; Wang, J.; Wei, X. Intelligent jamming decision-making system based on reinforcement learning. Comput. Electr. Eng. 2025, 123, 110288. [Google Scholar] [CrossRef]
Ding, S.; Lin, D.; Zhou, X. Graph convolutional reinforcement learning for dependent task allocation in edge computing. In Proceedings of the 2021 IEEE International Conference on Agents (ICA), Kyoto, Japan, 13–15 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 25–30. [Google Scholar] [CrossRef]
Hu, G.; Zhang, W.; Zhu, W. Prioritized experience replay for continual learning. In Proceedings of the 2021 6th International Conference on Computational Intelligence and Applications (ICCIA), Xiamen, China, 11–13 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 16–20. [Google Scholar] [CrossRef]
Mannion, P.; Devlin, S.; Mason, K.; Duggan, J.; Howley, E. Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 2017, 263, 60–73. [Google Scholar] [CrossRef]

Figure 1. Proposed resource orchestration architecture for converged mmWave–PON O-RAN.

Figure 2. Graph Convolutional Network (GCN) architecture for state abstraction.

Figure 3. URLLC latency vs. normalized load.

Figure 4. PRB utilization vs. eMBB traffic load.

Figure 5. Training convergence (cumulative reward).

Figure 6. Radar chart comparing four methods across five normalized performance metrics.

Figure 7. Inference time per control step for successive optimizations of the RS-PPO framework.

Figure 8. Robustness of the trained RS-PPO policy under static, mobility, and dynamic blockage scenarios (no retraining).

Table 1. Qualitative comparison of related works with the proposed RS-PPO framework.

Work/Framework	Slice-Aware	PON/Fronthaul Constraints	Joint DBA–PRB Optimisation	Reward Shaping for Misalignment	Computational Efficiency (Sub-5 ms)	Deployable as xApp on Near-RT RIC
DORA [6]	✓	✗ (ideal fronthaul)	✗ (radio only)	✗	✗ (not reported)	✓
PandORA [7]	✓	✗ (ideal fronthaul)	✗ (radio only)	✓ (general shaping)	✗ (not reported)	✓
MARSRA [8]	✓	✗ (ideal fronthaul)	✗ (radio only)	✗	✗ (not reported)	✓
CTI demo [4]	✗	✓ (heuristic coordination)	✗ (no learning)	✗	✓ (rule-based)	✓
Cooperative DBA [10]	✗	✓ (TDD-aware DBA)	✗ (optical only)	✗	✓ (heuristic)	✗ (no RIC integration)
Energy-efficient O-RAN/PON [11]	✗	✓ (sleep modes)	✗	✗	✓ (simple)	✓
RS-PPO (Ours)	✓	✓ (full CMDP formulation)	✓ (joint DBA + PRB)	✓ (potential-based, misalignment-specific)	✓ (2.3 ms inference, 45% reduction)	✓ (as xApp)

Legend: ✓ = fully addressed/included; ✗ = not addressed/missing.

Table 2. Principal notation used in the system model and problem formulation.

Symbol	Description	Typical Value/Domain
$R, U, N, K$	Number of RUs, UEs, ONUs, PRBs	5, 125, 5, 50
$S$	Set of slices	{URLLC, eMBB, mMTC}
$T_{ctrl}$	Control period (RIC decision interval)	10 ms
$T_{DBA}$	DBA cycle time	125 µs
$B_{PRB}$	Bandwidth per PRB	2 MHz
$C_{PON}$	PON upstream line rate	2.488 Gbps
${SINR}_{r, u, k} (t)$	Signal-to-interference-plus-noise ratio	Dimensionless
$R_{r, u, k} (t)$	Achievable data rate on PRB $k$	bps
$x_{r, k} (t)$	PRB allocation indicator (1 if assigned)	$\{0, 1\}$
$Q_{n} (t)$	ONU $n$ queue length	bytes
$g_{n} (t)$	DBA grant allocated to ONU $n$	bytes
$s (t)$	State vector (CMDP)	–
$a (t)$	Action vector (PRB + DBA grants)	–
$R_{base} (t), R_{shaped} (t)$	Base and shaped rewards	scalar
$Φ (s)$	Potential function for reward shaping	scalar
$γ, λ, ε$	Discount factor, GAE parameter, PPO clip	0.99, 0.95, 0.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shezi, N.; Nleya, B.; Pule, B. Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation. Telecom 2026, 7, 75. https://doi.org/10.3390/telecom7030075

AMA Style

Shezi N, Nleya B, Pule B. Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation. Telecom. 2026; 7(3):75. https://doi.org/10.3390/telecom7030075

Chicago/Turabian Style

Shezi, Nokwanda, Bakhe Nleya, and Beverly Pule. 2026. "Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation" Telecom 7, no. 3: 75. https://doi.org/10.3390/telecom7030075

APA Style

Shezi, N., Nleya, B., & Pule, B. (2026). Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation. Telecom, 7(3), 75. https://doi.org/10.3390/telecom7030075

Article Menu

Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation

Abstract

1. Introduction

2. Related Work

2.1. O-RAN Resource Allocation and Network Slicing

2.2. PON-RAN Integration and Cooperative Transport

2.3. DRL, Reward Shaping, and Computational Efficiency

2.4. Differentiation from PandORA and GNN-DRL Hybrids

2.5. Synthesis and Identified Gap

3. System Model for Converged mmWave–PON O-RAN

3.1. O-RAN Architecture with mmWave RUs and PON Fronthaul

3.2. mmWave Radio Access Model

3.3. PON Fronthaul Model

3.4. Traffic Models for Network Slices

3.5. Cooperative DBA Heuristic (Baseline)

4. Problem Formulation and Reward-Shaped PPO

4.1. Constrained Markov Decision Process Formulation

Explicit Constraint Specifications

4.2. Reward Shaping for Accelerated Convergence

4.2.1. Detailed Design of the Reward Shaping Function

4.2.2. Mechanism for Addressing DBA–PRB Misalignment

4.3. Proximal Policy Optimization with Shaped Reward

Algorithmic Commentary

5. Computational Efficiency Enhancements

Graph Convolutional State Abstraction

6. Simulation Results

6.1. Simulation Setup

6.2. URLLC Latency Performance

6.3. PRB Utilization for eMBB Slice

6.4. Training Convergence

6.5. Multi-Metric Performance Comparison

6.6. Computational Overhead

6.7. Robustness to Mobility and Dynamic Blockage

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI