Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation
Abstract
1. Introduction
- Firstly, we develop a detailed system model that captures mmWave channel dynamics (including blockage), PON DBA cycles, ONU buffer evolution, and slice-specific traffic characteristics (URLLC, eMBB, mMTC).
- Secondly, we design a potential-based reward shaping function that penalises DBA–PRB misalignment and provide a proof that the shaped reward preserves the optimal policy.
- Thirdly, we introduce three computational efficiency enhancements—GCN state abstraction, action masking, and prioritised N-step replay—that together reduce inference time and accelerate convergence.
- Fourthly, through extensive simulations using 3GPP-compliant mmWave channel models, GPON DBA dynamics, and mixed slice traffic, we demonstrate that RS-PPO reduces URLLC end-to-end latency by 37% (from 1.38 ms to 0.87 ms), improves PRB utilisation by 28% (from 68% to 87%), achieves 99.999% reliability for URLLC slices, converges 45% faster, and reduces inference time by 45% (from 4.2 ms to 2.3 ms). The framework is compatible with O-RAN specifications and can be deployed as an xApp on the near-real-time RIC.
2. Related Work
2.1. O-RAN Resource Allocation and Network Slicing
2.2. PON-RAN Integration and Cooperative Transport
2.3. DRL, Reward Shaping, and Computational Efficiency
2.4. Differentiation from PandORA and GNN-DRL Hybrids
2.5. Synthesis and Identified Gap
3. System Model for Converged mmWave–PON O-RAN
3.1. O-RAN Architecture with mmWave RUs and PON Fronthaul
3.2. mmWave Radio Access Model
3.3. PON Fronthaul Model
3.4. Traffic Models for Network Slices
3.5. Cooperative DBA Heuristic (Baseline)
- Collects buffer reports from all ONUs and receive a predicted traffic arrival from the DU (based on the current TDD uplink/downlink configuration).
- Computes a provisional grant for each ONU as .
- Scales the grants to satisfy the PON capacity constraint: .
- Allocates grants to ONUs; ONUs transmit up to bits.
4. Problem Formulation and Reward-Shaped PPO
4.1. Constrained Markov Decision Process Formulation
Explicit Constraint Specifications
4.2. Reward Shaping for Accelerated Convergence
4.2.1. Detailed Design of the Reward Shaping Function
4.2.2. Mechanism for Addressing DBA–PRB Misalignment
4.3. Proximal Policy Optimization with Shaped Reward
| Algorithm 1: Reward-Shaped PPO for Joint DBA–PRB Allocation |
| Input: |
| Policy value function ; Potential function ; Discount factor , GAE parameter , Clipping parameter ; Learning rates ; Episodes , horizon |
| Output: Optimal policy |
| 1: Initialise , |
| 2: for episode = 1 to do |
| 3: Observe initial state |
| 4: Initialise trajectory buffer |
| 5: for to − 1 do |
| 6: Sample action: |
| 7: Execute action at |
| Observe and next state |
| 8: Compute shaped reward: |
| 9: Store in |
| 10: |
| 11: end for |
| 12: Compute temporal-difference residual: |
| 13: Compute advantage (GAE): |
| 14: Compute return: |
| 15: for each mini-batch ⊂ do |
| 16: Compute probability ratio: |
| 17: Update actor (gradient ascent): |
| E[min( ₜ, |
| clip |
| 18: Update critic (MSE loss): |
| 19: end for |
| 20: end for |
| 21: return |
Algorithmic Commentary
5. Computational Efficiency Enhancements
Graph Convolutional State Abstraction
| Algorithm 2: Inference with Action Masking for Joint DBA–PRB Allocation |
| Input: |
| Trained policy ; State ; Action mask Maximum grant ; |
| Queue lengths |
| Output: |
| Joint action |
| 1: Encode state: |
| -------------------------------------------------- |
| 2: PRB Allocation (Discrete Actions) |
| -------------------------------------------------- |
| for each PRB to do |
| Compute logits: |
| Apply mask: |
| if then |
| end if |
| Compute probabilities: |
| Sample allocation: |
| end for |
| -------------------------------------------------- |
| 3: DBA Grant Allocation (Continuous Actions) |
| -------------------------------------------------- |
| for each ONU to do |
| Compute raw grant: |
| Project to feasible region: |
| end for |
| -------------------------------------------------- |
| 4: return |
6. Simulation Results
6.1. Simulation Setup
6.2. URLLC Latency Performance
6.3. PRB Utilization for eMBB Slice
6.4. Training Convergence
6.5. Multi-Metric Performance Comparison
6.6. Computational Overhead
6.7. Robustness to Mobility and Dynamic Blockage
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| CMDP | Constrained Markov Decision Process |
| CTI | Cooperative Transport Interface |
| CU | Central Unit |
| DBA | Dynamic Bandwidth Allocation |
| DRL | Deep Reinforcement Learning |
| DU | Distributed Unit |
| eMBB | Enhanced Mobile Broadband |
| GAE | Generalized Advantage Estimation |
| GCN | Graph Convolutional Network |
| gNB | gNodeB (5G base station) |
| GPON | Gigabit-capable Passive Optical Network |
| IPACT | Interleaved Polling with Adaptive Cycle Time |
| mMTC | Massive Machine-Type Communications |
| mmWave | Millimetre-wave (or millimeter-wave) |
| near-RT | Near-Real-Time |
| OLT | Optical Line Terminal |
| OMNeT++ | Objective Modular Network Testbed in C++ |
| ONU | Optical Network Unit |
| O-RAN | Open Radio Access Network |
| PON | Passive Optical Network |
| PPO | Proximal Policy Optimization |
| PRB | Physical Resource Block |
| QoS | Quality of Service |
| RIC | RAN Intelligent Controller |
| RS-PPO | Reward-Shaped Proximal Policy Optimization |
| RU | Radio Unit |
| SINR | Signal-to-Interference-plus-Noise Ratio |
| SLA | Service-Level Agreement |
| TDM-PON | Time-Division Multiplexing Passive Optical Network |
| TTI | Transmission Time Interval |
| UE | User Equipments |
| URLLC | Ultra-Reliable Low-Latency Communication |
| xApp | Application Running on the near-RT RIC (O-RAN term) |
References
- Alam, K.; Habibi, M.A.; Tammen, M.; Krummacker, D.; Saad, W.; Di Renzo, M.; Melodia, T.; Costa-Pérez, X.; Debbah, M.; Dutta, A.; et al. A comprehensive tutorial and survey of O-RAN: Exploring slicing-aware architecture, deployment options, use cases, and challenges. IEEE Commun. Surv. Tutor. 2026, 28, 1637–1678. [Google Scholar] [CrossRef]
- Wang, S.; Xiong, G.; Zhang, S.; Zeng, H.; Li, J.; Panwar, S.S. Structured reinforcement learning for delay-optimal data transmission in dense mmWave networks. IEEE Trans. Wireless. Commun. 2024, 23, 14546–14559. [Google Scholar] [CrossRef]
- O-RAN Alliance. O-RAN Transport Protocols for R1 Services. O-RAN.WG2.TS.R1TP-R004-v04.03. June 2025. Available online: https://specifications.o-ran.org/specifications (accessed on 18 April 2026).
- Slyne, F.; O’Sullivan, K.; Dzaferagic, M.; Richardson, B.; Wrzeszcz, M.; Ryan, B.; Power, N.; Giller, R.; Ruffini, M. Demonstration of cooperative transport interface using open-source 5G OpenRAN and virtualised PON network. In Proceedings of the 2024 Optical Fiber Communications Conference and Exhibition (OFC), San Diego, CA, USA, 24–28 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–3. [Google Scholar]
- Yaqoob, A.; Muntean, G.-M. A slice-centric and SLA-aware flexible radio resources allocation solution for 5G network slice management. In Proceedings of the 2025 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Dublin, Ireland, 11–13 June 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar] [CrossRef]
- Dorcheh, A.E.; Seyfi, T.; Afghah, F. DORA: Dynamic O-RAN resource allocation for multi-slice 5G networks. arXiv 2025, arXiv:2509.07242. [Google Scholar]
- Tsampazi, M.; D’Oro, S.; Polese, M.; Bonati, L.; Poitau, G.; Healy, M.; Alavirad, M.; Melodia, T. PandORA: Automated design and comprehensive evaluation of deep reinforcement learning agents for Open RAN. IEEE Trans. Mob. Comput. 2025, 24, 3223–3240. [Google Scholar] [CrossRef]
- Ebrahimi, S.; Bouali, F.; Haas, O.C.L. MARSRA: Mobility-aware RAN slicing resource allocation for Open RAN deployments. In Proceedings of the 2024 IEEE 29th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Athens, Greece, 21–23 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Dai, J.; Li, L.; Safavinejad, R.; Mahboob, S.; Chen, H.; Ratnam, V.V.; Wang, H.; Zhang, J.; Liu, L. O-RAN-enabled intelligent network slicing to meet service-level agreement (SLA). IEEE Trans. Mob. Comput. 2025, 24, 890–906. [Google Scholar] [CrossRef]
- Bidkar, S.; Christodoulopoulos, K.; Pfeiffer, T.; Bonk, R. Evaluating bandwidth efficiency and latency of scheduling schemes for 5G fronthaul over TDM-PON. In Proceedings of the 2022 European Conference on Optical Communication (ECOC), Basel, Switzerland, 18–22 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
- Valcarenghi, L.; Marotta, A.; Centofanti, C.; Graziosi, F.; Kondepu, K. Energy-efficient integrated O-RAN/PON access network. In Proceedings of the ICC 2024–IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4967–4972. [Google Scholar] [CrossRef]
- Mehdaoui, M.; Abouaomar, A. Dynamics of resource allocation in O-RANs: An in-depth exploration of on-policy and off-policy deep reinforcement learning for real-time applications. arXiv 2024, arXiv:2412.01839. [Google Scholar]
- Zheng, M.; Zhang, J.; Zhan, C.; Ren, X.; Lü, S. Proximal policy optimization with reward-based prioritization. Expert Syst. Appl. 2025, 283, 127659. [Google Scholar] [CrossRef]
- Xiong, X.; Hu, S.; Yan, T.; Xing, Z.; Ma, T.; Yin, K.; Wang, J.; Wei, X. Intelligent jamming decision-making system based on reinforcement learning. Comput. Electr. Eng. 2025, 123, 110288. [Google Scholar] [CrossRef]
- Ding, S.; Lin, D.; Zhou, X. Graph convolutional reinforcement learning for dependent task allocation in edge computing. In Proceedings of the 2021 IEEE International Conference on Agents (ICA), Kyoto, Japan, 13–15 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 25–30. [Google Scholar] [CrossRef]
- Hu, G.; Zhang, W.; Zhu, W. Prioritized experience replay for continual learning. In Proceedings of the 2021 6th International Conference on Computational Intelligence and Applications (ICCIA), Xiamen, China, 11–13 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 16–20. [Google Scholar] [CrossRef]
- Mannion, P.; Devlin, S.; Mason, K.; Duggan, J.; Howley, E. Policy invariance under reward transformations for multi-objective reinforcement learning. Neurocomputing 2017, 263, 60–73. [Google Scholar] [CrossRef]








| Work/Framework | Slice-Aware | PON/Fronthaul Constraints | Joint DBA–PRB Optimisation | Reward Shaping for Misalignment | Computational Efficiency (Sub-5 ms) | Deployable as xApp on Near-RT RIC |
|---|---|---|---|---|---|---|
| DORA [6] | ✓ | ✗ (ideal fronthaul) | ✗ (radio only) | ✗ | ✗ (not reported) | ✓ |
| PandORA [7] | ✓ | ✗ (ideal fronthaul) | ✗ (radio only) | ✓ (general shaping) | ✗ (not reported) | ✓ |
| MARSRA [8] | ✓ | ✗ (ideal fronthaul) | ✗ (radio only) | ✗ | ✗ (not reported) | ✓ |
| CTI demo [4] | ✗ | ✓ (heuristic coordination) | ✗ (no learning) | ✗ | ✓ (rule-based) | ✓ |
| Cooperative DBA [10] | ✗ | ✓ (TDD-aware DBA) | ✗ (optical only) | ✗ | ✓ (heuristic) | ✗ (no RIC integration) |
| Energy-efficient O-RAN/PON [11] | ✗ | ✓ (sleep modes) | ✗ | ✗ | ✓ (simple) | ✓ |
| RS-PPO (Ours) | ✓ | ✓ (full CMDP formulation) | ✓ (joint DBA + PRB) | ✓ (potential-based, misalignment-specific) | ✓ (2.3 ms inference, 45% reduction) | ✓ (as xApp) |
| Symbol | Description | Typical Value/Domain |
|---|---|---|
| Number of RUs, UEs, ONUs, PRBs | 5, 125, 5, 50 | |
| Set of slices | {URLLC, eMBB, mMTC} | |
| Control period (RIC decision interval) | 10 ms | |
| DBA cycle time | 125 µs | |
| Bandwidth per PRB | 2 MHz | |
| PON upstream line rate | 2.488 Gbps | |
| Signal-to-interference-plus-noise ratio | Dimensionless | |
| Achievable data rate on PRB | bps | |
| PRB allocation indicator (1 if assigned) | ||
| ONU queue length | bytes | |
| DBA grant allocated to ONU | bytes | |
| State vector (CMDP) | – | |
| Action vector (PRB + DBA grants) | – | |
| Base and shaped rewards | scalar | |
| Potential function for reward shaping | scalar | |
| Discount factor, GAE parameter, PPO clip | 0.99, 0.95, 0.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Shezi, N.; Nleya, B.; Pule, B. Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation. Telecom 2026, 7, 75. https://doi.org/10.3390/telecom7030075
Shezi N, Nleya B, Pule B. Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation. Telecom. 2026; 7(3):75. https://doi.org/10.3390/telecom7030075
Chicago/Turabian StyleShezi, Nokwanda, Bakhe Nleya, and Beverly Pule. 2026. "Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation" Telecom 7, no. 3: 75. https://doi.org/10.3390/telecom7030075
APA StyleShezi, N., Nleya, B., & Pule, B. (2026). Slice-Aware and Computationally Efficient Resource Orchestration for Converged mmWave–PON O-RAN: A Reward-Shaped PPO Approach for Joint DBA and PRB Allocation. Telecom, 7(3), 75. https://doi.org/10.3390/telecom7030075

