Techno-Economic and SLA-Aware Control of 5G Cloud-RAN via Multi-Objective and Penalty-Constrained Reinforcement Learning

Aboul, Sherif M.; Abd El Kader, Hala M.; Eid, Esraa M.; Ali, Shimaa S.

doi:10.3390/network6020020

Open AccessArticle

Techno-Economic and SLA-Aware Control of 5G Cloud-RAN via Multi-Objective and Penalty-Constrained Reinforcement Learning

¹

Electrical Engineering Department, Faculty of Engineering at Shoubra, Benha University, Cairo 11629, Egypt

²

Electrical Engineering Department, Faculty of Engineering, MTI University, Cairo 11439, Egypt

³

Faculty of Computer Science, Benha National University, Cairo 13518, Egypt

^*

Author to whom correspondence should be addressed.

Network 2026, 6(2), 20; https://doi.org/10.3390/network6020020

Submission received: 6 February 2026 / Revised: 12 March 2026 / Accepted: 16 March 2026 / Published: 31 March 2026

Download

Browse Figures

Versions Notes

Abstract

Fifth-generation (5G) mobile networks must simultaneously satisfy stringent latency targets, high user density, and energy-aware operation across heterogeneous services. Cloud Radio Access Networks (C-RAN) provide architectural flexibility through centralized baseband processing, but they also introduce new control challenges related to fronthaul constraints, dynamic traffic variations, and joint radio–compute coordination with Mobile Edge Computing (MEC). This paper proposes a unified AI-driven optimization framework for adaptive 5G C-RAN management, where the controller dynamically tunes key system decisions—including functional split selection, TDD downlink ratio, user–RU association, fronthaul load management, and MEC offloading proportion. To enable fair benchmarking under identical simulation settings, a static baseline policy is compared against five adaptive control strategies: Deep Q-Network (DQN), Proximal Policy Optimization (PPO), Deep Deterministic Policy Gradient (DDPG), Multi-Objective Reinforcement Learning (MORL), and a Deterministic Service-Level Agreement (SLA)-aware controller Penalty-Constrained Hierarchical Action Controller (PCHAC). Performance evaluation across techno-economic and service KPIs shows that intelligent control significantly improves operational profit, tail-latency behavior, and energy efficiency while enhancing SLA compliance compared with non-adaptive operation. The results highlight the practicality of multi-objective and constraint-aware learning for next-generation C-RAN orchestration under scaling traffic demand.

Keywords:

C-RAN; 5G NR; MEC; deep reinforcement learning; multi-objective optimization; fronthaul constraint

1. Introduction

The continuous expansion of 5G mobile communication networks has created strong demand for high data rates, ultra-reliable low-latency services, and massive user connectivity under highly dynamic radio and traffic conditions [1]. These requirements become more complex in C-RAN deployments, where baseband processing centralization, fronthaul limitations, and distributed radio resources jointly impact system performance [2]. The integration of MEC further increases system complexity by introducing computation-related delay components and additional control decisions for latency-sensitive services [3].

Traditional optimization methods for C-RAN and MEC-enabled systems—such as fixed policies, handcrafted heuristics, or single-objective formulations—often struggle to maintain a stable performance when multiple objectives must be satisfied simultaneously [4,5]. In practical deployments, network operators must balance conflicting goals, including profit maximization, energy consumption minimization, fronthaul load control, and quality-of-service assurance (e.g., latency and coverage constraints). The underlying decision process is non-linear and depends on stochastic factors such as user distribution, propagation effects, interference behavior, and traffic load, which limits the effectiveness of static or purely deterministic resource management [6].

To address these challenges, this work develops an AI-driven optimization framework for a 5G C-RAN architecture enhanced with MEC capabilities, enabling adaptive decision-making across radio, fronthaul, and computation dimensions. The proposed framework evaluates multiple intelligent strategies under a unified simulation and techno-economic modeling environment. A baseline configuration is used as a reference point, and the study compares five AI-based methods representing different learning and control paradigms: MORL for balancing competing KPIs through explicit trade-off modeling, PCHAC a deterministic SLA-aware controller for constraint enforcement, and three standard deep reinforcement learning algorithms—DQN, PPO, and DDPG—which collectively cover value-based learning, policy-gradient optimization, and continuous-control decision spaces.

In 5G cloud-RAN deployments, network operators must meet strict service-level targets—particularly tail-latency requirements for delay-sensitive applications—while limiting operating costs such as energy usage and fronthaul consumption. Centralized RAN processing and MEC improve resource pooling and flexibility, but they also create tightly coupled configuration decisions, including functional split choice, TDD scheduling, offloading fraction, and RU sleep control, each of which impacts both user experience and cost. Since traffic loads and radio conditions fluctuate over time, fixed parameter settings can either violate SLAs when resources are insufficient or waste resources when provisioning is excessive. These practical constraints motivate a techno-economic, SLA-aware control framework that dynamically selects cloud-RAN/MEC configurations to improve operator utility while respecting latency and fronthaul limits.

This study aims to quantify these trade-offs and develop an adaptive configuration-control framework that increases profit/utility while satisfying SLA constraints under stochastic traffic and channel variations.

Core research question: How can cloud-RAN/MEC configuration knobs (functional split, TDD ratio, MEC offloading, and RU sleep) be jointly controlled to maximize techno-economic utility while satisfying tail-latency and fronthaul constraints under stochastic traffic and channel variations?

The central contribution of this paper is a unified techno-economic and SLA-aware configuration-control formulation and evaluation pipeline for cloud-RAN/MEC; PCHAC and the DRL algorithms are included as representative controllers/baselines to demonstrate and benchmark the framework under identical conditions.

The main contributions of this study are summarized as follows:

Unified 5G C-RAN + MEC System Framework: A comprehensive simulation model incorporating radio propagation, user association, SINR-based link adaptation, MEC processing delay, and fronthaul performance within a single evaluation pipeline.
Joint KPI Optimization with Realistic Techno-Economic Metrics: A multi-dimensional performance model capturing network profit, energy cost, fronthaul cost, latency behavior, and user connectivity in scalable scenarios.
Incorporation of Constraint-Aware Control via PCHAC: A deterministic SLA-aware controller designed to guide operation toward feasibility under service-level constraints while maintaining competitive techno-economic performance.
Cross-Method Benchmarking Under Identical Conditions: A fair comparison of baseline, MORL, PCHAC, DQN, PPO, and DDPG using the same network topology, simulation parameters, and KPI definitions across different user densities.
Extensive KPI-Level Evaluation and Scalability Analysis: A detailed performance assessment using economics, service KPIs, coverage probability, CQI distributions, fairness across radio units, and execution time behavior as the number of users increases.

The structure of this paper is as follows. Section 4 describes the considered 5G C-RAN architecture with MEC integration and outlines the environment and state modeling adopted in the study. Section 5 details the methodology and the AI-based optimization approaches used for system control. Section 6 presents the simulation results and provides a comprehensive comparative analysis of the evaluated strategies. Section 7 discusses key observations, limitations, and directions for future research, followed by the concluding remarks.

2. Related Work

Recent studies indicate that Proximal Policy Optimization (PPO) is an effective reinforcement learning approach for network management tasks such as spectrum allocation, scheduling, power control, and network slicing, delivering improvements in throughput, energy efficiency, and SLA compliance. Sherif et al. [7] propose a PPO-based framework for O-RAN slicing that integrates latency-aware reward design, action filtering, and policy transfer learning between non-real-time and near-real-time RIC agents, achieving faster convergence and reduced energy consumption. Mhatre et al. [8] address QoS-aware intra-slice resource allocation in beyond-5G O-RAN using a DQN-based model deployed at the near-real-time RIC, where parameterized user association reduces action-space complexity while improving eMBB throughput and ultra-reliable low-latency communication (URLLC) latency. Vidhya et al. [9] introduce a dynamic network slicing framework that combines multi-agent deep reinforcement learning with optimization-based VNF migration, employing Soft Actor-Critic and Red Panda Optimization to enhance slice admission and QoS under resource constraints. Wang et al. [10] investigate end-to-end resource allocation for network slicing with public and non-public networks, combining Independent PPO for RAN decisions with a Random Forest model for feasibility checking, resulting in improved SLA satisfaction and operator profit.

Beyond PPO-based approaches, Deep Q-Network (DQN) methods have been widely shown to outperform static and heuristic optimization techniques. Luo et al. [11] enhance resource allocation by integrating DQN with gradient boosting, while Sun et al. [12] propose a dueling DQN architecture for autonomous RRH activation, achieving significant energy savings. In O-RAN environments, Wang et al. [13] develop a DQN-based xApp for dynamic radio card deactivation, improving energy efficiency without throughput degradation. Additional contributions include adaptive functional split selection in non-terrestrial O-RAN by Shahabi et al. [14] and joint RAN–MEC resource allocation with slicing support by Martínez-Morfa et al. [15].

Actor–critic reinforcement learning has also been applied to Cloud RAN optimization. Mohsen Khani et al. [16] formulate C-RAN resource allocation using actor–critic methods—including Twin-Delayed Deep Deterministic Policy Gradient (TD3)—to control continuous radio-resource variables under QoS and fronthaul constraints, demonstrating improved stability over conventional DDPG. Chenjing Tian et al. [17] extend this approach to a multi-agent setting using MADDPG for network slice reconfiguration, enabling coordinated optimization of resource cost and QoS in dynamic environments.

Finally, Multi-Objective Reinforcement Learning (MORL) has emerged as a promising framework for addressing conflicting objectives in 5G and edge computing systems. Zhang et al. [18] propose a multi-DNN approach that jointly optimizes MEC task offloading and resource allocation under secure data transmission constraints. Song et al. [19] apply MORL to dependent task offloading in MEC, achieving reductions in latency and energy consumption through Pareto-optimal scheduling. Huang et al. [20] develop a MORL-based task offloading strategy for heterogeneous vehicular edge computing, while Xu et al. [21] introduce a MORL-driven offloading algorithm for satellite edge computing that jointly optimizes latency, resource utilization, and load balancing. Ismail et al. [22] provide a comprehensive survey of MEC resource scheduling, highlighting the growing role of DRL and MORL in enabling adaptive and energy-aware service management.

3. System Model

The proposed framework represents an intelligent C-RAN that integrates multiple artificial intelligence–driven control mechanisms to enhance network performance, reduce operational expenditure, and improve energy efficiency under time-varying 5G traffic demands. The simulated architecture consists of Remote Units (RUs), Distributed Units (DUs), a centralized Baseband Unit (BBU), and a MEC platform interconnected via high-capacity fronthaul and backhaul links.

Figure 1 illustrates the considered 5G C-RAN architecture with MEC integration and the main information flows over fronthaul/backhaul links.

To ensure a consistent and reproducible evaluation, the proposed 5G C-RAN with MEC framework is parameterized using a set of widely adopted physical-layer, architectural, and traffic assumptions. These parameters define the operating spectrum, bandwidth, transmit power, numerology, duplexing configuration, functional split, fronthaul capacity, network scale, and user density, and are selected in accordance with commonly used 5G NR and C-RAN modeling practices reported in the literature [23,24,25,26]. The main simulation and system parameters adopted in this study are summarized in Table 1.

In addition to radio and fronthaul configuration, the system model incorporates edge-computing capabilities to account for computation offloading and service latency behavior. The MEC-related parameters, including offloading fraction, processing latency, and core/backhaul delay assumptions, are aligned with prior studies on edge-enabled 5G systems and computation offloading frameworks [23,27,28]. These parameters and latency components are reported in Table 2.

L_{M E C}

and

L_{c o r e}

are modeled as constant latency terms to isolate the impact of configuration control on end-to-end performance. Fixed-latency modeling is commonly adopted in 5G/MEC latency studies, and low-ms core-network latency targets and sub-ms transport segments are reported in the low-latency 5G literature. To ensure conclusions are not tied to a single setting, a sensitivity analysis over

L_{M E C}

and

L_{c o r e}

is provided in Section 6 [29,30].

For each user equipment (UE), the downlink signal-to-interference-plus-noise ratio (SINR) is computed as

{SINR}_{u} = \frac{P_{s, u}}{\sum_{i \neq s} P_{i, u} \propto_{i} + N_{0}}

(1)

where P_s,u denotes the received signal power from the serving RU, P_i,u represents the interference contributions from neighboring RUs,

\propto_{i}

captures interference weighting, and N₀ is the thermal noise power. The resulting SINR is mapped to a Channel Quality Indicator (CQI), which in turn determines the achievable spectral efficiency through standardized link adaptation functions.

The achievable downlink throughput for UEu is modeled as

R_{u} = B \cdot S E_{u} \cdot η_{T D D} \cdot η_{O H} \cdot (1 - B L E R) \cdot s h a r e_{u}

(2)

where B is the system bandwidth,

η_{T D D}

represents the downlink time allocation in the TDD frame,

η_{O H}

accounts for protocol overhead, BLER denotes the block error rate, and share_u reflects the scheduler’s resource allocation to the user.

End-to-end latency is modeled as the sum of air-interface, processing, fronthaul, core-network, and MEC components. To account for MEC congestion, a load-dependent queuing delay term may be incorporated for offloaded tasks, as described below.

To capture MEC congestion effects, MEC processing can be approximated by an M/M/1 queue with arrival rate

λ_{m e c}

(offloaded task arrivals) and service rate

μ_{m e c}

(effective MEC service capacity). The mean queuing delay is

W_{q} = \frac{λ_{m e c}}{μ_{m e c} (μ_{m e c} - λ_{m e c})}, λ_{m e c} < μ_{m e c}

(3)

Accordingly, the end-to-end delay can be written as

L_{u} = L_{a i r} + L_{p r o c} + L_{f h} + L_{c o r e} + L_{m e c} + W_{q}

(4)

where L_air denotes the air-interface transmission delay, L_proc represents the processing delay, L_fh corresponds to the fronthaul delay, L_core captures the core network delay, and L_mec accounts for the MEC processing delay.

Queueing dynamics at the MEC are not explicitly simulated in the current evaluation; L_mec is treated as a constant term to isolate the impact of configuration control and to keep the simulator tractable. The M/M/1 extension above is provided to show how load-dependent congestion can be incorporated into the latency model. A full empirical re-evaluation of all methods under explicit queueing dynamics is left for future work.

Under an M/M/1 approximation, queuing delay increases sharply as utilization

ρ = λ_{m e c} / μ_{m e c}

approaches 1. In the evaluated settings,

λ_{m e c}

is bounded by the offloading fraction and the task generation rate, while

μ_{m e c}

is determined by the MEC compute capacity (Table 2). Thus, the fixed-delay assumption corresponds to operating in a regime where

ρ

remains sufficiently below 1; if

ρ

approaches saturation, the queuing term should be activated and the controller would need to account for congestion.

The optimization study adopts a baseline policy and five adaptive control strategies to adapt three operational control variables—namely the C-RAN functional split, the TDD downlink–uplink ratio, and the MEC offloading fraction—with the goal of maximizing system profit while meeting latency and fronthaul-capacity constraints. The compared methods are: (1) a static baseline based on predefined heuristic rules; (2) Deep Q-Network (DQN), representing value-based reinforcement learning; (3) Proximal Policy Optimization (PPO), representing a policy-gradient method; (4) Multi-Objective Reinforcement Learning (MORL), which searches for Pareto-efficient trade-offs among competing objectives; (5) Deterministic SLA-aware controller (PCHAC), which selects configurations deterministically using an SLA-aware scoring rule with explicit constraint handling; and (6) Deep Deterministic Policy Gradient (DDPG), an actor–critic approach that learns deterministic control policies for resource management.

All strategies are evaluated under identical simulation settings and KPI definitions; results are aggregated over 10 independent random seeds, which re-sample stochastic channel and traffic realizations to quantify variability.

4. Environment and State Modeling

The environment (CRANMORLEnv) reproduces the key dynamics of a 5G C-RAN, including user distribution, wireless channel conditions, interference, and traffic load. The action space is defined as:

a = (s p l i t, t d d, s l e e p_{k}, m e c_{f r a c})

(5)

where each parameter represents:

split: Functional split configuration between DU and RU.
tdd: Time Division Duplex (TDD) ratio for uplink/downlink balance.
sleep_k: Number of RUs in sleep mode to save energy.
mec_frac: Fraction of computation offloaded to MEC.

RU sleep is modeled as a per-epoch on/off decision without explicit state-transition costs (e.g., wake-up delay, switching energy, or minimum on/off durations). Therefore, the reported energy savings should be interpreted as an upper-bound under idealized RU state transition; incorporating transition overheads is left for future work.

Configuration selection is evaluated as periodic snapshot-based reconfiguration; therefore, the learning baselines are trained in a single-step setting where each action is scored by the immediate techno-economic return under stochastic traffic/channel realizations.

The goal is to maximize a composite reward function that reflects network profit while minimizing latency, energy, and fronthaul costs. The multi-objective formulation is given by

f (a) = [profit - p 95_{latency} - {energy}_{cost} - {fronthaul}_{cost}]

(6)

The objective is to increase operator profit while reducing tail latency and operating costs. To maintain a consistent maximization direction, profit is represented with a positive contribution, whereas latency and cost terms are included with negative signs so that larger values correspond to better performance.

Table 3 summarizes the reinforcement learning environment, including the state representation, joint control actions, reward formulation with SLA penalties, and the main training parameters used for all evaluated methods, following standard reinforcement learning formulations and MEC optimization practices [31,32].

5. Methodology

To enable a rigorous and unbiased evaluation of diverse control techniques, all considered methods are realized within a common decision and simulation framework. While the approaches differ in learning mechanisms, action formulations, and optimization goals, they interact with the same 5G C-RAN with MEC environment, rely on identical system observations, and are assessed using uniform techno-economic and service-level performance indicators.

Figure 2 summarizes the end-to-end control workflows of the baseline, MORL, PCHAC, DQN, PPO, and DDPG approaches. The figure outlines their respective action spaces, objective structures, SLA-enforcement mechanisms, training processes, and execution logic. By presenting all methods within a unified workflow, the figure highlights their methodological distinctions and ensures that the comparative results reported in Section 6 stem from algorithmic design choices rather than differences in implementation.

5.1. Baseline Model

The baseline policy serves as a non-adaptive benchmark for all adaptive approaches. It applies fixed configuration parameters—such as TDD ratio and MEC offload level—based on predefined averages. This model has minimal computational cost but cannot react to varying network load or channel fluctuations. It provides a stable reference for quantifying performance improvements achieved by AI-driven control strategies.

The baseline configuration is fixed to

Split = 2

,

T D D_{D L} = 0.7

, offloading fraction

ρ_{o f f} = 0.4

, and all RUs active (no sleep) throughout the simulation. All other simulation parameters are held constant across methods and follow Table 1 and Table 2, including the network snapshot settings (e.g., RU/UE counts, carrier frequency, bandwidth, antenna configuration, transmit power, noise figure, and propagation assumptions) and the computation/transport parameters (e.g., task size and CPU-cycle requirements, RU/DU/CU capacities, and fixed MEC/core latency terms).

5.2. Deep Q-Network (DQN)

The DQN algorithm, a key advancement in deep reinforcement learning, has shown remarkable capability for autonomous decision-making and control in next-generation C-RANs and Open RAN (O-RAN) systems. In these architectures, baseband processing is centralized in BBU pools, while Remote Radio Heads (RRHs) or radio frequency frontends are geographically distributed. This configuration introduces multiple optimization challenges, including dynamic traffic variability, interference management, energy efficiency enhancement, and stringent latency requirements.

The DQN framework estimates the optimal state–action value function using deep neural networks, expressed as:

Q (s, a; θ) \approx E [r^{t} + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ^{-})]

(7)

where θ and θ⁻ represent the parameters of the online and target networks, respectively.

At each learning step, the agent observes the current network state S_t, selects an action a_t based on an ε-greedy exploration strategy, and receives a reward determined by:

r_{t} = α_{1} (profit) - α_{2} (p 95_{latency}) - α_{3} ({energy}_{cost}) - α_{4} ({fronthaul}_{cost})

(8)

The model’s parameters are optimized by minimizing the temporal-difference loss function:

L (θ) \approx E [(r + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))^{2}]

(9)

Through this iterative process, the DQN agent learns an adaptive policy that maps real-time network observations—such as channel state information, traffic demand, and resource utilization—to discrete configuration actions, including RRH activation/deactivation, functional split selection, bandwidth allocation, and computation offloading. In the snapshot reconfiguration setting, the objective is to learn a state-to-configuration mapping that maximizes the expected immediate return under stochastic traffic and channel realizations, consistent with the techno-economic and SLA-aware performance targets considered here [11,12,33,34].

Stability and convergence in high-dimensional state spaces are commonly enhanced through techniques such as experience replay, target network updates, and dueling architectures [33]. However, practical deployment still requires the precise design of state and reward models, the use of safe exploration methods to avoid service disruption, and low-latency inference mechanisms that satisfy the stringent timing constraints of 5G and future wireless systems [34,35].

5.3. Proximal Policy Optimization (PPO)

Cloud Radio Access Network (C-RAN) has become a pivotal architecture in fifth-generation (5G) mobile communication systems, where BBUs are centralized within cloud infrastructures to enable the joint optimization of computational and radio resources. This architectural shift enhances spectral efficiency while simultaneously reducing capital and operational expenditures [36]. However, C-RAN deployment presents several challenges, including dynamic traffic variation, limited fronthaul capacity, inter-cell interference, and stringent latency constraints—particularly in URLLC scenarios [2]. Conventional optimization approaches often fail to adapt effectively to such non-stationary environments, leading to the growing adoption of machine learning, and specifically reinforcement learning (RL), for C-RAN control and resource management [37].

Among RL techniques, PPO has attracted considerable attention as a policy-gradient method capable of achieving stable and efficient learning through its clipped surrogate objective, which mitigates the risk of unstable or overly aggressive policy updates that could degrade network performance [38]. The PPO agent directly models the policy distribution over actions instead of estimating Q-values, optimizing the following clipped objective:

L^{C L I P} (θ) = E_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(10)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ o l d} (a_{t} | s_{t})}

denotes the probability ratio between the new and old policies, and

{\hat{A}}_{t}

represents the advantage estimate derived from the value function. PPO jointly trains both policy and value networks using a multi-objective reward structure consistent with that employed in DQN, ensuring effective optimization across competing performance metrics. Empirical studies have demonstrated that PPO can be successfully applied to several critical C-RAN management functions, including dynamic spectrum allocation, energy-efficient power control, and network slice orchestration, yielding measurable gains in system throughput, energy efficiency, and SLA compliance [7,8,9,10]. Owing to its model-free nature, PPO operates efficiently in stochastic and partially observable environments without requiring explicit channel modeling. Furthermore, its integration with deep neural networks facilitates scalability across the large state-action spaces typical of dense 5G network deployments [39]. Collectively, these attributes position PPO as a promising reinforcement learning framework for real-time, data-driven decision-making in C-RAN systems, offering a stable foundation for future multi-objective and hierarchical control strategies designed to meet the stringent QoS requirements of next-generation wireless networks.

In practice, PPO performance can be sensitive to the clipping range and reward/penalty scaling, particularly in highly stochastic environments.

5.4. Deep Deterministic Policy Gradient (DDPG)

DDPG generalizes actor–critic methods to continuous control by coupling a deterministic actor that maps states to actions with a critic that scores state–action returns, enabling precise policies beyond purely discrete choices [40,41]. In hybrid settings that combine continuous and discrete decisions, we let DDPG govern the continuous variables and resolve the discrete ones via a compact wrapper search; specifically, in our C-RAN 5G system with MEC offloading the actor sets the downlink TDD ratio and MEC offload fraction, while a lightweight grid enumerates the functional split {2, 7.2} and RU sleep counts k ∈ {0, 1, 2, 3}, snapping the actor’s proposals to the simulator’s nearest admissible levels and selecting the option with the highest shaped return [42]. The reward aggregates four operator objectives—profit, negative p95 latency, negative energy cost, and negative fronthaul cost—via linear weights, and applies normalized penalties when service targets are violated (e.g., p95 latency > 40 ms or fronthaul cost > $0.80/h), steering learning toward utility while honoring service constraints [43]. In this study, configuration selection is modeled as a periodic system-level reconfiguration over quasi-static intervals (snapshot-based evaluation), so each decision is assessed on the current state without modeling multi-step transition overheads. Accordingly, DDPG is evaluated in a single-step setting (

γ = 0

), where the critic fits immediate returns for the selected configuration and the actor updates via the deterministic policy gradient. This design matches the configuration formulation used in the simulator and enables a consistent comparison across methods. While this formulation is appropriate for periodic configuration updates, it does not capture transition costs (e.g., RU wake-up latency, split switching overhead) or queue evolution across multiple decision steps; incorporating these effects requires a multi-step MDP with

γ > 0

. Training occurs in a single-step, contextual-bandit regime (γ = 0): the critic minimizes mean-squared error to the immediate reward on concatenated state–action inputs, and the actor follows the deterministic policy gradient by back-propagating through the critic; exploration is injected as zero-mean Gaussian noise on the actor’s pre-tanh outputs, and a replay buffer (capacity 2000, minibatch 32) improves stability [44]. Concretely, the observation stacks normalized per-RU loads, normalized per-RU fronthaul utilization, and the fraction of attached UEs; the actor is a two-layer ReLU MLP with a tanh head whose outputs are clipped to [−1, 1], and linearly mapped to, TDD ∈ [0.5, 0.8] and MEC ∈ [0.0, 0.8]. Episodes consist of a single rollout, so learning is driven by immediate returns in a single-step setting (

γ = 0

) using compact networks, compact models (width 64), learning rates 10⁻³, a noise schedule decaying from 0.3 to 0.05, and 240 optimization steps for main runs (80 for sweeps). All reported results are aggregated over 10 independent random-seed runs to account for stochastic channel conditions and RL training variability. This formulation captures the principal C-RAN trade-offs—split 7.2’s higher fixed fronthaul versus split 2’s traffic-proportional load, the TDD ratio’s impact on downlink capacity, and MEC’s latency/backhaul relief versus local processing—making it well suited for periodic system-level reconfiguration rather than mobility control.

5.5. Multi-Objective Reinforcement Learning (MORL)

MORL extends conventional reinforcement learning by jointly optimizing multiple conflicting performance metrics [19,20,22]. In the context of 5G and beyond networks, MORL enables intelligent decision-making for resource allocation, task offloading, and energy management, where objectives such as throughput, latency, energy efficiency, and cost must be balanced dynamically [19,20,21].

The MORL framework explores trade-offs among these objectives through stochastic preference sampling [45]. Weight vectors w are drawn from a Dirichlet distribution, and for each sample, the corresponding scalarized reward is computed as:

R (a, w) = w^{⊤} f (a)

(11)

where f(a) represents the vector of objective functions [46]. The agent evaluates available actions and selects the one maximizing the scalarized reward [47]. Repeating this process across multiple weight vectors yields an approximate Pareto frontier, representing the set of non-dominated policies that achieve different trade-offs among objectives.

A closed-loop MORL pipeline integrates data collection, policy learning, and runtime adaptation. Telemetry from the RAN, core, and MEC layers defines a multi-objective optimization space consistent with SLAs. The MORL agent refines operational decisions such as power control, and task offloading while maintaining a Pareto archive for context-aware policy selection [19,20,47]. Continuous monitoring ensures SLA compliance and adaptive balancing of performance and energy consumption across network functions [15].

MORL vs. Penalty-Constrained Control

MORL addresses a multi-objective optimization problem in which several competing criteria are optimized jointly. In this work, MORL treats the objective vector

f (a)

(e.g., profit/utility, tail latency, energy, and fronthaul cost) as a set of objectives and learns policies that balance them via preference sampling, using a scalarization of the form

R (a, ω) = ω^{⊤} f (a)

for different weight vectors

ω

.

By contrast, penalty-constrained formulations emphasize constraint satisfaction: a primary objective is optimized while enforcing SLA limits (e.g., tail-latency and fronthaul constraints). Here, PCHAC is a deterministic SLA-aware controller that selects actions using the penalty-augmented scoring rule and selection criterion defined in Equations (12)–(14). Accordingly, MORL is used to learn trade-off policies across objectives, whereas PCHAC enforces SLA-aware feasibility through explicit constraint handling and serves as a stable, interpretable benchmark/controller

5.6. Penalty-Constrained Hierarchical Action Controller (PCHAC): Deterministic SLA-Aware Controller

PCHAC is a deterministic, constraint-aware controller for SLA-feasible configuration selection. It is included as a strong, interpretable benchmark (and supervisory decision rule) alongside adaptive controllers. Figure 3 illustrates the PCHAC workflow for SLA-aware C-RAN/MEC configuration selection with KPI/reward feedback.

Specifically, PCHAC avoids stochastic exploration and neural policy approximation and instead evaluates candidate configurations deterministically using an SLA-aware scoring rule with explicit constraint handling. Unlike gradient-based reinforcement learning approaches such as DQN, DDPG, or PPO, PCHAC avoids stochastic exploration and neural policy approximation. Instead, it deterministically evaluates candidate configurations and selects the action that maximizes a preference-weighted score while enforcing SLA feasibility. This ensures stability, interpretability, and transparent decision-making.

PCHAC follows a multi-objective, constraint-aware decision rule for deterministic configuration selection. It employs weighted scalarization to merge multiple performance objectives—such as profit, latency, energy efficiency, and fronthaul utilization—into a single scalar score. To enforce feasibility, hinge-based penalties are integrated into the objective function:

S (a) = {\hat{ω}}^{⊤} f (a) - λ_{p 95} ν_{p 95} - λ_{f h} ν_{f h}

(12)

where

\hat{ω}

is the normalized weight vector, represents objective values, and the penalty terms are defined as:

ν_{p 95} = \max (0, \frac{p 95_{l a t e n c y} - p 95_{t a r g e t}}{p 95_{t a r g e t}}), ν_{f h} = \max (0, \frac{f r o n t h a u l_{c o s t} - c o s t_{t a r g e t}}{c o s t_{t a r g e t}}),

(13)

The optimal configuration is selected deterministically as:

a^{*} = \arg \max_{a} S (a)

(14)

This constitutes a configuration-control problem over a discrete action space, where the controller maps observed network state to an SLA-feasible configuration.

Although PCHAC evaluates a discrete set of configurations, its contribution is the SLA-aware selection rule: objective terms are normalized to comparable scales and combined with operator-defined preferences, while constraint satisfaction is enforced through explicit penalty/feasibility handling. The enumeration over the discrete action set is only an implementation mechanism for configuration selection.

PCHAC evaluates a predefined discrete configuration set (functional split, RRH activation count k, TDD ratio, and MEC offload fraction) and selects the configuration that maximizes the SLA-aware score. Because RU transitions are modeled without wake-up penalties, PCHAC’s energy gains should be interpreted under idealized switching assumptions. While this guarantees deterministic convergence and strict SLA compliance, scalability is limited in high-dimensional or continuous spaces. The per-epoch runtime scales linearly with the number of candidate configurations evaluated, since each candidate is scored once under the same KPI model.

The computational cost of PCHAC is dominated by scoring the evaluated configuration set. Let

N_{c f g}

denote the number of admissible discrete configurations; the runtime per decision epoch is

O (N_{c f g} \cdot C_{s i m})

, where

C_{s i m}

is the cost of computing the KPIs for one candidate configuration. In the present study, the action set is intentionally compact (functional split, RRH activation count

k

, discretized TDD ratio, and discretized MEC offloading fraction), yielding a manageable

N_{c f g}

. If the resolution of continuous knobs is refined (more TDD/offloading levels) or the decision space is expanded (e.g., per-RU actions or larger RU clusters), then

N_{c f g}

grows multiplicatively with the number of discrete levels per control dimension and may become prohibitive. In such settings, PCHAC can be used as a strong constraint-aware benchmark, while learning-based controllers provide a scalable alternative for large or continuous action spaces.

Overall, PCHAC provides a MORL-inspired, constraint-regularized decision model that unifies preference weighting, penalty-based feasibility, and deterministic selection—making it a reliable baseline or supervisory control layer for adaptive reinforcement learning systems.

5.7. Comparative Decision Workflow

All evaluated methods follow the same end-to-end decision loop to ensure a fair and consistent comparison: the simulator initializes the 5G C-RAN + MEC environment, extracts the current observation, selects a control action, applies the selected configuration to the network, computes the resulting KPIs, and then updates the corresponding optimization mechanism (if applicable).

Within this shared workflow, the baseline executes a fixed configuration and serves as a static performance reference. DQN and PPO operate as single-objective deep reinforcement learning agents, using value-based and policy-gradient learning, respectively, to improve decisions over repeated interactions.

In contrast, DDPG extends the same pipeline to a continuous-control setting through an actor–critic structure, enabling fine-grained adjustment of continuous decision variables while maintaining compatibility with discrete configuration constraints through bounded mapping.

For multi-objective decision-making, MORL explicitly explores trade-offs across competing KPIs by sampling preference weights and selecting non-dominated policies, whereas PCHAC (deterministic SLA-aware controller) applies SLA-aware scoring to select feasible configurations without stochastic exploration. Overall, this unified evaluation workflow enables systematic benchmarking across heterogeneous strategies by comparing adaptability, computational cost, and optimization quality under identical network and traffic conditions.

Although the evaluation is snapshot-based, learning-based controllers remain useful because the mapping from observed network state to the best configuration is high-dimensional and non-linear under stochastic channel/traffic variation. The RL baselines are therefore used as generic policy-learning function approximators that can learn this mapping from interaction data, providing a consistent comparison against MORL and the deterministic SLA-aware benchmark (PCHAC).

5.8. Computational Complexity

Let N_UE and N_RU denote the number of users and radio units, respectively. A single simulation rollout includes UE association, SINR estimation, link adaptation and latency evaluation. The dominant cost per rollout can be approximated as:

C_sim = O(N_UE N_RU)

(15)

The baseline method performs one rollout only:

O(C_sim)

(16)

Both MORL and PCHAC evaluate multiple candidate configurations from a discrete action set A. MORL samples S preference vectors and selects the best solution across the evaluated set, yielding:

O(S |A| C_sim)

(17)

As network density increases (larger

N_{U E}

and

N_{R U}

) or as the discrete action set is expanded (larger

∣ A ∣

), the runtime increases proportionally through

C_{s i m}

and

∣ A ∣

, respectively.

PCHAC applies preference-based scoring with constraint penalties over candidate actions, leading to a worst-case complexity of:

O(|A| C_sim)

(18)

For learning-based schemes, DQN performs iterative rollouts with replay-based neural updates. Let T_DQN be the number of training steps, B the batch size, and H the hidden dimension. The complexity becomes:

O(T_DQN(C_sim + B H²))

(19)

PPO collects on-policy trajectories and updates actor–critic networks over multiple epochs. With E episodes per batch and U update epochs, the complexity is:

O(T_PPO(E C_sim + U B H²))

(20)

DDPG employs an actor–critic structure with continuous control outputs, combined with a lightweight wrapper search across discrete candidates. Let K_d be the number of discrete evaluations per iteration. The resulting complexity is:

O(T_DDPG(K_d C_sim + B H²))

(21)

Overall, the baseline strategy has the lowest computational overhead, MORL and PCHAC scale mainly with the action-space size, while DQN, PPO, and DDPG incur additional cost due to iterative training and neural optimization. PPO typically exhibits the highest training burden because of on-policy sampling and repeated policy updates.

6. Results and Evaluation

All reported results are averaged over

N = 10

independent random seeds for each method. Each seed re-initializes the stochastic components of the simulation (e.g., channel and traffic realizations) as well as the learning process (e.g., network initialization and exploration). Figures and tables report the mean across seeds; uncertainty is shown as 95% confidence intervals (CI) and computed as

1.96 σ / \sqrt{N}

, where

σ

is the standard deviation across seeds.

Figure 4 presents the spatial configuration of the baseline C-RAN scenario, depicting a 1 km × 1 km region populated with randomly positioned radio units (RUs), uniformly distributed user equipments (UEs), and a centrally located DU/CU node. The heterogeneous RU placement creates natural variations in coverage density, leading to unequal traffic loads and diverse interference conditions across the area. The central DU/CU location reduces average fronthaul distance, thereby limiting propagation and serialization delays, whereas RUs situated near the edges inevitably face higher fronthaul latency. The uniform UE distribution highlights regions of concentrated demand that may overload nearby RUs, producing fluctuations in SINR, throughput, and latency during operation. Overall, this layout provides the physical context underlying all subsequent performance evaluations, as the topology directly influences resource utilization, fronthaul usage, and the effectiveness of the optimization methods compared later in the study.

Figure 5 and Table 4 summarize the economic outcomes across methods in terms of profit, energy expenditure, fronthaul (FH) cost, and net utility (log-scaled to reflect the different value ranges). Across 10 independent random-seed runs, PCHAC achieves the highest mean profit (10.382 ± 0.143 $/h) and the highest mean net utility (8.458 ± 0.077 $/h), indicating the most favorable revenue–cost trade-off among the evaluated approaches. PCHAC also yields the lowest mean energy cost (1.813 ± 0.035 $/h), improving upon the baseline (2.002 ± 0.034 $/h) and the adaptive baselines (e.g., MORL 1.977 ± 0.031 $/h, DQN 1.989 ± 0.033 $/h). For FH cost, PCHAC attains the minimum level (0.012 ± 0.001 $/h), matching the baseline and remaining below MORL (0.022 ± 0.001 $/h), DQN (0.019 ± 0.001 $/h), PPO (0.021 ± 0.001 $/h), and DDPG (0.013 ± 0.001 $/h). Overall, these results show that PCHAC consistently provides superior mean economic performance with low variability across seeds.

Figure 6 and Table 5 report the end-to-end latency distribution using percentile statistics (p50, p75, p90, p95, and p99). Results are aggregated over 10 independent random-seed runs and presented as mean ± uncertainty, which captures variability due to channel/traffic randomness and RL training stochasticity. Across all reported percentiles, PCHAC maintains the lowest latency levels, indicating improved typical performance (median and upper-quartile) as well as stronger tail behavior (p95 and p99) compared with the baseline and the learning-based baselines. The smaller uncertainty ranges for PCHAC at high percentiles further suggest stable tail-latency control across independent runs, supporting the robustness of the observed latency improvements.

Figure 7 illustrates the empirical cumulative distribution functions (CDFs) of per-user latency for the six evaluated methods, where curves positioned further to the left indicate lower latency across a larger proportion of samples. PCHAC exhibits the most favorable latency distribution, with the majority of latency values concentrated between approximately 18 and 30 ms and a rapid rise in cumulative probability, indicating consistently low delay. MORL and DDPG form the next performance tier, showing slightly higher latency distributions, with most samples falling roughly within the 20–35 ms range. DQN is further shifted to the right, reflecting increased latency at comparable percentiles and a broader spread extending beyond 40 ms. The baseline method demonstrates a more pronounced right shift, with latency values extending toward approximately 50 ms, indicating slower service for a substantial fraction of users. PPO shows the least favorable latency behavior, with a wide distribution spanning from around 30 ms to over 60 ms and a slower increase in the CDF, revealing both higher typical latency and a heavier high-latency tail. Overall, the ordering of the CDFs confirms that PCHAC achieves the strongest latency performance, followed by MORL and DDPG, then DQN, with baseline and PPO exhibiting progressively higher latency distributions.

Figure 8 illustrates the distribution of channel quality indicator (CQI) levels across the evaluated methods using a UE count heatmap (top) and the corresponding normalized UE percentage curves (bottom). The baseline, MORL, PPO, and DDPG methods exhibit a strong concentration of users in low-to-medium CQI ranges, primarily between CQI 1 and 6, with clear peaks around CQI 2–3 where counts typically exceed 40 UEs. For these methods, the number of users decreases sharply beyond CQI 7, and higher CQI levels (CQI ≥ 13) contain few or no users. DQN follows a similar trend, showing a pronounced peak at CQI 2 (47 UEs) and a rapid decline at higher CQI values, with only limited presence beyond CQI 10. In contrast, PCHAC displays a distinctly right-shifted CQI distribution, allocating relatively few users to low CQI bins and progressively increasing the number of users in higher CQI levels. Specifically, PCHAC assigns substantially more users to CQI 10–15, with counts peaking between 22 and 24 UEs, resulting in a steadily increasing UE percentage profile across CQI levels. Overall, the combined heatmap and percentage curves indicate that PCHAC consistently maintains a larger fraction of users under favorable channel conditions, whereas the other methods predominantly operate in lower-CQI regimes.

Figure 9 depicts the per–radio unit (RU) fronthaul (FH) load for RU-0 through RU-9 under all evaluated methods, shown on a logarithmic scale with a dashed reference line indicating the 1 Gbps capacity limit. For every RU and method, the FH load remains well below the capacity threshold, confirming ample fronthaul headroom across the network. The FH demand varies noticeably across RUs, with higher loads observed at RU-2, RU-3, and RU-6, while RU-9 consistently exhibits the lowest FH consumption. Across methods, PCHAC generally achieves the lowest FH load on a per-RU basis, indicating more efficient fronthaul utilization. In contrast, DQN incurs the highest FH demand for most RUs, with pronounced peaks at RU-3 and RU-6 that approach the upper end of the observed range. The baseline, MORL, PPO, and DDPG methods fall between these extremes, displaying similar per-RU load patterns with moderate variations. Overall, the results highlight clear differences in fronthaul efficiency across control strategies while confirming that all methods operate comfortably within fronthaul capacity limits.

Figure 10 illustrates Jain’s fairness index measured per radio unit (RU-0 to RU-9) for all evaluated control methods. The results reveal noticeable variability in fairness across RUs, with comparatively low values at RU-0, RU-1, and RU-9, and substantially higher fairness observed at RU-8, indicating uneven user-level resource distribution across the network. PCHAC consistently achieves the highest or near-highest fairness across most RUs, showing clear improvements at RU-3, RU-4, RU-7, and RU-8. The baseline and MORL approaches generally yield moderate fairness levels and exhibit similar trends across RUs. DQN shows reduced fairness at several locations, most notably at RU-2 and RU-8, where its values fall well below those of the other methods. PPO and DDPG display intermediate to strong performance depending on the RU, with DDPG closely matching or slightly exceeding other methods at RU-4 and maintaining competitive fairness at RU-8. Overall, the figure highlights PCHAC’s superior ability to promote balanced resource allocation across radio units.

Figure 11 illustrates the normalized multi-objective performance of the evaluated methods using a radar chart that captures profit, inverse P95 latency, inverse energy cost, inverse fronthaul load, and the number of connected users. PCHAC consistently achieves the highest normalized values across all five dimensions, forming the outermost and most balanced profile, which indicates superior joint optimization of economic performance, service latency, energy efficiency, fronthaul utilization, and user connectivity. MORL and DDPG demonstrate strong performance in latency and connectivity; however, both methods exhibit reduced scores in energy and fronthaul efficiency, resulting in less uniform coverage across objectives. DQN attains relatively high profit and latency performance but shows noticeably weaker fronthaul efficiency and connectivity compared to PCHAC. The baseline approach performs well in fronthaul efficiency but displays substantially lower normalized values for profit, energy efficiency, and user connectivity, limiting its overall multi-objective effectiveness. PPO presents a smaller and uneven radar profile, characterized by particularly low latency performance and moderate values in the remaining dimensions. Overall, the radar visualization highlights PCHAC as the most well-balanced and robust solution under multi-objective evaluation, outperforming alternative methods across all considered criteria.

Figure 12 presents the percentage of served users whose downlink SINR meets or exceeds three thresholds (0, 5, and 10 dB) for the baseline scheme and the five adaptive policies (four RL-based methods and the deterministic SLA-aware controller PCHAC). PCHAC delivers the strongest coverage at every threshold, achieving 95.5% for SINR ≥ 0 dB, 85.9% for SINR ≥ 5 dB, and 67.7% for SINR ≥ 10 dB, which indicates a markedly larger share of users operating under favorable radio conditions. By comparison, the other approaches show substantially lower coverage—particularly at the stricter 10 dB target—where the baseline and MORL remain around the low single digits, and the remaining DRL methods stay below roughly 12%. Among these DRL baselines, DQN is comparatively better at the 0–5 dB thresholds (e.g., 45.0% at 0 dB and 23.0% at 5 dB), whereas PPO and DDPG provide weaker performances at higher thresholds (about 5–9% at 10 dB). Overall, the figure demonstrates that PCHAC substantially increases the probability of users meeting moderate and high SINR requirements, supporting improved link robustness and interference management in the considered C-RAN scenario.

Figure 13 and Table 6 report profit per hour versus the number of active users for the six evaluated methods, with each point summarized over 10 independent random-seed runs and shown as the mean with uncertainty. Across all approaches, profit generally rises as user load increases, with several methods yielding negative profit at low load (50 users) before becoming positive as demand grows. PCHAC delivers the highest mean profit over the full load range, increasing from a small positive value at 50 users to the best performance at 300 users, and it maintains a clear advantage at the representative operating point of 200 users. MORL and the baseline also scale well at higher loads, reaching similarly high profit levels at 300 users, whereas DQN and DDPG exhibit weaker gains beyond 200 users and therefore remain below the top performers at heavy load. PPO shows less consistent behavior at the highest load, with profit increasing up to 250 users but dropping at 300 users, indicating reduced robustness under peak demand. Overall, the results indicate that PCHAC provides the most reliable and scalable profit performance as user demand increases.

Figure 14 and Table 7 present the p95 latency as a function of the number of active users for the six evaluated methods, with each point reported as the mean over 10 independent random-seed runs and accompanied by uncertainty to reflect variability from channel/traffic randomness and RL training. As expected, tail latency increases with user load for all approaches. PCHAC maintains the lowest mean p95 latency across the entire range, increasing from around 21 ms at 50 users to about 40 ms at 300 users. MORL and DDPG follow with slightly higher but closely tracking trends, while DQN exhibits consistently higher p95 latency than these methods. The baseline shows substantially larger tail latency, rising from the high-20 ms range at low load to above 50 ms at 300 users. PPO yields the highest p95 latency and degrades most sharply as load increases, exceeding 60 ms at 300 users. Overall, the method ranking remains stable across loads, indicating that PCHAC provides the most robust tail-latency control as demand scales.

PPO’s less favorable tail-latency behavior in this setting is consistent with the combination of (i) an on-policy update rule, which is typically more sample-demanding under stochastic channel/traffic variability, and (ii) the clipped surrogate objective, which can produce conservative policy updates when reward scales or advantage estimates are noisy. While PPO is widely effective in many networking tasks, its performance here is sensitive to hyperparameter choices (e.g., clipping range, learning rate, entropy regularization, batch size) and to reward scaling/penalty magnitudes. The reported PPO results therefore reflect a reasonable baseline configuration rather than an upper bound for PPO under exhaustive tuning, and further task-specific tuning may reduce the observed latency gap.

Figure 15 and Table 8 summarize energy cost versus the number of active users for the six evaluated methods, with each point reported as the mean over 10 independent random-seed runs and accompanied by uncertainty to capture variability across stochastic realizations. Energy cost increases with user load for all approaches, reflecting higher operational demand at larger scale. PCHAC consistently achieves the lowest mean energy cost across the full range, rising from about 1.35 $/h at 50 users to roughly 1.90 $/h at 300 users. The remaining methods follow similar upward trends but at higher cost levels; the baseline increases most sharply and reaches approximately 2.60 $/h at 300 users, while MORL, DQN, PPO, and DDPG remain between these extremes. Overall, the persistent separation between PCHAC and the other methods indicates superior energy efficiency as user demand scales.

Figure 16 and Table 9 report execution time versus the number of users for all evaluated methods, with each point summarized as the mean over 10 independent random-seed runs and accompanied by uncertainty. Execution time increases approximately linearly with user load for all approaches, reflecting the additional computation required at higher concurrency. The baseline remains the most efficient method, rising from about 2.0 ms at 50 users to roughly 3.0 ms at 300 users. MORL and PCHAC follow closely, with slightly higher but comparable execution times that reach around 3.1–3.2 ms at the highest load. DQN incurs additional overhead and grows to about 3.6 ms at 300 users, while DDPG increases further to approximately 4.0 ms. PPO consistently records the highest execution time, exceeding 4.3 ms at 300 users, indicating weaker scalability under heavy load. The higher runtime is consistent with PPO’s on-policy training/inference overhead and the additional policy/value-network computations per decision epoch under the adopted implementation. Overall, these results reveal clear differences in computational efficiency as user demand scales.

Sensitivity to MEC and Core Network Latency Assumptions

To assess whether the conclusions depend on the latency constants used in Table 2, a sensitivity analysis is conducted by sweeping

L_{M E C} \in \{0.3, 1, 5\}

ms and

L_{c o r e} \in \{2, 10, 20\}

ms. For each setting, all methods are evaluated over

N = 10

independent random seeds and reported as mean ± 95% confidence intervals. The profit sensitivity results are summarized in Table 10, while the p95 latency sensitivity results are reported in Table 11. The results confirm that the relative performance trends remain stable across a practical range of MEC and core-latency assumptions.

7. Discussion

The presented results collectively confirm that joint control across the radio, fronthaul, and edge-compute dimensions is essential for techno-economic optimization in 5G C-RAN. Across the evaluated policies, the strongest methods are those that explicitly account for competing objectives and operational constraints rather than optimizing a single KPI in isolation. In particular, the multi-objective and constraint-aware decision mechanisms deliver more stable behavior under load growth, where interference, fronthaul utilization, and MEC/core latency components jointly amplify tail latency and degrade user experience.

From an economic perspective, the methods that better manage the profit–cost trade-off achieve higher net utility by simultaneously improving revenue-related terms (e.g., connected users and service success) while limiting operational expenses (energy and fronthaul costs). This behavior is consistent with the observation that fronthaul efficiency and RU-level load balancing reduce unnecessary transport overhead and prevent localized congestion. The fronthaul-load visualization further indicates that capacity headroom is preserved under all methods, but the relative reductions produced by the best controller remain important because fronthaul usage directly impacts both cost and downstream latency components.

Service-level behavior is best understood by combining the tail-latency trend plots with the latency CDF. Methods that shift the CDF left and steepen its rise effectively reduce not only average delay but also high-percentile outliers—an outcome that is particularly relevant for SLA-driven operation. Importantly, a controller may appear competitive at moderate user density while becoming unstable at higher load; therefore, the scalability curves provide more reliable evidence of robustness than single operating-point comparisons. In this context, the most consistent policy is the one that maintains a favorable ordering across user-load scaling, rather than one that excels at a single snapshot.

Radio-side indicators (coverage probability, CQI distribution, and per-RU fairness) provide additional diagnostic insight into why certain controllers outperform others. A higher share of users operating in stronger link states (higher CQI and improved coverage probability at practical SINR thresholds) typically correlates with better scheduling decisions, RU–UE association balance, and reduced interference exposure. Similarly, higher Jain fairness across RUs indicates that the controller avoids systematically starving certain cells or concentrating traffic in a subset of RUs, which otherwise increases tail latency and can indirectly increase energy and fronthaul cost due to repeated retransmissions or inefficient resource usage.

The comparative workflow and complexity analysis highlight an important practical trade-off: adaptive methods can deliver adaptivity but may impose higher training or inference overhead, whereas deterministic constrained selection can be more interpretable and stable but may face scalability limits as the action space expands. These results suggest that a promising deployment direction is hybrid orchestration, where a constraint-aware supervisor ensures SLA feasibility while a learning agent refines continuous control variables (e.g., TDD ratio and MEC offloading fraction) within safe bounds.

Finally, several limitations remain relevant for scientific rigor and reproducibility. First, the absolute latency level depends strongly on the latency decomposition, processing assumptions, and fronthaul modeling; therefore, the discussion should emphasize percentile trends, relative ordering, and scalability rather than only point estimates. Second, RU sleep/activation decisions should be tied to explicit energy-state modeling (transition costs, wake-up delay, and minimum on/off durations) to avoid overly optimistic energy savings. Third, extending the environment to include more realistic traffic mixtures, burstiness, and mobility would strengthen the generality of the findings. These extensions do not change the main conclusion: multi-objective and constraint-aware control provides a more reliable foundation for 5G C-RAN orchestration under heterogeneous demands.

8. Conclusions

This paper investigated techno-economic and SLA-aware orchestration for 5G Cloud-RAN integrated with Mobile Edge Computing, where effective operation requires coordinated decisions across radio configuration, fronthaul management, RU activation behavior, and computation offloading. A unified evaluation framework was developed to benchmark a static baseline against multiple AI-driven strategies spanning value-based learning, policy-gradient learning, continuous-control actor–critic optimization, and explicitly multi-objective/constraint-aware decision mechanisms. The results demonstrate that intelligent control can improve the overall operational trade-off by increasing economic return while simultaneously enhancing service performance indicators such as tail latency, user connectivity, and radio-side quality distributions.

The comparative analysis highlights that controllers designed to handle competing objectives and enforce feasibility constraints provide the most stable behavior under increasing user load, where tail performance and resource bottlenecks become dominant. In contrast, methods that do not explicitly incorporate constraint handling or multi-objective balance may exhibit weaker robustness under scaling demand, despite performing well in limited operating points. Overall, the findings support the practical value of multi-objective and constraint-aware learning for next-generation C-RAN orchestration, especially in deployments where operational costs, fronthaul limits, and SLA targets must be simultaneously satisfied.

Future work will improve realism and deployment readiness by incorporating richer RU energy-state modeling (including transition costs and wake-up latency), adopting bursty and service-differentiated traffic patterns, and studying broader network scenarios with mobility and heterogeneous fronthaul. The simulator will also be extended to include explicit MEC queueing dynamics (e.g., M/M/1 or M/G/1) and to re-examine controller robustness under time-varying load. Moreover, hybrid designs that couple constraint-enforcing supervision with learning-based continuous control may offer a practical path toward scalable, reliable, and interpretable AI-assisted C-RAN management. Finally, the formulation will be generalized to a multi-step MDP with

γ > 0

by introducing RU sleep/wake transition penalties, functional-split switching overheads, and explicit MEC queue evolution, enabling evaluation under delayed effects.

Author Contributions

S.M.A. led the conceptualization and design of the study, developed the proposed approach, conducted the experiments, and prepared the manuscript. Together with E.M.E., S.M.A. contributed to system modeling and supported the analysis and interpretation of the results. S.M.A. and S.S.A. assisted in the development and evaluation of the adaptive methods and contributed to performance analysis. H.M.A.E.K. provided technical oversight, supported validation and analysis, and reviewed the methodology. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

The data used to support the findings of this study were generated through simulations and are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Haque, M.E.; Tariq, F.; Khandaker, M.R.A.; Wong, K.K.; Zhang, Y. A Survey of Scheduling in 5G URLLC and Outlook for Emerging 6G Systems. IEEE Access 2023, 11, 34372–34396. [Google Scholar] [CrossRef]
Bhandari, A.; Gupta, A.; Tanwar, S.; Rodrigues, J.J.P.C.; Sharma, R.; Singh, A. Latency Optimized C-RAN in Optical Backhaul and RF Fronthaul Architecture for beyond 5G Network: A Comprehensive Survey. Comput. Netw. 2024, 247, 110459. [Google Scholar] [CrossRef]
Akhlaqi, M.Y.; Mohd Hanapi, Z.B. Task Offloading Paradigm in Mobile Edge Computing-Current Issues, Adopted Approaches, and Future Directions. J. Netw. Comput. Appl. 2023, 212, 103568. [Google Scholar] [CrossRef]
Zhao, L.; Liu, C.; Qi, E.; Shi, S. Multi-Objective Optimization in Order to Allocate Computing and Telecommunication Resources Based on Non-Orthogonal Access, Participation of Cloud Server and Edge Server in 5G Networks. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102187. [Google Scholar] [CrossRef]
Zhang, X.; Debroy, S. Resource Management in Mobile Edge Computing: A Comprehensive Survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Martínez-Morfa, M.; Ruiz de Mendoza, C.; Cervelló-Pastor, C.; Sallent-Ribes, S. Federated Learning System for Dynamic Radio/MEC Resource Allocation and Slicing Control in Open Radio Access Network. Future Internet 2025, 17, 106. [Google Scholar] [CrossRef]
Sherif, H.; Ahmed, E.; Kotb, A.M. Energy-Efficient and Accelerated Resource Allocation in O-RAN Slicing Using Deep Reinforcement Learning and Transfer Learning. Cybern. Inf. Technol. 2024, 24, 132–150. [Google Scholar] [CrossRef]
Mhatre, S.; Adelantado, F.; Ramantas, K.; Verikoukis, C. Intelligent QoS-Aware Slice Resource Allocation with User Association Parameterization for Beyond 5G O-RAN-Based Architecture Using DRL. IEEE Trans. Veh. Technol. 2025, 74, 3096–3109. [Google Scholar] [CrossRef]
Vidhya, P.; Subashini, K.; Sathishkannan, R.; Gayathri, S. Dynamic Network Slicing Based Resource Management and Service Aware Virtual Network Function (VNF) Migration in 5G Networks. Comput. Netw. 2025, 259, 111064. [Google Scholar] [CrossRef]
Wang, Y.; Liu, N.; Pan, Z.; You, X. AI-Based Resource Allocation in E2E Network Slicing with Both Public and Non-Public Slices. Appl. Sci. 2023, 13, 12505. [Google Scholar] [CrossRef]
Luo, Y.; Yang, J.; Xu, W.; Wang, K.; Renzo, M. Di Power Consumption Optimization Using Gradient Boosting Aided Deep Q-Network in C-RANs. IEEE Access 2020, 8, 46811–46823. [Google Scholar] [CrossRef]
Sun, G.; Ayepah-Mensah, D.; Budkevich, A.; Liu, G.; Jiang, W. Autonomous Cell Activation for Energy Saving in Cloud-RANs Based on Dueling Deep Q-Network. Knowl. Based Syst. 2020, 192, 105347. [Google Scholar] [CrossRef]
Wang, Q.; Chetty, S.; Al-Tahmeesschi, A.; Liang, X.; Chu, Y.; Ahmadi, H. Energy Saving in 6G O-RAN Using DQN-Based XApp. In Proceedings of the 2024 IEEE 29th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Athens, Greece, 21–23 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
Mahdi Shahabi, S.M.; Deng, X.; Qidan, A.; Elgorashi, T.; Elmirghani, J. Energy-Efficient Functional Split in Non-Terrestrial Open Radio Access Networks. In Proceedings of the GLOBECOM 2024—2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; pp. 3799–3804. [Google Scholar] [CrossRef]
Martínez-Morfa, M.; De Mendoza, C.R.; Cervelló-Pastor, C.; Sallent, S. DRL-Based XApps for Dynamic RAN and MEC Resource Allocation and Slicing in O-RAN. In Proceedings of the 2024 15th International Conference on Network of the Future (NoF), Castelldefels, Spain, 2–4 October 2024; pp. 106–114. [Google Scholar] [CrossRef]
Khani, M.; Jamali, S.; Sohrabi, M.K.; Sadr, M.M.; Ghaffari, A. Resource Allocation in 5G Cloud-RAN Using Deep Reinforcement Learning Algorithms: A Review. Trans. Emerg. Telecommun. Technol. 2024, 35, e4929. [Google Scholar] [CrossRef]
Tian, C.; Cao, H.; Xie, J.; Garg, S.; Rodrigues, J.J.P.C.; Hossain, M.S. MADDPG-Empowered Slice Reconfiguration Approach for 5G Multi-Tier System. J. Netw. Comput. Appl. 2023, 219, 103725. [Google Scholar] [CrossRef]
Zhang, S.; Tong, X.; Chi, K.; Shi, Z. Jointly Optimizing Task Offloading and Resource Allocation in MEC with Secure Data Transmission: A Multi-DNNs Approach. IEEE Trans. Mob. Comput. 2025, 1–16. [Google Scholar] [CrossRef]
Song, F.; Xing, H.; Wang, X.; Luo, S.; Dai, P.; Li, K. Offloading Dependent Tasks in Multi-Access Edge Computing: A Multi-Objective Reinforcement Learning Approach. Future Gener. Comput. Syst. 2022, 128, 333–348. [Google Scholar] [CrossRef]
Huang, Z.D.; Wu, X.F.; Dong, S. Bin Multi-Objective Task Offloading for Highly Dynamic Heterogeneous Vehicular Edge Computing: An Efficient Reinforcement Learning Approach. Comput. Commun. 2024, 225, 27–43. [Google Scholar] [CrossRef]
Xu, S.; Liu, J.; Tang, J.; Liu, X.; Li, Z. Multi Objective Reinforcement Learning Driven Task Offloading Algorithm for Satellite Edge Computing Networks. Sci. Rep. 2025, 15, 24045. [Google Scholar] [CrossRef]
Ismail, A.A.; Khalifa, N.E.; El-Khoribi, R.A. A Survey on Resource Scheduling Approaches in Multi-Access Edge Computing Environment: A Deep Reinforcement Learning Study. Clust. Comput. 2025, 28, 184. [Google Scholar] [CrossRef]
Hikmaturokhman, A.; Fahira, G.; Esa, R.N.; Wulandari, A.; Wen, G.K. Deployment of 5G NR Outdoor-to-Indoor at Midband and MmWave Frequency Implementation in Indonesia’s Industrial Area. Int. J. Adv. Sci. Eng. Inf. Technol. 2023, 13, 2120–2127. [Google Scholar] [CrossRef]
Jia, X.; Liu, P.; Qi, W.; Liu, S.; Huang, Y.; Zheng, W.; Pan, M.; You, X. Link-Level Simulator for 5G Localization. IEEE Trans. Wirel. Commun. 2023, 22, 5198–5213. [Google Scholar] [CrossRef]
Pang, L.; Zhang, J.; Zhang, Y.; Huang, X.; Chen, Y.; Jiandong, L.I. Investigation and Comparison of 5G Channel Models: From QuaDRiGa, NYUSIM, and MG5G Perspectives. Chin. J. Electron. 2022, 31, 1–17. [Google Scholar] [CrossRef]
Chen, X.; Zhang, H.; Wu, C.; Mao, S.; Ji, Y.; Bennis, M. Optimized Computation Offloading Performance in Virtual Edge Computing Systems via Deep Reinforcement Learning. IEEE Internet Things J. 2019, 6, 4005–4018. [Google Scholar] [CrossRef]
Liu, J.; Ahmed, M.; Mirza, M.A.; Khan, W.U.; Xu, D.; Li, J.; Aziz, A.; Han, Z. RL/DRL Meets Vehicular Task Offloading Using Edge and Vehicular Cloudlet: A Survey. IEEE Internet Things J. 2022, 9, 8315–8338. [Google Scholar] [CrossRef]
Zhao, T.; He, L.; Huang, X.; Li, F. DRL-Based Secure Video Offloading in MEC-Enabled IoT Networks. IEEE Internet Things J. 2022, 9, 18710–18724. [Google Scholar] [CrossRef]
Coll-Perales, B.; Lucas-Estan, M.C.; Shimizu, T.; Gozalvez, J.; Higuchi, T.; Avedisov, S.; Altintas, O.; Sepulcre, M. End-to-End V2X Latency Modeling and Analysis in 5G Networks. IEEE Trans. Veh. Technol. 2023, 72, 5094–5109. [Google Scholar] [CrossRef]
Liang, B.; Gregory, M.A.; Li, S. Multi-Access Edge Computing Fundamentals, Services, Enablers and Challenges: A Complete Survey. J. Netw. Comput. Appl. 2022, 199, 103308. [Google Scholar] [CrossRef]
Zhou, H.; Jiang, K.; Liu, X.; Li, X.; Leung, V.C.M. Deep Reinforcement Learning for Energy-Efficient Computation Offloading in Mobile-Edge Computing. IEEE Internet Things J. 2022, 9, 1517–1530. [Google Scholar] [CrossRef]
Elgendy, I.A.; Zhang, W.Z.; He, H.; Gupta, B.B.; Abd El-Latif, A.A. Joint Computation Offloading and Task Caching for Multi-User and Multi-Task MEC Systems: Reinforcement Learning-Based Algorithms. Wirel. Netw. 2021, 27, 2023–2038. [Google Scholar] [CrossRef]
Tan, J.; Guan, W. Resource Allocation of Fog Radio Access Network Based on Deep Reinforcement Learning. Eng. Rep. 2022, 4, e12497. [Google Scholar] [CrossRef]
Lee, J.; Lim, S.-C.; Kim, H.; Na, J.; Lee, H. Distributed DQN-Based Network-Wide Energy Efficiency Maximization for 5G NR-U and Wi-Fi Coexistence Environments. ICT Express 2024, 10, 845–850. [Google Scholar] [CrossRef]
Raftopoulos, R.; D’Oro, S.; Melodia, T.; Schembra, G. DRL-Based Latency-Aware Network Slicing in O-RAN with Time-Varying SLAs. In Proceedings of the 2024 International Conference on Computing, Networking and Communications (ICNC), Big Island, HI, USA, 19–22 February 2024; pp. 737–743. [Google Scholar] [CrossRef]
Junejo, Y.S.; Shaikh, F.K.; Chowdhry, B.S.; Shah, A.A.; Ejaz, W. Converging Towards Open Radio Access Networks—A Comprehensive Review. J. Mob. Multimed. 2024, 20, 49–86. [Google Scholar] [CrossRef]
Zangooei, M.; Saha, N.; Golkarifard, M.; Boutaba, R. Reinforcement Learning for Radio Resource Management in RAN Slicing: A Survey. IEEE Commun. Mag. 2023, 61, 118–124. [Google Scholar] [CrossRef]
Zhang, C.; Wu, C.; Lin, M.; Lin, Y.; Liu, W. Proximal Policy Optimization for Efficient D2D-Assisted Computation Offloading and Resource Allocation in Multi-Access Edge Computing. Future Internet 2024, 16, 19. [Google Scholar] [CrossRef]
Huang, J.; Yang, C.; Zhang, S.; Yang, F.; Alfarraj, O.; Frascolla, V.; Mumtaz, S.; Yu, K. Reinforcement Learning Based Resource Management for 6G-Enabled MIoT with Hypergraph Interference Model. IEEE Trans. Commun. 2024, 72, 4179–4192. [Google Scholar] [CrossRef]
Zhang, B.; Van Huynh, N. Deep Deterministic Policy Gradient for End-to-End Communication Systems without Prior Channel Knowledge. In Proceedings of the GLOBECOM 2023—2023 IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 8–12 December 2023. [Google Scholar] [CrossRef]
Lee, J.; Park, T.; Sung, W. Digital Twin Based DDPG Reinforcement Learning for Sum-Rate Maximization of AI-UAV Communications. EURASIP J. Wirel. Commun. Netw. 2024, 2024, 57. [Google Scholar] [CrossRef]
Du, R.; Wang, J.; Gao, Y. Computing Offloading and Resource Scheduling Based on DDPG in Ultra-Dense Edge Computing Networks. J. Supercomput. 2024, 80, 10275–10300. [Google Scholar] [CrossRef]
Liu, Q.; Choi, N.; Han, T. Constraint-Aware Deep Reinforcement Learning for End-to-End Resource Orchestration in Mobile Networks. In Proceedings of the 2021 IEEE 29th International Conference on Network Protocols (ICNP), Dallas, TX, USA, 1–5 November 2021; Volume 2021. [Google Scholar] [CrossRef]
Sumiea, E.H.; Abdulkadir, S.J.; Alhussian, H.S.; Al-Selwi, S.M.; Alqushaibi, A.; Ragab, M.G.; Fati, S.M. Deep Deterministic Policy Gradient Algorithm: A Systematic Review. Heliyon 2024, 10, e30697. [Google Scholar] [CrossRef]
Abels, A.; Roijers, D.M.; Lenaerts, T.; Nowe, A.; Steckelmacher, D. Dynamic Weights in Multi-Objective Deep Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Volume 2019. [Google Scholar]
Felten, F.; Talbi, E.G.; Danoy, G. Multi-Objective Reinforcement Learning Based on Decomposition: A Taxonomy and Framework. Daimon 2024, 79, 679–723. [Google Scholar] [CrossRef]
Xu, Y.; Zhu, K.; Xu, H.; Ji, J. Deep Reinforcement Learning for Multi-Objective Resource Allocation in Multi-Platoon Cooperative Vehicular Networks. IEEE Trans. Wirel. Commun. 2023, 22, 6185–6198. [Google Scholar] [CrossRef]

Figure 1. 5G C-RAN with MEC architecture showing UE–RU access, fronthaul to centralized DU/CU processing, and MEC offloading with backhaul/core connectivity.

Figure 2. Unified control pipelines for the baseline and adaptive approaches in a 5G C-RAN system with MEC.

Figure 3. PCHAC workflow for SLA-aware C-RAN/MEC configuration selection with KPI/reward feedback.

Figure 4. Simulated 5G C-RAN topology with RUs, UEs, and centralized DU/CU.

Figure 5. Economics (mean over 10 seeds; 95% CI).

Figure 6. Service KPIs (mean over 10 seeds; 95% CI).

Figure 7. Cumulative distribution functions (CDFs) of end-to-end latency.

Figure 8. CQI distribution across users for all evaluated methods.

Figure 9. Per-RU fronthaul load with capacity limit.

Figure 10. Jain’s fairness index per radio unit.

Figure 11. Normalized multi-objective performance comparison.

Figure 12. Percentage of users exceeding SINR thresholds.

Figure 13. Profit vs. users (mean over 10 seeds; 95% CI).

Figure 14. p95 latency vs. users (mean over 10 seeds; 95% CI).

Figure 15. Energy cost vs. users (mean over 10 seeds; 95% CI).

Figure 16. Execution time vs. users (mean over 10 seeds; 95% CI).

Table 1. Simulation and system parameters for the C-RAN downlink scenario.

Parameter/Symbol	Meaning	Unit	Value
f_c	Carrier frequency (FR1 mid-band example)	GHz	3.5
B	System bandwidth	MHz	100
P_tx	RU/gNB transmit power (per RU, EIRP-related)	dBm	46
NF	Receiver noise figure	dB	7
σ_sh	Shadow-fading standard deviation	dB	6
μ	Numerology index (μ = 1 → 30 kHz SCS)	–	1
TDD_DL	Downlink time fraction (TDD duty factor)	–	0.7
Split	C-RAN functional split (e.g., 2 or 7.2)	–	“2” (baseline)
C_FH	Fronthaul capacity per RU (illustrative)	Gbps	1
Area	Simulation area size	m × m	1000 × 1000
N_RU, N_UE	Number of RUs and UEs in a snapshot	count	10,200

Table 2. MEC-related parameters and latency assumptions for edge-enabled operation.

Parameter/Symbol	Meaning	Unit	Value
MEC_enabled	MEC availability flag	–	True
ρ_off	Offloaded traffic/task fraction	–	0.4
L_MEC	MEC processing latency term (simplified constant)	ms	0.30
L_core	Core/backhaul latency term (simplified constant)	ms	2.0
MEC_loc	Edge server co-location/placement (example)	–	Center of area

Table 3. State, action, reward, and training parameters of the reinforcement learning environment.

Parameter/Symbol	Meaning	Unit	Value
State (s)	Observation vector (RU loads, FH load, connected count)	–	Normalized RU-load + FH + n_connected
Action (a)	Joint control (split, TDD, RU sleep, MEC fraction)	–	Discrete grid; DDPG uses continuous (TDD, MEC) mapped to grid
Reward (r)	Weighted multi-objective with SLA penalties	–	$w^{T} f (a) - p e n a l t i e s$
DQN steps/batch	Training steps and minibatch size	steps/samples	240/32
PPO batches/updates	Episodes per batch and epochs per batch	episodes/epochs	64/6
DDPG steps/batch	Training steps and minibatch size	steps/samples	240/32
PER (α)	Prioritized replay exponent	–	0.6

Table 4. Economic performance across methods, reported as mean ± 95% CI over 10 random-seed runs.

	Baseline	MORL	PCHAC	DQN	PPO	DDPG
Profit ($/h) ↑	8.126 ± 0.144	9.834 ± 0.206	10.382 ± 0.143	9.999 ± 0.173	9.621 ± 0.202	9.485 ± 0.198
Energy ($/h) ↓	2.002 ± 0.034	1.977 ± 0.031	1.813 ± 0.035	1.989 ± 0.033	2.014 ± 0.030	2.009 ± 0.025
FH ($/h) ↓	0.012 ± 0.001	0.022 ± 0.001	0.012 ± 0.001	0.019 ± 0.001	0.021 ± 0.001	0.013 ± 0.001
Net utility ($/h) ↑	6.274 ± 0.165	7.622 ± 0.120	8.458 ± 0.077	8.139 ± 0.166	7.674 ± 0.122	7.492 ± 0.118

Table 5. Latency percentile summary (p50, p75, p90, p95, p99) for each method, reported as mean ± 95% CI over 10 random-seed runs.

Method	p50 (ms)	p75 (ms)	p90 (ms)	p95 (ms)	p99 (ms)
Baseline	32.95 ± 0.35	40.56 ± 0.58	47.94 ± 0.58	50.24 ± 0.46	51.46 ± 0.32
MORL	25.45 ± 0.35	28.59 ± 0.28	33.45 ± 0.28	36.77 ± 0.39	38.57 ± 0.58
PCHAC	22.00 ± 0.20	24.09 ± 0.26	28.29 ± 0.25	29.93 ± 0.37	30.83 ± 0.25
DQN	27.49 ± 0.29	32.07 ± 0.49	38.27 ± 0.67	43.62 ± 0.69	45.90 ± 0.74
PPO	39.96 ± 0.38	47.80 ± 0.72	55.79 ± 0.60	59.89 ± 0.72	61.52 ± 0.49
DDPG	24.24 ± 0.26	26.93 ± 0.27	30.12 ± 0.48	31.90 ± 0.44	32.73 ± 0.35

Table 6. Profit versus users for all methods, reported as mean ± 95% CI over 10 random-seed runs.

Users	Baseline	MORL	PCHAC	DQN	PPO	DDPG
50	−1.699 ± 0.205	−0.997 ± 0.252	0.132 ± 0.226	−1.877 ± 0.247	−1.770 ± 0.290	−1.584 ± 0.208
100	1.557 ± 0.235	2.007 ± 0.263	2.372 ± 0.111	0.871 ± 0.317	1.859 ± 0.253	2.566 ± 0.210
150	5.467 ± 0.255	5.929 ± 0.264	6.322 ± 0.124	4.372 ± 0.258	5.086 ± 0.268	5.983 ± 0.194
200	8.309 ± 0.207	9.672 ± 0.089	10.436 ± 0.267	10.155 ± 0.338	9.707 ± 0.274	9.198 ± 0.148
250	11.215 ± 0.279	12.527 ± 0.171	12.750 ± 0.148	10.389 ± 0.277	10.549 ± 0.293	9.765 ± 0.336
300	12.943 ± 0.191	13.439 ± 0.237	13.694 ± 0.138	11.342 ± 0.119	9.674 ± 0.268	11.515 ± 0.224

Table 7. p95 latency versus users for all methods, reported as mean ± 95% CI over 10 random-seed runs.

Users	Baseline	MORL	PCHAC	DQN	PPO	DDPG
50	28.30 ± 0.62	24.01 ± 0.72	20.80 ± 0.68	24.08 ± 0.60	34.77 ± 0.97	23.79 ± 0.52
100	32.17 ± 0.71	27.02 ± 0.75	23.62 ± 0.33	28.68 ± 0.78	40.20 ± 0.84	26.66 ± 0.52
150	35.90 ± 0.76	29.80 ± 0.75	28.37 ± 0.37	33.18 ± 0.63	45.29 ± 0.89	28.96 ± 0.49
200	40.25 ± 0.62	33.83 ± 0.26	31.25 ± 0.80	37.23 ± 0.83	50.16 ± 0.91	32.32 ± 0.37
250	46.64 ± 0.84	38.65 ± 0.49	34.85 ± 0.44	41.46 ± 0.68	55.83 ± 0.98	36.41 ± 0.84
300	51.83 ± 0.57	43.68 ± 0.68	39.68 ± 0.41	46.37 ± 0.29	61.25 ± 0.89	42.04 ± 0.56

Table 8. Energy cost versus users for all methods, reported as mean ± 95% CI over 10 random-seed runs.

Users	Baseline	MORL	PCHAC	DQN	PPO	DDPG
50	1.613 ± 0.026	1.500 ± 0.036	1.341 ± 0.030	1.508 ± 0.027	1.572 ± 0.032	1.520 ± 0.026
100	1.807 ± 0.029	1.651 ± 0.038	1.483 ± 0.015	1.706 ± 0.035	1.757 ± 0.028	1.653 ± 0.026
150	1.996 ± 0.032	1.790 ± 0.038	1.616 ± 0.017	1.908 ± 0.029	1.930 ± 0.030	1.778 ± 0.024
200	2.210 ± 0.026	1.942 ± 0.013	1.711 ± 0.036	2.061 ± 0.038	2.105 ± 0.030	1.886 ± 0.018
250	2.427 ± 0.035	2.132 ± 0.024	1.793 ± 0.020	2.241 ± 0.031	2.274 ± 0.033	2.051 ± 0.042
300	2.593 ± 0.024	2.284 ± 0.034	1.886 ± 0.018	2.371 ± 0.013	2.425 ± 0.030	2.222 ± 0.028

Table 9. Execution time versus users for all methods, reported as mean ± 95% CI over 10 random-seed runs.

Users	Baseline	MORL	PCHAC	DQN	PPO	DDPG
50	2.020 ± 0.041	2.101 ± 0.057	2.182 ± 0.060	2.316 ± 0.055	2.681 ± 0.077	2.579 ± 0.052
100	2.211 ± 0.047	2.301 ± 0.060	2.366 ± 0.030	2.571 ± 0.071	3.016 ± 0.067	2.866 ± 0.052
150	2.393 ± 0.051	2.484 ± 0.060	2.633 ± 0.033	2.816 ± 0.057	3.323 ± 0.071	3.096 ± 0.049
200	2.617 ± 0.041	2.687 ± 0.020	2.822 ± 0.071	3.021 ± 0.075	3.613 ± 0.073	3.332 ± 0.037
250	2.843 ± 0.056	2.952 ± 0.039	2.987 ± 0.039	3.342 ± 0.062	3.886 ± 0.078	3.641 ± 0.084
300	2.989 ± 0.038	3.155 ± 0.054	3.172 ± 0.037	3.543 ± 0.026	4.240 ± 0.071	4.004 ± 0.056

Table 10. Profit sensitivity at 200 Users under the (L_MEC, L_core) sweep (mean ± 95% CI, N = 10).

L_MEC (ms)	L_core (ms)	Baseline	MORL	PCHAC	DQN	PPO	DDPG
0.3	2	8.309 ± 0.207	9.672 ± 0.089	10.436 ± 0.267	10.155 ± 0.338	9.707 ± 0.274	9.198 ± 0.148
0.3	10	8.029 ± 0.207	9.432 ± 0.089	10.236 ± 0.267	9.891 ± 0.338	9.387 ± 0.274	8.926 ± 0.148
0.3	20	7.679 ± 0.207	9.132 ± 0.089	9.986 ± 0.267	9.561 ± 0.338	8.987 ± 0.274	8.586 ± 0.148
1.0	2	8.284 ± 0.207	9.651 ± 0.089	10.418 ± 0.267	10.132 ± 0.338	9.679 ± 0.274	9.174 ± 0.148
1.0	10	8.005 ± 0.207	9.411 ± 0.089	10.219 ± 0.267	9.868 ± 0.338	9.359 ± 0.274	8.902 ± 0.148
1.0	20	7.654 ± 0.207	9.111 ± 0.089	9.969 ± 0.267	9.538 ± 0.338	8.959 ± 0.274	8.562 ± 0.148
5.0	2	8.144 ± 0.207	9.531 ± 0.089	10.319 ± 0.267	10.000 ± 0.338	9.519 ± 0.274	9.038 ± 0.148
5.0	10	7.864 ± 0.207	9.291 ± 0.089	10.118 ± 0.267	9.736 ± 0.338	9.199 ± 0.274	8.766 ± 0.148
5.0	20	7.514 ± 0.207	8.991 ± 0.089	9.868 ± 0.267	9.406 ± 0.338	8.799 ± 0.274	8.426 ± 0.148

Table 11. p95 latency sensitivity at 200 Users under the (L_MEC, L_core) sweep (mean ± 95% CI, N = 10).

L_MEC (ms)	L_core (ms)	Baseline	MORL	PCHAC	DQN	PPO	DDPG
0.3	2	40.25 ± 0.62	33.83 ± 0.26	31.25 ± 0.80	37.23 ± 0.83	50.16 ± 0.91	32.32 ± 0.37
0.3	10	48.25 ± 0.62	41.83 ± 0.26	39.25 ± 0.80	45.23 ± 0.83	58.16 ± 0.91	40.32 ± 0.37
0.3	20	58.25 ± 0.62	51.83 ± 0.26	49.25 ± 0.80	55.23 ± 0.83	68.16 ± 0.91	50.32 ± 0.37
1.0	2	40.95 ± 0.62	34.53 ± 0.26	31.95 ± 0.80	37.93 ± 0.83	50.86 ± 0.91	33.02 ± 0.37
1.0	10	48.95 ± 0.62	42.53 ± 0.26	39.95 ± 0.80	45.93 ± 0.83	58.86 ± 0.91	41.02 ± 0.37
1.0	20	58.95 ± 0.62	52.53 ± 0.26	49.95 ± 0.80	55.93 ± 0.83	68.86 ± 0.91	51.02 ± 0.37
5.0	2	44.95 ± 0.62	38.53 ± 0.26	35.95 ± 0.80	41.93 ± 0.83	54.86 ± 0.91	37.02 ± 0.37
5.0	10	52.95 ± 0.62	46.53 ± 0.26	43.95 ± 0.80	49.93 ± 0.83	62.86 ± 0.91	45.02 ± 0.37
5.0	20	62.95 ± 0.62	56.53 ± 0.26	53.95 ± 0.80	59.93 ± 0.83	72.86 ± 0.91	55.02 ± 0.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aboul, S.M.; Abd El Kader, H.M.; Eid, E.M.; Ali, S.S. Techno-Economic and SLA-Aware Control of 5G Cloud-RAN via Multi-Objective and Penalty-Constrained Reinforcement Learning. Network 2026, 6, 20. https://doi.org/10.3390/network6020020

AMA Style

Aboul SM, Abd El Kader HM, Eid EM, Ali SS. Techno-Economic and SLA-Aware Control of 5G Cloud-RAN via Multi-Objective and Penalty-Constrained Reinforcement Learning. Network. 2026; 6(2):20. https://doi.org/10.3390/network6020020

Chicago/Turabian Style

Aboul, Sherif M., Hala M. Abd El Kader, Esraa M. Eid, and Shimaa S. Ali. 2026. "Techno-Economic and SLA-Aware Control of 5G Cloud-RAN via Multi-Objective and Penalty-Constrained Reinforcement Learning" Network 6, no. 2: 20. https://doi.org/10.3390/network6020020

APA Style

Aboul, S. M., Abd El Kader, H. M., Eid, E. M., & Ali, S. S. (2026). Techno-Economic and SLA-Aware Control of 5G Cloud-RAN via Multi-Objective and Penalty-Constrained Reinforcement Learning. Network, 6(2), 20. https://doi.org/10.3390/network6020020

Article Menu

Techno-Economic and SLA-Aware Control of 5G Cloud-RAN via Multi-Objective and Penalty-Constrained Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. System Model

4. Environment and State Modeling

5. Methodology

5.1. Baseline Model

5.2. Deep Q-Network (DQN)

5.3. Proximal Policy Optimization (PPO)

5.4. Deep Deterministic Policy Gradient (DDPG)

5.5. Multi-Objective Reinforcement Learning (MORL)

MORL vs. Penalty-Constrained Control

5.6. Penalty-Constrained Hierarchical Action Controller (PCHAC): Deterministic SLA-Aware Controller

5.7. Comparative Decision Workflow

5.8. Computational Complexity

6. Results and Evaluation

Sensitivity to MEC and Core Network Latency Assumptions

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI