1. Introduction
The continuous expansion of 5G mobile communication networks has created strong demand for high data rates, ultra-reliable low-latency services, and massive user connectivity under highly dynamic radio and traffic conditions [
1]. These requirements become more complex in C-RAN deployments, where baseband processing centralization, fronthaul limitations, and distributed radio resources jointly impact system performance [
2]. The integration of MEC further increases system complexity by introducing computation-related delay components and additional control decisions for latency-sensitive services [
3].
Traditional optimization methods for C-RAN and MEC-enabled systems—such as fixed policies, handcrafted heuristics, or single-objective formulations—often struggle to maintain a stable performance when multiple objectives must be satisfied simultaneously [
4,
5]. In practical deployments, network operators must balance conflicting goals, including profit maximization, energy consumption minimization, fronthaul load control, and quality-of-service assurance (e.g., latency and coverage constraints). The underlying decision process is non-linear and depends on stochastic factors such as user distribution, propagation effects, interference behavior, and traffic load, which limits the effectiveness of static or purely deterministic resource management [
6].
To address these challenges, this work develops an AI-driven optimization framework for a 5G C-RAN architecture enhanced with MEC capabilities, enabling adaptive decision-making across radio, fronthaul, and computation dimensions. The proposed framework evaluates multiple intelligent strategies under a unified simulation and techno-economic modeling environment. A baseline configuration is used as a reference point, and the study compares five AI-based methods representing different learning and control paradigms: MORL for balancing competing KPIs through explicit trade-off modeling, PCHAC a deterministic SLA-aware controller for constraint enforcement, and three standard deep reinforcement learning algorithms—DQN, PPO, and DDPG—which collectively cover value-based learning, policy-gradient optimization, and continuous-control decision spaces.
In 5G cloud-RAN deployments, network operators must meet strict service-level targets—particularly tail-latency requirements for delay-sensitive applications—while limiting operating costs such as energy usage and fronthaul consumption. Centralized RAN processing and MEC improve resource pooling and flexibility, but they also create tightly coupled configuration decisions, including functional split choice, TDD scheduling, offloading fraction, and RU sleep control, each of which impacts both user experience and cost. Since traffic loads and radio conditions fluctuate over time, fixed parameter settings can either violate SLAs when resources are insufficient or waste resources when provisioning is excessive. These practical constraints motivate a techno-economic, SLA-aware control framework that dynamically selects cloud-RAN/MEC configurations to improve operator utility while respecting latency and fronthaul limits.
This study aims to quantify these trade-offs and develop an adaptive configuration-control framework that increases profit/utility while satisfying SLA constraints under stochastic traffic and channel variations.
Core research question: How can cloud-RAN/MEC configuration knobs (functional split, TDD ratio, MEC offloading, and RU sleep) be jointly controlled to maximize techno-economic utility while satisfying tail-latency and fronthaul constraints under stochastic traffic and channel variations?
The central contribution of this paper is a unified techno-economic and SLA-aware configuration-control formulation and evaluation pipeline for cloud-RAN/MEC; PCHAC and the DRL algorithms are included as representative controllers/baselines to demonstrate and benchmark the framework under identical conditions.
The main contributions of this study are summarized as follows:
Unified 5G C-RAN + MEC System Framework: A comprehensive simulation model incorporating radio propagation, user association, SINR-based link adaptation, MEC processing delay, and fronthaul performance within a single evaluation pipeline.
Joint KPI Optimization with Realistic Techno-Economic Metrics: A multi-dimensional performance model capturing network profit, energy cost, fronthaul cost, latency behavior, and user connectivity in scalable scenarios.
Incorporation of Constraint-Aware Control via PCHAC: A deterministic SLA-aware controller designed to guide operation toward feasibility under service-level constraints while maintaining competitive techno-economic performance.
Cross-Method Benchmarking Under Identical Conditions: A fair comparison of baseline, MORL, PCHAC, DQN, PPO, and DDPG using the same network topology, simulation parameters, and KPI definitions across different user densities.
Extensive KPI-Level Evaluation and Scalability Analysis: A detailed performance assessment using economics, service KPIs, coverage probability, CQI distributions, fairness across radio units, and execution time behavior as the number of users increases.
The structure of this paper is as follows.
Section 4 describes the considered 5G C-RAN architecture with MEC integration and outlines the environment and state modeling adopted in the study.
Section 5 details the methodology and the AI-based optimization approaches used for system control.
Section 6 presents the simulation results and provides a comprehensive comparative analysis of the evaluated strategies.
Section 7 discusses key observations, limitations, and directions for future research, followed by the concluding remarks.
2. Related Work
Recent studies indicate that Proximal Policy Optimization (PPO) is an effective reinforcement learning approach for network management tasks such as spectrum allocation, scheduling, power control, and network slicing, delivering improvements in throughput, energy efficiency, and SLA compliance. Sherif et al. [
7] propose a PPO-based framework for O-RAN slicing that integrates latency-aware reward design, action filtering, and policy transfer learning between non-real-time and near-real-time RIC agents, achieving faster convergence and reduced energy consumption. Mhatre et al. [
8] address QoS-aware intra-slice resource allocation in beyond-5G O-RAN using a DQN-based model deployed at the near-real-time RIC, where parameterized user association reduces action-space complexity while improving eMBB throughput and ultra-reliable low-latency communication (URLLC) latency. Vidhya et al. [
9] introduce a dynamic network slicing framework that combines multi-agent deep reinforcement learning with optimization-based VNF migration, employing Soft Actor-Critic and Red Panda Optimization to enhance slice admission and QoS under resource constraints. Wang et al. [
10] investigate end-to-end resource allocation for network slicing with public and non-public networks, combining Independent PPO for RAN decisions with a Random Forest model for feasibility checking, resulting in improved SLA satisfaction and operator profit.
Beyond PPO-based approaches, Deep Q-Network (DQN) methods have been widely shown to outperform static and heuristic optimization techniques. Luo et al. [
11] enhance resource allocation by integrating DQN with gradient boosting, while Sun et al. [
12] propose a dueling DQN architecture for autonomous RRH activation, achieving significant energy savings. In O-RAN environments, Wang et al. [
13] develop a DQN-based xApp for dynamic radio card deactivation, improving energy efficiency without throughput degradation. Additional contributions include adaptive functional split selection in non-terrestrial O-RAN by Shahabi et al. [
14] and joint RAN–MEC resource allocation with slicing support by Martínez-Morfa et al. [
15].
Actor–critic reinforcement learning has also been applied to Cloud RAN optimization. Mohsen Khani et al. [
16] formulate C-RAN resource allocation using actor–critic methods—including Twin-Delayed Deep Deterministic Policy Gradient (TD3)—to control continuous radio-resource variables under QoS and fronthaul constraints, demonstrating improved stability over conventional DDPG. Chenjing Tian et al. [
17] extend this approach to a multi-agent setting using MADDPG for network slice reconfiguration, enabling coordinated optimization of resource cost and QoS in dynamic environments.
Finally, Multi-Objective Reinforcement Learning (MORL) has emerged as a promising framework for addressing conflicting objectives in 5G and edge computing systems. Zhang et al. [
18] propose a multi-DNN approach that jointly optimizes MEC task offloading and resource allocation under secure data transmission constraints. Song et al. [
19] apply MORL to dependent task offloading in MEC, achieving reductions in latency and energy consumption through Pareto-optimal scheduling. Huang et al. [
20] develop a MORL-based task offloading strategy for heterogeneous vehicular edge computing, while Xu et al. [
21] introduce a MORL-driven offloading algorithm for satellite edge computing that jointly optimizes latency, resource utilization, and load balancing. Ismail et al. [
22] provide a comprehensive survey of MEC resource scheduling, highlighting the growing role of DRL and MORL in enabling adaptive and energy-aware service management.
3. System Model
The proposed framework represents an intelligent C-RAN that integrates multiple artificial intelligence–driven control mechanisms to enhance network performance, reduce operational expenditure, and improve energy efficiency under time-varying 5G traffic demands. The simulated architecture consists of Remote Units (RUs), Distributed Units (DUs), a centralized Baseband Unit (BBU), and a MEC platform interconnected via high-capacity fronthaul and backhaul links.
Figure 1 illustrates the considered 5G C-RAN architecture with MEC integration and the main information flows over fronthaul/backhaul links.
To ensure a consistent and reproducible evaluation, the proposed 5G C-RAN with MEC framework is parameterized using a set of widely adopted physical-layer, architectural, and traffic assumptions. These parameters define the operating spectrum, bandwidth, transmit power, numerology, duplexing configuration, functional split, fronthaul capacity, network scale, and user density, and are selected in accordance with commonly used 5G NR and C-RAN modeling practices reported in the literature [
23,
24,
25,
26]. The main simulation and system parameters adopted in this study are summarized in
Table 1.
In addition to radio and fronthaul configuration, the system model incorporates edge-computing capabilities to account for computation offloading and service latency behavior. The MEC-related parameters, including offloading fraction, processing latency, and core/backhaul delay assumptions, are aligned with prior studies on edge-enabled 5G systems and computation offloading frameworks [
23,
27,
28]. These parameters and latency components are reported in
Table 2.
and
are modeled as constant latency terms to isolate the impact of configuration control on end-to-end performance. Fixed-latency modeling is commonly adopted in 5G/MEC latency studies, and low-ms core-network latency targets and sub-ms transport segments are reported in the low-latency 5G literature. To ensure conclusions are not tied to a single setting, a sensitivity analysis over
and
is provided in
Section 6 [
29,
30].
For each user equipment (UE), the downlink signal-to-interference-plus-noise ratio (SINR) is computed as
where
Ps,u denotes the received signal power from the serving RU,
Pi,u represents the interference contributions from neighboring RUs,
captures interference weighting, and
N0 is the thermal noise power. The resulting SINR is mapped to a Channel Quality Indicator (CQI), which in turn determines the achievable spectral efficiency through standardized link adaptation functions.
The achievable downlink throughput for UEu is modeled as
where
B is the system bandwidth,
represents the downlink time allocation in the TDD frame,
accounts for protocol overhead,
BLER denotes the block error rate, and
shareu reflects the scheduler’s resource allocation to the user.
End-to-end latency is modeled as the sum of air-interface, processing, fronthaul, core-network, and MEC components. To account for MEC congestion, a load-dependent queuing delay term may be incorporated for offloaded tasks, as described below.
To capture MEC congestion effects, MEC processing can be approximated by an M/M/1 queue with arrival rate
(offloaded task arrivals) and service rate
(effective MEC service capacity). The mean queuing delay is
Accordingly, the end-to-end delay can be written as
where
Lair denotes the air-interface transmission delay,
Lproc represents the processing delay,
Lfh corresponds to the fronthaul delay,
Lcore captures the core network delay, and
Lmec accounts for the MEC processing delay.
Queueing dynamics at the MEC are not explicitly simulated in the current evaluation; Lmec is treated as a constant term to isolate the impact of configuration control and to keep the simulator tractable. The M/M/1 extension above is provided to show how load-dependent congestion can be incorporated into the latency model. A full empirical re-evaluation of all methods under explicit queueing dynamics is left for future work.
Under an M/M/1 approximation, queuing delay increases sharply as utilization
approaches 1. In the evaluated settings,
is bounded by the offloading fraction and the task generation rate, while
is determined by the MEC compute capacity (
Table 2). Thus, the fixed-delay assumption corresponds to operating in a regime where
remains sufficiently below 1; if
approaches saturation, the queuing term should be activated and the controller would need to account for congestion.
The optimization study adopts a baseline policy and five adaptive control strategies to adapt three operational control variables—namely the C-RAN functional split, the TDD downlink–uplink ratio, and the MEC offloading fraction—with the goal of maximizing system profit while meeting latency and fronthaul-capacity constraints. The compared methods are: (1) a static baseline based on predefined heuristic rules; (2) Deep Q-Network (DQN), representing value-based reinforcement learning; (3) Proximal Policy Optimization (PPO), representing a policy-gradient method; (4) Multi-Objective Reinforcement Learning (MORL), which searches for Pareto-efficient trade-offs among competing objectives; (5) Deterministic SLA-aware controller (PCHAC), which selects configurations deterministically using an SLA-aware scoring rule with explicit constraint handling; and (6) Deep Deterministic Policy Gradient (DDPG), an actor–critic approach that learns deterministic control policies for resource management.
All strategies are evaluated under identical simulation settings and KPI definitions; results are aggregated over 10 independent random seeds, which re-sample stochastic channel and traffic realizations to quantify variability.
4. Environment and State Modeling
The environment (CRANMORLEnv) reproduces the key dynamics of a 5G C-RAN, including user distribution, wireless channel conditions, interference, and traffic load. The action space is defined as:
where each parameter represents:
split: Functional split configuration between DU and RU.
tdd: Time Division Duplex (TDD) ratio for uplink/downlink balance.
sleepk: Number of RUs in sleep mode to save energy.
mecfrac: Fraction of computation offloaded to MEC.
RU sleep is modeled as a per-epoch on/off decision without explicit state-transition costs (e.g., wake-up delay, switching energy, or minimum on/off durations). Therefore, the reported energy savings should be interpreted as an upper-bound under idealized RU state transition; incorporating transition overheads is left for future work.
Configuration selection is evaluated as periodic snapshot-based reconfiguration; therefore, the learning baselines are trained in a single-step setting where each action is scored by the immediate techno-economic return under stochastic traffic/channel realizations.
The goal is to maximize a composite reward function that reflects network profit while minimizing latency, energy, and fronthaul costs. The multi-objective formulation is given by
The objective is to increase operator profit while reducing tail latency and operating costs. To maintain a consistent maximization direction, profit is represented with a positive contribution, whereas latency and cost terms are included with negative signs so that larger values correspond to better performance.
Table 3 summarizes the reinforcement learning environment, including the state representation, joint control actions, reward formulation with SLA penalties, and the main training parameters used for all evaluated methods, following standard reinforcement learning formulations and MEC optimization practices [
31,
32].
5. Methodology
To enable a rigorous and unbiased evaluation of diverse control techniques, all considered methods are realized within a common decision and simulation framework. While the approaches differ in learning mechanisms, action formulations, and optimization goals, they interact with the same 5G C-RAN with MEC environment, rely on identical system observations, and are assessed using uniform techno-economic and service-level performance indicators.
Figure 2 summarizes the end-to-end control workflows of the baseline, MORL, PCHAC, DQN, PPO, and DDPG approaches. The figure outlines their respective action spaces, objective structures, SLA-enforcement mechanisms, training processes, and execution logic. By presenting all methods within a unified workflow, the figure highlights their methodological distinctions and ensures that the comparative results reported in
Section 6 stem from algorithmic design choices rather than differences in implementation.
5.1. Baseline Model
The baseline policy serves as a non-adaptive benchmark for all adaptive approaches. It applies fixed configuration parameters—such as TDD ratio and MEC offload level—based on predefined averages. This model has minimal computational cost but cannot react to varying network load or channel fluctuations. It provides a stable reference for quantifying performance improvements achieved by AI-driven control strategies.
The baseline configuration is fixed to
,
, offloading fraction
, and all RUs active (no sleep) throughout the simulation. All other simulation parameters are held constant across methods and follow
Table 1 and
Table 2, including the network snapshot settings (e.g., RU/UE counts, carrier frequency, bandwidth, antenna configuration, transmit power, noise figure, and propagation assumptions) and the computation/transport parameters (e.g., task size and CPU-cycle requirements, RU/DU/CU capacities, and fixed MEC/core latency terms).
5.2. Deep Q-Network (DQN)
The DQN algorithm, a key advancement in deep reinforcement learning, has shown remarkable capability for autonomous decision-making and control in next-generation C-RANs and Open RAN (O-RAN) systems. In these architectures, baseband processing is centralized in BBU pools, while Remote Radio Heads (RRHs) or radio frequency frontends are geographically distributed. This configuration introduces multiple optimization challenges, including dynamic traffic variability, interference management, energy efficiency enhancement, and stringent latency requirements.
The DQN framework estimates the optimal state–action value function using deep neural networks, expressed as:
where θ and θ
− represent the parameters of the online and target networks, respectively.
At each learning step, the agent observes the current network state
St, selects an action
at based on an
ε-greedy exploration strategy, and receives a reward determined by:
The model’s parameters are optimized by minimizing the temporal-difference loss function:
Through this iterative process, the DQN agent learns an adaptive policy that maps real-time network observations—such as channel state information, traffic demand, and resource utilization—to discrete configuration actions, including RRH activation/deactivation, functional split selection, bandwidth allocation, and computation offloading. In the snapshot reconfiguration setting, the objective is to learn a state-to-configuration mapping that maximizes the expected immediate return under stochastic traffic and channel realizations, consistent with the techno-economic and SLA-aware performance targets considered here [
11,
12,
33,
34].
Stability and convergence in high-dimensional state spaces are commonly enhanced through techniques such as experience replay, target network updates, and dueling architectures [
33]. However, practical deployment still requires the precise design of state and reward models, the use of safe exploration methods to avoid service disruption, and low-latency inference mechanisms that satisfy the stringent timing constraints of 5G and future wireless systems [
34,
35].
5.3. Proximal Policy Optimization (PPO)
Cloud Radio Access Network (C-RAN) has become a pivotal architecture in fifth-generation (5G) mobile communication systems, where BBUs are centralized within cloud infrastructures to enable the joint optimization of computational and radio resources. This architectural shift enhances spectral efficiency while simultaneously reducing capital and operational expenditures [
36]. However, C-RAN deployment presents several challenges, including dynamic traffic variation, limited fronthaul capacity, inter-cell interference, and stringent latency constraints—particularly in URLLC scenarios [
2]. Conventional optimization approaches often fail to adapt effectively to such non-stationary environments, leading to the growing adoption of machine learning, and specifically reinforcement learning (RL), for C-RAN control and resource management [
37].
Among RL techniques, PPO has attracted considerable attention as a policy-gradient method capable of achieving stable and efficient learning through its clipped surrogate objective, which mitigates the risk of unstable or overly aggressive policy updates that could degrade network performance [
38]. The PPO agent directly models the policy distribution over actions instead of estimating Q-values, optimizing the following clipped objective:
where
denotes the probability ratio between the new and old policies, and
represents the advantage estimate derived from the value function. PPO jointly trains both policy and value networks using a multi-objective reward structure consistent with that employed in DQN, ensuring effective optimization across competing performance metrics. Empirical studies have demonstrated that PPO can be successfully applied to several critical C-RAN management functions, including dynamic spectrum allocation, energy-efficient power control, and network slice orchestration, yielding measurable gains in system throughput, energy efficiency, and SLA compliance [
7,
8,
9,
10]. Owing to its model-free nature, PPO operates efficiently in stochastic and partially observable environments without requiring explicit channel modeling. Furthermore, its integration with deep neural networks facilitates scalability across the large state-action spaces typical of dense 5G network deployments [
39]. Collectively, these attributes position PPO as a promising reinforcement learning framework for real-time, data-driven decision-making in C-RAN systems, offering a stable foundation for future multi-objective and hierarchical control strategies designed to meet the stringent QoS requirements of next-generation wireless networks.
In practice, PPO performance can be sensitive to the clipping range and reward/penalty scaling, particularly in highly stochastic environments.
5.4. Deep Deterministic Policy Gradient (DDPG)
DDPG generalizes actor–critic methods to continuous control by coupling a deterministic actor that maps states to actions with a critic that scores state–action returns, enabling precise policies beyond purely discrete choices [
40,
41]. In hybrid settings that combine continuous and discrete decisions, we let DDPG govern the continuous variables and resolve the discrete ones via a compact wrapper search; specifically, in our C-RAN 5G system with MEC offloading the actor sets the downlink TDD ratio and MEC offload fraction, while a lightweight grid enumerates the functional split {2, 7.2} and RU sleep counts k ∈ {0, 1, 2, 3}, snapping the actor’s proposals to the simulator’s nearest admissible levels and selecting the option with the highest shaped return [
42]. The reward aggregates four operator objectives—profit, negative p95 latency, negative energy cost, and negative fronthaul cost—via linear weights, and applies normalized penalties when service targets are violated (e.g., p95 latency > 40 ms or fronthaul cost >
$0.80/h), steering learning toward utility while honoring service constraints [
43]. In this study, configuration selection is modeled as a periodic system-level reconfiguration over quasi-static intervals (snapshot-based evaluation), so each decision is assessed on the current state without modeling multi-step transition overheads. Accordingly, DDPG is evaluated in a single-step setting (
), where the critic fits immediate returns for the selected configuration and the actor updates via the deterministic policy gradient. This design matches the configuration formulation used in the simulator and enables a consistent comparison across methods. While this formulation is appropriate for periodic configuration updates, it does not capture transition costs (e.g., RU wake-up latency, split switching overhead) or queue evolution across multiple decision steps; incorporating these effects requires a multi-step MDP with
. Training occurs in a single-step, contextual-bandit regime (γ = 0): the critic minimizes mean-squared error to the immediate reward on concatenated state–action inputs, and the actor follows the deterministic policy gradient by back-propagating through the critic; exploration is injected as zero-mean Gaussian noise on the actor’s pre-tanh outputs, and a replay buffer (capacity 2000, minibatch 32) improves stability [
44]. Concretely, the observation stacks normalized per-RU loads, normalized per-RU fronthaul utilization, and the fraction of attached UEs; the actor is a two-layer ReLU MLP with a tanh head whose outputs are clipped to [−1, 1], and linearly mapped to, TDD ∈ [0.5, 0.8] and MEC ∈ [0.0, 0.8]. Episodes consist of a single rollout, so learning is driven by immediate returns in a single-step setting (
) using compact networks, compact models (width 64), learning rates 10
−3, a noise schedule decaying from 0.3 to 0.05, and 240 optimization steps for main runs (80 for sweeps). All reported results are aggregated over 10 independent random-seed runs to account for stochastic channel conditions and RL training variability. This formulation captures the principal C-RAN trade-offs—split 7.2’s higher fixed fronthaul versus split 2’s traffic-proportional load, the TDD ratio’s impact on downlink capacity, and MEC’s latency/backhaul relief versus local processing—making it well suited for periodic system-level reconfiguration rather than mobility control.
5.5. Multi-Objective Reinforcement Learning (MORL)
MORL extends conventional reinforcement learning by jointly optimizing multiple conflicting performance metrics [
19,
20,
22]. In the context of 5G and beyond networks, MORL enables intelligent decision-making for resource allocation, task offloading, and energy management, where objectives such as throughput, latency, energy efficiency, and cost must be balanced dynamically [
19,
20,
21].
The MORL framework explores trade-offs among these objectives through stochastic preference sampling [
45]. Weight vectors w are drawn from a Dirichlet distribution, and for each sample, the corresponding scalarized reward is computed as:
where f(a) represents the vector of objective functions [
46]. The agent evaluates available actions and selects the one maximizing the scalarized reward [
47]. Repeating this process across multiple weight vectors yields an approximate Pareto frontier, representing the set of non-dominated policies that achieve different trade-offs among objectives.
A closed-loop MORL pipeline integrates data collection, policy learning, and runtime adaptation. Telemetry from the RAN, core, and MEC layers defines a multi-objective optimization space consistent with SLAs. The MORL agent refines operational decisions such as power control, and task offloading while maintaining a Pareto archive for context-aware policy selection [
19,
20,
47]. Continuous monitoring ensures SLA compliance and adaptive balancing of performance and energy consumption across network functions [
15].
MORL vs. Penalty-Constrained Control
MORL addresses a multi-objective optimization problem in which several competing criteria are optimized jointly. In this work, MORL treats the objective vector (e.g., profit/utility, tail latency, energy, and fronthaul cost) as a set of objectives and learns policies that balance them via preference sampling, using a scalarization of the form for different weight vectors .
By contrast, penalty-constrained formulations emphasize constraint satisfaction: a primary objective is optimized while enforcing SLA limits (e.g., tail-latency and fronthaul constraints). Here, PCHAC is a deterministic SLA-aware controller that selects actions using the penalty-augmented scoring rule and selection criterion defined in Equations (12)–(14). Accordingly, MORL is used to learn trade-off policies across objectives, whereas PCHAC enforces SLA-aware feasibility through explicit constraint handling and serves as a stable, interpretable benchmark/controller
5.6. Penalty-Constrained Hierarchical Action Controller (PCHAC): Deterministic SLA-Aware Controller
PCHAC is a deterministic, constraint-aware controller for SLA-feasible configuration selection. It is included as a strong, interpretable benchmark (and supervisory decision rule) alongside adaptive controllers.
Figure 3 illustrates the PCHAC workflow for SLA-aware C-RAN/MEC configuration selection with KPI/reward feedback.
Specifically, PCHAC avoids stochastic exploration and neural policy approximation and instead evaluates candidate configurations deterministically using an SLA-aware scoring rule with explicit constraint handling. Unlike gradient-based reinforcement learning approaches such as DQN, DDPG, or PPO, PCHAC avoids stochastic exploration and neural policy approximation. Instead, it deterministically evaluates candidate configurations and selects the action that maximizes a preference-weighted score while enforcing SLA feasibility. This ensures stability, interpretability, and transparent decision-making.
PCHAC follows a multi-objective, constraint-aware decision rule for deterministic configuration selection. It employs weighted scalarization to merge multiple performance objectives—such as profit, latency, energy efficiency, and fronthaul utilization—into a single scalar score. To enforce feasibility, hinge-based penalties are integrated into the objective function:
where
is the normalized weight vector, represents objective values, and the penalty terms are defined as:
The optimal configuration is selected deterministically as:
This constitutes a configuration-control problem over a discrete action space, where the controller maps observed network state to an SLA-feasible configuration.
Although PCHAC evaluates a discrete set of configurations, its contribution is the SLA-aware selection rule: objective terms are normalized to comparable scales and combined with operator-defined preferences, while constraint satisfaction is enforced through explicit penalty/feasibility handling. The enumeration over the discrete action set is only an implementation mechanism for configuration selection.
PCHAC evaluates a predefined discrete configuration set (functional split, RRH activation count k, TDD ratio, and MEC offload fraction) and selects the configuration that maximizes the SLA-aware score. Because RU transitions are modeled without wake-up penalties, PCHAC’s energy gains should be interpreted under idealized switching assumptions. While this guarantees deterministic convergence and strict SLA compliance, scalability is limited in high-dimensional or continuous spaces. The per-epoch runtime scales linearly with the number of candidate configurations evaluated, since each candidate is scored once under the same KPI model.
The computational cost of PCHAC is dominated by scoring the evaluated configuration set. Let denote the number of admissible discrete configurations; the runtime per decision epoch is , where is the cost of computing the KPIs for one candidate configuration. In the present study, the action set is intentionally compact (functional split, RRH activation count , discretized TDD ratio, and discretized MEC offloading fraction), yielding a manageable . If the resolution of continuous knobs is refined (more TDD/offloading levels) or the decision space is expanded (e.g., per-RU actions or larger RU clusters), then grows multiplicatively with the number of discrete levels per control dimension and may become prohibitive. In such settings, PCHAC can be used as a strong constraint-aware benchmark, while learning-based controllers provide a scalable alternative for large or continuous action spaces.
Overall, PCHAC provides a MORL-inspired, constraint-regularized decision model that unifies preference weighting, penalty-based feasibility, and deterministic selection—making it a reliable baseline or supervisory control layer for adaptive reinforcement learning systems.
5.7. Comparative Decision Workflow
All evaluated methods follow the same end-to-end decision loop to ensure a fair and consistent comparison: the simulator initializes the 5G C-RAN + MEC environment, extracts the current observation, selects a control action, applies the selected configuration to the network, computes the resulting KPIs, and then updates the corresponding optimization mechanism (if applicable).
Within this shared workflow, the baseline executes a fixed configuration and serves as a static performance reference. DQN and PPO operate as single-objective deep reinforcement learning agents, using value-based and policy-gradient learning, respectively, to improve decisions over repeated interactions.
In contrast, DDPG extends the same pipeline to a continuous-control setting through an actor–critic structure, enabling fine-grained adjustment of continuous decision variables while maintaining compatibility with discrete configuration constraints through bounded mapping.
For multi-objective decision-making, MORL explicitly explores trade-offs across competing KPIs by sampling preference weights and selecting non-dominated policies, whereas PCHAC (deterministic SLA-aware controller) applies SLA-aware scoring to select feasible configurations without stochastic exploration. Overall, this unified evaluation workflow enables systematic benchmarking across heterogeneous strategies by comparing adaptability, computational cost, and optimization quality under identical network and traffic conditions.
Although the evaluation is snapshot-based, learning-based controllers remain useful because the mapping from observed network state to the best configuration is high-dimensional and non-linear under stochastic channel/traffic variation. The RL baselines are therefore used as generic policy-learning function approximators that can learn this mapping from interaction data, providing a consistent comparison against MORL and the deterministic SLA-aware benchmark (PCHAC).
5.8. Computational Complexity
Let
NUE and
NRU denote the number of users and radio units, respectively. A single simulation rollout includes UE association, SINR estimation, link adaptation and latency evaluation. The dominant cost per rollout can be approximated as:
The baseline method performs one rollout only:
Both MORL and PCHAC evaluate multiple candidate configurations from a discrete action set
A. MORL samples
S preference vectors and selects the best solution across the evaluated set, yielding:
As network density increases (larger and ) or as the discrete action set is expanded (larger ), the runtime increases proportionally through and , respectively.
PCHAC applies preference-based scoring with constraint penalties over candidate actions, leading to a worst-case complexity of:
For learning-based schemes, DQN performs iterative rollouts with replay-based neural updates. Let
TDQN be the number of training steps,
B the batch size, and
H the hidden dimension. The complexity becomes:
PPO collects on-policy trajectories and updates actor–critic networks over multiple epochs. With
E episodes per batch and
U update epochs, the complexity is:
DDPG employs an actor–critic structure with continuous control outputs, combined with a lightweight wrapper search across discrete candidates. Let
Kd be the number of discrete evaluations per iteration. The resulting complexity is:
Overall, the baseline strategy has the lowest computational overhead, MORL and PCHAC scale mainly with the action-space size, while DQN, PPO, and DDPG incur additional cost due to iterative training and neural optimization. PPO typically exhibits the highest training burden because of on-policy sampling and repeated policy updates.
6. Results and Evaluation
All reported results are averaged over independent random seeds for each method. Each seed re-initializes the stochastic components of the simulation (e.g., channel and traffic realizations) as well as the learning process (e.g., network initialization and exploration). Figures and tables report the mean across seeds; uncertainty is shown as 95% confidence intervals (CI) and computed as , where is the standard deviation across seeds.
Figure 4 presents the spatial configuration of the baseline C-RAN scenario, depicting a 1 km × 1 km region populated with randomly positioned radio units (RUs), uniformly distributed user equipments (UEs), and a centrally located DU/CU node. The heterogeneous RU placement creates natural variations in coverage density, leading to unequal traffic loads and diverse interference conditions across the area. The central DU/CU location reduces average fronthaul distance, thereby limiting propagation and serialization delays, whereas RUs situated near the edges inevitably face higher fronthaul latency. The uniform UE distribution highlights regions of concentrated demand that may overload nearby RUs, producing fluctuations in SINR, throughput, and latency during operation. Overall, this layout provides the physical context underlying all subsequent performance evaluations, as the topology directly influences resource utilization, fronthaul usage, and the effectiveness of the optimization methods compared later in the study.
Figure 5 and
Table 4 summarize the economic outcomes across methods in terms of profit, energy expenditure, fronthaul (FH) cost, and net utility (log-scaled to reflect the different value ranges). Across 10 independent random-seed runs, PCHAC achieves the highest mean profit (10.382 ± 0.143
$/h) and the highest mean net utility (8.458 ± 0.077
$/h), indicating the most favorable revenue–cost trade-off among the evaluated approaches. PCHAC also yields the lowest mean energy cost (1.813 ± 0.035
$/h), improving upon the baseline (2.002 ± 0.034
$/h) and the adaptive baselines (e.g., MORL 1.977 ± 0.031
$/h, DQN 1.989 ± 0.033
$/h). For FH cost, PCHAC attains the minimum level (0.012 ± 0.001
$/h), matching the baseline and remaining below MORL (0.022 ± 0.001
$/h), DQN (0.019 ± 0.001
$/h), PPO (0.021 ± 0.001
$/h), and DDPG (0.013 ± 0.001
$/h). Overall, these results show that PCHAC consistently provides superior mean economic performance with low variability across seeds.
Figure 6 and
Table 5 report the end-to-end latency distribution using percentile statistics (p50, p75, p90, p95, and p99). Results are aggregated over 10 independent random-seed runs and presented as mean ± uncertainty, which captures variability due to channel/traffic randomness and RL training stochasticity. Across all reported percentiles, PCHAC maintains the lowest latency levels, indicating improved typical performance (median and upper-quartile) as well as stronger tail behavior (p95 and p99) compared with the baseline and the learning-based baselines. The smaller uncertainty ranges for PCHAC at high percentiles further suggest stable tail-latency control across independent runs, supporting the robustness of the observed latency improvements.
Figure 7 illustrates the empirical cumulative distribution functions (CDFs) of per-user latency for the six evaluated methods, where curves positioned further to the left indicate lower latency across a larger proportion of samples. PCHAC exhibits the most favorable latency distribution, with the majority of latency values concentrated between approximately 18 and 30 ms and a rapid rise in cumulative probability, indicating consistently low delay. MORL and DDPG form the next performance tier, showing slightly higher latency distributions, with most samples falling roughly within the 20–35 ms range. DQN is further shifted to the right, reflecting increased latency at comparable percentiles and a broader spread extending beyond 40 ms. The baseline method demonstrates a more pronounced right shift, with latency values extending toward approximately 50 ms, indicating slower service for a substantial fraction of users. PPO shows the least favorable latency behavior, with a wide distribution spanning from around 30 ms to over 60 ms and a slower increase in the CDF, revealing both higher typical latency and a heavier high-latency tail. Overall, the ordering of the CDFs confirms that PCHAC achieves the strongest latency performance, followed by MORL and DDPG, then DQN, with baseline and PPO exhibiting progressively higher latency distributions.
Figure 8 illustrates the distribution of channel quality indicator (CQI) levels across the evaluated methods using a UE count heatmap (top) and the corresponding normalized UE percentage curves (bottom). The baseline, MORL, PPO, and DDPG methods exhibit a strong concentration of users in low-to-medium CQI ranges, primarily between CQI 1 and 6, with clear peaks around CQI 2–3 where counts typically exceed 40 UEs. For these methods, the number of users decreases sharply beyond CQI 7, and higher CQI levels (CQI ≥ 13) contain few or no users. DQN follows a similar trend, showing a pronounced peak at CQI 2 (47 UEs) and a rapid decline at higher CQI values, with only limited presence beyond CQI 10. In contrast, PCHAC displays a distinctly right-shifted CQI distribution, allocating relatively few users to low CQI bins and progressively increasing the number of users in higher CQI levels. Specifically, PCHAC assigns substantially more users to CQI 10–15, with counts peaking between 22 and 24 UEs, resulting in a steadily increasing UE percentage profile across CQI levels. Overall, the combined heatmap and percentage curves indicate that PCHAC consistently maintains a larger fraction of users under favorable channel conditions, whereas the other methods predominantly operate in lower-CQI regimes.
Figure 9 depicts the per–radio unit (RU) fronthaul (FH) load for RU-0 through RU-9 under all evaluated methods, shown on a logarithmic scale with a dashed reference line indicating the 1 Gbps capacity limit. For every RU and method, the FH load remains well below the capacity threshold, confirming ample fronthaul headroom across the network. The FH demand varies noticeably across RUs, with higher loads observed at RU-2, RU-3, and RU-6, while RU-9 consistently exhibits the lowest FH consumption. Across methods, PCHAC generally achieves the lowest FH load on a per-RU basis, indicating more efficient fronthaul utilization. In contrast, DQN incurs the highest FH demand for most RUs, with pronounced peaks at RU-3 and RU-6 that approach the upper end of the observed range. The baseline, MORL, PPO, and DDPG methods fall between these extremes, displaying similar per-RU load patterns with moderate variations. Overall, the results highlight clear differences in fronthaul efficiency across control strategies while confirming that all methods operate comfortably within fronthaul capacity limits.
Figure 10 illustrates Jain’s fairness index measured per radio unit (RU-0 to RU-9) for all evaluated control methods. The results reveal noticeable variability in fairness across RUs, with comparatively low values at RU-0, RU-1, and RU-9, and substantially higher fairness observed at RU-8, indicating uneven user-level resource distribution across the network. PCHAC consistently achieves the highest or near-highest fairness across most RUs, showing clear improvements at RU-3, RU-4, RU-7, and RU-8. The baseline and MORL approaches generally yield moderate fairness levels and exhibit similar trends across RUs. DQN shows reduced fairness at several locations, most notably at RU-2 and RU-8, where its values fall well below those of the other methods. PPO and DDPG display intermediate to strong performance depending on the RU, with DDPG closely matching or slightly exceeding other methods at RU-4 and maintaining competitive fairness at RU-8. Overall, the figure highlights PCHAC’s superior ability to promote balanced resource allocation across radio units.
Figure 11 illustrates the normalized multi-objective performance of the evaluated methods using a radar chart that captures profit, inverse P95 latency, inverse energy cost, inverse fronthaul load, and the number of connected users. PCHAC consistently achieves the highest normalized values across all five dimensions, forming the outermost and most balanced profile, which indicates superior joint optimization of economic performance, service latency, energy efficiency, fronthaul utilization, and user connectivity. MORL and DDPG demonstrate strong performance in latency and connectivity; however, both methods exhibit reduced scores in energy and fronthaul efficiency, resulting in less uniform coverage across objectives. DQN attains relatively high profit and latency performance but shows noticeably weaker fronthaul efficiency and connectivity compared to PCHAC. The baseline approach performs well in fronthaul efficiency but displays substantially lower normalized values for profit, energy efficiency, and user connectivity, limiting its overall multi-objective effectiveness. PPO presents a smaller and uneven radar profile, characterized by particularly low latency performance and moderate values in the remaining dimensions. Overall, the radar visualization highlights PCHAC as the most well-balanced and robust solution under multi-objective evaluation, outperforming alternative methods across all considered criteria.
Figure 12 presents the percentage of served users whose downlink SINR meets or exceeds three thresholds (0, 5, and 10 dB) for the baseline scheme and the five adaptive policies (four RL-based methods and the deterministic SLA-aware controller PCHAC). PCHAC delivers the strongest coverage at every threshold, achieving 95.5% for SINR ≥ 0 dB, 85.9% for SINR ≥ 5 dB, and 67.7% for SINR ≥ 10 dB, which indicates a markedly larger share of users operating under favorable radio conditions. By comparison, the other approaches show substantially lower coverage—particularly at the stricter 10 dB target—where the baseline and MORL remain around the low single digits, and the remaining DRL methods stay below roughly 12%. Among these DRL baselines, DQN is comparatively better at the 0–5 dB thresholds (e.g., 45.0% at 0 dB and 23.0% at 5 dB), whereas PPO and DDPG provide weaker performances at higher thresholds (about 5–9% at 10 dB). Overall, the figure demonstrates that PCHAC substantially increases the probability of users meeting moderate and high SINR requirements, supporting improved link robustness and interference management in the considered C-RAN scenario.
Figure 13 and
Table 6 report profit per hour versus the number of active users for the six evaluated methods, with each point summarized over 10 independent random-seed runs and shown as the mean with uncertainty. Across all approaches, profit generally rises as user load increases, with several methods yielding negative profit at low load (50 users) before becoming positive as demand grows. PCHAC delivers the highest mean profit over the full load range, increasing from a small positive value at 50 users to the best performance at 300 users, and it maintains a clear advantage at the representative operating point of 200 users. MORL and the baseline also scale well at higher loads, reaching similarly high profit levels at 300 users, whereas DQN and DDPG exhibit weaker gains beyond 200 users and therefore remain below the top performers at heavy load. PPO shows less consistent behavior at the highest load, with profit increasing up to 250 users but dropping at 300 users, indicating reduced robustness under peak demand. Overall, the results indicate that PCHAC provides the most reliable and scalable profit performance as user demand increases.
Figure 14 and
Table 7 present the p95 latency as a function of the number of active users for the six evaluated methods, with each point reported as the mean over 10 independent random-seed runs and accompanied by uncertainty to reflect variability from channel/traffic randomness and RL training. As expected, tail latency increases with user load for all approaches. PCHAC maintains the lowest mean p95 latency across the entire range, increasing from around 21 ms at 50 users to about 40 ms at 300 users. MORL and DDPG follow with slightly higher but closely tracking trends, while DQN exhibits consistently higher p95 latency than these methods. The baseline shows substantially larger tail latency, rising from the high-20 ms range at low load to above 50 ms at 300 users. PPO yields the highest p95 latency and degrades most sharply as load increases, exceeding 60 ms at 300 users. Overall, the method ranking remains stable across loads, indicating that PCHAC provides the most robust tail-latency control as demand scales.
PPO’s less favorable tail-latency behavior in this setting is consistent with the combination of (i) an on-policy update rule, which is typically more sample-demanding under stochastic channel/traffic variability, and (ii) the clipped surrogate objective, which can produce conservative policy updates when reward scales or advantage estimates are noisy. While PPO is widely effective in many networking tasks, its performance here is sensitive to hyperparameter choices (e.g., clipping range, learning rate, entropy regularization, batch size) and to reward scaling/penalty magnitudes. The reported PPO results therefore reflect a reasonable baseline configuration rather than an upper bound for PPO under exhaustive tuning, and further task-specific tuning may reduce the observed latency gap.
Figure 15 and
Table 8 summarize energy cost versus the number of active users for the six evaluated methods, with each point reported as the mean over 10 independent random-seed runs and accompanied by uncertainty to capture variability across stochastic realizations. Energy cost increases with user load for all approaches, reflecting higher operational demand at larger scale. PCHAC consistently achieves the lowest mean energy cost across the full range, rising from about 1.35
$/h at 50 users to roughly 1.90
$/h at 300 users. The remaining methods follow similar upward trends but at higher cost levels; the baseline increases most sharply and reaches approximately 2.60
$/h at 300 users, while MORL, DQN, PPO, and DDPG remain between these extremes. Overall, the persistent separation between PCHAC and the other methods indicates superior energy efficiency as user demand scales.
Figure 16 and
Table 9 report execution time versus the number of users for all evaluated methods, with each point summarized as the mean over 10 independent random-seed runs and accompanied by uncertainty. Execution time increases approximately linearly with user load for all approaches, reflecting the additional computation required at higher concurrency. The baseline remains the most efficient method, rising from about 2.0 ms at 50 users to roughly 3.0 ms at 300 users. MORL and PCHAC follow closely, with slightly higher but comparable execution times that reach around 3.1–3.2 ms at the highest load. DQN incurs additional overhead and grows to about 3.6 ms at 300 users, while DDPG increases further to approximately 4.0 ms. PPO consistently records the highest execution time, exceeding 4.3 ms at 300 users, indicating weaker scalability under heavy load. The higher runtime is consistent with PPO’s on-policy training/inference overhead and the additional policy/value-network computations per decision epoch under the adopted implementation. Overall, these results reveal clear differences in computational efficiency as user demand scales.
Sensitivity to MEC and Core Network Latency Assumptions
To assess whether the conclusions depend on the latency constants used in
Table 2, a sensitivity analysis is conducted by sweeping
ms and
ms. For each setting, all methods are evaluated over
independent random seeds and reported as mean ± 95% confidence intervals. The profit sensitivity results are summarized in
Table 10, while the p95 latency sensitivity results are reported in
Table 11. The results confirm that the relative performance trends remain stable across a practical range of MEC and core-latency assumptions.
7. Discussion
The presented results collectively confirm that joint control across the radio, fronthaul, and edge-compute dimensions is essential for techno-economic optimization in 5G C-RAN. Across the evaluated policies, the strongest methods are those that explicitly account for competing objectives and operational constraints rather than optimizing a single KPI in isolation. In particular, the multi-objective and constraint-aware decision mechanisms deliver more stable behavior under load growth, where interference, fronthaul utilization, and MEC/core latency components jointly amplify tail latency and degrade user experience.
From an economic perspective, the methods that better manage the profit–cost trade-off achieve higher net utility by simultaneously improving revenue-related terms (e.g., connected users and service success) while limiting operational expenses (energy and fronthaul costs). This behavior is consistent with the observation that fronthaul efficiency and RU-level load balancing reduce unnecessary transport overhead and prevent localized congestion. The fronthaul-load visualization further indicates that capacity headroom is preserved under all methods, but the relative reductions produced by the best controller remain important because fronthaul usage directly impacts both cost and downstream latency components.
Service-level behavior is best understood by combining the tail-latency trend plots with the latency CDF. Methods that shift the CDF left and steepen its rise effectively reduce not only average delay but also high-percentile outliers—an outcome that is particularly relevant for SLA-driven operation. Importantly, a controller may appear competitive at moderate user density while becoming unstable at higher load; therefore, the scalability curves provide more reliable evidence of robustness than single operating-point comparisons. In this context, the most consistent policy is the one that maintains a favorable ordering across user-load scaling, rather than one that excels at a single snapshot.
Radio-side indicators (coverage probability, CQI distribution, and per-RU fairness) provide additional diagnostic insight into why certain controllers outperform others. A higher share of users operating in stronger link states (higher CQI and improved coverage probability at practical SINR thresholds) typically correlates with better scheduling decisions, RU–UE association balance, and reduced interference exposure. Similarly, higher Jain fairness across RUs indicates that the controller avoids systematically starving certain cells or concentrating traffic in a subset of RUs, which otherwise increases tail latency and can indirectly increase energy and fronthaul cost due to repeated retransmissions or inefficient resource usage.
The comparative workflow and complexity analysis highlight an important practical trade-off: adaptive methods can deliver adaptivity but may impose higher training or inference overhead, whereas deterministic constrained selection can be more interpretable and stable but may face scalability limits as the action space expands. These results suggest that a promising deployment direction is hybrid orchestration, where a constraint-aware supervisor ensures SLA feasibility while a learning agent refines continuous control variables (e.g., TDD ratio and MEC offloading fraction) within safe bounds.
Finally, several limitations remain relevant for scientific rigor and reproducibility. First, the absolute latency level depends strongly on the latency decomposition, processing assumptions, and fronthaul modeling; therefore, the discussion should emphasize percentile trends, relative ordering, and scalability rather than only point estimates. Second, RU sleep/activation decisions should be tied to explicit energy-state modeling (transition costs, wake-up delay, and minimum on/off durations) to avoid overly optimistic energy savings. Third, extending the environment to include more realistic traffic mixtures, burstiness, and mobility would strengthen the generality of the findings. These extensions do not change the main conclusion: multi-objective and constraint-aware control provides a more reliable foundation for 5G C-RAN orchestration under heterogeneous demands.
8. Conclusions
This paper investigated techno-economic and SLA-aware orchestration for 5G Cloud-RAN integrated with Mobile Edge Computing, where effective operation requires coordinated decisions across radio configuration, fronthaul management, RU activation behavior, and computation offloading. A unified evaluation framework was developed to benchmark a static baseline against multiple AI-driven strategies spanning value-based learning, policy-gradient learning, continuous-control actor–critic optimization, and explicitly multi-objective/constraint-aware decision mechanisms. The results demonstrate that intelligent control can improve the overall operational trade-off by increasing economic return while simultaneously enhancing service performance indicators such as tail latency, user connectivity, and radio-side quality distributions.
The comparative analysis highlights that controllers designed to handle competing objectives and enforce feasibility constraints provide the most stable behavior under increasing user load, where tail performance and resource bottlenecks become dominant. In contrast, methods that do not explicitly incorporate constraint handling or multi-objective balance may exhibit weaker robustness under scaling demand, despite performing well in limited operating points. Overall, the findings support the practical value of multi-objective and constraint-aware learning for next-generation C-RAN orchestration, especially in deployments where operational costs, fronthaul limits, and SLA targets must be simultaneously satisfied.
Future work will improve realism and deployment readiness by incorporating richer RU energy-state modeling (including transition costs and wake-up latency), adopting bursty and service-differentiated traffic patterns, and studying broader network scenarios with mobility and heterogeneous fronthaul. The simulator will also be extended to include explicit MEC queueing dynamics (e.g., M/M/1 or M/G/1) and to re-examine controller robustness under time-varying load. Moreover, hybrid designs that couple constraint-enforcing supervision with learning-based continuous control may offer a practical path toward scalable, reliable, and interpretable AI-assisted C-RAN management. Finally, the formulation will be generalized to a multi-step MDP with by introducing RU sleep/wake transition penalties, functional-split switching overheads, and explicit MEC queue evolution, enabling evaluation under delayed effects.