1. Introduction
Multiple-input multiple-output (MIMO) technology has become one of the main foundations of modern wireless communication systems because it can improve spectral efficiency, link reliability, and network capacity by exploiting spatial diversity and multiplexing [
1]. In conventional massive MIMO systems, a large number of antennas are usually deployed at a centralized base station to serve many users within a cell. Although this architecture can provide high throughput, its performance may degrade for cell-edge users due to path loss, shadowing, inter-cell interference, and non-uniform service quality, as shown in
Figure 1. These limitations become more critical in dense beyond-5G networks, where a large number of users require reliable connectivity, high spectral efficiency, and stable Quality of Service (QoS) [
2].
Cell-free massive MIMO (CF-mMIMO) has been introduced as a promising distributed architecture to overcome the limitations of conventional cell-based networks [
3]. In CF-mMIMO, a large number of geographically distributed access points (APs) jointly serve all user equipment (UE) over the same time-frequency resources with the assistance of a central processing unit (CPU). By removing cell boundaries, CF-mMIMO can provide more uniform coverage, reduce the cell-edge effect, and improve the reliability of user-centric transmission [
4]. However, these advantages depend strongly on efficient power allocation, especially in the downlink, where the transmit power must be distributed among users while controlling interference and maintaining QoS requirements [
5].
Dynamic downlink power allocation in CF-mMIMO remains a challenging problem. The channel conditions vary continuously due to user mobility, shadow fading, changing AP–UE distances, and interference coupling among users [
6]. Classical optimization-based power-control methods, such as max-min fairness, product-based SINR optimization, and sum-rate maximization, can improve specific performance objectives, but they often require repeated optimization and may not adapt efficiently to fast network changes [
7]. Moreover, purely throughput-oriented strategies may improve the total spectral efficiency while reducing the service quality of weak users, whereas QoS-oriented strategies may protect weak users but sacrifice aggregate throughput. Therefore, a flexible power-allocation framework is needed to adapt to dynamic network conditions while balancing spectral efficiency, SINR behavior, QoS satisfaction, and computational cost.
Recently, artificial intelligence-based resource management has attracted significant attention for beyond-5G and 6G wireless networks. Deep reinforcement learning (DRL) is particularly suitable for dynamic wireless optimization because it can learn control policies from interaction with the environment instead of solving a full optimization problem from the beginning at every time step [
8]. Recent studies on AI-driven wireless edge networks and reinforcement learning-based CF-mMIMO access-point clustering have shown the potential of learning-based methods for scalable resource management. In parallel, reconfigurable intelligent surface (RIS)-assisted MIMO systems have also been studied as an emerging 5G/6G technology for improving wireless propagation and resource allocation [
9]. However, RIS-based studies mainly focus on modifying the propagation environment, while the present work focuses on dynamic downlink power allocation in CF-mMIMO using a hybrid learning-and-optimization framework [
10].
To address the above challenges, this paper proposes a Deep Hybrid Intelligent (DHI) framework for dynamic downlink power allocation in CF-mMIMO systems. The proposed framework integrates Soft Actor–Critic (SAC) reinforcement learning with three power-control strategies: DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate. The SAC agent learns adaptive power-allocation decisions from the network state, while the optimization objectives guide the allocation process toward different operating priorities. DHI-Max-Min emphasizes QoS-oriented service reliability, DHI-Max-Product provides a balanced trade-off between SINR improvement and spectral-efficiency performance, and DHI-Max-Sum-Rate prioritizes aggregate throughput. In addition, L-BFGS-B refinement is applied to the Max-Product and Max-Sum-Rate strategies to improve the SAC-generated power allocation under bounded transmit-power constraints. The main contributions of this paper are summarized as follows:
A DHI-based dynamic downlink power-allocation framework is proposed for CF-mMIMO systems under user mobility, time-varying channels, interference, and transmit-power constraints.
The proposed framework integrates SAC-based adaptive learning with three downlink power-control objectives, namely DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate, to support different QoS and spectral-efficiency requirements.
An optimization refinement stage based on L-BFGS-B is incorporated into the DHI-Max-Product and DHI-Max-Sum-Rate strategies to improve the power-allocation decisions while respecting bounded transmit-power constraints.
A QoS-aware reward and evaluation structure is adopted to assess the relationship between spectral efficiency, SINR, QoS satisfaction rate, outage probability, and computational time.
The remainder of this paper is organized as follows.
Section 2 reviews related work on CF-mMIMO power allocation, optimization-based approaches, DRL-based resource management, and recent AI/RIS-assisted wireless studies.
Section 3 presents the system model, mathematical formulation, SAC-based learning framework, and optimization objectives.
Section 4 describes the simulation setup and parameter configuration.
Section 5 discusses the simulation results, including spectral efficiency, SINR, QoS satisfaction, computational time, and convergence behavior.
Section 6 discusses limitations and practical considerations, and
Section 7 concludes the paper.
3. DHI Model Description
This section presents the system model, mathematical formulation, SAC-based learning structure, DHI power-control strategies, and complexity analysis of the proposed downlink power-allocation framework for CF-mMIMO systems.
3.1. System Model and Mathematical Formulation
This study considers a downlink cell-free massive multiple-input multiple-output (CF-mMIMO) system consisting of
distributed access points (APs) and
single-antenna user equipment (UE). The APs are geographically distributed over the coverage area and connected to a central processing unit (CPU) through fronthaul links. The CPU coordinates the downlink transmission and power-allocation decisions, while all APs jointly serve the UE using the same time-frequency resources. In order to maintain consistency with the adopted simulation model and scalar SINR formulation, each AP–UE connection is represented by an effective scalar channel coefficient. This assumption corresponds to a single-antenna AP configuration or an equivalent scalar channel after local AP processing. Similar scalar-link representations are commonly used in CF-mMIMO power-control studies to evaluate distributed transmission, large-scale fading, and user-centric power allocation [
19].
Let
denote the effective downlink channel coefficient between AP
and UE
at time step
. The channel coefficient is modeled as [
20]
where
represents the large-scale fading coefficient, including path loss and shadow fading, and
denotes the small-scale fading coefficient. Since the users are mobile, the AP–UE distance changes over time, and therefore the large-scale fading coefficient
is updated dynamically according to the user position and shadowing model.
The downlink transmit power allocated from AP m to UE
at time step
is denoted by
, where
The total transmit power of each AP is constrained by the maximum available AP power
, such that [
21]
If the implementation uses an effective UE-level power vector, the AP-level power coefficient can be written as
where
is the effective power assigned to UE
, and
is a non-negative AP–UE weighting coefficient that distributes the UE-level power across the serving APs.
The received downlink signal at UE
can be expressed as [
19]
where
is the intended data symbol for UE
,
is the interfering data symbol intended for UE
, and
is the additive noise at UE
, with variance
. The first term represents the useful signal received by UE
, while the second term represents multi-user interference caused by downlink transmission to other UE.
Accordingly, the downlink signal-to-interference-plus-noise ratio (SINR) of UE
is defined as [
21]
This expression defines the UE-level SINR . Therefore, the notation SINR is not used as the final performance metric; instead, each UE is evaluated using the aggregated SINR , which includes the useful contribution from distributed APs and the interference generated by transmissions to other users.
The spectral efficiency of UE
is calculated as [
19]
where
is the pilot length and
is the coherence block length. If pilot overhead is not explicitly considered in the simulation, the pre-log factor can be set to one, and the simplified expression becomes
The total downlink spectral efficiency of the system is then given by
To evaluate the Quality of Service (QoS) performance, a minimum SINR threshold
is defined. UE
is considered to satisfy the QoS requirement if
The QoS satisfaction rate is computed as
where
is an indicator function equal to one when the condition is satisfied and zero otherwise. The outage probability is then given by
In this work, QoS is treated as a QoS-aware performance requirement rather than an absolute hard guarantee for every UE at every time step. This is because dynamic channel variation, user mobility, and transmit-power limitations may prevent all users from simultaneously satisfying the SINR threshold. Therefore, QoS performance is evaluated using the QoS satisfaction rate and outage probability.
The general downlink power-allocation problem can be formulated as
subject to
However, because the QoS constraint may not always be feasible under dynamic channel and power-limited conditions, the proposed DHI framework handles QoS using a QoS-aware reward penalty and evaluates the final performance using QoS satisfaction rate and outage probability. This avoids claiming unrealistic full QoS guarantee while still guiding the learning model toward QoS-supportive power allocation.
Figure 2 illustrates the architecture of the proposed DHI framework, which integrates deep reinforcement learning with hybrid optimization strategies for dynamic downlink power allocation in CF-mMIMO systems.
3.2. SAC-Based Learning Algorithm Description
The proposed DHI framework uses the Soft Actor–Critic (SAC) algorithm to learn a dynamic downlink power-allocation policy for the considered CF-mMIMO system. SAC is an off-policy actor–critic reinforcement learning method designed for continuous action spaces, making it suitable for downlink power control because transmit power is a continuous variable rather than a discrete decision. In the proposed framework, the SAC agent observes the current network condition, selects the downlink power-allocation action, receives a reward based on spectral efficiency and QoS behavior, and updates its policy using experience replay.
At each time step
, the interaction between the SAC agent and the CF-mMIMO environment is formulated as a Markov decision process (MDP), defined by the tuple
where
is the state space,
is the continuous action space,
represents the state-transition probability caused by channel evolution and UE mobility,
is the reward function, and
is the discount factor.
3.2.1. State Space
The state vector contains the network information required by the SAC agent to make downlink power-allocation decisions. In this work, the state at time step
is defined as [
22]
where
is the normalized large-scale fading coefficient between AP
and UE
,
contains the normalized UE location coordinates,
is the normalized SINR vector from the previous time step,
represents the previous interference level of the UE, and
is the previous downlink power-allocation vector. The operator
converts the AP–UE fading matrix into a vector.
For a system with
APs and
UE, the state dimension is [
23]
For the simulation setup used in this paper, where
, the state dimension becomes
This explicit state representation improves reproducibility by clarifying which network parameters are used by the SAC agent during training and inference.
3.2.2. Action Space
The action generated by the SAC actor network is a continuous vector representing the downlink power-control decision for all UE. The action vector is defined as [
24]
where each action component is bounded as
The normalized SAC action is then mapped to the physical transmit-power range as
where
and
denote the minimum and maximum allowable downlink transmit power, respectively. In this study,
and
. This mapping ensures that the SAC-generated power allocation always remains inside the permitted transmit-power range.
If AP-level power allocation is required, the effective UE-level power
is distributed across the serving APs using [
25]
where
is a non-negative AP–UE weighting coefficient.
3.2.3. Reward Function
Since reward design is central to SAC-based power control, the proposed framework defines the reward as a QoS-aware objective that balances spectral efficiency, SINR performance, QoS satisfaction, and power-constraint violation. The general reward at time step
is formulated as
where
denotes the selected DHI strategy,
is the normalized objective value of the selected strategy,
is the QoS satisfaction rate,
is the QoS violation penalty,
is the power-constraint violation penalty, and
are non-negative weighting factors.
The QoS violation penalty is defined as [
26]
where
is the minimum SINR threshold required for QoS satisfaction. This term penalizes the agent when the SINR of a UE falls below the QoS threshold.
The power violation penalty is expressed as
In the current implementation, the action mapping in Equation (23) already bounds the power values between and . Therefore, is normally zero, but it is retained in the formulation to make the constraint-handling mechanism explicit.
The strategy-dependent objective
is defined according to the selected DHI operating mode. For DHI-Max-Min, the objective emphasizes the weakest UE:
For DHI-Max-Product, the objective improves the overall SINR distribution using a logarithmic product form:
where
a small positive is a constant used to avoid numerical instability.
For DHI-Max-Sum-Rate, the objective maximizes the total downlink spectral efficiency:
The objective value is normalized before being used in the reward:
This reward formulation allows the SAC agent to learn power-allocation policies that improve the selected objective while reducing QoS violations and maintaining bounded transmit power.
3.2.4. SAC Policy and Critic Updates
The SAC agent consists of an actor network, two critic networks, and their corresponding target networks. The actor network generates a stochastic policy where denotes the actor parameters. The critic networks estimate the soft state-action value functions and , where and denote the critic parameters.
For each transition (
), the target soft value is computed as [
22]
where
,
are the target critic parameters, and
is the entropy-temperature coefficient.
The critic loss is defined as [
27]
where
is the replay buffer.
The actor loss is formulated as [
24]
The entropy coefficient α controls the balance between exploration and exploitation. A larger encourages more exploration, while a smaller encourages more deterministic power allocation. In this work, automatic entropy tuning is used to adjust during training.
The target critic networks are updated using Polyak averaging:
where
is the Polyak update coefficient.
3.2.5. Replay Buffer and Training Procedure
The SAC agent stores each interaction with the environment in a replay buffer as
where
is the terminal-state indicator. During training, mini-batches are sampled randomly from the replay buffer to update the actor and critic networks. This off-policy learning mechanism improves sample efficiency and reduces the correlation between consecutive training samples.
The training process begins with environment initialization, including AP and UE deployment, channel generation, mobility initialization, and initial power setting. At each time step, the SAC actor selects a downlink power action, the environment computes the resulting SINR, spectral efficiency, QoS satisfaction rate, and reward, and the transition is stored in the replay buffer. Once the replay buffer contains enough samples, the critic networks, actor network, entropy coefficient, and target networks are updated.
The actor and critic networks are implemented as multilayer perceptrons. The network architecture is selected through hyperparameter optimization, where candidate architectures include [64,64], [256,256], [400,300], [128,256,128], and [256,256,256]. The search also includes the learning rate, batch size, discount factor, replay-buffer size, learning-start steps, training frequency, and Polyak coefficient.
3.2.6. Convergence Monitoring
To evaluate the convergence behavior of the proposed SAC-based DHI framework, the training performance is monitored using the moving average of the episode reward, sum spectral efficiency, QoS satisfaction rate, and critic loss. The moving average reward over a window of
episodes is calculated as
where
is the total reward of episode
. The policy is considered stable when the moving average reward and QoS satisfaction rate no longer show large fluctuations over consecutive training windows.
Figure 3 illustrates the SAC action-generation mechanism used to transform the unbounded policy output into a bounded downlink power allocation vector within the predefined power constraints.
3.3. DHI Power-Control Strategies and L-BFGS-B Refinement
The proposed DHI framework combines SAC-based learning with three downlink power-control strategies: DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate. These strategies are designed to guide the learning model toward different operating objectives. DHI-Max-Min focuses on QoS-oriented service reliability, DHI-Max-Product improves the overall SINR balance among users, and DHI-Max-Sum-Rate maximizes the aggregate spectral efficiency of the CF-mMIMO system. Therefore, the proposed framework does not rely on a single fixed power-control objective; instead, it provides flexible operating modes according to the required network performance target.
3.3.1. DHI-Max-Min Strategy
The DHI-Max-Min strategy aims to improve the performance of the weakest UE by maximizing the minimum spectral efficiency among all users. This strategy is suitable for QoS-oriented operation because it prevents users with poor channel conditions from being severely degraded. The objective is formulated as [
28]
subject to
In this strategy, the SAC agent learns a power-allocation policy that improves the minimum user performance. However, because the objective prioritizes weak users, the total sum spectral efficiency may be lower than that achieved by throughput-oriented strategies. Therefore, DHI-Max-Min is mainly suitable for scenarios where QoS reliability and minimum user service are more important than maximum aggregate throughput.
3.3.2. DHI-Max-Product Strategy
The DHI-Max-Product strategy aims to improve the overall SINR distribution by maximizing the product of user SINR values. Direct multiplication of SINR values may cause numerical instability, especially when the number of users is large. Therefore, the logarithmic form is used [
29]:
subject to
where
is a small positive constant used to avoid
. This objective encourages the model to improve the SINR of all users while avoiding excessive concentration of resources on only a few strong users. As a result, DHI-Max-Product provides a balanced operating mode between QoS-oriented allocation and spectral-efficiency maximization.
3.3.3. DHI-Max-Sum-Rate Strategy
The DHI-Max-Sum-Rate strategy aims to maximize the total downlink spectral efficiency of the CF-mMIMO system. The objective is written as [
26]
subject to
This strategy is suitable for throughput-priority scenarios because it allocates power in a way that increases the total system spectral efficiency. However, this may reduce the QoS satisfaction of users with weaker channel conditions. Therefore, in the proposed DHI framework, the reward function includes a QoS violation penalty to reduce the negative effect of throughput maximization on weak users.
3.3.4. L-BFGS-B Refinement
After the SAC agent generates the initial power-allocation vector, the DHI-Max-Product and DHI-Max-Sum-Rate strategies apply an additional L-BFGS-B refinement step. L-BFGS-B is selected because it is suitable for bound-constrained optimization problems, where each user power must remain within the allowable interval .
Let the SAC-generated power vector be [
30]
The refined power vector is obtained as
subject to
where
} and
is the negative form of the selected objective. For DHI-Max-Product, the loss function is
For DHI-Max-Sum-Rate, the loss function is
The L-BFGS-B update can be generally expressed as
where
is the internal L-BFGS-B iteration index,
is the step size,
is the limited-memory approximation of the inverse Hessian matrix, and
denotes projection onto the feasible power interval.
The DHI-Max-Min strategy does not require L-BFGS-B refinement in this framework because its main purpose is QoS-oriented minimum user protection rather than gradient-based throughput maximization. Therefore, the SAC-generated power vector is directly evaluated using the Max-Min objective and QoS satisfaction metrics.
The final output of the DHI framework is therefore written as
This hybrid design allows the SAC agent to provide adaptive power-allocation decisions under dynamic network conditions, while L-BFGS-B further refines the throughput-oriented strategies within the bounded transmit-power range. As a result, the proposed DHI framework can support different downlink operating requirements, including QoS-oriented service reliability, balanced SINR improvement, and sum spectral-efficiency maximization. The main power-control strategies integrated within the proposed DHI framework are summarized in
Figure 4, showing how the framework applies different optimization objectives for downlink power allocation.
3.4. Proposed DHI Model
The overall procedure of the proposed DHI framework is summarized in Algorithm 1. The algorithm begins by initializing the CF-mMIMO environment, including AP and UE locations, channel coefficients, mobility parameters, transmit-power limits, and SAC learning parameters. At each time step, the SAC agent observes the current network state and generates a continuous power-allocation action. The action is then mapped into the feasible transmit-power range and evaluated using the selected DHI strategy. For DHI-Max-Min, the SAC-generated power vector is directly evaluated because the objective focuses on minimum user QoS-oriented performance. For DHI-Max-Product and DHI-Max-Sum-Rate, the SAC-generated power vector is further refined using the L-BFGS-B optimizer to improve the objective value while maintaining the transmit-power bounds.
After applying the final power-allocation vector, the environment computes the downlink SINR, spectral efficiency, QoS satisfaction rate, outage probability, and reward. The transition is stored in the replay buffer, and the SAC actor and critic networks are updated using mini-batch samples from the buffer. This process continues until the maximum number of training steps or episodes is reached. The trained policy is then used for online downlink power allocation under dynamic channel and mobility conditions.
Algorithm 1 shows how the proposed DHI framework combines SAC-based adaptive learning with strategy-dependent power-control objectives. The SAC agent provides the initial power-allocation decision based on the current network state, while the selected DHI strategy determines how the decision is evaluated and refined. The DHI-Max-Min strategy directly evaluates the SAC-generated power vector for QoS-oriented minimum user performance, whereas DHI-Max-Product and DHI-Max-Sum-Rate use L-BFGS-B refinement to improve SINR-product and sum-rate objectives under bounded transmit-power constraints. This structure allows the proposed framework to adapt to user mobility and channel variation while supporting different downlink operating requirements.
| Algorithm 1: Proposed DHI Framework for Dynamic Power Allocation |
| Require: |
| Ensure: |
- 1.
Initialize AP/UE locations, mobility model, channel parameters, SAC networks, and replay buffer D. - 2.
Set the initial downlink power vector . - 3.
For each episode do: - 4.
Reset the CF-mMIMO environment and observe the initial state . - 5.
For each time step l do: - 6.
Update UE positions, channel coefficients, and interference conditions. - 7.
Observe the current state . - 8.
. - 9.
]. - 10.
DHI-Max-Min, then: - 11.
. - 12.
DHI-Max-Product, then: - 13.
using L-BFGS-B with the SINR-product objective. - 14.
. - 15.
DHI-Max-Sum-Rate, then: - 16.
using L-BFGS-B with the sum-rate objective. - 17.
. - 18.
End if. - 19.
to the downlink CF-mMIMO environment. - 20.
Compute SINR, spectral efficiency, QoS satisfaction rate, outage probability, and reward. - 21.
. - 22.
contains enough samples, then: - 23.
Update SAC critic networks, actor network, entropy coefficient, and target networks. - 24.
End if. - 25.
End for. - 26.
End for. - 27.
.
|
3.5. Complexity and Practical Feasibility Analysis
The computational complexity of the proposed DHI framework depends on three main components: the CF-mMIMO environment evaluation, the SAC learning process, and the optional L-BFGS-B refinement used in the DHI-Max-Product and DHI-Max-Sum-Rate strategies. Since the system consists of distributed APs and UE, the computational cost increases with the number of AP–UE channel links, the number of users whose downlink powers are optimized, the neural network size of the SAC agent, and the number of refinement iterations required by L-BFGS-B.
At each time step, the environment computes the channel-dependent quantities, downlink SINR values, spectral efficiency, QoS satisfaction rate, outage probability, and reward. The dominant environment cost comes from evaluating the useful signal and multi-user interference terms for all UE. Since each UE receives useful and interfering contributions from
APs and
interfering user streams, the SINR evaluation can be approximated as
This cost reflects the repeated computation of AP–UE contributions for each user and interfering stream. Therefore, increasing the number of users has a stronger effect on the environment evaluation cost than increasing the number of APs.
For the SAC component, the offline training cost depends on the number of training time steps, the environment evaluation cost, and the actor–critic neural network update cost. If
denotes the total number of training time steps and
denotes the cost of one SAC neural network update, the total SAC training complexity can be expressed as
The neural network update cost depends on the selected multilayer perceptron architecture, batch size, and number of critic and actor updates. For a network with
layers and layer widths
the forward-pass cost can be approximated as
Because SAC uses an actor network, two critic networks, and target critic networks, the practical training cost is higher than a single-network forward pass. Considering a mini-batch size
, the neural network update cost can be approximated as
This cost is mainly incurred during offline training. Once training is completed, online deployment does not require full actor–critic updates. Instead, the trained actor network generates the downlink power-allocation action using only a forward pass. Therefore, the online SAC inference complexity is
This is significantly lower than the offline training cost and makes the trained policy suitable for real-time or near-real-time downlink power control.
For DHI-Max-Product and DHI-Max-Sum-Rate, L-BFGS-B refinement is applied after the SAC actor generates the initial power vector. Let
be the number of optimization variables, corresponding to the UE-level downlink power vector,
be the number of L-BFGS-B iterations, and
be the limited-memory correction parameter. The complexity of the refinement stage can be approximated as
where
is the cost of evaluating the selected objective function. Since the objective evaluation depends on SINR and spectral-efficiency computation, it is linked to
. Therefore,
The total online complexity of the proposed DHI framework depends on the selected strategy. For DHI-Max-Min, no L-BFGS-B refinement is used; therefore, the online complexity is mainly the SAC actor inference and environment evaluation:
For DHI-Max-Product and DHI-Max-Sum-Rate, the online complexity includes the SAC forward pass, environment evaluation, and L-BFGS-B refinement:
where
}.
In terms of scalability, the number of AP–UE links grows as , while the interference-related SINR computation grows approximately as . This means that user density has a stronger impact on computational cost than AP density. As increases, the action dimension of the SAC policy also increases because the actor outputs one power-control value for each UE. Therefore, for larger CF-mMIMO deployments, practical scalability can be improved by using user clustering, AP selection, parameter sharing, or distributed/federated learning structures. These approaches reduce the effective number of AP–UE links and make the learning problem more manageable.
The proposed framework is practical because the computationally expensive SAC training process can be performed offline using a CPU, GPU, or cloud/edge server. The trained actor policy can then be deployed online at the CPU or an edge controller to generate power-allocation decisions through fast inference. The code implementation supports automatic device selection among CPU, CUDA-enabled GPU, and Apple MPS when available, which allows the training process to benefit from hardware acceleration when supported.
From a deployment perspective, the CPU collects large-scale fading, UE location, SINR, and interference-related information through the fronthaul network. Since the SAC state mainly relies on slowly varying large-scale fading, mobility information, previous SINR, previous interference, and previous power allocation, the signaling overhead can be reduced compared with schemes that require full instantaneous CSI exchange at every time step. However, the framework still requires reliable fronthaul signaling between APs and the CPU to update state information and apply the optimized downlink power-control decisions.
The measured computational times support the feasibility of the proposed approach. In the reported results, DHI-Max-Product and DHI-Max-Sum-Rate achieved mean computational times of 0.0690 s and 0.0696 s, respectively, while the DDPG benchmark required 0.63 s. DHI-Max-Min required a higher mean duration of 0.3757 s because it focuses on minimum user QoS-oriented behavior. These results indicate that the proposed DHI-Max-Product and DHI-Max-Sum-Rate strategies can provide faster online execution than the benchmark while maintaining strong spectral-efficiency performance.
Overall, the DHI framework separates offline learning from online inference. The offline phase handles policy training, hyperparameter tuning, and convergence monitoring, while the online phase uses the trained actor and optional L-BFGS-B refinement to generate bounded downlink power-allocation decisions. This structure improves deployment feasibility by avoiding the need to train the full SAC model during real-time operation. Nevertheless, the practical implementation of the proposed framework still depends on fronthaul capacity, state-update frequency, network size, and the ability of the CPU or edge controller to perform inference and optional refinement within the required scheduling interval.
5. Results and Discussion
This section evaluates the performance of the proposed DHI framework for dynamic downlink power allocation in CF-mMIMO systems. The evaluation focuses on convergence behavior, spectral efficiency, SINR distribution, QoS satisfaction, outage probability, spatial SINR behavior, and computational time. The three proposed DHI strategies, namely DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate, are compared to show how different power-control objectives affect downlink performance. DHI-Max-Min is evaluated as a QoS-oriented strategy, DHI-Max-Product as a balanced SINR-aware strategy, and DHI-Max-Sum-Rate as a throughput-oriented strategy. The DDPG benchmark is also included in the sum spectral-efficiency and computational-time comparisons to assess the advantage of the SAC-based DHI design.
The purpose of this section is not only to compare numerical values but also to explain how each result supports the motivation of the paper. Specifically, the results examine whether the proposed DHI framework can adapt to dynamic downlink conditions, improve spectral-efficiency behavior, support QoS satisfaction, and reduce computational time compared with the benchmark method.
Figure 5 presents the empirical cumulative distribution function (CDF) and probability density function (PDF) of the downlink spectral efficiency obtained using DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate. The CDF curves show the probability that the spectral efficiency is below a given value, while the PDF curves indicate the concentration of spectral-efficiency values across the simulated downlink conditions. Therefore, these distributions provide more complete information than a single average value because they show how each strategy affects both low-performance and high-performance users.
The DHI-Max-Min strategy produces a more conservative spectral-efficiency distribution because its objective prioritizes the weakest UE. This behavior improves minimum user service and supports QoS-oriented operation, but it also limits the aggregate spectral-efficiency gain because additional power is assigned to weaker users instead of only maximizing throughput. In contrast, DHI-Max-Sum-Rate shifts the spectral-efficiency distribution toward higher values because it directly maximizes the total system spectral efficiency. This confirms its suitability for throughput-priority operation. However, the improvement in aggregate spectral efficiency may come with reduced QoS satisfaction for users experiencing weaker channel conditions.
DHI-Max-Product provides an intermediate behavior between these two strategies. By improving the product-based SINR objective, it avoids the excessive throughput bias of DHI-Max-Sum-Rate while still achieving better spectral-efficiency behavior than the QoS-oriented DHI-Max-Min strategy. This makes DHI-Max-Product suitable for balanced downlink operation where both SINR improvement and spectral-efficiency performance are required.
Overall,
Figure 5 confirms that the three DHI strategies provide different operating modes rather than a single fixed solution. DHI-Max-Min is more suitable when minimum user service and QoS reliability are prioritized, DHI-Max-Product provides a balanced operating point, and DHI-Max-Sum-Rate is preferable when the main objective is to maximize aggregate downlink throughput.
Figure 6 compares the empirical cumulative distribution function of the sum spectral efficiency achieved by the proposed DHI strategies and the DDPG benchmark. The sum spectral efficiency is an important metric because it measures the aggregate downlink throughput of the CF-mMIMO system and directly reflects how efficiently the available transmit power and distributed AP resources are used.
The DHI-Max-Min strategy achieves the lowest sum spectral-efficiency distribution among the proposed strategies. This behavior is expected because DHI-Max-Min prioritizes the weakest users and attempts to improve minimum user service rather than maximizing the total throughput. As a result, part of the available power is allocated to users with weaker channel conditions, which improves QoS-oriented behavior but reduces aggregate spectral-efficiency gain.
The DHI-Max-Product strategy shifts the distribution toward higher spectral-efficiency values compared with DHI-Max-Min. This indicates that maximizing the logarithmic SINR-product objective improves the overall SINR balance among users while still maintaining better aggregate throughput than the minimum user-focused strategy. Therefore, DHI-Max-Product provides a balanced operating point between QoS-oriented service reliability and sum spectral-efficiency improvement.
The DHI-Max-Sum-Rate strategy achieves the highest sum spectral-efficiency performance because its objective directly maximizes the total downlink spectral efficiency. Compared with the DDPG benchmark, the DHI-Max-Sum-Rate curve is shifted toward higher spectral-efficiency values, showing that the proposed hybrid learning-and-optimization structure provides stronger throughput-oriented performance. The DDPG benchmark achieves a median sum spectral efficiency of approximately 35.24 bit/s/Hz, while the DHI-Max-Sum-Rate strategy provides higher overall spectral-efficiency behavior under the same CF-mMIMO simulation setting.
The improvement of DHI-Max-Sum-Rate can be explained by two factors. First, SAC uses entropy-regularized stochastic exploration, which helps the agent explore different continuous power-allocation actions during training instead of converging too early to a narrow deterministic policy. Second, the L-BFGS-B refinement stage further improves the SAC-generated power vector under bounded transmit-power constraints. This combination allows the proposed DHI-Max-Sum-Rate strategy to exploit both learning-based adaptability and optimization-based refinement.
The DDPG benchmark is selected because it is also an actor–critic DRL method designed for continuous control problems. Therefore, it provides a reasonable comparison for continuous downlink power allocation. However, unlike SAC, DDPG uses a deterministic policy and may be more sensitive to exploration noise and local convergence behavior. This explains why the SAC-based DHI framework can achieve stronger and more stable sum spectral-efficiency performance in the dynamic CF-mMIMO environment.
Figure 7 presents the empirical CDF and PDF distributions of the downlink SINR under the three proposed DHI strategies: DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate. The SINR distribution is an important indicator because it directly affects both spectral efficiency and QoS satisfaction. While spectral efficiency measures the achievable data rate, SINR shows how effectively each strategy controls interference and supports reliable downlink reception.
The maximum SINR distribution shows the highest SINR values achieved under each strategy. DHI-Max-Sum-Rate generally produces higher maximum SINR values because its objective prioritizes aggregate throughput. This means that users with favorable channel conditions can receive stronger power allocation and achieve higher link quality. However, high maximum SINR alone does not necessarily indicate better QoS for all users, because it may occur when the strategy benefits strong users more than weak users.
The minimum SINR distribution provides a more direct view of QoS-oriented behavior. DHI-Max-Min performs better in this part because it is designed to improve the weakest user performance. By increasing the lower tail of the SINR distribution, DHI-Max-Min reduces the probability that users experience very poor link quality. This explains why DHI-Max-Min achieves stronger QoS satisfaction and lower outage behavior, even though its aggregate spectral efficiency is lower than the throughput-oriented strategy.
The mean SINR distribution shows the average SINR behavior across users and provides a balanced view of network-level link quality. DHI-Max-Product is expected to perform strongly in this metric because the logarithmic product-based objective encourages improvement across multiple users rather than focusing only on the weakest user or only on the strongest throughput contributors. Therefore, DHI-Max-Product can be interpreted as a balanced strategy that improves SINR distribution while maintaining better spectral-efficiency behavior than DHI-Max-Min.
Overall, the SINR results confirm that the three DHI strategies produce different downlink operating behaviors. DHI-Max-Min is more suitable for QoS-oriented operation because it improves the minimum SINR. DHI-Max-Product provides a balanced SINR distribution by reducing extreme imbalance among users. DHI-Max-Sum-Rate is more suitable for throughput-oriented operation because it increases the higher-SINR region and improves aggregate spectral efficiency. Therefore,
Figure 7 supports the main motivation of the paper by showing that the proposed DHI framework can adapt the power-allocation behavior according to the required downlink objective.
Table 5 summarizes the QoS-oriented performance of the proposed DHI downlink power-allocation strategies. The QoS satisfaction rate is calculated based on the percentage of UE whose SINR values satisfy the required SINR threshold. The outage probability represents the percentage of UE that do not satisfy this threshold. Therefore, these two metrics provide a direct evaluation of how well each strategy supports downlink service reliability under dynamic channel and mobility conditions.
The results show that DHI-Max-Min achieves the highest QoS satisfaction rate of 93.75%, corresponding to an average of 30 users satisfying the required SINR threshold out of 32 UE. It also achieves the lowest outage probability of 6.25%. This confirms that DHI-Max-Min is the most suitable strategy when the main objective is QoS-oriented service reliability and minimum user protection. However, this improvement is obtained at the cost of lower aggregate spectral efficiency, with a mean sum SE of 17.50 bit/s/Hz.
DHI-Max-Product achieves a QoS satisfaction rate of 91.04% and an outage probability of 8.96%. Its mean sum spectral efficiency reaches 19.89 bit/s/Hz, which is higher than DHI-Max-Min. This shows that DHI-Max-Product provides a balanced trade-off between QoS support and throughput improvement. It does not prioritize the weakest users as strongly as DHI-Max-Min, but it avoids the more aggressive throughput bias of DHI-Max-Sum-Rate.
DHI-Max-Sum-Rate achieves the highest mean sum spectral efficiency of 20.14 bit/s/Hz. However, its QoS satisfaction rate decreases to 87.98%, with an outage probability of 12.02%. This behavior is expected because the sum-rate objective prioritizes aggregate throughput, which may allocate more resources to users with better channel conditions. As a result, some weaker users may fail to satisfy the SINR threshold.
These results clarify that QoS in the proposed framework should be interpreted as a QoS-aware performance objective, not as an absolute hard guarantee for every UE at every time step. This is because mobility, interference, and transmit-power limitations may prevent all users from satisfying the SINR threshold simultaneously. Therefore, the proposed framework evaluates QoS using QoS satisfaction rate and outage probability rather than claiming complete QoS guarantee.
Figure 8 presents the spatial SINR heat maps obtained using the three proposed DHI downlink power-allocation strategies: DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate. The heat maps provide a visual interpretation of how each strategy distributes downlink link quality across the CF-mMIMO coverage area. Unlike the CDF and PDF plots, which summarize the statistical behavior of SINR, the heat maps show the geographical distribution of SINR and help identify regions with strong or weak service quality.
The DHI-Max-Min heat map shows a more uniform SINR distribution across the service area. This behavior is consistent with the objective of DHI-Max-Min, which focuses on improving the weakest user performance. As a result, extremely low-SINR regions are reduced, and the minimum user service is improved. However, because the strategy allocates more resources to weaker users, it does not produce the highest SINR peaks across the network. This confirms that DHI-Max-Min is more suitable for QoS-oriented downlink operation rather than throughput-maximizing operation.
The DHI-Max-Product heat map shows a more balanced spatial pattern. Compared with DHI-Max-Min, it provides higher SINR values in several regions while still avoiding severe imbalance among users. This is because the product-based SINR objective encourages the improvement of multiple users simultaneously instead of focusing only on the weakest user or only on the strongest channel conditions. Therefore, DHI-Max-Product can be considered a balanced strategy for scenarios where both SINR improvement and QoS support are required.
The DHI-Max-Sum-Rate heat map shows stronger SINR values in selected areas, especially where AP–UE channel conditions are more favorable. This behavior is expected because the sum-rate objective prioritizes the maximization of aggregate spectral efficiency. However, this may create more spatial variation because users in weaker regions may receive less relative support compared with users in stronger channel conditions. Therefore, DHI-Max-Sum-Rate is more suitable for throughput-priority scenarios, while DHI-Max-Min remains more appropriate for QoS-oriented service coverage.
Overall, the heat map analysis confirms that the proposed DHI framework provides different spatial power-allocation behaviors according to the selected strategy. DHI-Max-Min improves spatial service uniformity, DHI-Max-Product provides a balanced SINR distribution, and DHI-Max-Sum-Rate increases high-SINR regions to improve total downlink throughput. These observations support the main objective of the proposed framework, which is to provide adaptive downlink power allocation under dynamic CF-mMIMO conditions.
Table 6 compares the computational time of the proposed DHI strategies with the DDPG benchmark. Computational time is an important metric for dynamic downlink power allocation because the power-control decision must be updated within a practical scheduling interval, especially under user mobility and time-varying channel conditions.
The results show that DHI-Max-Product and DHI-Max-Sum-Rate achieve the lowest mean computational times, with 0.0690 s and 0.0696 s, respectively. Compared with the DDPG benchmark, which requires 0.63 s, DHI-Max-Product reduces the mean computational time by approximately 89.05%, while DHI-Max-Sum-Rate reduces it by approximately 88.95%. This confirms that the proposed DHI framework can provide faster online execution for throughput-oriented downlink power allocation.
DHI-Max-Min records a higher mean computational time of 0.3757 s compared with DHI-Max-Product and DHI-Max-Sum-Rate. This is expected because DHI-Max-Min focuses on minimum user service and QoS-oriented power balancing, which requires more careful allocation to weaker users. However, DHI-Max-Min is still faster than the DDPG benchmark, reducing the mean computational time by approximately 40.37%.
The maximum duration values further support this observation. DHI-Max-Min reaches a maximum duration of 0.7430 s, while DHI-Max-Product and DHI-Max-Sum-Rate require only 0.3321 s and 0.3638 s, respectively. In comparison, DDPG reaches a maximum duration of 0.99 s. These values show that the proposed DHI strategies reduce not only the average execution time but also the worst-case computational delay compared with the benchmark.
The lower computational time of DHI-Max-Product and DHI-Max-Sum-Rate can be explained by the hybrid design of the proposed framework. The SAC actor provides an initial power-allocation decision through fast neural network inference, while L-BFGS-B refinement improves the objective within bounded power constraints. This avoids solving the full power-allocation problem from the beginning at every time step. Therefore, the proposed framework separates the expensive offline training phase from the faster online decision-making phase.
Overall, the computational-time results confirm that the proposed DHI framework is suitable for dynamic CF-mMIMO downlink power allocation. DHI-Max-Sum-Rate is preferable when the main objective is high spectral efficiency with low execution time, DHI-Max-Product is suitable for balanced SINR and spectral-efficiency performance with very low computational cost, and DHI-Max-Min is appropriate when QoS-oriented service reliability is prioritized despite a higher computational time.
Figure 9 presents the downlink performance evolution of the proposed DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate strategies over the simulation steps.
Figure 9a shows the moving average sum spectral efficiency, which smooths short-term fluctuations and highlights the general performance trend of each strategy.
Figure 9b presents the instantaneous sum spectral efficiency, showing the step-by-step variation caused by mobility, channel changes, and dynamic power-allocation decisions. The results show that DHI-Max-Sum-Rate achieves the highest sum spectral-efficiency behavior, while DHI-Max-Product provides a close and stable performance. In contrast, DHI-Max-Min produces lower aggregate spectral efficiency because it prioritizes minimum user service and QoS-oriented allocation.
Figure 9c shows the estimated QoS satisfaction behavior based on the spectral-efficiency threshold used in the evaluation, while
Figure 9d presents the rolling standard deviation of the sum spectral efficiency as a stability indicator. A lower fluctuation level indicates more stable downlink performance over time. Overall, the figure confirms that the proposed DHI framework can maintain stable downlink performance under dynamic network conditions, with DHI-Max-Sum-Rate being more suitable for throughput-oriented operation and DHI-Max-Product providing a balanced performance trend.
The overall results confirm that the proposed DHI framework provides a flexible and adaptive solution for dynamic downlink power allocation in CF-mMIMO systems. Instead of using a single fixed objective, the proposed framework supports three operating strategies: DHI-Max-Min, DHI-Max-Product, and DHI-Max-Sum-Rate. Each strategy produces a different performance behavior according to the selected downlink objective as shown in
Table 7.
DHI-Max-Min provides the strongest QoS-oriented performance. It achieves the highest QoS satisfaction rate and the lowest outage probability because it prioritizes the weakest users and improves minimum user service. This makes it suitable for scenarios where service reliability is more important than maximizing total throughput. However, this strategy provides the lowest mean sum spectral efficiency because more power is assigned to users with weaker channel conditions.
DHI-Max-Product provides the most balanced behavior among the three strategies. It improves the SINR distribution across users while maintaining better spectral-efficiency performance than DHI-Max-Min. Its low computational time also makes it suitable for practical dynamic downlink operation. Therefore, DHI-Max-Product can be considered an appropriate strategy when the network requires a balance between QoS support, SINR improvement, spectral-efficiency performance, and execution time.
DHI-Max-Sum-Rate achieves the highest sum spectral efficiency. This confirms that the sum-rate objective is effective when the main goal is to maximize aggregate downlink throughput. However, its QoS satisfaction rate is lower than DHI-Max-Min and DHI-Max-Product because throughput-oriented allocation may favor users with stronger channel conditions. Therefore, DHI-Max-Sum-Rate is most suitable for throughput-priority scenarios where maximizing total system spectral efficiency is the main requirement.
The comparison with the DDPG benchmark further confirms the advantage of the proposed SAC-based DHI framework. DHI-Max-Sum-Rate provides stronger sum spectral-efficiency behavior than DDPG, while DHI-Max-Product and DHI-Max-Sum-Rate achieve much lower computational time. This improvement is mainly due to the combination of SAC-based adaptive learning and L-BFGS-B refinement, which allows the proposed framework to generate effective power-allocation decisions without solving the complete optimization problem from the beginning at every time step.
The results also clarify that QoS in this paper should be interpreted as a QoS-aware performance objective rather than a strict guarantee for all users at all time steps. Under mobility, interference, and bounded transmit power, it is not always feasible for every UE to satisfy the SINR threshold simultaneously. For this reason, QoS is evaluated using QoS satisfaction rate and outage probability. This interpretation is consistent with the reported results and avoids over-claiming absolute QoS guarantee.
Overall, the proposed DHI framework demonstrates three main advantages. First, it adapts to dynamic downlink conditions through SAC-based learning. Second, it provides flexible strategy selection according to QoS-oriented, balanced, or throughput-oriented requirements. Third, it reduces computational time compared with the DDPG benchmark, especially for DHI-Max-Product and DHI-Max-Sum-Rate. These findings support the suitability of the proposed framework for dynamic downlink power allocation in CF-mMIMO systems.