Joint Optimization of Bandwidth and Power Allocation in Uplink Systems with Deep Reinforcement Learning

Wireless resource utilizations are the focus of future communication, which are used constantly to alleviate the communication quality problem caused by the explosive interference with increasing users, especially the inter-cell interference in the multi-cell multi-user systems. To tackle this interference and improve the resource utilization rate, we proposed a joint-priority-based reinforcement learning (JPRL) approach to jointly optimize the bandwidth and transmit power allocation. This method aims to maximize the average throughput of the system while suppressing the co-channel interference and guaranteeing the quality of service (QoS) constraint. Specifically, we de-coupled the joint problem into two sub-problems, i.e., the bandwidth assignment and power allocation sub-problems. The multi-agent double deep Q network (MADDQN) was developed to solve the bandwidth allocation sub-problem for each user and the prioritized multi-agent deep deterministic policy gradient (P-MADDPG) algorithm by deploying a prioritized replay buffer that is designed to handle the transmit power allocation sub-problem. Numerical results show that the proposed JPRL method could accelerate model training and outperform the alternative methods in terms of throughput. For example, the average throughput was approximately 10.4–15.5% better than the homogeneous-learning-based benchmarks, and about 17.3% higher than the genetic algorithm.


Introduction
The fifth generation (5G) and beyond fifth generation (B5G) era is boosting a mega growth in the number of mobile devices [1], thereby resulting in explosive increasing demand that prompts people to explore new technologies to ease the demand strains. Recently, the large-scale dense network is gradually developing as a trend for the nextgeneration communication networks [2,3] due to its advantages traffic capacity and diversified services [4]. The densification of the network [5] is one of the key features of the 5G wireless network architecture, which not only contributes to increasing the system capacity of 5G networks, but also is closely related to user experience enhancement. As an important technique for improving the efficiency and quality of communications, dense RL method that can outperform the single agent in resource allocation, especially in the multi-cell multi-user system [34,35]. In [36], a joint resource allocation problem is settled by a MADRL relying on the independent Q-learning method [37]. Similarly, Tian et al. [38] presented a DDPG-based MADRL method to allocate the channel and power by optimizing the QoS in vehicular networks.
Though MADRL contributes a great progress in the filed of joint resource allocation, it still continues to have the following limitations typically: (1) It generally ignoring the importance of the transition replay in sampling a mini-batch. In the traditional MADRL, since the complex communication environments usually contain a large amount of information, uniform experience replay leads to poor stability and the slow convergence of neural networks; (2) It weakens the interconnectivity between agents, especially in the system where the agent plays a direct role with the other agents (for example, an agent promotes individually and hinders others). Therefore, the traditional MADRL, which uses a distributed training process to explore solutions, is unsuitable for finding the action characteristics of each agent; and (3) It is not realistic to simplify the channel with a free-space propagation model, since some test scenarios are neglected in different channel models [39], including the urban macro-cell (UMa), rural macro0cell (RMa), and rural micro-cell (RMi) in IMT-2020.
Inspired by the success of DRL and the above research, the joint-priority-based RL (JPRL) method has been proposed to maximize the average throughput, which considers the co-channel interference between different cells. Unlike the traditional DRL algorithm that optimizes multiple variables, we selected different algorithms to optimize variables according to the problem property and deployed a distributed learning and centralized training framework. The main contributions of this paper are summarized as follows: • We proposed a joint bandwidth and power allocation framework based on the JPRL method to maximize the average throughput of the uplink large-scale system, which considered the co-channel interference between different cells with the assurance of the QoS. For the joint optimization problem, since the bandwidth assignment is a discrete problem, while the power allocation is continuous, we decomposed the joint problem into two sub-problems and used different algorithms to solve them. • We proposed a priority experience replay mechanism for power allocation. By analyzing the characteristics of the optimization sub-problems, the proposed experience replay mechanism was applied to a multi-agent DDPG (MADDPG), which was named the prioritized MADDPG (P-MADDPG), which trained valuable experiences to improve the throughput in the training process, thereby surpassing the issue of infinite power action space. • The proposed JPRL method is shown in Figure 1. It consists of a multi-agent DDQN (MADDQN) algorithm and the P-MADDPG algorithm, where MADDQN was developed to solve the bandwidth assignment sub-problem, and the P-MADDPG was employed to solve the transmit power allocation. Besides, both the MADDQN and P-MADDPG used a centralized training framework with a joint action-value function.
The remainder of this paper is organized as follows. Section 2 introduces the system model and optimization problem. The proposed JPRL method is described in Section 3. Section 4 demonstrates the simulation results, and the conclusions are presented in Section 5.

System Model
Consider a large-scale uplink multi-cell multi-user network, where M single-antenna base stations (BSs) are collected by the set M = {1, 2, . . . , M} are deployed at the center of M cells, respectively. Assume that there are N users collecting by the set L m = {l m,1 , l m,2 , . . . , l m,N } in each cell m, where l m,n denotes the index of the n-th user in the m-th BS. The total users of the considered system are collected by the set K = {1, 2, . . . , K}, where K = MN. The total bandwidth of the considered system is denoted as W and is divided into three widths, which are collected by the set B = {B i } = {15kHz, 30kHz, 60kHz}, where i ∈ {1, 2, 3} [40]. Let X i = {1, 2, . . . , X i } denote the set of the sub-bands of the width i of the bandwidth, where X i is the total number of allocated bandwidth of width i.
Since users in different cells would occupy the same frequency band when transmitting their uplink signals, there exists interference between these users. This interference is called co-channel interference [41]. In this paper, each cell occupies the same frequency band and serves the same number of users N. For each user l m,n , some users in the neighboring cells can cause co-channel interference. In other words, users in the same cell can use different frequency band sub-carriers, and, thus, each user is subject to co-channel interference from users in other cells. Let M = {l m ,n |m ∈ M, m = m} denote the set of interfering users. Thus, these users from different cells belonging to the set M will interfere with user l m,n . The channel gain between user l m ,n and BS m at the slot t is represented by the following: is the large scale fading corresponding to the distance d l m ,n ,m between user l m ,n , BS m , PL l m ,n is the path loss of user l m ,n , σ β is the standard deviation of shadow fading, z β ∼ N (0, 1) is a Gaussian random variable, and h l m ,n ∼ CN (0, σ 2 h ) is the small-scale fading with variance σ 2 h . Then, the power of co-channel interference on user l m,n is expressed as follows: where p l m ,n denotes the transmit power for user l m ,n .
The signal y l m,n received by BS m from user l m,n can be written as y l n,m = x l n,m + I l m,n + n l m,n , where x l m,n = b l m,n |g d l m,n ,m , t |p l m,n denotes the transmitted signal by user l m,n , b l m,n is the transmitted symbol from user l m,n to BS m, and n 0 ∼ CN 0, σ 2 l m,n is the additive white complex Gaussian noise. As a result, the received signal-to-interference-plus-noise ratio (SINR) at BS m of user l m,n is given by ξ l m,n p l m,n , B i,l m,n = p l m,n g(d l m,n ,m , t) where σ 2 l m,n = n f B i,l m,n indicates the variance of the Gaussian white noise, and n f is the power spectral density of noise. p l m,n is the power vector that includes the power of user l m,n and its interfering users, and B i,l m,n is the i-th width of the bandwidth allocated to the user l m,n . Then, by considering the normalized rate [42], the achievable throughput of user l m,n at BS m is TH l m,n = log 2 1 + ξ l m,n p l m,n , B i,l m,n .

Problem Formulation
This paper mainly focuses on maximizing the average throughput of the considered large-scale multi-cell multi-user system subject to QoS of all users by jointly optimizing the transmit power and bandwidth allocation of all the users. Denote the average throughput of all the users by TH; then, the joint resource allocation problem is formulated as follows: C3 : TH l m,n ≥ TH th , ∀l m,n ∈ L m , m ∈ M, where P min and P max are the minimum and maximum transmit power of each user, respectively. Constraint C1 limits the transmit power budget per user; C2 indicates that the allocated bandwidth cannot exceed the total bandwidth of the system; and C3 ensures the QoS of each user. TH th denotes the required minimum throughput. Note that p l m,n and B i,l m,n are the decision variables associated with user l m,n , where p l m,n is the allocated power of the user l m,n , and B i,l m,n denotes the bandwidth assigned to the user l m,n of width i. This paper aims at obtaining better throughput by jointly optimizing the two variables. Problem P1 is non-convex; it is difficult to solve using traditional methods due to the high computational complexity. Furthermore, owing to the intricacy of the co-channel interference relationship in large-scale systems and the interaction between users in different cells, it is challenging to find the effective solution for joint transmission power and bandwidth allocation directly. To tackle these challenges, we proposed the JPRL method, which is excellent for the multi-cell multi-user system. In the proposed method, the MAD-DQN algorithm was used to allocate the bandwidth, and the P-MADDPG algorithm was developed to optimize the transmit power.

JPRL-Based Joint Resource Allocation Approach
The detailed structure of the joint uplink bandwidth and transmit power allocation is shown in Figure 2. Joint resource allocation often optimizes multiple variables consistently. However, for the problem of the joint allocation of the bandwidth and transmit power, there exist infinite combinations of joint assignment schemes that are influenced by the users interactions, thereby leading to unfortunate performance. In addition, the bandwidth assignment with limited choices is a discrete assignment scheme, rather than the continuous range such as for the power allocation. Thus, we de-coupled problem P1 into two sub-problems and designed an efficient JPRL method to solve the joint resource allocation problem in the considered large-scale multi-cell multi-user system. Specifically, the MADDQN algorithm was developed to solve the bandwidth allocation sub-problem with a discrete action space, and the P-MADDPG algorithm was designed to solve the transmit power allocation subproblem in the continuous domain. This resource assignment procedure satisfies the decentralized partially observable Markov decision process. Therefore, the proposed JPRL based on the RL method employed each user as an agent to model the optimization, which could solve large-scale resource allocation while meeting QoS constraints.
The RL can be described as a stochastic game, which is defined by a tuple K, S, A, R, P , where K is the set of agents, and S and A denote the set of states and the joint actions the space of all agents, respectively. The R is the reward function, and P is the state transition probability. The game is generally concerned with the interaction between the environment and one or more agents in a series of iterations. In each iteration, the agent observes the environmental state S to take action from action space A. Thenm the agent receives an immediate reward R t to reflect the quality of this iteration and observes a new state to the next step. Our goal was to maximize of the long-term rewards over various iterations. The details of the proposed framework are illustrated as follows.
• Agent: All users K.

•
State space: The state s k (t) of agent k is denoted as its co-channel interference, and the global environment state is thus defined as a set including the state of all agents, i.e., • Actions space: The actions of each agent consist of the bandwidth and power allocation and can be expressed as where Reward function: Since the whole performance is influenced by all users in the considered system, the sparse reward is a serious issue. Inspired by the entire longterm evaluation mechanism, in the learning process, previous lessons are indicative of the current learning. Therefore, a novel reward function is defined as where TH t denotes the average throughput of the current step t, τ denotes the moving step, and TH t,τ = 1 τ ∑ τ τ=1 TH t−τ+1 is the moving average of TH t . c is a nonnegative value. Especially, c = 0 if constraint C3 of Problem P1 is satisfied for all users; otherwise, c > 0. Unlike the typical reward functions that evaluate the single-step target by setting a threshold, the proposed reward function employs a long short-term criterion that varies autonomously as the performance over time, which allows agents to perform more stable exploration in the multi-cell multi-user system.
In the proposed JPRL method, we developed a distributed learning and centralized training framework, as shown in Figure 3, which promised to explore the entire action space fully and encourage each agent to leverage the experience of other agents. Specifically, all agents are guided by the harmonized loss feedback value of the MADDQN and P-MADDPG when learning the bandwidth and power individually. The details of the proposed JPRL method are given as follows, its structure is illustrated in Algorithm 1, and the flow chart is shown in Figure 4. In the learning phase, the state of each agent is input into the the MADDQN and P-MADDPG algorithms synchronously, and then each agent individually performs the bandwidth allocation and power allocation actions. Based on the actions, rewards and new states are generated and stored in the replay buffers of the two algorithms. Note that the reward is calculated by Equation (8), which corresponds to all actions of the bandwidth and power. In the training phase, the values in the buffer are randomly selected to compute correlation values to guide the intelligence in the direction of increasing throughput. The details are described as follows.

Bandwidth Allocation of MADDQN
Bandwidth allocation is a non-convex problem with discrete space; there are finite choices. The size of the action space grows exponentially with the number of users. Therefore, a MADDQN algorithm with centralized training was presented to achieve sufficient exploration of the actions, which had good performance in large-scale discrete action spaces.
A MADDQN model consists of a target Q network and an evaluated Q network, which creates a copy of neural network for the two networks, respectively. For multiple agents, an arbitrary agent taking actions to improve its performance could lead to the degradation of the overall performance as the agents are interacting with each other. Therefore, the effect of mutual synergy between agents cannot be ignored. A centralized training architecture, to this end, denotes a joint action-state function Q b sum that composes the action-state functions from different agents to promote cooperation between agents. The concrete formula is defined as where ω is the parameter of the evaluated Q network. Q b k is the k-th user's action-state function based on its own state. In the training phase, the joint action-state function is used for back propagation to promote cooperation, and a mini-batch sample is randomly sampled from the replay memory D1 that stores the states, actions, next states, and rewards of all the agents (note that all the agents have the same reward value) to minimize the loss function, which is written as where E[•] denotes the mathematical expectation and γ 1 is the discount ratio. For each agent, the soft updating is given by where ω are the parameters (including the weights and biases) of target Q network.
In the multi-cell multi-user system, the MADDQN model of agent k chooses the bandwidth assignment action according to its own state s k (t) in step t. Note that the agents can share their past training process (state, the influence based on training). Then, all the agents are centralized trained to minimize the loss value by Q b sum .

P-MADDPG-Based Uplink Power Allocation
For power allocation, a huge action space is not helpful for exploitation. In addition, although the discrete DRL algorithms can quantize power, they ignore the diversity of power choices. To this end, a novel P-MADDPG algorithm was proposed to solve the transmit power allocation subproblem. This is an enhancement of the DDPG with a prioritized replay buffer. In contrast to the power quantization, the P-MADDPG directly outputs the power of all the users in a continuous domain with infinite choices. Furthermore, by applying the prioritized replay buffer, it is more sensitive to the negative effect of the bad actions than the general MADDPG algorithms.
Similar to DDPG, an actor-critic architecture [43] applies for learning and training; both the actor and critic networks of each agent contain two identical neural networks, which are named the online network and target network, respectively. For a multi-agent system, the actor network of agent k outputs the power allocation under the current state through a policy π, i.e., a p k (t) = π(s k (t)). However, the inherent exploration-exploitation dilemma in the DRL is prevalent for an inflexible action policy. By taking advantage of the DQN, it is balanced by a stochastic noise whose function is similar to the − greedy mechanism. Consequently, the actions of all agents are written as where ω u is the weight of the actor network, and Σ t follows a Gaussian distribution N (0, ); is the variance of Gaussian noise and decreases linearly to zero as the iteration proceeds. Similarly, applying the individual action-value function to each agent ignores the features of others, which reduces the learning stability and weakens agent interaction. To this end, the critic network uses the joint action-value function Q p sum (S t , A t ) to evaluate all actions. The specific Q p sum is defined as where D2 is the experience replay buffer, and γ 2 ∈ (0, 1] is a discount factor. According to the deterministic policy gradient theorem, the action-value function Q p sum is used to update the actor parameters ω µ in the direction of increasing the cumulative discounted reward with D samples of a mini-batch, that is where ω Q is the weight of critic network. A common method for training neural networks is to randomly and uniformly sample mini-batches from the buffer D2, which often results in a high probability of selecting bad actions among the vast combinations of different actions, thereby lowering performance. This method is inefficient and poorly helpful for guiding the networks to update in the correct direction. Considering the transition samples of all agents, we designed the P-MADDPG algorithm to enhance the MADDPG by customizing a prioritized experience replay technique, where the more important transition samples have a higher probability of being replayed to participate in network updating. Specifically, in each step t, the transition samples of all agents are measured by the corresponding importance denoted by V t , which is combined with S t , A p t , R t , and S t+1 to form a tuple S t , A p t , R t , S t+1 , V t being stored in D2. Similar to the MADDQN, the goal of P-MADDPG updating is to minimize the magnitude between the joint Q-value and target joint Q-value, i.e., joint temporal-difference (JTD) error. The transitions with the large JTD error contain more information and are more necessary to the update of neural networks. Thus, the JTD error is a reasonable proxy measure of important value, and V t is written as However, in the sampling process, initially stored transition samples with small JTDs may not be sampled to replay if the sampling only relies on the importance. This can result in over-fitting, since the system lacks the sampling diversity of transitions. To avoid the issue, a probability associated with the importance is assigned to each transition sample, which can overcome the above issues effectively. The probability of the arbitrary transition sample ϕ at the step t is expressed as where α ∈ [0, 1] is a contribution factor that controls the impact of importance. In particular, α = 0 means that all samples are equally distributed, i.e., no contribution is made according to importance (uniform sampling). Original samples are equally probability-distributed in the replay buffer, but prioritized experience replay introduces bias, since it changes the original distribution by assigning different probabilities to the transitions. The compensation weight is thus introduced to correct this bias, which is expressed as where β ∈ [0, 1] is a hyperparameter, which regulates the degree of bias compensation. In particular, there is no compensation for non-uniform probabilities P ϕ t if β = 0; there ispartial compensation if 0 < β < 1; and thre is full compensation if β = 1. As a result, The loss of a mini-batch ϕ after weight compensation is rewritten as (19)

3:
Initialize the actions of all agents. 4: for step t = 1, · · · , T i do 5: for agent k = 1, · · · , K do 6: if random number ζ < t then 7: Randomly choose a b k (t) from bandwidth allocation action space. 8: Choose power allocation a p k (t) = [π(s k (t) + σ t )] P max P min . 12: Execute actions a b k (t), a p k (t) and observe next state s k (t + 1). 13: end for 14: Calculate reward with all agents' actions by Equation (8). 15: Store transition with bandwidth allocation S t , A b t , S t+1 , R t in D1.

16:
Store transition ϕ with power allocation S t , A p t , S t+1 , R t , V ϕ t in D2.

17:
if Both D1 and D2 are full then 18: Sample a mini-batch of transition from D1.

19:
Sample a mini-batch of transition from D2 according to sample importance. 20: Compute the action-value function of MADDQN and P-MADDPQ according to Equations (9) and (14), respectively.

21:
Update evaluated Q network of MADDQN by Equation (10). 22: Update actor online network by Equation (15). 23: Update critic online network by Equation (19). 24: Update the MADDQN and P-MADDPG networks by soft updating. 25: end if 26: end for 27: end for For a mini-batch with D samples, directly traversing the experience buffer D2 for sampling requires D times, and the complexity is intolerable. To tackle this matter, a sumtree frame is designed for D2, where the sample ϕ is stored with the sampling probability P ϕ t . As shown in Figure 2, the structure is a binary tree with a root node at the top, and there are only two child nodes for each node of the upper level. For the leaf nodes at the bottom, the tuple S t , A p t , R t , S t+1 , V ϕ t of transition ϕ is stored with its probability according to Equation (16). Note that the value of each node is the sum of its child nodes' value. We divided the value of the root node (the sum of the probabilities of all samples) into D segments, which have an equal interval. In each interval, a random value, which is no more than the range of the interval generated to backtrack the leaf node from top to bottom. The specific backtracking rules are listed as follows, until the leaf node is selected, if the random value is less than or equal to the value of the left child node, and we continue backtracking from left child node; otherwise, we continue backtracking from the right child node and calculates the difference between this value and the value of the left child node as the basis for the next backtracking. Then, the critic and actor networks are updated by the selected transition samples. The proposed JPRL method is summarized in Algorithm 1.

Time Analysis of the Proposed JPRL Method
We analyzed the time complexity of our proposed JPRL method. In Algorithm 1, let I be the total number of training episodes and T i be the training steps in the episode i. Therefore, the total amount of training iterations implies the time complexity, that is For each iteration, the computational efficiency is subjected to the the size of the neural network, i.e., the number of parameters. According to [44], the time complexity The total time complexity of the JRPL method is O c ∑ I i=1 iT i .

Simulations
In this section, we evaluate the performance of the proposed JPRL method. First of all, the simulation setup is portrayed. Then, the experience results are discussed in terms of the convergence, learning rate analysis, and performance comparison. Lastly, the performance of our proposed method compared to different models is exhibited.

Setup
Parameter Setting of Environment: We set the location of seven base stations in the cell center, and four users wererandomly distributed in each cell. The uplink user power was limited to P min = −40 dBm and P max = 23 dBm [40]. The total bandwidth of the system was W = 20 MHz. The minimum throughput requirement of all the users was TH th = 0.15 bit/s, and the power spectral density n f was −174 dBm/Hz.
The size of the cells and channel model change according to different scenarios [39], which are referenced from the test scenior in the 3GPP protocol, such as UMa, RMa, RMi. By default, the outsider scenario of the non-line-of-sight of the RMa was selected to evaluate the proposed method. The RMa stipulates the radius of cell r, and the pathloss is defined as PL l m,n = max PL l m,n ,1 , PL l m,n ,2 , (20) where PL l m,n ,1 and PL l m,n ,2 denote the line-of-sight and non-line-of-sight pathloss, respectively, which are written as PL l m,n ,1 = PL l m,n ,11 , 10 m < d h < d BP , PL l m,n ,12 , where PL l m,n ,11 = min(0.03h ε b , 10) lg(d s ) − min(0.044h ε b , 14.77) + 0.002 lg(h b )d s + 20 lg(40πd s f c ), (22) PL l m,n ,12 = PL l m,n ,11 Here, d s = d 2 h + (h a1 − h a2 ) 2 and d h = d l m,n ,m denote the straight distance and horizontal distance between BS and user respectively, where h a1 and h a2 are the heights of the antenna in the BS and user, respectively. h b is the building height, l w is the average width of the road, and ε is the excitation factor. For the long distance line-of-light pathloss PL l m,n ,12 , f c is the central frequency, and v denotes the propagation velocity. These parameter settings are listed in Table 1. In this paper, the five benchmarks were considered: (1) DDQN and DDPG: The existing DDQN for bandwidth assignment and the DDPG for allocating the power. The architecture with a one-layer fully connected network was used in the DDQN, and the DDPG deployed two-layer fully connected networks in the actor and critic networks. Both of them adopted the uniform sampling-based experience replay. The GA framework in the DEAP was used to realize this benchmark [45]. The bandwidth and power allocation schemes were encoded into the chromosome of each individual, which is the action sequence about the bandwidth and power allocation of all the users. We set the population size to 200. The crossover rate and mutation rate were set as 0.8 and 0.05, respectively.
Note that the GA depends on the fitness rather than the learning-based reward; thus it is appropriate to compare the results after final convergence instead of comparing the entire optimization process with the learning-based approach. Hyperparameter Setting of JPRL: The JPRL method contains an MADDQN algorithm and a P-MADDPG algorithm. There are the same size of the experience buffer for the two algorithms, which are set to |D1| = |D2| = 10000. The learning rate, including the MADDQN, actor network, and critic network of the P-MADDPG was set as 0.0001. Furthermore, we set the hyperparameters of the prioritized replay buffer in the P-MADDPG D2 as α = 1 and β = 0.1. In the training phase, both the MADDQN and P-MADDPG used the Adam optimizer to optimize the loss function. The sampling batch size was D = 128, and the reward discount factor was γ 1 = γ 2 = 0.89. The system began to train the neural network when the memory buffer was full, and it updated the neural parameters at one-step frequency after training. Besides, we set the number of episodes to I = 500. Note that every episode did not have fixed steps. To determine whether an episode was completed, a done flag was designed, where the done flag was true if the reward R increased by 200 steps; otherwise, it was false (the learning of this episode was not finished). the other parameters of each neural network are listed in Table 2.
All experiences were operated by a computer with the 12-th Gen Intel(R) Core(TM) i7-12700F @2.10 GHz, 16-GB RAM. The simulation results were presented using Numpy 1.21.5 and Tensorflow 2.3.0 on the Python 3.6 platform.

Results
The setting of the learning rate has a profound impact on the learning of the distribution scheme of the proposed method, which determines the ability to explore action space. Specifically, higher learning rates are detrimental to the exploration of the action space, as well as to the updating of network parameters in large systems with large action spaces. Moreover, in large systems with large action spaces, a lower learning rate implies finer-grained exploration, which does not mean that better actions can be explored, since having more actions in a large action space degrades performance. Thus, it is necessary to study the setting of the learning rate in a multi-cell multi-user system. Firstly, Figure 5 compares the loss values of the multiple networks under different learning rates. In order to view the variation and performance clearly, the loss values within the first 3000 steps after training are given. Figure 5a-c imply an interaction between the MADDQN and P-MADDPG in the proposed method. It is worth noting that the curve values in Figure 5b show a clear loss reduction in the P-MADDPG with a lower learning rate. It reveals the fact that the MADDQN exploring bandwidth influenced the P-MADDPG training. However, as shown in Figure 5c, a decrease in loss value did not signify an increase in throughput, and it may have also been trapped in sub-optimality. As a result, we set the learning rate of the MADDQN to 0.0001 to achieve a high throughput and fast convergence speed of the P-MADDPG. Figure 6 illustrates the loss values and throughput of the actor and critic networks in the P-MADDPG at different learning rates. For the learning rates of the actor network, the proposed JPRL achieved the best in terms of the loss value and throughput whenthe learning rate was 0.0001. The loss curves of the MADDQN in Figure 6a show a slight increase after 1000 steps, and a similar trend appears in Figure 6d. The reason is that the power actions selected from the P-MADDPG affected the training process of the MADDQN. As shown in Figure 6b,c, it is noticed that, the smaller the learning rate, the better the performance, since the larger learning rate may skip various actions within the infinite action space. Finally, from Figure 6d-f, in the large action spaces, the critic network with a higher learning rate converged faster but converged to a worse value. The reason is that a larger learning rate of the critic network implies a more coarse-grained exploration, which is prone to learning sub-optimality. As a result, when the learning rate of the actor and critic networks were set to 0.0001, our method could jump out of the local optimal. With respect to recording the reward for every 200 steps, Figure 7 plots the reward values of the proposed method and benchmarks; the benchmarks included the DDQN and DDPG, DDQN and P-DDPG, MADDQN and MADDPG based on the centralized training (ct) and decentralized training (dt). In the process of early random exploration (before the buffers are full), rewards decrease to negative values. The reason is that there are users whose throughput does not meet the QoS requirement. As the system begins to train, all five curves have a sharp augment. After a period of training, the moving average of the average throughput TH t,τ will be close to the average throughput TH t , e.g., the reward is close to 0, which indicates that the methods fall into a local optimal or converge to an optimal. It is seen that the curve of the MADDQN and MADDPG(dt) swung more than that of the MADDQN and MADDPG(ct). As a result, Figure 7 indicates that the JPRL method has an excellent ability to jump out of sub-optimal conditions and obtain good feedback.  Figure 8 illustrates the average throughput of the different methods after 500 episodes. In the random exploration stage, the throughput is unstable and relatively small because of the impacts of Gaussian noise and the randomly selected actions. All methods are prone to get stuck in the local optimum during the learning process, and there is a small fluctuation for the average throughput because of the existence of the Gaussian noise. Since a small change in power of any user may cause a large variation for co-channel interference, the benchmarks fall into the local optimum easily and are difficult to jump out of it. We can also see that the joint method MADDQN and MADDPG(dt) was extremely unstable, since the distributed training favored the individual performance of the agent at the expense of the overall performance. In other words, an agent, which follows its own wishes while neglecting the other characteristics for increasing power, will increase interference and decrease throughput. It was observed that the proposed JPRL outperformed the other methods in terms of throughput, since it explored the action spaces fully.  It should be observed that the average throughput decreased as the number of cells M increased. This is because fewer cells mean less interference from users l m,n , which leads to a lower amount of co-channel interference. Obviously, it can be seen that the RL-based approach was far superior to the GA, which is because the GA fell into the local optimum easily. We also see that the proposed JPRL had a steeper curve than the others, since it had better exploration in the small action spaces as cells decreased. Therefore, the JPRL method could achieve the high throughput.
As shown in Figure 10, we further tested the average throughput of the proposed JPRL under some different channel models, including the RMa, RMi, and UMa. The average throughput of the users for the urban environment (UMa model) is generally less than that of the users in rural scenarios (RMa and RMi models). This is because severe interference is caused by a lot of users in a small range. It can be seen that the JPRL method is universally applicable to different environments.

Conclusions
This paper mainly studied the resource allocation to maximize the throughput by jointly optimizing the bandwidth assignment and power allocation subject to the QoS constraint for the multi-cell multi-user uplink system. According to the variable attributes of the joint resource allocation problem, we proposed a JPRL method to decouple the optimization problem into two sub-problems, where the MADDQN was used to allocate bandwidth, and the P-MADDPG assigned uplink power with the given importance of transition. In order to compare the loss value and learning performance of the different networks with various learning rates, we set the appropriate parameters for the proposed JPRL method and analyzed the impact of the different learning rates. Furthermore, we evaluated the reward value and throughput of the proposed JPRL method against other existing methods. the simulation results showed that our approach can (1) obtain a better performance and be more applicable to the complex environments than other alternative methods (e.g., the average throughput was approximately 10.4-15.5% better than the average throughput of the benchmarks.) and (2) be universally applicable to other largescale scenarios.
It is worth noting that, for simplicity, the single antenna system was used in this work. As for multi-antenna systems such as MIMO, the impact of more complex channel matrices caused by multiple antennas on user interference needs to be considered. In future work, the multiple antennas, the users' trajectory, and cloud computing will be taken into consideration in multi-cell systems to facilitate communication-computing integration. By considering the interference corresponding to the complex channel matrix, the optimization is relevant to the compromised performance of the computing delay and energy consumption, which is based on the resource allocation and task offloading under various constraints, such as QoS constraints and offloading decisions. Moreover, multi-dimensional and deep analysis will be researched to validate the system tradeoff.