Reinforcement Learning-Based UAVs Resource Allocation for Integrated Sensing and Communication (ISAC) System

: Due to the limited ability of a single unmanned aerial vehicle (UAV), group unmanned aerial vehicles (UAVs) have attracted more attention in communication and radar ﬁelds. The use of an integrated sensing and communication (ISAC) system can make communication and radar modules share a radar module’s resources, coupled with efﬁcient resource allocation methods. It can effectively solve the problem of inadequate UAV resources and the low utilization rate of resources. In this paper, the resource allocation problem is addressed for group UAVs to achieve a trade-off between the detection and communication performance, where the ISAC system is equipped in group UAVs. The resource allocation problem is described by an optimization problem, but with group UAVs, the problem is complex and cannot be solved efﬁciently. Compared with the traditional resource allocation scheme, which needs a lot of calculation or sample set problems, a novel reinforcement-learning-based method is proposed. We formulate a new reward function by combining mutual information (MI) and the communication rate (CR). The MI describes the radar detection performance, and the CR is for wireless communication. Simulation results show that compared with the traditional Kuhn Munkres (KM) or the deep neural network (DNN) methods, this method has better performance with the increase in problem complexity. Additionally, the execution time of this scheme is close to that of the DNN scheme, and it is better than the KM algorithm.


Introduction
In recent years, due to the ability limitation of a single unmanned aerial vehicle (UAV), group UAVs have been proposed for complex applications. With the increasing popularity of modernization and intelligence, the intelligence of UAVs has attracted more attention. Group UAVs have prominent advantages such as high autonomy, multiple functions, timeliness, strong anti-damage ability and low cost. Their applications include logistics distribution, agricultural plant protection, reconnaissance and raids, electronic countermeasures, communication and navigation [1][2][3][4][5][6]. Group UAVs mainly use public resources to perform these tasks at present, which causes information leakage and communication interference between UAVs. With the exponential growth of the number of UAVs, expanding resources and improving the utilization rate of resources have become an urgent demand. Additionally, Group UAVs are supposed to have the abilities of detection, localization and communication according to their tasks. However, the weight and cost of equipment for UAVs limit the available resources of communication and detection. For the above consideration, an integrated sensing and communication (ISAC) system with the characteristics of low-cost, lightweight and a high level of integration has been proposed by using the same hardware [7].
The ISAC protocol, system architecture design [8], signal-sharing [8][9][10][11][12][13], time-sharing [14], array-sharing [15], spectrum-sharing [16] and power-sharing [17] have been studied to UAV resources but also improves the anti-jamming ability of UAVs from the common frequency ban. • A novel reinforcement-learning-based method is proposed to solve the complex problem, where we formulate a new reward function by combining both the MI and the CR. The MI describes the radar detection performance, and the CR is for wireless communication.
The rest of this paper is organized as follows: In Section 2, the system model is formulated. In Section 3, the design of the algorithms is introduced. In Section 4, the analysis and discussion of simulation results of different algorithms in different group UAVs under different environments are given. The conclusions are presented in Section 5.

Groups UAVs Resource Allocation Model for ISAC System
Consider the ISAC system for group UAVs as shown in Figure 1, where group UAVs in number K are sharing information among each other and detecting targets. Three kinds of resources such as the beam in number X, the power in number Y and the channel in number Z are allocated among group UAVs to achieve a trade-off between the wireless communication and the target detection. According to these resources, due to its low cost, light weight and high integration characteristics, the ISAC system is employed in each UAV. In the case of limited resources, how to allocate resources usage reasonably can optimize the balancing of the two tasks.   The problem model can be described to find the maximum of the reward of the ISAC system for group UAVs R(b, c, p) by allocating resources usage reasonably: where b, c and p represent the set of beam values, channel values and power values, respectively, for the corresponding n-th task and k-th UAV. The definitions of b, c and p are given as: c := c n,k |n = 1, 2; k = 1, 2, · · ·, K , p := p n,k |n = 1, 2; k = 1, 2, · · ·, K , where we use n = 1 and n = 2 to represent the information communication task and the radar detection task for the individuals of group UAVs, respectively. b n,k , c n,k and p n,k represent the beam, the channel and the power of the k-th UAV performing the n-th task, respectively.
The domain values of b,c and p are defined as follows: which shows that the k-th UAV perform different tasks at different beam.
δ 1 and δ 2 represents the all bandwidth of UAVs and the k-th UAV, respectively. c n,k = c n ,k shows that the k-th UAV perform different tasks at different channels.
δ 3 and delta 4 represent the power of all UAVs and the k-th UAV, respectively. The definition of R(b, c, p) is given as: where R 1,k (b 1,k , c 1,k , p 1,k ) represents the reward of the ISAC system for the k-th UAV which performs the 1-th task. It can be represented by the reward of the communication system [18]. The communication reward is the communication rate. R 2,k (b 2,k , c 2,k , p 2,k ) represents the reward of the radar communication integrated system for the k-th UAV which performs the 2nd task. It can be represented by the reward of the radar system. The radar reward is mutual information. Specific equations about R 1,k (b 1,k , c 1,k , p 1,k ) and R 2,k (b 2,k , c 2,k , p 2,k ) will be seen as follows.
The communication rate is taken as the reward of communication performance. The communication rate can represent the performance of the communication link and is defined as: where λ is the parameter to achieve a trade-off between the communication and detection performance. ξ{·} is the nomination operation. ζ k represents the channel loss of the k-th UAV of group UAVs. γ k represents the communication interference caused to the individual of group UAVs when group UAVs share information. κ is the Boltzmann constant, andT represents the system noise temperature. On the basis of information theory, target detection can be regarded as a non-cooperative communication problem, which means that the detected target is reluctant to transmit information to the radar. Mutual information is proposed to measure the ability of radar to acquire target information and is defined as: where T k is the pulse duration, s k ( f ) denotes the normalized baseband signal in the frequency domain, σ( f ) denotes the frequency response of the target corresponding to transmitted signal, w k ( f ) is the noise power in the frequency domain and υ k is the interference caused by other UAVs.

Reinforcement-Learning-Based UAVs Resource Allocation Method
To optimize the resources among UAVs for the ISAC system, a reinforcement learningbased method is proposed. As shown in Figure 2, at each time t, the group UAVs, as the agents, observe the state E [t] from the state space S and take an action A [t] , selecting the beam, channel and transmission power from the action space a Due to the discrete characteristics, a resource allocation problem can be described as the interaction between resource allocation and the environment, and the corresponding resource allocation scheme can be described by the reward function. It is consistent with reinforcement learning. Therefore, in order to solve problem (1), we propose a resource allocation method based on reinforcement learning. As shown in Figure 2, according to each group UAV's perceived business request and available resource, the group UAVs take the corresponding resource allocation. The reward value can be obtained according to the strategy, so as to describe the performance of resource allocation. Different from traditional resource allocation, reinforcement learning methods can effectively solve the problem of high complexity and unrealization of UAVs' resource allocation. Each section is described in detail below:

Environment State
Since the state is the mapping and representation of the environment and also the basis for agents to take agents, the design of the environment state is very meaningful. During the t-th state, it is defined as the current resource allocation strategy and the status of the UAVs, such as their locations u [t] and velocity v [t] , the locations of targets r and last time reward l [t] : where R denotes the real number set. W denotes the number of targets.

Agent Actions
Actions are the outputs of an agent and the inputs to the environment. Group UAVs allocates resources reasonably according to group UAV task requests and the available resource status of the system. Therefore, action A [t] can be defined as below: where all the possible resource allocation strategies during the t-th state from an action set is M [t] , the number of actions is M [t] and the m-th action is defined as: a m denote the channel and power allocation strategy in the m-th action, respectively. Notably, considering the channel interference, once a channel is used, it cannot be selected the next time.

3.
Reward Reward refers to the feedback after the agent taking action according to certain environmental states. The reasonability is closely related to the income that can be obtained by the agent and the performance of the dynamic resource allocation algorithm. In the ISAC system for group UAVs with dynamic resource allocation, it is necessary to give certain rewards to learn the optimal resource allocation strategy. According to the current allocation strategy E [t] , the reward r [t] is defined as below: E terminal denotes that resources have been allocated for this episode. Set the reward to 0 when the resource allocation has not ended and to when the resource allocation has ended. Finally, we try to maximize the reward r [t] .
According to the above definition of environment state, action and reward, four reinforcement learning algorithms including Q-Learning, SARSA, DQN and Dueling DQN are used to realize the RL-based resource allocation strategy. Q-Learning, SARSA, DQN and Dueling DQN are value-based and model-free reinforcement learning algorithms. DQN and Dueling DQN have changes relative to Q-Learning and SARSA, mainly in three aspects [39,40]: DQN and Dueling-DQN can break out the fact that state space and action space are discrete and cannot be too large from Q-Learning and SARSA.
The differences of Q-Learning and SARSA are as follows. The purpose of Q-Learning is to learn the value of a specific action in a specific state. Create a Q-table with state rows and action columns, and update the Q-Table with rewards for each action. Q-Learning is off-policy. A different strategy means that the action strategy and the evaluation strategy are not the same strategy. In Q-Learning, the action strategy is ε-greedy, and the strategy to update the Q-table is greedy.
SARSA stands for state-action-reward-state-action. The Q-table is also used to store the action value function. Moreover, the decision-making part is the same as Q-Learning, which also adopts the ε-greedy strategy. The differences are as follows: 1.
SARSA is an update of on-policy, and its action strategy and evaluation strategy are ε-greedy.
As can be seen above, Q(s, a) and Q(s , a ) denote the Q-value at the current moment and the next moment, respectively. α indicates the learning rate. Formula (14) is the updated formula of the SARSA Q-value, which performs the action using the ε-greedy strategy, then updates the value function based on the action being performed. The Formula (15) is the updated formula of the Q-Learning Q-value, which assumes the next step to select the maximum reward action and update the value function. Then, the action is selected using the ε-greedy strategy.
To compare DQN and Dueling-DQN, the algorithm flow is shown the Figure 3. Firstly, based on the greedy strategy, the main network generates action a [t]   The trajectory ( ) is stored in the experience replay with the maximum size being 3000. The network begins to learn when the quantity of storage reaches 500. Thus, 500 experiences are used as the input of the main network, with the input layer being 10, the hidden layer being 20 and the output layer being 26. The target network has the same network structure as the main network.
To train the network, the loss function is defined as: The main network generates the Q-value as Q(E [t] , a [t] ; θ) during the state E [t] with performing action a [t] . Additionally, the target network generates the max Q-value as y with performing a different action a [t+1] .

Simulation Results
The simulation results of the ISAC system for group UAVs under different resource methods are carried out in a personal computer with 16 GByte DDR4 Intel Core i7-8750H, 6 GByte Nvidia GeForce GTX 1060 with Max-Q Design and stated in this chapter. Specific simulation parameters are shown in Table 1. With a direct component in the channel, the received signal is the superposition of complex Gaussian signals and direct components, so the channel fading model adopted in this paper is the rice channel fading model. As given in Table 1, we use 5 UAVs to form a group and 24 channels, 3 beams and 6 power grades to develop resource block resources allocation.  R(b, c, p), which is a combination of MI and CR and has been used in many papers. Specific simulation parameters are shown in Table 2 for Q-Learning, SARSA, DQN and Dueling DQN used in resource allocation. The DQN and Dueling DQN algorithms belong to deep reinforcement learning algorithms. They contain a neural network framework structure, including a neural network input layer, hidden layer and output layer. Linear stands for Linear functions at the input and output levels. Relu is the activation function that breaks linearity. The size of the input layer is the dimension of state S, and the dimension of the hidden layer is 20. The output layer of the former is the corresponding dimension of behavior A, while the output layer of the latter mainly corresponds to dimensions one and A. In this paper, the convergence performance of each reinforcement learning algorithm is verified by simulation. It is shown in Figure 4 that group UAVs' resource allocation is simulated in a joint PyCharm and Matlab platform, where λ is 0.1, which means the gravity of target detection task, and the gravity of the information sharing task is 0.9. At this point, the normalized value of CR is 1000 Mbit/s, and the normalized value of MI is 5000 bit/s. The CR is the specific gravity times communication the normalized value times the total reward. MI is the specific gravity times the target detection normalized value times the total reward.  As can be seen from Figure 4, in the process of group UAVs learning, the four reinforcement learning algorithms have a relatively small number of learning iterations and low total reward values in the early stage, which is because group UAVs have little awareness of the environment in the early stage of learning and are in the exploratory stage. With the increase in environmental awareness, group UAVs gradually learn the optimal strategy, which makes the total reward tend to be stable. As shown in the data fitting curves, Q-Learning and SARSA converged at round 5000 and 7000, respectively, while DQN and Dueling DQN converged at round 500. Compared with the Q-Learning and SARSA algorithms, DQN and Dueling-DQN are modified in three aspects based on Q-Learning: using the DL approximation function, using experience reply to train the learning process of reinforcement learning and independently establishing target networks to deal with TD deviation in the time difference algorithm. This dramatically solves the problem of too much moving space and breaks the correlation between experiences. The number of episodes has also been significantly reduced. Additionally, due to too much moving, the system performance of DQN and Dueling DQN is more excellent than Q-Learning and SARSA algorithms. Figure 5 shows comparison of loss under DQN and Dueling DQN with the same learning rate. With the increase in episodes, the loss of Dueling DQN tended to be 0, while that of DQN tended to be 1.5. Obviously, Dueling DQN had lower loss than DQN. Additionally, the curve of Dueling DQN fluctuates wildly while that of DQN tended to be flat. Dueling DQN had better performance on convergence than DQN. As shown in Figure 6, under the condition of the same learning rate, Dueling DQN requires fewer episodes than DQN, although the complexity of the problem they solved is increasing. Therefore, the overall performance of Dueling DQN is better than DQN.
Second, we show the system performance under different methods. The proposed method is compared with two benchmark methods: • KM method [41]: The traditional Kuhn Munkres (KM) iterative optimization method is used for resource allocation by iterating over each resource to optimize allocation. • DNN method [42]: The Deep Neural Network(DNN) method based on supervised learning is used for resource allocation by fitting initial data sets.   Figure 7, a reasonable distribution of resources is carried out under different channel numbers. Channel values range from 21 to 33. As the number of channels increases, the total reward performance curves of three algorithms also gradually improve, which is caused by the gradual increase in the system channel resources. It can be further seen from the figure that under the same number of channels, when the number of channels is small, the performance of Dueling DQN and DQN approaches the KM algorithm and is better than the DNN algorithm. However, with the increase in the channel number, the performance of Dueling DQN and DQN is obviously better than that of the KM algorithm and the DNN algorithm. This is mainly because the KM algorithm is based on multiple resource iteration. Compared with the DNN algorithm, it mainly depends on its fitting data set.
As shown in Figure 8, the proper allocation of resources is carried out under other power grades numbers. Power grades range from 6 to 18. As the number of power levels increases, the total reward performance curves of three algorithms gradually increase because it is highly likely to approach the optimal power as the number of power levels increases. It can be further seen from the figure that under the same number of power levels, the KM algorithm will perform better than the Dueling DQN and DQN algorithms when the number of levels is small. However, the performance of the KM algorithm grows slowly with the increase in the number of power levels, while the performance of the Dueling DQN and DQN algorithms is better. The DNN algorithm has inferior performance to the Dueling DQN and DQN algorithms from beginning to end.  As shown in Figure 9, proper allocation of resources is carried out under different beam number. Beam values range from 3 to 7. With the increase of beam number, the total reward performance curves of four algorithms gradually improve. This is mainly due to the increase of beam resources. It can be further seen from the figure that the performance of Dueling DQN algorithm and DQN algorithm are significantly better than that of KM algorithm and DNN algorithm with the increase of beam number.
As shown in Figure 10, proper allocation of resources is carried out under other UAVs numbers. UAVs numbers range from 5 to 25. As the number of users increases, so does the number of channels. The total reward performance curves of four algorithms gradually increase. However, it can be clearly seen from the figure that when the number of users and channels is small, the performance of KM algorithm is slightly higher than that of Dueling DQN algorithm and DQN algorithm and the performance of DNN algorithm is the lowest. However, the performance of Dueling DQN algorithm and DQN algorithm show a rapid growth with the increase of the number of users while the KM algorithm and DNN algorithm grow slowly. In addition, Dueling DQN algorithm and DQN algorithm have the best performance.  As shown in Figure 11, a reasonable distribution of resources is carried out under different λ. λ ranges from 0.1 to 0.9. As can be seen from the figure, it is changing lambda to compare the performance of the four algorithms under the condition that the resource situation remains unchanged. The DNN algorithm has the worst performance. the Dueling DQN algorithm and the DQN algorithm have the best performance. The KM algorithm is average. It can be seen from the figures that the DQN and Dueling DQN resource allocation algorithms are superior to the KM and DNN allocation algorithms with the increase in environment complexity.
To show the computational complexity clearly, the computational time is given in Figure 12. The computational time of the proposed method is shorter than KM and close to DNN with the increase in environment complexity. When the environment becomes more and more complex, the time consumption of the KM algorithm increases obviously while those of the Dueling DQN, DQN and DNN algorithms increase slowly.

Summary and Prospect
For the flexible mobilization of the ISAC system, the traditional fixed resource allocation can no longer satisfy the effective allocation of resources according to the real-time situation, resulting in the low utilization of resources. In order to solve this problem, this paper has firstly summarized and analyzed the resource allocation technology of the ISAC system and then introduced the related resource allocation technology. Secondly, with the considering of complexity for the resource allocation problem, reinforcementlearning-based methods including Q-Learning, SARSA, DQN and Dueling DQN have been proposed under a novel reward combining both the MI and the CR, and the reasons for its combination have been analyzed. Finally, the allocation of resources under different environments has been introduced. Simulation results show that compared with the KM method, the resource allocation method based on reinforcement learning has better performance and lower time complexity with the increase in environment complexity. Compared with the DNN method, this method does not require prior data set preparation. Additionally, when the time complexity of both methods is almost the same, the system performance of this method is better. The resource allocation problem for the ISAC system is studied based on the reinforcement learning method and the simulation results are used to show the advantages of the proposed method. In future work, we will realize a hardware system to test the proposed method.