UAV Trajectory Design and Power Optimization for Terahertz Band-Integrated Sensing and Communications

Sixth generation (6G) wireless networks require very low latency and an ultra-high data rate, which have become the main challenges for future wireless communications. To effectively balance the requirements of 6G and the extreme shortage of capacity within the existing wireless networks, sensing-assisted communications in the terahertz (THz) band with unmanned aerial vehicles (UAVs) is proposed. In this scenario, the THz-UAV acts as an aerial base station to provide information on users and sensing signals and detect the THz channel to assist UAV communication. However, communication and sensing signals that use the same resources can cause interference with each other. Therefore, we research a cooperative method of co-existence between sensing and communication signals in the same frequency and time allocation to reduce the interference. We then formulate an optimization problem to minimize the total delay by jointly optimizing the UAV trajectory, frequency association, and transmission power of each user. The resulting problem is a non-convex and mixed integer optimization problem, which is challenging to solve. By resorting to the Lagrange multiplier and proximal policy optimization (PPO) method, we propose an overall alternating optimization algorithm to solve this problem in an iterative way. Specifically, given the UAV location and frequency, the sub-problem of the sensing and communication transmission powers is transformed into a convex problem, which is solved by the Lagrange multiplier method. Second, in each iteration, for given sensing and communication transmission powers, we relax the discrete variable to a continuous variable and use the PPO algorithm to tackle the sub-problem of joint optimization of the UAV location and frequency. The results show that the proposed algorithm reduces the delay and improves the transmission rate when compared with the conventional greedy algorithm.


Introduction
Following the birth of various emerging applications, such as holographic communication, sensory interconnection, three-dimension immersive experiences, and the metaverse, terahertz (THz) band communication is envisioned as one of the key enabling technologies to satisfy the needs of emerging applications [1,2]. Specifically, the ultra-wide THz band that ranges from 0.1 THz to 10 THz promises to support applications with a high quality of service and terabits per second data rates [3]. The THz frequency band will provide new applications for future ultra-high data rate communication because of the ultra-wide THz band [4]. In addition to communication applications, the THz frequency will also enable high resolution and accuracy sensing, such as radar, augmented human senses, and other scenarios [5]. Furthermore, THz networks can realize massive communication connectivity

•
Designing a new sensing and communication power optimization method that considers interference between sensing and communication signals in a THz sensing-assisted UAV communication network.

•
Formulating an optimization problem ( Figure 1) and proposing an efficient alternative optimization to solve this problem. First, we use the Lagrangian dual decomposition method to obtain the power of sensing and communication with a fixed trajectory. Second, we use the policy optimization (PPO) algorithm for joint optimization of the UAV location and frequency association with a fixed power of sensing and communication.

•
Designing a PPO algorithm for optimizing the UAV trajectory and frequency association. The PPO algorithm uses the critic network with global information and the actor network with local information to achieve cooperation to explore the angle of UAV and the frequency association.
The rest of this paper is organized as follows: The prior works are described in Section 2. In Section 3, the system model is described. In Section 4, the decomposition problem and the joint optimization design are presented. In Section 5, the simulation results are provided and discussed. Finally, this paper is concluded in Section 6.

Prior Works
The study of UAVs is considered as a new frontier field [12][13][14][15][16]. In [12], the authors designed an optimization problem to maximize the sum rate of a satellite and aerial integrated network. In [13], the authors aimed to maximize the energy efficiency of UAVenabled communication by optimizing its trajectory. In [14], the authors designed a limited storage space and energy for a UAV-assisted wireless communication system to realize the multi-user communication. In [15], the authors proposed a new protocol for UAV-to-UAV and UAV-to-GCS communication . In [16], the authors give a short overview of the possible threats, attacks, and countermeasures related to UAV communications.
To exploit THz band UAV wireless communication, some initial works have considered THz-enabled aerial communications [17][18][19]. In [17], the authors proposed a UAV-touser THz sub-band association scheme to eliminate interference in the THz frequency transmission. They proved that terahertz frequencies could be used for communication, and extensions of the wireless charging window and THz-transmitting window are derived.
In [18], the authors minimized the total delays of the uplink and downlink transmissions between the UAV and the users by jointly optimizing the location of the operating UAV and the bandwidth of the users, as well as minimizing the transmitting power of the users. They optimized the performance of the drones to communicate using terahertz frequencies.
In [19], the authors studied how UAVs support THz communications and an IRS was deployed to help the transmission. Yijin Pan's aim is to maximize the minimum average rates of all users. They optimized and evaluated the resource optimization problem for terahertz UAVs.
Many works have been dedicated to integrated sensing and communications [20][21][22]. In [20], the authors provided a brief explanation of communication rate maximization theory. Their goals were to research the basic communications phenomenology and to study dealing with systems in an information theory context. In [21], the authors aimed to further investigate the achievable performance of spectrally overlapping radar and communication systems by conjugating the detection. In [22], the authors developed a new approach for producing joint radar communications performance bounds. The authors studied the boundary question of combined communication and sensing.
There are growing research interests in power optimization [23][24][25][26]. In [23], the authors' design objective was to minimize the total transmission power of both the satellite and BS with a limited onboard power resource. In [24], the authors designed an objective function to maximize the system secrecy energy efficiency under the constraint of the total transmission power budget. In [25], the authors investigated the energy minimization problem of a UAV-assisted data collection sensor network. In [26], the authors designed a function that maximized the sum rate in a satellite-terrestrial integrated network, aiming to satisfy the constraints of per-antenna transmission power and quality-of-service requirements of both satellite and cellular users.
We have summarized the relevant work in Table 1.

System Model
Let us now consider a downlink from a THz UAV to N users during time horizon T, shown in Figure 2. We suppose that the user equipment is taken as a two-dimensional (2D) homogeneous Poisson point process (PPP) Φ u with intensity λ u . For ease of calculation, the time horizon of T is equally divided into K + 1 time slots with length T K+1 . THz-UAVs use integrated sensing and communication to improve the performance of system. As a result of shared spectrum resources in sensing and communication signals, it is challenging to achieve the critical trade-off between these two integrated functionalities. In order to reduce the interference of communication and sensing signals of the same frequency, at time slot 0, the UAV sends sensing signals and users receive sensing signals. During the time slot of 1 to K + 1, the UAV sends communication and sensing signals and users receive communication and sensing signals.
Therefore, for N targets, the user signal received at time slot k can be expressed as: where S k is the sensing signal, C k is the communication signal, P S,n k and P C,n k are the transmitting power of sensing and communication signals at time slot k, respecitvely, and h n k is the THz channel gain from the UAV to the user n. Without loss of generality, we assume that the UAV is moving with a constant speed denoted by V, and the location of the UAV is denoted by L k = (x k , y k , H) at time slot k. Here, the altitude, H, of the UAV is assumed to be constant. Therefore, the following coordinates of the UAV at time slot k should be satisfied where ψ n k−1 ∈ [0, 4π] is the direction of the UAV at time slot k − 1 from the UAV to the user n. The following trajectory constraints of the UAV should be satisfied [27] ( where L 0 and L K are the initial location and finial location, respectively.
Considering the LoS transmission, the path loss between the UAV and the user, n, can be written as [28]: .., f I } is the carrier frequency adpoted by the UAV for communicating with user n and ε n f n k,i , k is the absorption coefficient parameter related to the carrier frequency f n k,i and the number of water molecules in the atmosphere, k , at time slot k. The free space direct ray or LoS channel transfer function, H LoS , consists of the spreading loss function, H Spr , and the molecular absorption loss function, H Abs . The transfer function due to the spreading loss is given by: The transfer function of the molecular absorption loss can be expressed as: where the accuracy of ε n f n k,i , k is positively correlated with the sensing power. For the specific formula, please refer to [22].
The environmental parameters change slowly; therefore, we can use time slot k − 1 to represent the sensing estimate value at time slot k. The communication signal of the user at time slot k is the total signal received at time slot k, z n k , minus the sensing estimated signal at time slot k. Thus, user n receives communication signals at time slot k, which can be determined by: whereh n k−1 f n k−1,i , ε n f n k−1,i , k−1 is the THz channel gain at frequency f n k,i , which is obtained by sensing signals.
The THz-UAV needs to extract sensing signals to estimate ε n f n k,i , k and to assign a THz carrier to users. The accuracy of ε n f n k,i , k affects the THz carrier distribution. Similarly, at time slot k, the sensing signals received by the THz-UAV can be expressed as: As a result of sensing and communication signals sharing spectrum resources, the error between the real sensing signal at time k and the estimated sensing signal will interfere with communication signals. In addition, other users using the same THz carrier will also interfere with user n. Therefore, the SINR received at user n can be expressed as: where N 0 is the additive white gaussian noise power at user n using the ith carrier frequency of the THz band. Correspondingly, the achievable downlink rate of the UAV to user n can be written as [29]: where B is the bandwidth of the UAV to user n, which is assumed to be equal for each user. Thus, the delay of all the users at time slot k can be written as follows: where D n is the amount of data required by user n.

Problem Formulation
Using the above setup, we aim to minimize the delay over time slots K + 1 by jointly optimizing the UAV trajectory, frequency association, and transmission power. This optimization problem is mathematically formulated as: so that C1 : where constraint C1 ensures each user can be associated with one carrier frequency at each time slot k. C2-C4 ensure that the UAV cannot exceed the maximum speed at the time horizon T. C5 limits the maximum transmission power of sensing signals and communication signals.

Problem Decomposition
We note that the challenges of solving problem (12) lie in the following reasons. First, the optimization variable f n k,i for user n at time slot k is binary, and thereby the feasible set of problem (12) is non-convex. Second, the variables L k and f n k,i are strongly coupled with the sensing power and communication power. Hence, problem (12) is a mixed integer non-convex optimization problem and in general there is no standard method for solving it efficiently.
To tackle the above challenges, we decompose the original problem (12) into two sub-problems by separating the power allocation optimization (P1) and the trajectory and frequency variables (P2).
We first consider the power variables p C,n k and p S,n k in (P1) by fixing the trajectory variable L n k and the frequency variable f n k,i . Therefore, subproblem (P1) can be expressed as: We next consider the trajectory variable in (14) by fixing the UAV power allocation variables p C,n k and p S,n k . Therefore, subproblem (P2) can be formulated by: The two subproblems are separately optimized with multiple iterations. In the j + 1-th iteration (j = 0, 1, 2, · · ·, j max ), we first optimize p C,n k and p S,n k using the Lagrange multiplier method in (P1) with fixed trajectory variable L n k and frequency variable f n k,i , and find that the solution can be expressed by p * C,n k , p * S,n k . We then optimize the variables L k and f n k,i in (P2) using the PPO algorithm, and find that the solution can be expressed by L j+1 k , f n,j k,i . After the solution converges or a the maximum number of iterations or j max is reached, the solution of (14) can be obtained.

Joint Optimization Design
In this section, we will present the solution to the above two subproblems, and then propose a joint algorithm via separately optimizing the subproblems in an iterative way.

Joint Sensing and Communication Power
Before solving (12), we first demonstrate the convexity of this problem in Theorem 1 shown below.
As a result of sub-problem (13) being a convex problem, we chose the Lagrangian dual decomposition method to solve it and obtain the optimal solution of p * S,n k and p * C,n k . The Lagrangian function of (P1) can be given by: where η k is the Lagrange multiplier associated with constraint C5. Since (P1) is convex, it satisfies the Karush-Kuhn-Tucker (KKT) conditions, which can be specifically derived as: Case 1. If η k = 0, the KKT conditions (16) can be written as: where P C,max k and P S,max k indicate the maximum sensing and communication powers for time slot k, respectively. η k = 0; therefore, the solution ofṕ S,n k andṕ S,n k in (P1) can be denoted in closed-form asṕ C,n k = P C,max k andṕ S,n k = P S,max k . Case 2. If η k = 0, combining η k = 0 and (17) and (18), the solution ofṕ S,n k andṕ S,n k in (P1) can be denoted in closed-form as: In summary, the optimal solutions ofṕ S,n k andṕ S,n k in (P1) can be denoted in closed-form as: arc min p * C,n k ,p * S,n k Φ k p C,n k =ṕ C,n k , p S,n k =ṕ S,n k , Φ k p C,n k =p C,n k , p S,n k =p S,n k

Joint UAV Trajectory and Frequency Association
As shown in Figure 3, we pursue an intelligent UAV trajectory optimization aided by the PPO algorithm for improving the system's delay. The proposed PPO algorithm framework considers the UAV as a learning agent. The learning process of the PPO algorithm for the UAV by interacting with the THz environment can be expressed as: where S is the state space, A is the action space, and R = S × A → R is the infinite set of rewards that contain the set of immediate rewards when moving from one state to next state resulting from the actions taken by the agents. The state, action, and reward are defined as follows: • State: The states observed by an agent are determined by a combination of the transmission powers of sensing and communication. Thus, we define the state of a UAV at time step t as follows: • Action: The action is to choose proper flight direction and proper frequency association to obtain better rewards. Furthermore, we define the action performed in time-step t as a The agent receives an immediate reward, denoted as T k } ∈ R, which describes its benefit from taking action a (t) k . Thus, the function of reward can be written as: where η k −k ∈ [0, 1] is the discount rate, which determines the effect of future rewards on the current action. η k −k → 1 means that the reward value of the future state has a great influence on the action state function, while η k −k → 0 means that the reward value of the future state has little influence on the action state function.
In the policy gradient algorithm, shown in Algorithm 1, the agent updates the policy by gradient augmentation. In PPO, the old actor modifies its parameters by duplicating the actor's parameters. In order not to incur too much error, we introduce ratio k to limit the magnitude of rewards. In other words, when calculating the rewards, by limiting the ratio of the new policy and the old policy, the amplitudes of the state can be limited. As a result, it not only improves the stability of the PPO algorithm, but also reduces its complexity and improves the efficiency of the calculation. In this paper, the ratio of the old to new policy of each agent is calculated as follows: Figure 3 describes the operation of the PPO algorithm. During training, a set of samples are chosen from the storage system to update the THz network parameters. The value of the network determines the choice of action through the rewards value of these sampled values. The rewards value in turn affects the sampling probability density functions. When the agent explores the THz network parameters, it will select an action at random, targeting a higher long-term reward. Furthermore, it selects the action that gains the most rewards immediately. In order to improve the sampling efficiency, PPO adopts an important sampling method to change the policy gradient algorithm from the on policy to the off policy. At this time, the update formula of the actor network is: where τ = {s 1 .a 1 , s 2 , a 2 , ...., s K , a K } represents the trajectory of the agent in the entire episode.

Action Reward State
Reward network  PPO uses a clip function to directly limit the update range to [1 − ε, 1 + ε]. From Figure 4, this function of PPO can be written as follows: where ε is a hyperparameter that represents the maximum difference between P θ k and P θ . P θ k (τ) interacts with the environment and P θ (τ) has already interacted with the environment. Furthermore, A θ k (s t , a t ) represents the estimation of the advantage function at time step t and can be written as: where J is the number of points to sample with the probability of P θ (a k |s k ). P θ k (a k |s k ) is the modified probability density function parameters (θ k ). Furthermore, the function of clip can be written as: The formula for updating the action of possibility, P θ k (τ), can be written as:

Computational Complexity
Theorem 2. The complexity of Algorithm 2 is given by O(N + j max · KN).
In line 2 of Algorithm 2, sub-problem (P1) is solved. Every user needs to calculate function (22). Since there are N users, the computational complexity using method Lagrangian function is O(N).
In line 3 of Algorithm 2, sub-problem (P2) is solved by Algorithm 1. The computationally most expensive part is lines 3 and 4 of Algorithm 1. In lines 3 and 4 of Algorithm 1, we need to calculate the probability density function parameter θ k and calculate the rewards function. Thus, the computational complexity is O(KN). We assume that the maximum number of iterations of Algorithm 1 is j max . Therefore, the total computational complexity of Algorithm 1 can be written as O(j max · KN).
To summarize, the overall computational complexity of Algorithm 2 is calculated as O(N + j max · KN). This concludes the proof. for action = 1,2,....K do 3: Run policy θ k in environment for K time steps according to (30) 4: Compute advantage estimates A θ k according to (29) 5: end for 6: Optimize surrogate θ k 7: Calculate J θ k (θ k ) according to (28)  Solve problem (P1) for given L k , f n k,i and denote the optimal solution as P * C,n k , P * S,n k .

Simulation Results
In this section, we numerically evaluate the performance of the overall alternating optimization algorithm of intelligent trajectory planning by implementing simulations in MATLAB. The radius of the UAV coverage area was set to 50 m. We set the bandwidth which is allocated to the UAV as 10 GHz. We adopted THz carrier frequencies of 300 GHz, 310 GHz, 320 GHz, 330 GHz, 340 GHz, and 350 GHz. The details of the relevant parameters are listed in Table 2. To investigate the convergence behavior of the proposed algorithm, we start with illustrating the accumulation of the UAV communication rate versus the number of iterations when the user Poisson distribution parameter is λ u = 0.2, 0.3, or 0.4 persons per meter, Figure 5. It is observed that the proposed algorithm provides a higher sum rate of the system than that of the greedy sampling algorithm . This is because the PPO algorithm considers the rewards from the time of k + 1 to K + 1. The greedy algorithm is the result of the k-time obtained by mass sampling. Without considering other possible cases in general, the local optimal solution is selected each time and no backtracking is carried out, so the optimal solution is rarely obtained. This highlights the importance of the PPO algorithm, and how it theoretically gives the better sum rate for the system. In Figure 6, we show a comparison of the system's sum rate in the THz and Sub-6G frequency ranges, respectively, between the proposed algorithm and the greedy algorithm under varying user distribution functions. It is discovered that the proposed algorithm provides a higher sum rate of the system than that of the greedy algorithm, because in the greedy algorithm, there is a large number of random sampling at time k, while the PPO algorithm not only considers the system performance at time k, but also considers the system performance from time k to time K + 1. It is also discovered that the THz frequency provides a higher sum rate of the system than that of the Sub-6G. That is because the signal-to-noise ratio is much higher at the terahertz frequency than at the sub-6G frequency due to the high pathloss characteristic of THz channel resulting in low interference between users. This highlights the importance of an appropriate algorithm for the THz frequency.   In Figure 7, the relationship between the maximum communication power and sensing power is shown. It is observed that as the maximum communication power increases, the transmitting sensing power increases, but once the maximum value is reached, the sensing and communication powers start to decrease to maintain the same communication rate. This is because the communication and sensing signals share a spectrum. When the value of the maximum transmitted communication power increases, the THz-UAV increases the power of communication in order to obtain a higher information rate. As a result of the C5 constraint, the sensing power becomes smaller. The precision of sensing the terahertz channel will be affected by the decrease in sensing power. This will affect the allocation of the THz-UAV channel and cause the information rate to decrease. Therefore, there must be a maximum value of the sensing power to obtain the minimum delay.
In Figure 8, we show the relationship between frequency efficiency in the THz and Sub-6G frequencies, respectively. The numbers of users under the parameter of user density function are λ u = 0.2 and λ u = 0.3. We can see that as the number of users increases, the frequency efficiency increases. This is due to the fact that as the number of users increases, the information rate has been greatly improved. As can be seen from the figure, with the same number of users, the higher the user density function parameter, the lower the spectrum density. This is because when the user density function parameter is higher, the interference between users is stronger, resulting in a reduction in the information rate, so the spectral efficiency is lower. Therefore, the frequency spectrum efficiency of THz wireless communication is easily affected by the user density.

Conclusions
This paper investigated the problem of joint UAV trajectory, frequency association, and power optimization, aiming to minimize the sum delay in the terahertz band. The sum delay minimization was formulated as a convex optimization problem. This problem was transformed into the Lagrange multiplier method and a PPO problem. A Lagrange sub-problem was devised, aiming to obtain the sensing and communication powers. A PPO algorithm was devised to obtain the UAV trajectory and frequency association. Our results showed that the proposed algorithm achieved a good performance with a significant increase in the sum delay compared with the greedy algorithm and the Sub-6G frequency scenario, indicating its potential in a practical design. However, the method used in this paper has not used in a real UAV. Thus, there is a certain gap between theory and practice, which provides a direction for future research.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to legal restrictions.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Proof. The second-order derivative of objective function (13) with respect to p C,n k and p S,n k can be, respectively, obtained by where λ 1 = h n k ( f n k,i ) N 0 +p S,n k h n k ( f n k,i )−p S,n k−1 h n k−1 ( f n k,i )+∑