Deep Reinforcement Learning-Based 3D Trajectory Planning for Cellular Connected UAV

: To address the issue of limited application scenarios associated with connectivity assurance based on two-dimensional (2D) trajectory planning, this paper proposes an improved deep reinforcement learning (DRL) -based three-dimensional (3D) trajectory planning method for cellular unmanned aerial vehicles (UAVs) communication. By considering the 3D space environment and integrating factors such as UAV mission completion time and connectivity, we develop an objective function for path optimization and utilize the advanced dueling double deep Q network (D3QN) to optimize it. Additionally, we introduce the prioritized experience replay (PER) mechanism to enhance learning efficiency and expedite convergence. In order to further aid in trajectory planning, our method incorporates a simultaneous navigation and radio mapping (SNARM) framework that generates simulated 3D radio maps and simulates flight processes by utilizing measurement signals from the UAV during flight, thereby reducing actual flight costs. The simulation results demonstrate that the proposed approach effectively enable UAVs to avoid weak coverage regions in space, thereby reducing the weighted sum of flight time and expected interruption time.


Introduction
In recent years, with the rapid advancement of unmanned aerial vehicles (UAVs) technology and the increasing maturity of wireless communication technology, UAVs have demonstrated extensive potential applications in aerial photography, logistics distribution, search and rescue.Some researchers have conducted detailed studies in areas such as spectrum for air-to-ground (A2G) communication and wireless communication environments [1,2].However, existing UAVs communications still inevitably be limited by factors such as controller or WIFI connection modes, resulting in restricted communication range, low data transmission rates and susceptibility to interference.The cellular network is a widely distributed mobile communication network with high capacity.Integrating UAVs into the cellular network can enhance communication distance, achieve higher data transmission rate and lower latency, as well as supplement positioning accuracy in adverse weather conditions or when obstacles affect GPS signals, thereby mitigating environmental impacts on communications.Consequently, cellular-connected UAVs communication emerges as a promising research area [3].
Despite the aforementioned advantages of cellular-connected UAVs communication, there are still several challenges that need to be addressed.Firstly, in order to cater to a larger number of ground users, the antenna orientation of the ground base station (GBS) is typically optimized for ground coverage, which may result in inadequate air communication coverage [4].Secondly, three-dimensional (3D) obstacles such as buildings may obstruct the communication link [5].Additionally, building upon prior research on the A2G channel model [6][7][8], cellular-connected UAVs may encounter significant signal interference due to potential line-of-sight channel issues between the UAVs and non-associated base station (BS), as shown in Figure 1.To address the aforementioned issues, the authors in [9] combined with the analysis of uplink/downlink 3D coverage performance, introduced the generalized Poisson multinomial distribution to simulate interference information and demonstrated the impact of different downdip angles of GBS antennas on 3D coverage.Additionally, Ref. [10] employed deep reinforcement learning to train aerial BS layout decision strategies, thereby enhancing coverage in complex environments.Ref. [11] investigated the performance of cellular connected UAVs under actual antenna configurations and revealed how the number of antenna units influences coverage probability and handover rate.Ref. [12] optimized the downdip angle of GBS antennas to maximize received signal quality for UAVs while ensuring throughput performance for ground users and reducing switching times.To mitigate strong ground-to-air interference, various anti-interference techniques were proposed [13][14][15][16].For instance, Ref. [13] presented a novel cooperative interference elimination strategy for multi-beam UAVs uplink communication that aims to eliminate co-channel interference on each occupied GBS while maximizing the summation rate of available GBS.
Efficient path planning should ensures optimal air-ground communication conditions, high data transmission rates, and reliable connectivity while minimizing unnecessary movements of UAVs, thereby enhancing energy efficiency [17,18].In [19], the problem of shortest path planning under the constraint of minimum reception SNR was investigated.In [20,21], the authors employed graph theory to design the shortest path under the minimum SINR constraint and deduced an optimal UAV path by solving an equivalent shortest path problem in graph theory.Ref. [22] proposed the constraction of a received signal intensity map using a distributed recursive Gaussian process regression framework.This approach achieves higher positioning accuracy with lower complexity and storage requirements, making it an efficient solution for positioning applications.Similar problems have been addressed in [23][24][25].Some traditional trajectory optimization schemes simplified channel models in various environments for ease of solution.However, environmental models such as those assuming path loss of channels or isotropic radiation of antennas are not applicable to real-world scenarios [26].Moreover, the trajectory optimization problem is non-convex, and its complexity increases dramatically with the number of optimization variables, which is difficult to solve effectively.Fortunately, machine learning techniques have emerged as another solution for non-convex optimization problems.For instance, Ref. [27] presents a two-dimensional (2D) radio map-based approach for path planning in conjunction with machine learning techniques.Nevertheless, 2D path planning has limitations regarding its applicability and susceptibility to local optima; thus further research should focus on 3D path planning.
Some recent studies, such as [28], have proposed a Multi-Layer Trajectory Planning (MTTP) method, addressing the challenges of ensuring air-to-ground communication services and avoiding collisions in complex urban environments.The work referenced in [29] introduces a two-step centralized development system for 3D path planning of drone swarms.Additionally, Both articles [30,31] take into account energy consumption during the 3D trajectory planning process for UAVs.Ref. [32] proposed collaborative UAV trajectory optimization using federated learning to overcome challenges in ensuring reliable connectivity in 3D space.In this paper, we propose a novel synchronous path planning approach based on an improved deep reinforcement learning (DRL) algorithm, integrated with radio mapping techniques, to optimize the 3D trajectory of UAV.This method aims to efficiently navigate UAVs by avoiding areas with weak communication coverage and reaching the destination in minimal time.The major contributions and novelties of this paper are summarized as follows:

•
We propose a 3D path optimization strategy that aims to minimize the weighted sum of task completion time and communication interruption time, thereby enhancing the efficiency and reliability of the system.

•
We employ a multi-step dueling double deep Q network (D3QN) method incorporating with prioritized experience reply (PER) mechanism to efficiently optimize the proposed objective function and acquire the optimal path.

•
We propose a simultaneous navigation and radio mapping (SNARM) framework that leverages 3D radio mapping and simulates flight processes to optimize the costeffectiveness of real flights while enhancing learning accuracy.
The remainder of the article is organized as follows.Section 2 introduces the system model and problem formulation.Section 3 presents the improved DRL-based 3D path planning strategy.The simulation results and analysis are provided in Section 4. The conclusions are drawn in Section 5.

3D Flight Environment Model
In this paper, the UAV operates in the airspace above a dense urban area measuring 2 km × 2 km.The height and locations of urban buildings are generated using the statistical model recommended by the International Telecommunication Union (ITU).This model involves three parameters: α bd , which represents the ratio of land area covered by buildings to the total land area; β bd , which denotes the average number of buildings per unit area; and γ bd , a variable determining the distribution of building heights following a Rayleigh distribution, with a mean value of σ bd (σ bd > 0). Figure 2 shows the 2D views of one particular realization of the building locations and heights with α bd = 0.3, β bd = 300 buildings/km 2 , and σ bd = 50 m.For convenience, the building height is clipped to not exceed 70 m.
The UAV's flight parameters include a flying height ranging from h min to h max , a constant flight speed of V m/s, and the UAV's position at any given moment denoted as q(t) = (x t , y t , h t ).The starting and ending points of the UAV's flight are represented as q s = (x s , y s , h s ) and q f = (x f , y f , h f ), respectively.
Within the target area, a total of 7 GBS are distributed in a honeycomb pattern, as indicated by black star markers in Figure 2. The GBS antenna stands at a height of h bs , and each GBS site comprises 3 sectors, resulting in a total of M = 21 sectors.The GBS antenna is a vertically oriented 8-element uniform linear array (ULA) with a half-power beamwidth of 65°in both horizontal and vertical directions.The main lobe is tilted 10°to the ground, forming a directional antenna array.

Reception Signal Model
In the system model, we simulate path loss using the Urban Microcell (UMI) model specified by 3GPP.It is worth noting that the statistical building model has been widely used to estimate the line-of-sight (LoS) probability of ground-to-air links [33].However, this model only reflects the average characteristics of large-scale geographic areas with similar types of terrain.For each local area with given building positions and heights, the presence/absence of LoS links with cellular base stations can be accurately determined by examining whether the communication path between the base stations and UAVs is obstructed by any buildings.The path loss for the LoS link between the UAV and sector m is represented as follows h LoS m (t) = max h FSPL m , 30.9 + (22.25 − 0.5log 10 h t )log 10 d m (t) + 20log 10 f c } where h FSPL m represents the free-space path loss, h t represents the altitude of the UAV at time t, d m (t) represents the distance between the UAV and sector m, and f c is the carrier frequency.When the communication path between the base station sector m and the UAV is obstructed by obstacles, a non-line-of-sight (NLoS) channel is formed, characterized by a path loss denoted as The channel gain between the UAV and sector m, denoted as h m (t), is primarily determined by three factors, GBS antenna gain, large-scale channel fading, and small-scale fading.According to [34], the received instantaneous signal power at the UAV from sector m can be mathematically expressed as where the constant P m represents the transmit power of GBS in sector m, while β(q(t)) and hm (q(t)) respectively denote GBS antenna gain and large-scale channel fading.The variable hm (t) signifies the channel gain under small-scale fading, and hm (q(t)) can be determined by the building's location between the UAV and GBS hm (q(t)) = h LoS m (q(t)), LoS h NLoS m (q(t)), NLoS.
The sector associated with the UAV at time t is denoted as b(t) ∈ {1, • • • , M}.Consequently, the descending instantaneous SIR can be mathematically formulated as The small-scale fading hm (t) introduces randomness to the variable γ(t) at any given location q(t) and its associated unit b(t).To assess the reliability of the UAV-to-target cell link, we introduce the interrupt probability function as follows The interruption of the connection to the GBS-UAV is considered when the SIR γ(t) falls below the interruption threshold γ th , where event probability Pr{•} indicates its likelihood.
The direct solution of P out (q(t), b(t)) being unattainable, we reformulate the instantaneous γ(t) as a function of q(t), b(t), and small-scale fading hb(t) .Subsequently, we define the interrupt indicator function as follows Then, the interrupt probability function in ( 6) can be expressed as the expectation of small-scale fading hb(t) , i.e., The interruption probability of each time point t is obtained by conducting J-time signal measurements on M sectors within a short duration using the UAV.The j-th measurement of the small-scale fading is denoted as hb(t) [t, j], the corresponding SIR and the outage indication function are denoted as γ(q(t), b(t), hb(t) [t, j]) and c(q(t), b(t), hb(t) [t, j]), therefore the corresponding communication interruption probability can be expressed as According to the large number theorem, Pout (q(t), b(t)) can provide an accurate approximation of the actual interruption probability at q(t) when J is sufficiently large.The optimal associated cell can be denoted as where arg min signifies the argument or input value that minimizes the corresponding functionand and the estimation of the interruption probability at any given location can be calculated by Pout (q(t) According to the aforementioned analysis, the anticipated interruption probability of UAV at any given location can be derived, enabling the construction of a 3D coverage probability graph (CPG).The constructed coverage probability map will be shown in Section 4, where coverage probability Pcoverage (q(t))=1− Pout (q(t)).

UAV Motion Model
The rotor UAV utilized in this experiment primarily consumes energy in two main aspects.The first aspect pertains to communication, encompassing signal processing, radiation, and circuitry.The second aspect involves propulsion energy, which is essential for sustaining the UAV's flight and movement.It is noted that the communication-related energy consumption of UAVs is considered negligible due to its typically smaller magnitude compared to the propulsion energy of UAVs [35].According to [36], the instantaneous propulsion energy of a rotor UAV with a velocity of V can be expressed as where P 0 and P i are constants, representing the UAV's blade profile power and induced power in hovering states, respectively.v 0 represents the mean rotor induced velocity in hover, U tip signifies the tip speed of the rotor blade, and d 0 and s denote the fuselage drag ratio and rotor solidity, respectively.ρ and A denote air density and rotor disc area, respectively.In a given environment, with all environmental parameters and UAV settings held constant, the power required for UAV flight remains constant for a given speed.Therefore, the consumed energy of the rotary-wing UAV during time T can be expressed as E = T 0 P(V)dt = P(V)T.It can be deduced that the energy consumption of UAVs is directly proportional to their flight time, indicating that longer flight durations result in higher energy consumption.
In this study, we prioritize flight time over energy consumption as our research metric.By imposing a maximum flight time constraint, we ensure the safe operation of UAVs.Additionally, we introduce the concept of communication interruption time, denoted as T 0 Pout (q(t))dt, to represent the communication quality of UAVs within a given time period.The main objective of our study is to train UAVs to acquire optimal flight strategies.If UAVs solely focus on energy consumption, they would instinctively choose the shortest path from the starting point to the destination, inevitably compromising the communication quality between UAVs and associated ground stations.Similarly, if only communication quality is prioritized, it would significantly increase the energy consumption of UAVs.To address this trade-off, we introduce a weighting coefficient, denoted as µ, which combines the flight time and estimated interruption time of the UAV.By minimizing the weighted sum of both factors, we aim to achieve a balanced optimization between energy consumption and communication quality between UAVs and ground stations.
Based on the obtained CPG of 3D space, the optimization objective equation can be formulated as follows max s.t.q(0) = q s , ( 14) where µ represents the trade-off between the flight time and the expected outage time.A higher value of µ indicates a greater emphasis on maintaining connectivity between the UAV and the GBS, but at the cost of potentially increased travel distance for the UAV.The constraints on the starting and ending positions are represented by ( 14) and (15), while the limitation of the UAV velocity is denoted by (16).Additionally, ( 17)-( 19) specify the constraints on the 3D motion space of the UAV.
The path planning problem can be formulated as a markov decision process (MDP) that is amenable to solution using DRL.However, addressing the continuous optimization aspect of (13) introduces challenges due to the inherent complexity arising from continuous state and action spaces.This often leads to instability or non-convergence during DRL training.To mitigate these issues, we convert problem (13) into a discrete-time formulation by discretizing the time period, which can be expressed as max s.t.q(0) = q s , ( 21) where T = N∆t, ∆s = a∆t, and the time interval should be sufficiently small so that within each time step, the distance between the UAV and any GBS in the target area remains approximately constant, while ensuring that both the antenna gain and channel state parameters between the UAV-GBS remain nearly constant.

3D Path Planning Based on Improved DRL
To address problem (20), we employ the multi-step D3QN model in DRL to optimize the objective function, and use PER mechanism instead of the conventional random experience replay (RER) approach to enhance learning efficiency and expedite convergence.Moreover, for assisting path planning, a radio mapping network is incorporated to generate simulated 3D radio maps and simulate flight processes.This incorporation not only mitigates flight costs, but can also enhances the accuracy of the D3QN network model.The improved framework based on DRL is then applied to UAV path planning, enabling efficient identification of an optimal route that satisfies all constraints from any given starting point.

Multi-Step D3QN Model
In this section, we will briefly introduce the relevant knowledge of DRL and provide an overview of the specific components of the multi-step D3QN model employed in this paper.
In the reinforcement learning model, the agent and the environment play crucial roles.The agent selects actions a n based on the current state s n provided by the environment, while its own state changes to s n+1 according to state transition function, with rewards r n+1 being fed back to the environment.By iteratively following this process, the agent can efficiently converge towards an optimal strategy within a specific environment.Execution of this optimal policy leads to maximum cumulative reward G n for agent movements, which can be defined as where γ ∈ [0, 1] represents the discount factor, denoting the future reward discounted at the prevailing rate.A higher value of γ signifies greater emphasis on long-term gains, while a lower value indicates prioritization of short-term benefits.
Since the cumulative reward G n is unknown prior to the completion of the agent's trajectory, we estimate the expected reward instead of its actual value to derive the action value function where π(a n , s n ) = P[a = a n |s = s n ] represents the state transition function, denoting the probability of an action a n being performed by the agent while in a particular state s n .The action value function Q π signifies the expected return obtained by adhering to a given policy π(a n , s n ).If there exists a strategy function capable of selecting the optimal action for the agent at each state during its trajectory, it is referred to as the optimal strategy π * (s).
Under this guidance, the optimal action value function Q * (s, a) can be expressed as follows In principle, by exhaustively traversing all possible sequences (s n , a n , r n+1 , s n+1 ) and iteratively optimizing, we can obtain the optimal value for Q * (s, a) and subsequently determine the optimal strategy π * (s).However, to address the limitation of Q-learning in dealing with continuous high-dimensional state or action spaces, we employ the classical DQN network model instead of the Q table as a function approximator, and update the network parameters by minimizing the loss function where θ denotes neural network parameter vector.However, the direct utilization of (31) in the standard training algorithm may give rise to the issue of overestimating Q value, thereby leading to learning instability and inefficiency.To address this challenge, we introduce Double DQN into our research, aiming to mitigate overestimation.This approach separates the selection of the target Q value from the estimation process by leveraging the policy network to determine the optimal action and utilizing the target network to estimate the corresponding Q value.In accordance with the Double DQN model, we can reformulate the loss function as where θ ′ denotes the parameter vector of target network.Additionally, to enhance the effectiveness of learning state value information and address bias-variance trade-off in training, this study introduces the dueling network and n-step bootstrapping techniques to improve the Double DQN model, and the improved model was represented as multi-step D3QN model.The dueling network models both the state value function V s and the action advantage function A(s, a), respectively, enabling the network to learn the relative value of each state as well as the advantages of different actions.By decomposing the network's output into status value and action advantages, we obtain a comprehensive Q value by combining these two components, which can be expressed as where α and β are the parameters of the advantage stream and the value stream, respectively, and |A| represents the size of the action space.Multi-step bootstrap is an improved learning style in reinforcement learning that aims to improve the efficiency of learning by considering the rewards of multiple future time steps N 1 , which can be expressed as It is worth noting that the return accumulates to a maximum of N steps, when The loss function of the D3QN model, incorporating multi-step bootstrap technology, can be summarized as follows

Priority Experience Replay
Experiential playback is a crucial technique in DRL.Its fundamental concept involves storing the experiences acquired through agent-environment interactions and sampling them randomly for learning, thereby reducing sample correlation.However, randomly selecting samples may result in the loss of crucial experiences, thereby impacting the learning efficacy and, consequently, the effectiveness of UAV path planning.To address this issue, we propose employing PER instead of traditional RER by assigning priorities to each experience.During the process of sample extraction, samples with higher priority are more likely to be selected, thus enhancing the efficiency of sample training.
The PER mechanism assigns sampling weights based on the absolute value of the temporal difference error (TD-error).In this mechanism, the priority of each experience is set to p i = |δ i | + σ, where |δ i | represents TD-error, and the parameter σ is a constant greater than 0, which is used to ensure all p i > 0. Notably, higher TD-errors correspond to greater experience priorities.Consequently, the sampling probability for each experience can be defined as follows the hyperparameter α ≥ 0 controls the intensity of priority playback, while α = 0 represents traditional random experience playback where each experience is sampled with equal probability.Additionally, ∑ k p α k denotes the sum of all experience priorities in the buffer.The sampling probability P(i) can be utilized for calculating the loss function.
To mitigate the computational complexity arising from priority sampling as the number of experiences increases, we employ a sum-tree data structure to store priorities and conduct sampling operations.Given a sample size of k, priority 0, ∑ k p α k is divided into an average of intervals.A random value is generated in each interval and the corresponding transition sample is extracted from the sum-tree.However, changing the priority of the sample will introduces errors into the data distribution.To compensate for this error, importance sampling weights are introduced and can be expressed as follows where β is a hyperparameter that determines how much PER affects the convergence result, and the loss function in (35) can be rewritten as

SNARM Framework
Due to the lack of prior environmental knowledge, relying solely on the actual flight of UAVs not only incurs high training costs and a slow learning process, but also poses a significant risk of accidents.To address this issue, we propose the SNARM framework in this paper, which utilizes UAV measurement signals during flight to generate a simulated 3D radio map and create a virtual flight trajectory.By doing so, the UAV can predict the expected outcome for each path without physically traversing, thereby reducing the cost of measured flight and mitigating potential risks.Furthermore, we employ the Dyna framework to integrate simulation experience with real-world experience in updating UAV flight strategies within deep learning algorithms, thus enhancing the accuracy of neural network.The Dyna framework is shown in Figure 3 below.Notably, the simulated trajectory is utilized more frequently than the actual path, and for each real episode taken by the UAV, Ñ = min([n/200], 5) episodes are employed in the simulated trajectory.Initially, the limited efficacy of the map model in learning resulted in a relatively low reference value for simulation experience, leading to a reduced contribution of simulation experience towards neural network updates.As the accuracy of the local map model improves, there will be an increased proportion of simulation experience involved in network updates.Since acquiring simulation experience does not necessitate actual UAV measurements, it is possible to appropriately increase the proportion of simulation experience without concerns about additional UAV operating costs and algorithm runtime consumption.

Path Planning Based on Improved DRL
In the enhanced DRL model, the UAV functions as an autonomous agent that strategically selects the optimal course of action based on its current state, subsequently receiving rewards from the environment and transitioning to subsequent states.The comprehensive depiction of the state space, action space, and reward function is expounded upon in meticulous detail as follows: • State: The state serves as the input of the neural network, representing the UAV's 3D positions.The state space S encompasses all potential UAV positions within the terrain of interest S = q : q s ≤ q n ≤ q f .For each episode, the initial location of the UAV is randomly generated, while the final location is predetermined.• Action: The action space A corresponds to the UAV flying direction.Considering the limited vertical range of the UAV's flying area, the action space of the UAV consists of 10 directions, including 8 horizontal directions spaced 45 degrees apart, as well as upward and downward directions, as shown in Figure 4.The selection of UAV motion direction relies on the model's estimation of the Q value for each direction in specific position.• Reward: The reward R is defined as R(q) = −1 − µ Pout (q), and the UAV incurs a penalty of 1 for each step taken before reaching the endpoint q f .Additionally, if it enters an area with weak coverage, it will be penalized by a weighted value of µ.This encourages the UAV to consider both flight time and interruption time to determine the optimal path towards the endpoint.
Initialize the sliding window W of size N, the actual starting position q s , the simulated starting position qs , and the flight step n = 0 , ñ = 0 3: Select the action with ε-greedy policy 4: Perform action a n to obtain the next state q n+1 , measure the probability of communication interruption Pout (q n+1 ), and save it in map network E 5: Update the map network E with (q n+1 , Pout (q n+1 )) Set single-step reward R n = −1 − µ Pout (q n+1 ) and store sequence (q n , a n , R n , q n+1 ) in slide window W 7: Extract B sequence (q j , a j , R j:j+N 1 , q j+N 1 ) and its priority w j from D according to PER mechanism 9: Set Perform a gradient descent step on w j (y j − Q(q j , a j |θ)) 2 with respect to network parameters θ 11: Perform steps (3-4,6-10) for the simulated experience, where the interrupt probability of q n+1 is predicted by map network E. Repeat steps 3-16 until q n − q f ≤ D tol , h n = h target ; q n / ∈ S or n = N step , 18: After every δ episodes, set the target network parameters θ − = θ 19: end for In the multi-step D3QN model, the UAV obtains the state from the state space and selects the action from the action space according to ε − greedy strategy, i.e., a = randomly selected from A, p = ε arg max where ε ≥ 0 represents the random exploration rate , θ denotes a multi-step D3QN network parameter, and the value of ε gradually decreases as the number of iterations increases.
During the initial learning stage, the UAV conducts random exploration with a high probability to gather sufficient environmental information.As the UAV accumulates more experience, it becomes more inclined to select directions corresponding to maximum Q values.To enhance algorithm convergence, parameter θ is initialized based on the distance between the UAV and endpoint.After initialization, Q(q, a|θ) = − q ′ − q f can be obtained, where q ′ represents the next state of UAV after action a is performed in state q, and q f signifies the endpoint coordinate.This encourages optimal path selection when radio environment understanding is limited during early stages.Additionally, θ radio serves as a parameter for radio map network E and undergoes random initialization.The 3D path planning framework of UAV is shown in Figure 5.

Simulation Verification and Analysis
In order to validate the efficacy of the proposed approach, this section conducts simulations on radio mapping and path planning based on the enhanced DRL algorithm.Furthermore, we conduct a comparative analysis between 2D and 3D trajectories in the path planning simulation to substantiate the indispensability of incorporating 3D path planning for UAVs under connection constraints.In simulations, each GBS has a transmitting power P m = 20 dBm, with an interrupt SIR threshold set at γ th = 0 dB.Other simulation parameters are presented in Table 1.The height of GBSs 25 m

Radio Mapping Based Environmental Learning
For radio mapping, we employ artificial neural networks (ANN) for map learning, which are trained using Adam optimizers to minimize mean square error (MSE) losses.The radio map network comprises five hidden layers with 512, 256, 128, 64, and 32 neurons respectively.The input consists of the UAV's 3D coordinates q n , while the output represents the predicted probability of interruption Pout (q(n)) at that location.The objective of network learning is to accurately align the 3D radio map with the real environment, thereby providing precise interconnection probabilities for each spatial point during simulated flight and enhancing the accuracy of the multi-step D3QN algorithm.
The actual global coverage of the 3D region under consideration is depicted in Figure 6a, which is obtained through numerical simulations using a computer based on the aforementioned model of the 3D environment and channel.Therefore, direct utilization of this simulated data in the algorithm is not feasible.As shown in Figure 6a, due to the combined influence of GBS antenna inclination and building occlusion, the coverage map exhibits irregularities in high altitude areas, while low altitude areas demonstrate a more regular pattern.Figure 6b illustrates the spatial 3D coverage probability map acquired from radio mapping.By comparing Figure 6a with Figure 6b, it is evident that the acquired CPG exhibits a remarkable alignment with its corresponding actual counterpart, thereby substantiating the effective application of SNARM in path learning.The algorithm effectiveness was further verified through simulations, which evaluated the MSE and mean absolute error (MAE) of the learned radio map in relation to the episode count, as depicted in Figure 7. MSE and MAE are derived by comparing predicted outage probabilities obtained from radio network measurements against actual outage probabilities at randomly selected locations.Episodes ranging from 0 to 500 correspond to an initial learning stage where large MSE and MAE values indicate poor quality of initially learned radio maps.However, as episode count increases, accumulating more signal measurement data leads to gradual decline in both MSE and MAE values, indicating improved approximation between learned radio maps and real maps.

UAV Path Planning
The UAV operates within a 3D environment established in Section 2.1.It is assumed that the UAV's starting point is randomly generated, while its endpoint is located at coordinates [1400,1600,100] and labelled by the big blue triangle in simulated graph.When the UAV reaches the specified altitude and the distance between it and the endpoint satisfies condition d ≤ D tol , it is considered to have reached the endpoint.
The multi-step D3QN network consists of five hidden layers, with 512, 256, 128, 128, and 11 neurons respectively.The last hidden layer consists of one neuron representing the estimated state value, while the remaining ten neurons represent action advantages.These action advantages capture the discrepancy between each state's action value and its corresponding state value.By aggregating these differences in the output layer, we obtain ten estimates for action values.The objective of multi-step D3QN network learning is to accurately estimate the Q value for each action, enabling the UAV to determine an optimal flight strategy that minimizes the cumulative flight time and interruption-weighted time.
Authors in the paper [27] thoroughly investigate the 2D trajectory planning of UAVs subject to connectivity constraints.Nonetheless, overlooking the 3D attributes of the environment and neglecting the vertical movement of UAVs may lead to missed opportunities for optimal connectivity points and improved communication pathways.In this section, we conducted simulations of both 3D and 2D trajectories for UAVs, as outlined below Comparing the 2D and 3D motion trajectories of the UAVs in Figure 8, it is evident that UAV prioritizes descending during 3D motion to seek better communication conditions.When the UAV is at the lowest altitude of 80 m, the convergence of its trajectory is lower compared to the 2D trajectory due to weakened spatial connectivity constraints.Nevertheless, upon comparing the weighted time of 2D and 3D trajectories from the same starting point in Figure 9, it is observed that the UAV's flight process with 3D motion has lower weighted time.Even at starting point 6, its weighted time is only half of that of the 2D trajectory.Therefore, it can be inferred that within the confines of connectivity limitations, the superiority of UAVs' 3D motion compared to 2D motion becomes apparent.To further emphasize the merits of the proposed SNARM-PER technique in path planning, a comparative analysis is conducted with other approaches such as SNARM-RER [27] and D3QN-PER [34].
Figure 10 illustrates the final 20 episodes of UAV flight paths under different simulation conditions.Among them, the SNARM-PER algorithm integrates a multi-step D3QN algorithm with a radio map network and a PER mechanism, while the SNARM-RER and D3QN-PER algorithms serve as comparative algorithms, incorporating a multi-step D3QN algorithm with a radio map network and a RER mechanism, and utilizing a PER mechanism without a radio map network, respectively.In Figure 10a,b depict flight trajectory maps using the target algorithm with different weight values.It can be observed that when the weight value is sufficiently large, UAVs tend to prioritize avoiding areas with weak communication coverage by descending to seek better communication conditions.Conversely, when the weight value is small, UAVs tend to follow more direct paths towards the destination with less consideration given to communication connectivity.This demonstrates the influence of weight coefficients µ in the objective function (20) on UAV flight paths.Specifically, a higher weight coefficient directs the UAV's focus more towards maintaining connectivity with the base station, consequently diminishing its emphasis on seeking the shortest route.Following the principle of controlling variables, we compare (a) with (c) and (d) in the Figure 10.Under the same weighting coefficients, UAVs exhibit significant differences in their trajectories.it is evident that UAVs using the proposed SNARM-PER algorithm exhibit more convergent flight paths, allowing for precise avoidance of communication weak coverage areas, and completion of flight missions over shorter distances.However, UAVs using the comparative algorithms, due to insufficient learning of their Q-networks, show only partial convergence in their flight paths, along with oscillations in trajectory altitude.
Figure 11 illustrates how the average return of the UAV flight path changes with the number of episodes across various algorithms.The average return of the path is calculated as the mean value of the returns from the previous 200 episodes, thereby introducing data smoothing and enhancing trend visibility through averaging within a moving window.The average mobile return serves as a pivotal index for assessing the overall efficacy of UAV movement processes.Incorporating the settings of return values as outlined in Section 3.4, a higher average movement return indicates lower cumulative flight and interruption times.A clear observation from Figure 11 is that, in the learning phase after 1000 episodes, UAVs leveraging the SNARM-PER algorithm, as proposed in this paper, exhibit notably superior average movement returns and enhanced motion performance compared to the contrasting algorithms.Figure 12

Conclusions
The quality of communication between UAV-BS in cellular network assisted UAV communication can be enhanced by strategically planning the 3D path of UAVs.Based on

Figure 1 .
Figure 1.Schematic representation of UAV communication channel for cellular connectivity in an urban setting.

Figure 2 .
Figure 2. Top view of building and base station distribution.

Figure 5 .
Figure 5.The framework of 3D path planning algorithm.PER mechanism and radio map network are utilized to assist the DQN network in learning Q-values.The specific algorithm structure of dueling DQN is presented in the dotted box.

Figure 8 .
Figure 8.Comparison diagram of 2D and 3D trajectory.The red route represents the 2D trajectory, while the blue route represents the 3D trajectory.The starting height of the 12-episode route is set at 100 m and the weigh coefficient is set as µ = 40.

Figure 10 .
Figure 10.The diagram of UAV flight trajectory.These trajectories are all derived from the final training set of 20 episodes.
illustrates the total mission time of different algorithms during the last 20 episodes, representing a weighted sum of actual flight time and estimated interruption time.The weighted time of the last 20 episodes serves as an indicator of learning outcomes.As depicted in Figure 12, compared to SNARM-RER and direct-line approaches, the SNARM-PER algorithm excels in minimizing the weighted sum of UAV flight time and interruption time.Consequently, the UAV achieves a better balance between flight energy conservation and the avoidance of areas with weak communication coverage.All these findings serve to numerically validate the superiority of the proposed algorithm over other comparative methods.The UAV employing the SNARM-PER algorithm demonstrates enhanced capability in path planning under connectivity constraints while minimizing the weighted sum of flight time and interruption time.

Figure 12 .
Figure 12.Weighted time of the last 20 episodes.
Algorithm 1 N1-Step D3QN with PER for Connectivity-Aware UAV Path Planning Initialize: number of episodes N epi , maximum number of steps per episode N step , number of multi-step learning steps N 1 , experience buffer D of size C, initial exploration rate ε 0 , exploration decay rate α and experience extraction number B Initialize: Q network with parameter θ, target network with parameter θ − , network update rate δ, and radio map network E with parameter θ radio 1: