Trajectory Planning for UAV-Assisted Data Collection in IoT Network: A Double Deep Q Network Approach

: Unmanned aerial vehicles (UAVs) are becoming increasingly valuable as a new type of mobile communication device and autonomous decision-making device in many application areas, including the Internet of Things (IoT). UAVs have advantages over other stationary devices in terms of high flexibility. However, a UAV, as a mobile device, still faces some challenges in optimizing its trajectory for data collection. Firstly, the high complexity of the movement action and state space of the UAV’s 3D trajectory is not negligible. Secondly, in unknown urban environments, a UAV must avoid obstacles accurately in order to ensure a safe flight. Furthermore, without a priori wireless channel characterization and ground device locations, a UAV must reliably and safely complete the data collection from the ground devices under the threat of unknown interference. All of these require the proposing of intelligent and automatic onboard trajectory optimization techniques. This paper transforms the trajectory optimization problem into a Markov decision process (MDP), and deep reinforcement learning (DRL) is applied to the data collection scenario. Specifically, the double deep Q-network (DDQN) algorithm is designed to address intelligent UAV trajectory planning that enables energy-efficient and safe data collection. Compared with the traditional algorithm, the DDQN algorithm is much better than the traditional Q-Learning algorithm, and the training time of the network is shorter than that of the deep Q-network (DQN) algorithm.


Introduction 1.Background
In recent years, the use of unmanned aerial vehicles (UAVs) as airborne base stations to assist in offloading hotspots in existing ground communication infrastructures and cellular networks has been recognized as a promising candidate technology.UAVassisted communication, when combined with technologies such as the fifth-generation (5G) networks [1][2][3] and airborne self-organizing networks [4][5][6], has the potential to provide Internet of Things (IoT) services from high altitudes, creating an airborne domain for the IoT.In certain geographic regions where operators may not be able to afford to build a cellular infrastructure (e.g., a base station), as an alternative, UAVs can lower the communication costs while performing tasks such as collecting or transmitting data to ground-based IoT devices [7,8].For instance, UAV-assisted cellular communication technology can efficiently restore wireless services after unexpected damage to facilities, such as from natural disasters (e.g., earthquakes, volcanic eruptions, and floods) or in hotspot areas (e.g., sports stadiums and outdoor events), where ground-based cellular stations are insufficient [9,10].Compared to traditional satellite relays, UAVs fly at lower altitudes when acting as wireless communication providers, resulting in a higher optical resolution than that of traditional satellites.UAVs have lower maintenance costs and are typically tens or even hundreds of times less expensive than satellites [11].Furthermore, UAVs are more flexible than traditional satellites, which can only fly according to preset orbits.UAVs can be deployed in a specific area according to demand to fulfill wireless communication tasks [12].
However, as the IoT networks expand in scalability and system design complexity, collecting data from IoT devices while maintaining stable and superior network performance becomes increasingly challenging.In response to this urgent need, UAVs may be an effective solution, due to their high mobility and flexibility.According to the literature [13], UAVs are increasingly being used to collect data from remote sensors.In cases where UAVs assist in information dissemination or data collection, they can collect data sustainably and cost-effectively, as they are equipped with ground-based wireless sensor networks [14].However, UAVs encounter several challenges, such as a restricted battery capacity, a limited flight time [15,16] and flight altitude, and the impact of malevolent external interference on the communication link between the UAV and the user [17].Therefore, it is important to revisit how to ensure that UAVs efficiently perform data collection tasks in complex and realistic environments while avoiding obstacles and interferences.

Related Work and Contributions
With the rapid advancement of UAV-assisted communication technology, traditional mathematical planning algorithms have demonstrated notable efficacy [18].For instance, authors [19] have leveraged cellular UAVs to ensure stable connections during missions while minimizing the time taken to reach the destination, which was achieved through the implementation of convex optimization and graph theory techniques.Their results indicated reduced mission completion times and an enhanced signal-to-noise ratio (SNR) throughout the missions.Additionally, researchers [20] have identified the following three distinct phases in the trajectory control process: trajectory generation, trajectory correction, and trajectory smoothing.They proposed an ant-colony-based algorithm for initial trajectory generation and an effective collision avoidance scheme for the flight trajectory of UAVs.Another study [21] proposed a method to address the non-convex problem of task assignment, power allocation, and UAV trajectory in wireless communication services.By employing the block coordinate descent method, the original problem was decomposed into two sub-problems, which were subsequently solved iteratively using Lagrange bifurcation and successive convex approximation techniques.However, it is important to note that the computation time of these algorithms may grow exponentially with the scenario size, and they may not be fully adaptable to the increasingly complex scalable wireless network environment.
The application of machine learning techniques to UAV communications has recently gained attention.Reinforcement learning, a model-free algorithmic framework, has been proposed as an alternative to traditional algorithms.This approach does not require the modeling of specific environment feature parameters and can train strategies from trial and error.It has practical significance for UAV trajectory planning and wireless communication system optimization.For instance, in [22], the authors optimized UAV trajectories and power allocation to maximize the fairness of throughput between sensor nodes.In [23], the authors proposed a deep reinforcement learning (DRL)-based framework that uses convolutional neural networks for feature extraction and deep Q-network (DQN) algorithms for decision making to design energy-efficient remote sensing routes for UAVs.The experimental results from [24] demonstrate that the algorithm is significantly more efficient.This study utilizes a network model of the central layer and environmental information and processes the environmental layer through convolution.However, the works only examine the flight trajectory of UAVs at a certain altitude, ignoring the complexity of the 3D environment and the external unknown interference.
In [25], the authors proposed a DRL approach to minimize the task completion time of cellular-connected UAVs while maintaining good cellular network connectivity.In [26], the authors proposed a dual Q-learning approach to solve optimization problems involving UAV trajectories under continuous time constraints.The authors of [27] performed coordination between UAVs to avoid collisions.This coordination was achieved through a sense-transmit protocol.The main objective is to determine the optimal motion trajectory through a decentralized Q-learning algorithm.This algorithm reduces the convergence time and ensures an efficient transmission of sensor data.However, in practical scenarios, variations in building height can obstruct the flight path of a UAV, necessitating adjustments in altitude.Conversely, heightened interference is encountered when UAVs approach jamming devices, prompting maneuvers to different altitudes in order to mitigate jamming effects and improve the communication environment.It is worth noting that few published articles have considered the impact of obstacles and jammers on channel quality when planning 3D UAV trajectories.
In contrast to the preceding research endeavors, our study encompasses a heightened level of realism by incorporating a sophisticated environmental model, wherein the collective presence of multiple obstacles and jammers contributes to the degradation of the communication link between the UAV and the ground-based IoT devices.Our objective is the maximization of throughput through the strategic optimization of the UAV's three-dimensional flight trajectory, at a low UAV power consumption, while effectively accomplishing data acquisition from the devices and ensuring the safety of UAV flight operations.The proposed algorithm's performance is verified through extensive simulations.
The main contributions of this paper are as follows: • Our work considers a complex and realistic urban environment in order to study the effects of obstacles and jammers on 3D UAV trajectory planning.Particularly, during the simulation phase, we randomly generate the positions of jammers and ground devices for each iteration, which makes the scenario more uncertain and complicates the design of the Markov decision process (MDP).

•
In this paper, the environmental information is not predetermined, and the UAV dynamically senses and navigates around the obstacles in real time using onboard sensors such as cameras.It also learns from historical environmental information obtained from a memory bank to speed up its decision making.• To address the problem of the limited computing power of UAVs, we developed a DDQN-based UAV trajectory optimization algorithm.The algorithm sets the reward value according to the scene and converges faster.We also provide flight results under different scene parameters and comparison experiments of various reinforcement learning algorithms to support our view.The article fully demonstrates the simulation experiments and algorithm comparisons that validate the effectiveness and superiority of our approach.
The remainder of this paper is organized as follows: In Section 2, the system model and problem description are outlined.Section 3 details the DDQN-based algorithm for trajectory optimization in data collection scenarios.The experimental and simulation results for trajectory optimization are presented in Section 4. The conclusions and future work are discussed in Sections 5 and 6, respectivley.

System Model and Problem Formulation
We consider a smart urban setting in Figure 1, where a UAV operates within an unlicensed spectrum band to collect data from a collection of U = {1, ..., U} stationary ground-based IoT devices dispersed across a designated area.In this area, there may be a set of J = {1, ..., J} static ground directional jammers (for example, Wi-Fi that shares an unlicensed spectrum band with the UAV).The maximum duration of a UAV mission is denoted by Τ , during which the UAV optimal trajectory is designed to maximize data collection from ground IoT devices.Fo the sake of easy illustration, we assume that Τ is discretized into equal N time inter vals.The UAV position at the time step n is denoted by , , , while the -th u device is located at 3 , ,0 , , and the position of the -th j jammer is 3 , ,0 , . Moreover, various obstacle heights are in corporated to simulate a realistic environment.
The action space of each UAV in time step n is defined as follows: , , , ]

{ }
, , 1, 0,1 x y z a a a ∈ − and  define the set of feasible actions on the UAV' position.Given the executed action, the position of the UAV evolves as follows:

Channel Model
This paper employs a simplified path loss model in Decibel (dB).The channel gain between the UAV and the device u at the time step n is modeled as follows [28]: where is the distance between the UAV and the device u, α is a path loss constant, β is the average channel gain at reference distance 0 1m d = , and η rep resents the shadowing component following a Gaussian distribution of ( ) The communication link between UAVs and ground devices is affected by the alti tude of the UAV, the characteristics of the urban environment, and the interference from other wireless devices.We can assume that the communication between the UAV and ground devices follows a time-division, multiple access mode.The rule is that, at each communication time step, the UAV can collect data from only one ground device, and only the device with remaining data and the highest signal-to-interference plus noise ratio (SINR) can establish a communication link with the UAV at the current time step n.The SINR of the signal received by the UAV from the ground device u at the time step n i as follows: The maximum duration of a UAV mission is denoted by T, during which the UAV optimal trajectory is designed to maximize data collection from ground IoT devices.For the sake of easy illustration, we assume that T is discretized into equal N time intervals.The UAV position at the time step n is denoted by and the position of the j-th jammer is q Moreover, various obstacle heights are incorporated to simulate a realistic environment.
The action space of each UAV in time step n is defined as follows: where a x , a y , a z ∈ {−1, 0, 1} and A define the set of feasible actions on the UAV's position.Given the executed action, the position of the UAV evolves as follows:

Channel Model
This paper employs a simplified path loss model in Decibel (dB).The channel gain between the UAV and the device u at the time step n is modeled as follows [28]: where d u n = ∥q n − q u n ∥ 2 is the distance between the UAV and the device u, α is a path loss constant, β is the average channel gain at reference distance d 0 = 1m, and η represents the shadowing component following a Gaussian distribution of N 0, σ 2 .
The communication link between UAVs and ground devices is affected by the altitude of the UAV, the characteristics of the urban environment, and the interference from other wireless devices.We can assume that the communication between the UAV and ground devices follows a time-division, multiple access mode.The rule is that, at each communication time step, the UAV can collect data from only one ground device, and only the device with remaining data and the highest signal-to-interference plus noise ratio (SINR) can establish a communication link with the UAV at the current time step n.The SINR of the signal received by the UAV from the ground device u at the time step n is as follows: where P u is the transmission power at the ground device u, σ 2 is the white Gaussian noise power at the receiver, and I n is the received interference power that is calculated as , where P j represents the transmission power of the jammer j.
Furthermore, the channel throughput can be calculated by Shannon's formula as , where B is the bandwidth of the channel in bits per second.

Throughput Maximization Problem Formulation
Here, we denote the link status by l u n ∈ {0, 1}, where l u n = 1 indicates the collection of data by the UAV from the u-th device at the time step n, and otherwise l u n = 0, and the channel access constraint is given as follows: Our algorithm aims to optimize the trajectory of the UAV in order to maximize the amount of data collected from the ground equipment during the mission time T.This data collection problem can be formulated as the following optimization problem: max where a n is the action of the UAV at step n.Equation (6a) ensures that the UAV avoids collisions with obstacles B. Equation (6b) limits the operation time of the drones, forcing the UAV to end its mission before its battery b n has run out.Equation (6c) indicates that the communication is interrupted when the SINR produced by the UAV is lower than that of the SINR threshold.This optimization problem is challenging, due to its non-convexity and unknown environment at the decision-making moment.Consequently, conventional model-based approaches are rendered inapplicable.

Markov Decision Process
In reinforcement learning problems, the MDP is regarded as an idealized form that provides a theoretical framework for achieving goals through interactive learning.In the UAV-assisted communication model, we can use the MDP to simulate the interaction between entities.The complete MDP can be represented by a quaternion ⟨S, A, R, P ⟩ [29].The MDP for trajectory optimization in the data collection scenario is shown in Figure 2, which details the interaction process between the UAV and the environment.Additionally, it provides a detailed description of the process of generating an arbitrary time-slotted state space, where process 5  ⃝ is interchangeable with process 6 ⃝.A: The action space is defined in (1).S: The state of the UAV at the time step n is denoted by s n = (s n,1 , s n,2 , s n,3 ).To be specific, s n,1 = {q n , b n , L n } includes the characteristics of the UAV at the time step n, including the UAV's current momentary position q n , remaining power b n , and the amount of data that has been collected L n .s n,2 = {q u n , l u n , SNR u n , d u n , D u n } includes a characterization of the UAV concerning each ground device, in which d u n represents the distance of the UAV from the device and D u n represents the amount of data remaining for each device.s n,3 = {o n , o n+1 } represents the observation space o n of the UAV at the time step n and the predicted observation space o n+1 at the next time step n + 1 in the case of a wide observable range of the UAV camera.

P:
The transfer probability matrix P represents a transfer process, i.e., the probability of taking action a n to move to the next state s n+1 when the intelligence is in state s n .

R:
The reward space represents the information about the gains made by the UAV in the process of choosing an action and reaching the next state, which can be denoted as r n = {r n,1 , r n,2 , r n,3 }, and the expression for r n is defined as follows: where n δ n is defined as the total amount of data collected by the UAV at each time n, r n,2 is the power penalty consumed by the UAV's movement; in addition, a penalty of r n,3 is imposed if there is an obstacle in the current observation space o n , or if the UAV's current position q n is outside of the given region. : The reward space represents the information about the gains made by the UAV in the process of choosing an action and reaching the next state, which can be denoted as , , , and the expression for n r is defined as follows:

DDQN-Based UAV Trajectory Optimization Algorithm
The DQN algorithm is built on top of the standard Q-Learning algorithm framework [29], which utilizes deep learning algorithms to train action-value functions by updating the target network parameters.However, Q-learning and DQN may result in overestimating the Q-values, due to the use of max operations.To avoid the bias caused by this situation, we utilize the double deep Q-network (DDQN) algorithm [30].Similar to the DQN algorithm, the online network parameters of both of these algorithms are used to generate strategies for the trajectories of intelligence, while the action strategies are used to evaluate the current goals.The difference here is that the DDQN algorithm uses a different set of network parameters for the goodness of the action strategies, i.e., the selection of actions and the evaluation of actions are realized using two different sets of network parameters.Thus, the objective function can be expressed as follows: We use ε-greedy to choose the actions a n , i.e., randomly selecting actions with a probability of ε, and selecting actions based on the Q-values with a probability of 1 − ε, to ensure that the UAV is somewhat exploratory.Figure 3 shows the architecture of the DDQN.
selection of actions and the evaluation of actions are realized using two different sets of network parameters.Thus, the objective function can be expressed as follows: ,arg max , ; , We use greedy εto choose the actions n a , i.e., randomly selecting actions with a probability of ε , and selecting actions based on the Q-values with a probability of 1 ε − , to ensure that the UAV is somewhat exploratory.Figure 3 shows the architecture of the DDQN.1: for episode = 0, 1, …, M − 1 do 2: Randomly generating the location of the IoT devices 0 u q , the location of the UAV 0 q , the location of the jammers 0 j q , transmission power of UAV P , and interfer- ence power j P .Compared with the DQN, the DDQN avoids overestimation to some extent and improves the stability and speed of training.The implementation of the DDQN algorithm in the data collection scenario is presented in Algorithm 1.

Algorithm 1: DDQN-Based UAV Trajectory Optimization Algorithm
Initialize replay memory D, the online network parameters θ, the target network parameters θ ′ = θ, and the target network update period N f req .

1:
for episode = 0, 1, . .., M − 1 do 2: Randomly generating the location of the IoT devices q u 0 , the location of the UAV q 0 , the location of the jammers q j 0 , transmission power of UAV P, and interference power P j .

3:
Time step n = 0, initialization of the environment and state s n of the UAV 4: while b n ≥ 0 do 5: choose action a n with ε-greedy policy, i.e., 6:

Simulation Results and Discussion
In this section, simulation experiments are conducted, and the performance of the proposed algorithms in different scenarios is explored.To evaluate the efficacy of the proposed method in UAV data collection, the following algorithms are compared: (1) Qlearning; (2) proximal policy optimization (PPO); (3) DQN; (4) dueling DQN; and (5) DDQN.The simulation experiments are conducted with different numbers of devices and obstacles.The convergence performance of the algorithms is compared.In this study, a computer equipped with a 3.60GHz NVIDIA GPU RTX 2080 was used as the experimental platform, and the Adam optimizer was used to update the neural network.The specific simulation experiments are shown in Section 4.1.

Parameter Initialization
To reduce the training time, the simulation scene in this paper is set at 50 m × 50 m × 10 m, due to the complexity of the algorithm's action and state space [31][32][33][34].The algorithm presented in this paper applies to scenes of arbitrary size, and the simulation parameters are shown in Table 1.Throughout the testing phase, the two-dimensional coordinates of the UAV are positioned at the center of the scene, while its altitude is randomly generated within the range of 3 m to 10 m.The positions of the jammers and devices are randomly distributed within the designated area.Moreover, the data transmission rates of each device are randomly assigned from 15,000 bps to 20,000 bps.When comparing the performance of the different algorithms under identical scenarios, the total fixed data volume remains consistent.Additionally, the parameters for the DDQN algorithm are set as shown in Table 2.The hyperparameters are set according to the simulation experience.We compare the performance of the DDQN algorithm with other traditional algorithms across various scenarios.During the training phase, each algorithm saves the model that achieved the highest cumulative reward.Subsequently, during the testing phase, 80,000 episodes are run using the best-saved models, and the average reward for each algorithm is computed.

Complexity analysis:
The training process has a computational complexity of O |A| × |S| × N L , where |•| refers to the cardinality of a set, N is the maximum number of neurons in the hidden layers in θ, and L is the number of layers of θ.

Result Analysis with Different Numbers of IoT Devices
In this section, the results of the trajectory optimization problem for the UAV data collection task using the DDQN algorithm are analyzed for various numbers of devices.The UAV is probed in a bounded three-dimensional space, while ground IoT devices and jammers are randomly distributed across a two-dimensional plane.The objective is to optimize the UAV's movement trajectory in order to obtain the highest cumulative reward.

1.
The result analysis for the scenario with three devices and five obstacles (3D+5O, in short) is as follows: An example of a UAV trajectory involving three devices and three jammers is shown in the illustration attached to Figure 4.It is shown that the UAV demonstrates a strategy of bypassing the jammers and obstacles to mitigate interference during the data collection mission and to reduce the likelihood of collisions.Furthermore, the UAV tends to approach devices closely, thereby reducing the distance between them in order to enhance the amount of data collected.The simulation results indicate that the overall trajectory of the UAV aligns with the optimization goal.2. The result analysis for the scenario with five devices and five obstacles (5D+5O, in short), as follows: The scenario with five devices and three jammers is depicted in Figure 6.From ob-  Figure 5 shows the convergence performance of the different algorithms in this scenario.The experimental results show that the DDQN algorithm outperforms the other algorithms in terms of the final cumulative average rewards.

2.
The result analysis for the scenario with five devices and five obstacles (5D+5O, in short), as follows: The scenario with five devices and three jammers is depicted in Figure 6.From observing the UAV's trajectory, it is evident that the UAV successfully executes its mission while navigating around the jammers and obstacles.Figure 7 shows the convergence curves of the UAV.The curve shows that the mean reward of each agent shows an upward trend until convergence is reached.Notably, the reward curve of the DDQN algorithm surpasses that of the traditional algorithm, revealing the good performance of the DDQN algorithm.2. The result analysis for the scenario with five devices and five obstacles (5D+5O, in short), as follows: The scenario with five devices and three jammers is depicted in Figure 6.From observing the UAV's trajectory, it is evident that the UAV successfully executes its mission while navigating around the jammers and obstacles.Figure 7 shows the convergence curves of the UAV.The curve shows that the mean reward of each agent shows an upward trend until convergence is reached.Notably, the reward curve of the DDQN algorithm surpasses that of the traditional algorithm, revealing the good performance of the DDQN algorithm.

Result Analysis with Different Numbers of Obstacles
In this section, we increase the number of obstacles from five in 3D+5O to eight to assess the generalizability of the DDQN algorithm (3D+8O).As can be seen from Figure 8, as the density of the obstacles increases and the environment becomes more intricate, the UAV strategically selects an altitude characterized by lower obstacle density to navigate and complete the data collection task.Figure 9 provides a comparative analysis of the average reward attained by each algorithm, reaffirming the superior performance of DDQN in comparison to the other algorithms.

Result Analysis with Different Numbers of Obstacles
In this section, we increase the number of obstacles from five in 3D+5O to eight to assess the generalizability of the DDQN algorithm (3D+8O).As can be seen from Figure 8, as the density of the obstacles increases and the environment becomes more intricate, the UAV strategically selects an altitude characterized by lower obstacle density to navigate and complete the data collection task.Figure 9 provides a comparative analysis of the average reward attained by each algorithm, reaffirming the superior performance of DDQN in comparison to the other algorithms.

Result Analysis with Different Numbers of Obstacles
In this section, we increase the number of obstacles from five in 3D+5O to assess the generalizability of the DDQN algorithm (3D+8O).As can be seen from as the density of the obstacles increases and the environment becomes more intri UAV strategically selects an altitude characterized by lower obstacle density to and complete the data collection task.Figure 9   Figure 10 presents a comparison of the average rewards across the different s under the DDQN algorithm.We obtain the following conclusions: (1) with thr designated as the reward metric, scenario 5D+5O, characterized by an increased of ground devices, achieves the maximum reward value compared to 3D+5O; and wise, in scenario 3D+8O, the higher obstacle density relative to 3D+5O imposes m alties on the UAVs, resulting in a lower average reward value.Figure 10 presents a comparison of the average rewards across the different scenarios under the DDQN algorithm.We obtain the following conclusions: (1) with throughput designated as the reward metric, scenario 5D+5O, characterized by an increased number of ground devices, achieves the maximum reward value compared to 3D+5O; and (2) likewise, in scenario 3D+8O, the higher obstacle density relative to 3D+5O imposes more penalties on the UAVs, resulting in a lower average reward value.
We also compared the convergence rates of the algorithms.The 90% confidence interval of the average reward value of each algorithm is used as the convergence interval to calculate the convergence rate of each algorithm in each scenario, and then the convergence rates of the algorithms in the three scenarios are used to obtain that shown in Figure 11.The experimental results indicate that, while ensuring the optimal value of the average reward, both the DDQN and the dueling DQN exhibit a better convergence rate compared to the PPO and DQN algorithms.However, the Q-learning algorithm has the fastest convergence rate but the lowest average reward value, resulting in a poor convergence performance.
Electronics 2024, 13, x FOR PEER REVIEW 13 of 17 Figure 10 presents a comparison of the average rewards across the different scenarios under the DDQN algorithm.We obtain the following conclusions: (1) with throughput designated as the reward metric, scenario 5D+5O, characterized by an increased number of ground devices, achieves the maximum reward value compared to 3D+5O; and (2) likewise, in scenario 3D+8O, the higher obstacle density relative to 3D+5O imposes more penalties on the UAVs, resulting in a lower average reward value.We also compared the convergence rates of the algorithms.The 90% confidence interval of the average reward value of each algorithm is used as the convergence interval to calculate the convergence rate of each algorithm in each scenario, and then the convergence rates of the algorithms in the three scenarios are used to obtain that shown in Figure 11.The experimental results indicate that, while ensuring the optimal value of the average reward, both the DDQN and the dueling DQN exhibit a better convergence rate compared to the PPO and DQN algorithms.However, the Q-learning algorithm has the fastest convergence rate but the lowest average reward value, resulting in a poor convergence performance.
To further highlight the advantages of the DDQN algorithm, we took 3D+5O as an example; fixed the initial position of the UAV, the ground equipment, and the jammer position after training the network; and observed the test flights under the different algorithms to observe the data collection process of the UAV.The experimental results are shown in Figure 12, which shows that, although the Q-learning algorithm finally reached convergence in the previous experiments, it is still inferior to the dueling DQN and DDQN in terms of the data collection speed.The test results show that the DDQN algorithm can learn the unknown environment better and can collect the data in the shortest number of steps.Table 3 compares the average training time of each algorithm in different cases.The DDQN algorithm used in this paper has a shorter training time, while ensuring conver gence performance.The dueling DQN algorithm is the second-best in terms of conver gence performance, surpassed only by the DDQN algorithm.However, it is not as fast as the DDQN algorithm in terms of training time.Furthermore, the proposed algorithm can meet real-time requirements, as its execution time on the system is significantly less than that of its training time.The simulation results indicate that Q-learning has the worst performance, with an average reward value that is only 10% of that of DDQN.This is because Q-learning re quires the storage of the value function of each state-action pair, which can be very diffi cult, or even infeasible, in a high-dimensional state space.The PPO algorithm is only better Table 3 compares the average training time of each algorithm in different cases.The DDQN algorithm used in this paper has a shorter training time, while ensuring convergence performance.The dueling DQN algorithm is the second-best in terms of convergence performance, surpassed only by the DDQN algorithm.However, it is not as fast as the DDQN algorithm in terms of training time.Furthermore, the proposed algorithm can meet real-time requirements, as its execution time on the system is significantly less than that of its training time.The simulation results indicate that Q-learning has the worst performance, with an average reward value that is only 10% of that of DDQN.This is because Q-learning requires the storage of the value function of each state-action pair, which can be very difficult, or even infeasible, in a high-dimensional state space.The PPO algorithm is only better than Q-learning because it does not use empirical replay when updating the policy network, resulting in the low utilization of empirical samples.Additionally, the algorithm's performance is sensitive to hyperparameters.On average, DQN achieves only 50% of that of DDQN, due to overestimation, which leads to training instability and performance degradation.It is important to note that this is an objective evaluation and not a subjective one.Despite potential drawbacks such as implementation complexity and longer training times, dueling DQN may still be a better choice in certain problems and scenarios.

Conclusions
We studied the problem of 3D trajectory planning for UAVs in data collection network scenarios with jammers and flying obstacles.To achieve this, we proposed a DDQNbased UAV trajectory optimization algorithm that utilizes appropriate reward values.This algorithm enabled the UAVs to efficiently perform data collection tasks in complex and changing environments without prior knowledge of the channel information.We conducted simulations to analyze the impact of different numbers of IoT devices and

Figure 1 .
Figure 1.Scenario of UAV data collection.

Figure 1 .
Figure 1.Scenario of UAV data collection.

Figure 2 .
Figure 2. The MDP for trajectory optimization in the data collection scenario.: The action space is defined in (1).: The state of the UAV at the time step n is denoted by

Figure 2 .
Figure 2. The MDP for trajectory optimization in the data collection scenario.

Electronics 2024 , 17 Figure 5
Figure5shows the convergence performance of the different algorithms in this scenario.The experimental results show that the DDQN algorithm outperforms the other algorithms in terms of the final cumulative average rewards.

Figure 4 .Figure 5 .
Figure 4. Example of UAV data collection under the 3G+5O scenario.(a) three-dimensional(3D) UAV trajectory; (b) Top view of the scenario.(A pentagram represents the ground devices, while a triangular shape symbolizes the trajectory of the UAV.)

Figure 4 .
Figure 4. Example of UAV data collection under the 3G+5O scenario.(a) three-dimensional (3D) UAV trajectory; (b) Top view of the scenario.(A pentagram represents the ground devices, while a triangular shape symbolizes the trajectory of the UAV.)

Figure 4 .Figure 5 .
Figure 4. Example of UAV data collection under the 3G+5O scenario.(a) three-dimensional(3D) UAV trajectory; (b) Top view of the scenario.(A pentagram represents the ground devices, while a triangular shape symbolizes the trajectory of the UAV.)

Figure 7 .
Figure 7.The convergence performance of the different algorithms in 5D+5O.

Figure 7 .
Figure 7.The convergence performance of the different algorithms in 5D+5O.

Figure 8 .
Figure 8. Example of UAV data collection under the 3G+8O scenario.(a) 3D UAV trajectory; (b) Top view of the scenario.

Figure 9 .
Figure 9.The convergence performance of the different algorithms in 3D+8O.

Figure 8 .
Figure 8. Example of UAV data collection under the 3G+8O scenario.(a) 3D UAV trajectory; (b) Top view of the scenario.

Figure 8 .
Figure 8. Example of UAV data collection under the 3G+8O scenario.(a) 3D UAV trajector view of the scenario.

Figure 9 .
Figure 9.The convergence performance of the different algorithms in 3D+8O.

Figure 9 .
Figure 9.The convergence performance of the different algorithms in 3D+8O.

Figure 10 .
Figure 10.Comparison of average rewards for the scenarios proposed under the DDQN algorithm.

Figure 10 .
Figure 10.Comparison of average rewards for the scenarios proposed under the DDQN algorithm.Electronics 2024, 13, x FOR PEER REVIEW 14 of 17

Figure 11 .Figure 11 .
Figure 11.Comparison of convergence probability of the different algorithms.

Figure 11 .
Figure 11.Comparison of convergence probability of the different algorithms.

Figure 12 .
Figure 12.Comparison of probability of data collection of the different algorithms.

Figure 12 .
Figure 12.Comparison of probability of data collection of the different algorithms.

Table 3 .
Average training time (seconds per episode) of the UAV.

Table 3 .
Average training time (seconds per episode) of the UAV.