Deep Reinforcement Learning-Driven UAV Data Collection Path Planning: A Study on Minimizing AoI

: As a highly efﬁcient and ﬂexible data collection device, Unmanned Aerial Vehicles (UAVs) have gained widespread application because of the continuous proliferation of Internet of Things (IoT). Addressing the high demands for timeliness in practical communication scenarios, this paper investigates multi-UAV collaborative path planning, focusing on the minimization of weighted average Age of Information (AoI) for IoT devices. To address this challenge, the multi-agent twin delayed deep deterministic policy gradient with dual experience pools and particle swarm optimization (DP-MATD3) algorithm is presented. The objective is to train multiple UAVs to autonomously search for optimal paths, minimizing the AoI. Firstly, considering the relatively slow learning speed and susceptibility to local minima of neural network algorithms, an improved particle swarm optimization (PSO) algorithm is utilized for parameter optimization of the multi-agent twin delayed deep deterministic policy gradient (MATD3) neural network. Secondly, with the introduction of the dual experience pools mechanism, the efﬁciency of network training is signiﬁcantly improved. Experimental results show DP-MATD3 outperforms MATD3 in average weighted AoI. The weighted average AoI is reduced by 33.3% and 27.5% for UAV ﬂight speeds of v = 5 m/s and v = 10 m/s, respectively.


Introduction
In the current age of digital revolution, the advancement of Internet of Things (IoT) has exerted profound influences on across various sectors of society [1][2][3].The widespread deployment of sensor networks has enabled a continuous influx of real-time data into systems [4], encompassing parameters such as temperature, humidity, and light intensity, among others.These data sources hold a pivotal significance in supporting real-time management systems, including traffic supervision, industrial control, and so forth [5][6][7].In this context, the freshness of data collection becomes paramount for ensuring the quality of informed decision-making.Nevertheless, with the continuous growth of data scale and complexity, the freshness of data collection becomes a crucial consideration in guaranteeing the quality of real-time decision-making.In order to evaluate data freshness, Age of Information (AoI) is adopted as an evaluation index [8,9].
Currently, AoI has gained popularity as a metric to assess data timeliness in IoT networks [10][11][12][13][14][15].It underscores the "freshness" of data within target nodes.In reference [10], a balance between high reliability and information freshness was achieved through finetuning jointly encoded packets.Simultaneous transmission over multiple sub-channels effectively reduced AoI.Chen et al. [11] introduced an algorithm from a charging perspective, addressing the optimization of information delay in wireless-powered edge networks, concurrently reducing both average and maximum peak AoI.Pu et al. [12] proposed a novel AoI-bounded scheduling algorithm, ensuring that the peak age of AoI for each data package is maintained within a finite range.To optimize average AoI, Zhao et al. [13] studied the impact of interference on information delay in large-scale wireless networks and presented a novel method.The research revealed similarities in channel access probabilities and differences in packet arrival rates, particularly under low node density conditions.Zhou and Saad [14] introduced a threshold-based optimization strategy to minimize average AoI, achieving near-optimal performance in noisy channels through a low-complexity suboptimal strategy.Gu et al. [15] assessed the performance of average peak AoI in IoT systems by introducing AoI metrics, and evaluating coverage and underlying schemes.These studies primarily conduct theoretical analyses of AoI performance or adopt conventional approaches to minimize AoI through resource allocation optimization.
When facing challenges such as limited transmission power and unavailable uplink channels in the IoT, one widely recognized solution is the utilization of Unmanned Aerial Vehicles (UAVs) for auxiliary remote sensing.Their high maneuverability, ease of deployment, and capability to cover hard-to-reach areas make them efficient data collectors, significantly enhancing the efficiency of data acquisition in IoT [16,17].Despite the advantages, UAV-assisted IoT networks encounter problems stemming from restricted energy and the consequential effects of UAV trajectory on both the effectiveness of data acquisition and power consumption.Additionally, the uplink transmission time and power consumption of IoT devices are influenced by the choice of hover points.Hence, the focus of this work lies in investigating the problem of UAVs path planning to minimize AoI, thereby enabling IoT to collect data in a timely and effective manner.

Related Works
This section revisits some research on UAV path planning related to AoI.In reference [18], for minimizing the average AoI, a dynamic programming and ant colony algorithm are employed to decompose this problem into energy transmission and data collection time sequence allocation.Gao et al. [19] employed an end-to-end strategy to successfully enhance data collection efficiency and reduce the AoI.For the purpose of minimizing AoI, Xiong et al. [20] utilized GA to optimize the data collection process.Lu et al. [21] utilized mobile unmanned vehicles to cooperate with UAVs in data collection, thereby achieving a balance between information freshness and energy replenishment, leading to a reduction in AoI.Liu and Zheng [22] utilized continuous approximation and a genetic algorithm to optimize UAVs' speed and path, as well as AoI and onboard energy of the monitoring area data, thereby reducing task completion time.
However, the aforementioned works often employ heuristic algorithms to address trajectory optimization problems, facing challenges such as insufficient adaptability, susceptibility to local optima, and high computational complexity.These issues typically hinder their ability to adapt to increasingly complex and scalable wireless networks.As artificial intelligence continues to advance, deep reinforcement learning (DRL) could enable UAVs to make intelligent localized decisions and accomplish tasks.Leveraging DRL algorithms and trained models, UAVs can autonomously and rapidly optimize flight trajectories through interaction with the environment.Hu et al. [23] proposed the compound CA2C algorithm to address path design problems resulting from UAVs' collaborative perception and transmission, thereby reducing the AoI.Zhou et al. [24] introduced the A-TP algorithm, which employs DRL to optimize UAV paths, thereby enhancing the efficiency of IoT data collection.Abd-Elmagid et al. [25] proposed an improved DRL algorithm to address the AoI of different processes.Peng et al. [26] employed the double deep Q-learning network algorithm to intelligently plan paths for UAVs, aiming to reduce UAVs' energy consumption, thereby enhancing information freshness.Yin et al. [27] focused on UAV-assisted vehicular communication, using a deep Q network to optimize transmission power and offloading ratios, minimizing AoI.Chen et al. [28] conducted research on resource allocation for AoI perception using online DRL.

Contributions
This paper investigates the path planning problem for UAV-assisted IoT network data collection and introduces a DRL algorithm to achieve intelligent decision-making for UAV flight trajectories, thereby reducing the level of AoI.The key contributions are outlined in the following manner.
(1) To minimize AoI, it is transformed into a Markov game process, and the multiagent twin delayed deep deterministic policy gradient (MATD3) algorithm is designed to optimize the UAV flight trajectory.
(2) To address the relatively slow learning speed and susceptibility to local minima issues associated with neural network algorithms, we applied an improved Particle Swarm Optimization (PSO) algorithm for the parameter optimization of the MATD3 neural network.Additionally, to further expedite the training speed of the network, we incorporated an additional experience replay buffer to store superior experience information.This enhancement facilitates a quicker convergence of the network towards the optimal policy.
(3) Through a comparison with the traditional MATD3 algorithm, the proposed approach was evaluated.Simulation results indicate that it effectively reduces AoI, demonstrating the effectiveness of the designed algorithm.
The remaining parts of this work are structured as enumerated below.The system model and the multi-UAV path planning algorithm based on DP-MATD3 are introduced in Sections 1 and 3. Simulation analysis was conducted in Section 4. The conclusion is provided in Section 5.

Network Model
In the scenario of UAVs assisting in data collection for the IoT, the UAVs perform data acquisition and transmission, as illustrated in Figure 1.Assuming within the service range, there are M UAVs, and their electricity is E, and N IoT devices are randomly distributed, with each device having a task size of l n , where M ∈ M = {1, 2, . . ., M} and N ∈ N = {1, 2, . . ., N}.Then, a model is established to represent the connectivity between UAVs and IoT devices, utilizing the decision variable χ n m (t) ∈ {0, 1} to indicate the connection status between mth UAV and nth IoT device at time t.When χ n m (t) = 1, it signifies that mth UAV is accessing nth IoT device; otherwise, it denotes no communication link between them.It is assumed that in each time slot, the communication between IoT devices and UAVs is a one-to-one correspondence.Therefore, constraints ∑ n χ n m (t) ≤ 1 indicate that each UAV can communicate with only one IoT device at any given time.Additionally, to enhance collaboration among multiple UAVs, any nth IoT device can be serviced by at most one UAV at the same time, represented by the constraint ∑ m χ n m (t) ≤ 1.In addition, considering that multiple UAVs are serving the same area, it is essential to address the issue of collisions between UAVs.The distance constraint is denoted as follows here, q m (t) represents the two-dimensional position of UAV m ∈ M while flying at a constant altitude of H meters, and o min is the minimum inter-UAV collision distance.

Data Collection Model
In practical scenarios, given the typically low power of IoT devices, their long-range communication capabilities are notably constrained.To address this challenge, UAVs need to reach the communication zone of these devices and stay within the range for some time to facilitate efficient data collection.Drawing on experimental findings regarding channel gain from references [29,30], it is observed that in moderate altitudes, communication primarily relies on Line-of-Sight (LoS) propagation.Consequently, the channel gain g n m (t) can be approximated as [31] where q m (t) and c(n) denote the location of mth UAV at time t and nth IoT device, respectively.h 0 denotes the channel gain when d = 1.The wireless transmission rate is denoted as: In order to collect data from the nth IoT device, it is assumed that the UAV must hover above the IoT device for a minimum duration.Therefore, the delay in information transmission T n m needs to meet the following constraint:

Energy-Consuming Model
UAV energy consumption includes two parts: flight energy consumption and communication energy consumption.The power consumption P m (v) of the mth UAV when moving horizontally at the speed v can be expressed as follows where M and A e represent the weight and the frontal area of the UAV, respectively, ω p is the angular velocity, C D 0 denotes the drag coefficient, n p signifies the number of blades, r p indicates the rotor disk radius, c p is the blade chord, and ρ represents the air density.The total energy consumption of the mth UAV at the hovering position is given by where P m (0) represents the hover power and P c,m is the communication power of the mth UAV.Therefore, the energy consumption in time slot k is given by where L k,m represents the flight distance of the mth UAV in time slot k, and c k,m represents whether the mth UAV collects data in time slot k, which can be expressed as The battery remaining power E k,m of mth UAV in time slot k is as follows:

AoI
To guarantee timely data collection, AoI is a critical performance index.As shown in Figure 2, commencing from the initial value A 0 , AoI progressively accumulates at a constant rate of 1 until the reception of an update.
The following equation represents the AoI of nth IoT device at time t: where o(t) signifies the timestamp of data generation.It is worth noting that when an IoT device has no data stored or has already been collected, the AoI is 0.

Problem Modeling
This paper aims to design a collaborative path-planning method for multiple UAVs, to minimize the collected AoI.This process encompasses two distinct stages: the flight stage, where UAVs move, and the hovering collection stage, where UAVs hover to collect information from IoT devices.Therefore, the AoI collected by UAVs A n (t) can be divided into: the AoI of the information itself at the beginning of transmission, and the delay in information transmission T n m , which is the time it takes for information to transmit from nth IoT device to mth UAV.Accordingly, the optimization task is briefly outlined as: UAVs need to plan paths collaboratively in IoT data collection, and their location and movement behavior will affect each other, forming a multi-agent cooperation problem.Therefore, this problem can be modeled and solved using Markov games.
(1) State Space In time slot k, the position of mth UAV is denoted as q k,m , the remaining energy of mth , where f is the number of rangefinders equipped on the UAV, the position of the IoT devices is c = {c(1), . . . ,c(N)}, the pending transmission task of the IoT device is w k = w k,1 , . . ., w k,N , and the AoI is A k = A k,1 , . . ., A k,N .Therefore, the system state can be defined as follows: where s k,m , m ∈ [1, M] represents the state of the m-th UAV and can be expressed as: (2) Action Space For the collaborative scenario of multiple UAVs, the joint action of the UAVs is represented as follows: A k = (a k,1 , . . . ,a k,M ) where a k,m , m ∈ [1, M] is the decision made by mth UAV in time slot k.Using β k,m ∈ [0, 2π] to denote the flying angle of mth UAV, the decision a k,m is expressed as follows: Therefore, in time slot k+1, the position of the mth UAV q k+1,m can be expressed as follows (3) Reward Function In DRL, to minimize the AoI of data and ensure real-time accuracy, the reward function can be designed as a function of AoI: where R is a positive constant, A k,n is the AoI value when mth UAV chooses to communicate with the nth IoT device.As the AoI value decreases, the reward value increases, indicating that the UAV tends to arrive at that device sooner to reduce the AoI.When the UAV is in flight and not communicating with IoT devices, the reward value is set to 0.
To ensure that IoT devices are not accessed repeatedly, the reward function can be designed as follows: where g c is a positive constant, and O is the set of nodes that UAVs have visited.If mth UAV visits a node it has visited in the past, it will receive a penalty, as repeated visits to the same node are discouraged.
To prevent collisions between UAVs during path planning, the reward function can be designed as: where b denotes a positive constant, d i,j is the distance between two UAVs, and r p is the rotor radius.
To avoid collisions, the reward function can be designed as follows: where b blk is a positive constant, and d m,blk is the distance between the UAV and the obstacle.
Simultaneously, to ensure that the battery power of the UAV remains above zero during operation, the reward function is expressed as follows: where g e also indicates a positive constant and E k,m denotes the battery power of mth UAV.Therefore, the cumulative reward r k is formulated as follows:

DP-MATD3 Algorithm: Concept and Workflow
Traditional reinforcement learning algorithms based on value and policy encounter limitations when addressing the problem of multi-UAV trajectory planning.Value-based algorithms focus on learning the value functions of states or actions but often overlook the details of policies, making it challenging to adequately consider interactions and cooperation among agents in complex multi-agent collaborative scenarios, thus impacting system performance.The policy-based algorithms directly learn policy functions.However, in multi-agent problems, the policy space is typically large, leading to high computational and sample complexities.
To overcome these limitations, the MATD3 algorithm based on the Actor-Critic framework is utilized.It integrates the learning of policy functions and value functions, considering both the importance of policies and the information from value functions.The solution to this multi-agent stochastic game process is achieved through centralized training and decentralized execution, providing a comprehensive approach to address the challenges in multi-UAV path planning.
(1) TD3 Algorithm The TD3 algorithm is designed to address reinforcement learning problems in continuous action spaces, building upon the foundation of the DDPG algorithm.It introduces several improvements to mitigate issues present in the DDPG algorithm, including the problem of overestimating Q-values.Through the incorporation of twin critic networks, delayed updates, and policy noise, TD3 aims to enhance learning efficiency and stability compared to DDPG.
By precisely defining the boundaries of each action, the TD3 algorithm ensures the rationality of actions.In contrast to the DQN algorithm, TD3 exhibits greater flexibility in handling multidimensional variables, cleverly optimizing multiple actions simultaneously.Through techniques such as batch sampling and neural network estimation, TD3 obtains a probability distribution that encompasses multiple actions, allowing it to derive solutions with associated actions.
TD3's core structure comprises an Actor and two Critic networks.Each of them comprises two sub-networks: an online network for real-time decision-making and a target network for stable training.To store the accumulated experience during training, TD3 utilizes a Replay Buffer.When the replay buffer reaches full capacity, TD3 clears the oldest experiences to make room for the latest ones, helping to break the correlation between experiences in small batches.
Compared to the DDPG algorithm, the TD3 algorithm significantly demonstrates advantages in the following three aspects: (1) Dual Critic Networks TD3 introduces two critic networks, effectively reducing the overestimation of Q-values.
(2) Delay Update By implementing a delayed mechanism for updating the target network, TD3 reduces the frequency of updating the target networks, which helps to slow down the learning process of the algorithm, making it more stable and preventing inaccurate Q-values from prematurely affecting the policy network.
(3) Policy Noise TD3 introduces a target policy network and adds noise to its output to enhance the exploratory nature of the policy.This mechanism effectively improves the algorithm's exploration capability, particularly demonstrating superior performance in complex environments.
(2) MATD3 Algorithm The MATD3 algorithm is an extension of the TD3 algorithm, combining the theoretical framework of DRL with the concept of multi-agent cooperation.It is suitable for scenarios where multiple UAVs collaborate to plan paths to minimize performance metrics such as information age.During training, the Critic network functions can obtain information from all agents.Each agent i, utilizing the information obtained, can learn two centralized evaluation functions Q π i,θ 1,2 (s, a 1 , . . . ,a N ).To address the issue of overestimated Q-values, when calculating the target Q-value, the minimum evaluation network is chosen for computation: In addition, to ensure the exploratory and robust nature of the learned policy, clipped Gaussian noise is added to the output, as outlined in Equation ( 21): The Critic network is updated at a higher frequency, with the Actor network updated every d times after the Critic network update.This is conducted to ensure that the updates to the Actor network are more targeted and effective, as the Critic network has been updated multiple times before updating the Actor network, providing more accurate value function estimates.Additionally, the target networks also adopt a delayed updating strategy, enhancing the stability of the algorithm.
(3) DP-MATD3 Algorithm The MATD3 algorithm is a fusion of reinforcement learning and neural networks, respectively, using their deep learning advantages and parametric representation ability.In the construction and parameter determination process of neural networks, the key lies in determining the network's structure and weights.However, the learning speed of neural network algorithms is relatively slow, and they are prone to getting stuck in local minima, posing challenges for MATD3.
The Critic network plays a central role in MATD3, responsible for estimating the value of actions in the environment.This process involves a significant amount of weight adjustments, and the slow learning and susceptibility to local minima in neural network algorithms make the optimization of weights particularly challenging.To overcome these challenges, the MATD3 algorithm introduces the PSO algorithm, which has advantages such as fast convergence, simplicity, and global search capability.The network performance of the MATD3 algorithm can be enhanced by combining the PSO algorithm with the Critic network and leveraging its global search capability to train its weights.
In the PSO algorithm, each particle represents a possible solution, where its position denotes the values of the Critic network parameters, and its velocity indicates the direction and stride of the particle's movement.The particle updates its speed and position by comparing its individual and group best position, namely p best and g best .The update process involves the following steps: (1) Initialize Particle Swarm Firstly, G particles in the D-dimensional search space are initialized, where X g = x g1 , . . .x gD represents the position of the g-th particle and is a set of weights and biases for the Critic network, and V g = V g1 , . . ., V gD represents its speed.These particles are randomly distributed in the parameter space.
(2) Compute Fitness Value For each particle, based on its represented parameter settings, the performance of the corresponding Critic network on the training set is calculated.The Critic network undergoes a training process that involves gradually adjusting its parameters by minimizing the absolute TD error, ultimately enhancing the accuracy of Q-value estimation.Consequently, the magnitude of the TD error serves as a critical fitness metric, guiding the particles toward optimal parameter configurations.Among them, TD error is used to measure the difference between the Q-value estimated by the agent in a given state based on the current strategy and the target Q-value calculated based on experience.
(3) Determine Individual and Global Optimal Positions Individual and global optimal position refer to the fitness values and corresponding positional parameters of individual particle and whole particle populations in their history, respectively.
(5) Determine termination condition If the termination condition is not satisfied, go back to step 2; otherwise, end.The particle swarm adjusts its velocity and position according to the fitness values, gradually converging towards the global optimal solution.
Each time an operation is performed, all particles execute simultaneously, and the state of each particle is continuously updated, avoiding any waiting issues.During the process of adjusting states, particles adapt their individual states at any moment until the global optimal solution is found.The implementation steps are illustrated in Figure 3.
In the updating process of PSO, the particle swarm searches the entire parameter space, gradually converging to the global optimum.The absolute value of TD-error is employed as the fitness value, the Critic network can learn the value function more accurately, and then guide the Actor network to update the policy, thus improving the performance of the algorithm.
In addition, in order to further enhance the convergence speed of the MATD3 algorithm, the dual experience pools technique is introduced.However, the target region is relatively vast and the capacity of the experience pool is limited.This leads to the pos-sibility that, under limited exploration, the experience pool may store a large number of suboptimal actions, affecting training efficiency.Therefore, two experience pools, B 1 and B 2 , are considered.B 1 indiscriminately stores samples, while B 2 is used to store experiences with single-step rewards r t > v, and its size is much smaller than B 1 .During the sampling process, a subset of samples is randomly selected from B 1 , and an additional 10% of the total samples are randomly chosen from B 2 .The DP-MATD3 algorithm combines the PSO algorithm to optimize the parameters of the Critic network while utilizing dual-experience pools to overcome the shortcomings of the MATD3 algorithm.The DP-MATD3 algorithm is illustrated in Figure 4 and the key procedures are generalized in Algorithm 1.  by adding an additional experience pool so that the network can find the optimal strategy faster and speed up the training of the DP-MATD3 algorithm.

Environmental Configuration
This section simulates the designed the multi-UAV trajectory planning strategy.The simulation scenario consists of a square area measuring 1.5 km × 1.5 km.As shown in Figure 5, the orange circles represent IoT devices, while three UAVs (green circles) depart from locations (0.1 km, 0.1 km), (1.3 km, 0.1 km), and (1.3 km, 1.3 km).The parameters for the UAVs are presented in Table 1.The networks in the DP-MATD3 algorithm are constructed with two hidden layers of artificial neural networks, both being fully connected layers.The experimental platform is established using Python 3.7 and PyTorch 1.5.1.The remaining parameters are illustrated in Table 2.

Experimental Results and Analysis
The DP-MATD3 algorithm optimizes the MATD3 algorithm's Critic network by introducing the PSO and uses the absolute value of the TD-error as the fitness value for the PSO to achieve faster minimization of TD-error.To assess the algorithm's convergence, the trend of TD-error absolute value over training steps was used as an evaluation criterion.Figures 6 and 7 demonstrate that, under the same number of training steps, the DP-MATD3 algorithm reaches a stable state faster than the MATD3 algorithm.This is attributed to the PSO algorithm's ability to rapidly explore the optimal solution space, optimizing Critic network parameters and quickly reducing TD-error absolute value, thus expediting the overall convergence process.0 5 0 , 0 0 01 0 0 , 0 0 01 5 0 , 0 0 02 0 0 , 0 0 02 5 0 , 0 0 03 0 0 , 0 0 0 0 .0    The weighted average AoI between the DP-MATD3 and the MATD3 at the UAV speed of 5 m/s and 10 m/s are depicted in Figures 9 and 10, respectively.From the graphs, it is evident that the convergence and stability of the DP-MATD3 algorithm are better at different speeds.For instance, when v = 10 m/s, the weighted average AoI achieved by the DP-MATD3 algorithm is approximately 121.59 s, while the MATD3 algorithm's weighted average AoI is around 155.03 s.It can be seen that DP-MATD3 achieves the reduction in the weighted average AoI improves the system performance.This is because the DP-MATD3 algorithm can generate more reasonable trajectories based on the actual information update patterns.For example, concerning N1, since it will undergo a second data update at t = 100 s, UAV1 collects its data after collecting N2's data, resulting in a smaller AoI.In contrast, the MATD3 algorithm completes information collection before its second update, leading to a larger AoI, falling into a local optimum.0 2 , 0 0 0 4 , 0 0 0 6 , 0 0 0 8 , 0 0 0 1 0 , 0 0 0 1 5 0 2 0 0

Conclusions
This paper investigated the path planning problem for multi-UAV collaboration in completing data collection tasks and proposed a DP-MATD3 path planning algorithm to minimize the weighted average AoI.The algorithm optimizes the Critic network by introducing the PSO algorithm, overcoming the issue of the MATD3 algorithm being prone to local optima.Additionally, the algorithm is extended and optimized through the introduction of a dual experience pool structure to enhance training efficiency.The simulation experiments, demonstrating the path planning results for multiple UAVs, effectively validated the feasibility of the DP-MATD3 algorithm in path planning.A comparative analysis with baseline strategies also clearly indicated a significant improvement in algorithm performance.

Figure 1 .
Figure 1.A scene of UAV-assisted data collection.

Figure 5 .
Figure 5.The map of the environment.

Figure 6 .
Figure 6.The convergence of the absolute value of TD-error in the MATD3 algorithm.

3 Figure 7 .
Figure 7.The convergence of the absolute value of TD-error in the DP-MATD3 algorithm.

Figure 8
Figure8shows the trajectory diagram under the DP-MATD3 algorithm designed in this chapter.At t = 0, all IoT devices perform a data update.Subsequently, IoT devices

Table 1 .
Parameters of the aerial vehicle.