The UAV Trajectory Optimization for Data Collection from Time-Constrained IoT Devices: A Hierarchical Deep Q-Network Approach

: Recently, using unmanned aerial vehicles (UAVs) to collect information from distributed sensors has become one of the hotspots in the Internet of Things (IoT) research. However, previous studies on the UAV-assisted data acquisition systems focused mainly on shortening the acquisition time, reducing the energy consumption, and increasing the amount of collected data, but it lacked the optimization of data freshness. Moreover, we hope that UAVs can perform long-term data collection tasks in dynamic scenarios within a constantly changing age of information (AoI) and within their own power levels. Therefore, we aim to maximize the quality of service (QoS) based on the freshness of data, while considering the endurance of the UAVs. Since our scenario is not an inertial order decision process with uniform time slots, we ﬁrst transform the optimization problem into a semi-Markov decision process (SMDP) through modeling, and then we propose a hierarchical deep Q-network (DQN)-based path-planning algorithm to learn the optimal strategy. The simulation results show that the algorithm is better than the benchmark algorithm, and the tradeoff between the system QoS and the safe power state can be achieved by adjusting the parameter β e .


Introduction
Unmanned aerial vehicles (UAVs) have received extensive attention in wireless communication networks due to their small size, flexibility, mobility, and controllability [1][2][3][4]. One of the most important applications is using UAVs as flying base stations to collect data from ground nodes, especially in the area of the Internet of Things (IoT). Using UAVs to assist data collectio, can not only save energy by reducing communication distances and hops, but it also enables data collection tasks to be performed in areas lacking communication coverage. In [5][6][7], UAVs are used to transmit data for the ground nodes, and the trajectory planning goal is set to balance the transmission delay of the ground nodes, as well as reducing the energy consumption of the UAVs. However, UAVs consume additional propulsive energy during movement. Thus, a tradeoff is required between the transmission energy consumption of the ground nodes and the propulsion energy consumption of the UAVs [8]. Because of the limited capacity and coverage of cellular networks, it is challenging to accommodate massive IoT devices. UAVs can be used to provide additional radio access services [9]. Moreover, UAVs can fly close to each ground node and can establish air-to-ground channels dominated by their line of sight (LoS), which can greatly reduce the transmission energy consumption, and can significantly improve the throughput of the ground nodes. These advantages have made UAV-assisted IoT networks attract extensive attention in recent years. A study by Abdulla et al. [10] demonstrated how to maximize the acquisition efficiency of the unmanned aircraft system (UAS), modeling the problem as a potential game between UAS and sensor nodes, and proposing a data acquisition method that maximizes energy efficiency under fairness constraints. Zhan et al. [11] considered jointly optimizing the wake-up time of the sensor nodes and the trajectory of UAVs. While ensuring the required amount of data is collected, the energy consumption is minimized, and the suboptimal solution is obtained by using the iterative convex optimization technique. Wang et al. [12] considered minimizing the completion time of UAV acquisition tasks, which builds on their previous research on joint optimization. Moataz et al. [13] considered the problem of UAV-assisted data acquisition in delay-sensitive application scenarios, and they maximized the available quantity of devices serving the IoT by optimizing the trajectory and resource allocation of UAVs.
Most of the existing works of UAV-assisted data acquisition systems focus mainly on shortening the acquisition time, reducing energy consumption, and increasing the amount of collected data, which is aimed at either maximizing system throughput or minimizing delay via designing the trajectory, deployment position, or resource allocation of UAVs [14][15][16][17][18]. However, for real-time IoT applications, such as air quality monitoring, precision agriculture, and post-disaster early warning, the quality of service (QoS) highly depends on the freshness of information collected by UAVs from the ground nodes. In this paper, we introduce the age of information (AoI) to indicate the freshness of collected data. Different from the delay metric, the AoI is defined as the time elapsed since the latest received status update packet at a destination node [19]. The continuous data collection task is also a research hotspot. Due to its learning ability and generalization ability, deep reinforcement learning (DRL) is very suitable for such dynamic scenarios in which the AoI is constantly changing [20,21]. In [15,22], the authors formulated the fresh data collection problem in UAV-assisted IoT networks as a Markov decision process (MDP) and used a DRL-based approach to minimize the AoI. Zhou et al. [23] considered the unknown and time-varying traffic generation pattern of ground nodes, and proposed a novel online AoI-based trajectory planning approach using a deep deterministic policy gradient (DDPG). However, the energy of the UAV is limited, and each UAV needs to recharge in continuous data collection tasks. None of these works take recharging the UAV into account. In addition, the MDP-based UAV path-planning method requires the environment to be divided into discrete areas with separate action spaces, which results in a lot of additional propulsion power consumption. To solve these problems, we consider, in our work, that UAVs can perform long-term data collection tasks in dynamic scenarios where the AoI and their own power levels are constantly changing. Therefore, we formulate the trajectory design problem of the UAVs as a semi-MDP (SMDP) and use a hierarchical deep Q-network (DQN)-based method. The contributions can be summarized as follows: • We study the path-planning of rechargeable UAVs in continuous data collection tasks, in which a UAV moves towards the ground nodes to collect data during a period of time and returns to recharge at the right time. We introduce the AoI to indicate the freshness of the collected data; • To optimize the freshness of collected data and to ensure that the remaining charge of the UAVs is sufficient, we formulate the path-planning problem as an SMDP and propose a hierarchical DQN (H-DQN) method to deal with it. The agent learns to select options at a high level and the action depends on the selected option.
The rest of the paper is organized as follows. We review related work regarding the AoI and the typical algorithm of DRL in Section 2. Then, we describe our system model in Section 3. In Section 4, the proposed algorithm is explained in detail. Afterwards, we present simulation experiments and analyses to validate the performance and efficiency of our proposed approach in Section 5. Moreover, we compare our proposed method with the baseline methods to show its advantages. Finally, we give the conclusions in Section 6.

Related Work
Traditional data communications pay attention to the time delay of information transmission. However, after the data is generated, the staleness of the data information will increase with the one-way passage of time. So, real-time update applications need to pay more attention to the freshness of information. The concept of the age of information (AoI) is for scenarios that require a high freshness of collected data, and it is used to describe the freshness of the data collected by the system [24,25]. The biggest difference between the AoI and time delay is that the AoI includes not only the transmission time delay of the information, but also the waiting time of the information at the source node and the residence time at the destination node. The AoI is defined as the difference between the current time and the generation time of the latest packet available at the receiving end. Both the average age and the average age distribution can be used to characterize the freshness of the information. There are some studies that have already explored the data freshness in the UAV-assisted IoT network for data collection. Cao et al. [26] considered a scenario where a UAV aims at a source node to transmit data packets to a remote destination node, minimizing the peak AoI by jointly optimizing the flight trajectory and power allocation. Yi et al. [27] proposed a dynamic programming approach to minimize the AoI at the data center by redesigning the path-planning of UAVs. In these works, the UAV collects the data from each ground node only once, and then flies back to the depot.
Reinforcement learning is a learning mechanism that obtains the long-term maximum reward by learning the mapping relationship between the environment state space and the agent action space [28]. Reinforcement learning can be divided into two categories: value-based and policy-based. Value-based methods mainly use the state-value function V and the state-action value function Q for policy selection, where the expressions of V and Q are in a recursive form. In the case of limited state and action space, all strategies can be evaluated to obtain the optimal strategy pi * , but that is difficult to achieve in reality. A feasible method is to continuously optimize the policy through iterations. Firstly, this involves randomly initializing the value function of a policy, then selecting the action according to the value function and updating it according to the Bellman optimal equation value function until its convergence. Q-Learning is a common algorithm of value-based functions. However, traditional reinforcement learning methods can only be applied in some very simple scenarios due to the difficulty of convergence, the curse of dimensionality, and the low generalization. Deep learning is very suitable for regression problems. Using artificial neural networks to fit value functions can not only overcome the problem of a dimensional explosion of state-action space, but it can also give reinforcement learning a certain generalization ability, which makes deep reinforcement learning possible [29]. The introduction of deep learning into reinforcement learning also brings some new challenges. First, in deep learning, the samples used for training are often independent and identically distributed, while decision-making is inertial in reinforcement learning. There is a correlation between samples, which makes it difficult for the neural network to converge. Besides, in the original Q-Learning, the target values and the current evaluation values, as the sample labels, are generated by the same Q-table. However, in deep reinforcement learning, this approach causes the model to oscillate and diverge, so the network that generates the labels is constantly changing.
In order to solve the above problems, the deep Q-network (DQN) came into being [30]. The DQN is a classic representative of value-based algorithms. There is only one value function network in this algorithm, without a policy network. In [31], the authors believe that its biggest feature is that it can directly use the original high-dimensional sensor data for reinforcement learning, and can use the stochastic gradient descent for training, making the training process more stable. The DQN uses the experience reply mechanis; that is, it constructs a memory called the experience playback pool to store the samples collected each time, and it then removes the independent and identically-distributed samples through the random sampling for training, breaking the association of the data. For the second problem, a network identical to the Q-network is constructed in the DQN, called the target Q-network. The network is updated after a certain training step, and the updated method involves copying the parameters in the Q-network. This method can reduce the possibility of oscillation divergence during training, and can make training more stable.
Liu et al. [17] proposed a QoS-based action selection strategy based on the double deep Q-network algorithm. UAVs can make real-time moving and unloading decisions according to the current system state. Wang et al. [32] transformed the path-planning problem of the UAV base stations into a constrained Markov decision process (CMDP), and used Q-Learning to solve the problem. Due to the large state dimension, an adversarial deep Q-Network (dueling deep Q-network) is proposed, which introduces neural networks and adversarial architectures. Yi et al. [33] considered throughput and user fairness, a deep Q-network based learning method is proposed to maximize the overall network utility in the downlink of heterogeneous networks. However, the current research rarely conducts a unified study of the UAV's energy consumption and the AoI-based quality of service.
In this paper, we propose an improved method based on the DQN algorithm, and we use it in our study.

Scenario
As shown in Figure 1, we consider a UAV-assisted post-disaster early warning scenario with a set N of N ground nodes randomly distributed on the ground, a hovering UAV, and a ground control station (GCS). The ground nodes are responsible for sensing the environmental information data of the target area, and the UAV takes off from the GCS and collects the data from each ground node. During the flying and collection process, if the remaining energy is insufficient, the UAV should return to the control station for recharging to ensure the completion of the mission. For the ease of deployment, we assume that the UAV flies at a fixed flight velocity v u and a fixed altitude h. In addition, we require the UAV to fly directly above the ground node to collect data. Therefore, the UAV has N + 1 predetermined waypoints, which are denoted by the set P = p u 0 , p u 1 , ..., p u N , where p u 0 is the position of GCS and p u n , n ∈ N is the position of the n-th ground node. The trajectory of the UAV is approximated by the sequence p u = p u 0 , ..., p u (i), ..., p u 0 , where p u (i), p u (i) ∈ P denotes the hover point of the i-th decision.

Communication Model
Similar to [27], the channels between the UAV and the ground nodes are assumed to be dominated by the LoS links. Therefore, the channel power gain between the UAV and the m-th ground node can be given by where β 0 is the channel gain at a reference distance of 1 m, and d is the distance between the UAV and the ground sensing device. Since this paper assumes that the UAV collects data directly above the sensing device, d = h. Then, the uplink throughput between the UAV and the m-th ground node can be calculated as where B is the channel bandwidth, γ n = p n g σ 2 is the signal-to-noise ratio (SNR) between the UAV and the n-th ground node, p n is the transmitting power of the n-th ground node, and σ 2 denotes the Gaussian white noise power perceived at the UAV. The flight time of the i-th decision is then given by where v u is the flying speed of the UAV. This paper assumes that the UAV is flying at a constant speed. The collecting time of the i-th decision can be given by where s n is the size of the data generated by the n-th ground node.

Energy Model
The energy consumption of the UAVs consists of the communication energy and the propulsion energy. Since the communication energy consumption is relatively small, we only consider the propulsion energy in this paper. According to [34], the propulsion energy is used for keeping the UAV hovering, flying, and accelerating, which is related to the speed of the UAV. The propulsion energy of the rotary-wing UAV can be expressed as where P 0 and P 1 represent the blade profile power and derived power, respectively, V t is the flying speed of the UAV in time slot t, U tip is the tip speed of the rotor blade of the UAV, v 0 is the mean rotor induced velocity in the hovering state, d 0 is the fuselage drag ratio, ρ is the density of air, s 0 is the rotor solidity, and A r represents the area of the rotor disk. In particular, the energy consumption, when hovering, is E u = P 0 + P 1 . Therefore, the energy consumption generated of the i-th decision can be expressed as where E(v) is the UAV flight energy consumption and E(0) is the hovering energy consumption. In this paper, it is assumed that E m ax is the initial energy consumption of the UAV.

Freshness of Collected Data
We employ the AoI to measure the freshness of collected data, which is defined as the time that elapsed since the generation of the latest status update received by the UAV. The AoI of the n-th ground node at time slot t can be expressed by where U n (t) denotes the time at which the latest status update of the n-th ground node, successfully received by the UAV, was generated. Figure 2 describes the characteristics of the AoI. It can be seen that when there is no data collection action, the AoI of the sensing device increases in a stepwise manner. When data is successfully collected, the AoI will decrease to the time it takes to generate the latest data. Moreover, the data freshness should be evaluated based on how valuable the fresh data is in the specific application. Therefore, data freshness can be characterized by a non-increasing utility function. Our optimization objective is the AoI-based system utility. The AoI-based system utility at the time slot t can be expressed as according to [35], we define the function u(∆) = a w∆ , where 0 < a < 1 is a constant, and w is the time-sensitive weight.

Optimization Problem
We consider a discrete-time system, where the entire time is discretized into the t time slots of equal length, denoted as T = {1, 2, . . . , T}, and the unit length is denoted as δ. On the basis of equal-length time slots, the time system is abstracted into multiple execution cycles of unequal length. The duration of one cycle is the time it takes to execute a decision, and the length of each time slot depends on the task performed in this time slot. For example, in the y-th cycle, the UAV performs the data acquisition task of the n-th device, n ∈ N, then the execution time is where T i f represents the flight time to the target sensing device in (3), and T i c represents the time of the data transmission in (4). According to the model and formulas described above, the central goal of the multi-UAV path-planning problem can be modeled as the following maximization problem s.t. p u i ∈ {p 1 , ..., p N }, ∀i ∈ I, e i > 0, ∀i ∈ I.
The first constraint in (10) indicates that the target point of the UAV can only be selected from the set of specified locations for each decision. The second constraint in (11) indicates that the energy of the UAV during the mission will not be exhausted prematurely, where e i stands for the remaining energy.

Proposed Trajectory Optimization Approach
Different from common reinforcement learning problems, the scenario modeled in this paper is not an inertial order decision process with evenly divided time slots. Therefore, a time abstraction strategy is adopted, using a semi-Markov decision process (SMDP) to describe this problem. At first, we abstract the UAV as an agent, and build the system model as an SMDP problem by designing the system state space, the action space, and the reward function. Then, a path-planning algorithm based on the hierarchical DQN is proposed, which enables the UAVs to perform long-term data collection tasks in dynamic scenarios with constantly changing AoI and their own power.

SMDP
An MDP is often used to simplify the modeling of scenarios in reinforcement learning. It assumes that the state transition of the environment has Markov properties; that is, the probability of transitioning to the next state is only related to the previous state. The agent continuously observes the environment with Markov characteristics and makes habitual decisions in an MDP. An MDP with M agents consists of a quadruple S, O, A, R , where S represents the state space of the environment, which is used to describe the state of each entity. In addition, O represents the observable state space of each agent, A represents the action space of each agent, and R represents the reward function. Common MDP is mainly used to model the traditional decision-making process. However, the time between decision-making stages is not constant in the actual scene. Thus, it can be hard for the traditional MDP to model such problems. The SMDP, as an extension of the MDP, is an option-based decision-making process that can model this kind of control problem with variable decision-making times.
An SMDP is mainly composed of a quintuple S, A, O, P, R , where O represents the set of options, and other elements in the tuple have the same meaning as in the MDP. The ption can be represented by a triplet I o , π o , β o , where I o ∈ S is the initial state set of the option, π o represents the strategy adopted within the option, and β o represents the termination condition of the option. In other words, the option I o , π o , β o is available only when the current state s t ∈ I o . If the option is chosen, the agent has to choose the next behavior according to the strategy π, and β o is responsible for judging whether it reaches the termination state. If so, the agent chooses the next option to enter. The MDP can be converted into an SMDP by defining the set O of options on the basis of MDP. The original MDP is more like a special case of the SMDP, in which each action in the MDP is an option that only lasts for a moment.
The upper-level strategy in the SMDP µ : S × O → [0, 1] applies to the reinforcement learning of the selection of an option; that is, to choose an option from O according to the current state, and then proceed to the next round of selection after the option reaches the terminate state. As for the strategy π in each option, it can include some pre-defined rules, or a mapping function from a state to an action space learned through a reinforcement learning algorithm.

Proposed Model
We consider that the UAV can return to the control station for charging when the battery is about to run out, and then continue its data collection task. This paper uses the SMDP to model the UAV-assisted data collection process, and regards the UAV responsible for data collection as an agent, which is designed as follows.

•
State space model S: The system state space includes the state of the UAV and the AoI of the sensor equipment. At each time t, the system state is expressed as s t = {p u (t), e(t), (t)}. Among them, p u (t), e(t) are the own state of the UAV, representing the current position and remaining energy of the UAV, respectively, and (t) = { 1 (t), . . . , n (t)} represents the current AoI of each sensing device. • Work space model A: In this paper, the no-fly areas due to terrain and other reasons are not considered, and the UAV can reach the target area by flying in a straight line. Therefore, the action space includes three categories: a f , a c , and a r , where a f = {a f ,0 , a f ,1 , ..., a f ,n , ..., a f ,N }, n ∈ N represents the flight action of the UAV, and a f ,n is the action of the UAV flying straight from the current position to the n-th sensing device, in which a f ,0 represents the UAV flying in a straight line above the control station. a c indicates that the UAV performs a data collection action, and a r indicates that the UAV performs a charging action.
for e(∑ i j=1 T j ) < e min .
It can be seen from the above formula (12) that there are two main situations for instantaneous rewards. e min represents the minimum energy consumption that can guarantee the return of the UAV. After an option is executed, it will bring a penalty value p if the energy consumption of the UAV is less than the threshold. Otherwise, the instantaneous reward is mainly composed of two parts: the first item represents the average QoS during the execution of the option, and the second item represents the reward brought by the current remaining energy consumption. β e represents the weight of energy consumption rewards, which is used to balance the proportion of the energy consumption rewards with the overall rewards.

Proposed H-DQN Algorithm
We use the SMDP to model the data collection and flight process of the UAV. In fact, the entire decision-making process is layered. The upper-level strategy u is used to select an option from the option set, and the strategy π o in the option set is used to perform a series of bottom-level actions to complete the goal of the option. In the more typical deep reinforcement learning methods used to solve the SMDP problems, such as the hierarchical DQN (H-DQN) algorithm, both u and π o need to be trained. For example, in a scene where there is a no-fly zone, the high-level strategy u is used to learn which option should be executed in the current state; the low-level strategy π o learns the flight rules in the target area, bypassing the no-fly zone to reach the target point in the option. However, in the modeling of the SMDP in the above section, in order to simplify the scene, it is assumed that there is no no-fly zone, so the low-level strategy π o can reach the target point in a straight flight. This is a pre-defined strategy that does not need to be learned. This chapter uses the H-DQN architecture to solve the SMDP problem.
The H-DQN algorithm needs to first initialize the Q network parameters, experience the playback pool, and initialize the environment state and decision index in each training round. For each decision cycle, the agent selects an option o i based on the current system state and a Q network based on the greedy strategy ε. It randomly selects an option from the option space with the probability of ε, and selects an option according to the Q network with the probability of 1 − ε. Then, it executes the strategy of the option. After the execution is complete, the system reaches the new state s i+1 and gets the reward r o,i , saving the four-tuple s i , o i , s i+1 , r o,i into the experience replay pool. This algorithm does not use a fixed greedy coefficient, but allows itself to be continuously updated during the training process. The update formula can be given by where n epi represents the current episode round number, and ω d is the decay weight. ε will continue to decrease during the training process, together with the probability of exploration. When ω d is smaller, ε decreases faster. After updating the greedy coefficient, the time stamp is updated according to (3) and (4). After that, K quadruple samples are randomly sampled from the empirical playback pool and the target value is calculated. The calculation formula for the target value can be expressed as Then, ∑ K k=1 (y k − Q(s k , o k | θ)) is used as the objective loss function to update the parameters of the Q network. The H-DQN algorithm flow is shown as Algorithm 1. Every C updates cycles, and the parameters of the Q network are assigned to the target network for updating. When the remaining energy of the UAV is less than 0 or the mission execution time is reached, the current training round ends and the next round of training is entered.

Algorithm 1 The H-DQN algorithm
Input: initial Q network parameter θ and target network parameter θ Output: trained network parameters 1: initialize the experience playback pool D and the greedy coefficient ε = 1 2: for episode∈ {1, 2, 3...} do 3: initialize the system state t = 0 4: for i ∈ {1, 2, ..., I} do 5: the UAV agent gets the current system state s i calculate the target value y k of each sample according to (14) 13: ) as the objective function 14: use the gradient descent method to update θ 15: if i%C = 0 then 16: θ ← θ 17: else if e(t) < 0 t ≥ T then 18: episode←episode+1,end this round of training 19: else 20: continue training

Performance Evaluation
In this section, the performance of the proposed H-DQN method is numerically evaluated. The simulation program runs on the Ubuntu16.04 operating system. The deep reinforcement learning environment is realized based on the Gym toolkit, andit implements the H-DQN algorithm code based on PyTorch. We consider using a 500 m × 500 m area as a simulation scenario. Due to natural disasters or other reasons, UAVs are needed to assist in real-time data collection tasks in this area, and a certain number of sensor devices are distributed. The data convergence center is located at the edge of the area [0, 0], and the UAVs take off or land from here. During the execution of the data collection task, the UAV can return to the data convergence center for charging. In the simulation environment, the UAV flies at a fixed speed of 20 m/s and the maximum energy it can carry is 2 ×10 4 J. The experimental analysis is carried out according to the percentage of residual energy. More detailed environmental parameters and UAV parameters are shown in Table 1. In our H-DQN method, both the Q network and the target Q network use four-layer artificial neural network architecture. The activation function is unified using the ReLu function, the output layer is added with a layer of the Softmax function to obtain the probability distribution of the output action, and the learning rate is set to 0.001. Other hyperparameters related to the H-DQN training are shown in Table 2.

Training Results
First, we verify the convergence effect of the algorithm when the learning rate is 0.001, and we compare it with the traditional DQN algorithm based on rasterization processing and the Markov decision process modeling. As shown in Figure 3, it can be seen that the layered deep reinforcement learning algorithm H-DQN proposed in this paper converges around the 100th training round, and the reward remains between 140-150. In comparison, the traditional DQN algorithm has an upward trend, but the convergence process is much slower than the H-DQN method proposed in this paper. This is due to the large action state space of the traditional DQN algorithm, while the H-DQN method is modeled through the SMDP and the option, which greatly reduces the action state space, making it easier to learn the optimal strategy. In addition, in the last 200 training rounds, the DQN method has a certain convergence trend, but the reward is still lower than the H-DQN method. This is because the rasterized flight mode of the DQN method causes more energy loss. The energy consumption weight in the reward function will affect the performance of the algorithm. Figure 4 represents the remaining energy change curve of the H-DQN algorithm when the energy consumption weight β e = 1, 1.5, 2. The horizontal axis represents the decision sequence, and the vertical axis represents the current remaining energy consumption as a percentage of the initial energy consumption. It can be seen that when β e = 1, the percentage of the lowest remaining energy is 2-10%,while 2% of the remaining energy is a relatively dangerous situation, which is close to energy exhaustion, and the safety factor is low, which may lead to the UAV not being able to complete the return flight, making it difficult to deal with emergencies. With the increase in β e , it can be seen that the charging frequency of the UAV also increases, and the remaining energy can also be maintained at a safer level. When β e = 1.5, the remaining energy of the UAV is always maintained above 18%. When β e = 2, the remaining energy of the UAV is basically maintained at more than 40%, but frequent charging may cause more QoS losses. As shown in Figure 5, when β e = 1, the average QoS of the system is maintained at around 0.25, which is significantly higher than the QoS when β e = 1.5 and β e = 2. Therefore, users can set β e according to the application requirements to achieve a tradeoff between the QoS and the safe power state.

Comparing with Baselines
In this section, we test the effect of the H-DQN algorithm and other benchmark algorithms. The evaluation metric is the average QoS. As shown in (7), the QoS is a utility function about the AoI. The benchmark algorithms include the DQN algorithm and the only AoI selection method. The DQN algorithm uses the traditional MDP to model the path-planning problem, in which the target area is divided into 20 × 20 square sub-areas of equal area, and the center point of each sub-area is specified as the waypoint where the UAV can hover. The action space of the UAV is to move to an adjacent sub-area, and perform data collection tasks and charging tasks in this area. The instantaneous reward of the UAV is set to the AoI QoS only, and the energy reward of the time slot. In the AoI only method, each time the UAV selects the sensor device with the largest AoI in the current system for data collection, it then returns for recharging when the remaining power is less than a certain threshold.   Figure 6 describes the performance of the H-DQN and other benchmark algorithms with different numbers of sensing devices. It can be seen that when the number of ground sensors is 10, 20, and 30, the performance of the H-DQN algorithm is better than the other two benchmark algorithms. First of all, the selection algorithm based on the AoI is a greedy strategy, and the effect in long-term tasks is not as good as reinforcement learning methods, while reinforcement learning methods focus on long-term cumulative benefits. In addition, the performance of the H-DQN algorithm, based on the SMDP modeling, is also better than the traditional DQN method which is based on the MDP modeling. When the number of ground sensors is 10, 20, and 30, the performance of the H-DQN method is improved by 18.36%, 37%, and 21%, respectively, compared with the DQN method. This is because the H-DQN method, based on the SMDP modeling, uses the concept of time abstraction, which greatly reduces the state-action space, making it easier to learn better strategies. However, as the number of sensing devices increases, the performance of the H-DQN algorithm gradually declines. Therefore, when the number of sensing devices is large, the use of the multi-UAV collaboration for data collection should be considered.

Conclusions
In this paper, we have studied the path-planning problem in UAV-assisted data collection. In order to enable UAVs to perform long-term data collection tasks in dynamic scenarios with a constantly changing AoI and their own power levels, we propose a method based on the QoS maximization problem of the AoI, and we use a semi-Markov decision process (SMDP) to describe this problem. In addition, we also propose a path-planning algorithm based on hierarchical DQN to learn the optimal strategy, and conduct simulation experiments to verify the convergence of the proposed algorithm. We test the effect of the energy consumption weight β e on the remaining power and the QoS influence. Experiments show that the tradeoff between the system QoS and the safe power state can be achieved by adjusting the parameters β e . At the same time, it is compared with two benchmark algorithms, i.e., the DQN and the AoI only algorithm. The results show that this algorithm is better than the other two benchmark algorithms. In the H-DQN method in this paper, the bottom-level strategy option uses the established strategy. In future work, further improvements can be made, such as using the DQN for high-level strategies and the DDPG for low-level strategies.