Radio Map-Based Trajectory Design for UAV-Assisted Wireless Energy Transmission Communication Network by Deep Reinforcement Learning

: In this paper, we consider a wireless energy-carrying communication network of a UAV. In this communication network, the internet of things (IoT) devices maintain their work via the power supply of batteries. The energy of batteries is slowly consumed over time. The UAV adopts the full-duplex working mode and the flight hover protocol, that is, they can power the target device in the hover position and collect base station data at the same time. Unlike traditional methods, which seek to achieve wireless energy transmission, this paper adopts deep reinforcement learning. On the one hand, the deep reinforcement learning algorithm seeks to solve the dynamic programming problem without a model. The traditional method often requires a prior channel model, to form a formula about variables for convex optimization, while reinforcement learning only requires the interaction between agents and the environment. Then, the strategy is optimized according to the reward function feedback. On the other hand, traditional optimization methods generally solve static programming problems. Since IoT devices constantly collect information from the surrounding physical environment, and their requirements for power supply change dynamically, traditional methods are relatively complex and require huge computational overhead, while deep reinforcement learning performs well in complex problems. The purpose of our work is that with the assistance of the radio map, a UAV can ﬁnd the best hover position, maximize the energy supplied by the UAV, maximize the data throughput collected, and minimize the energy consumption. The simulation results show that the proposed algorithm can well ﬁnd the best hovering position of the UAVs and signiﬁcantly improve the performance of the UAV-assisted wireless energy transmission communication network.


Introduction
Drones have been used for a variety of applications for nearly a century.UAVs have many advantages, such as low cost, a clear line of sight (LoS) airground communication link, high mobility, and deployment flexibility [1].UAVs communication has attracted wide attention in the industry and academia.In June 2016, the Federal Aviation Administration (FAA) released rules for daily civilian operations of small unmanned aerial systems, and in 2017, the third generation partnership project (3GPP) conducted a study investigating UAVs support in long-term evolution (LTE) cellular networks [2].These commercial directions and scientific research projects illustrate the importance of the development of the drone industry.In addition, drones can be controlled remotely and do not require a pilot, so they are widely used in military applications such as remote surveillance and armed attacks to save pilots' lives [3].At present, drones are widely employed in areas such as reconnaissance, surveillance, public safety, traffic management, search and rescue, and data acquisition [4].The miniaturization of communication devices and the flexible deployment and cost-effectiveness of unmanned aerial vehicles (UAVs) have paved the way for UAV-assisted communication to become a promising technology for future 6G networks [5].Unlike traditional ground communication nodes, UAVs bring about significant differences as a new addition to the 6G network.The UAV auxiliary communication system introduces several distinct features in contrast to ground systems.Operating at higher altitudes, they benefit from unobstructed wireless links [6], reduced signal scattering, and lower path loss, thanks to a higher occurrence of line-of-sight (LoS) channels [7].However, LoS-dominant channels can introduce interference challenges, necessitating strategic UAV positioning in three-dimensional space [8].Moreover, UAVs' high mobility, enabling rapid deployment and dynamic trajectory planning, is particularly advantageous in emergencies.Their maneuverability allows them to closely approach target users, enhancing channel gain and obstacle avoidance.Nevertheless, energy constraints due to limited onboard power are a critical consideration [9].These distinctive features collectively enhance UAV auxiliary communication systems' capabilities for various applications, including disaster relief and military operations.In recent years, 6G technology has become a hot research direction; the combination of UAVs and 6G technology can further promote the large-scale application of UAVs.The advantage of the combination of UAVs and 6G technology is to provide high-speed, low-latency communication; support a variety of applications of UAVs, including high-definition image transmission, remote control, autonomous flight, and collaborative missions; and integrate satellite communications, intelligent network management, and advanced security, opening up new prospects for UAVs applications, from agriculture to logistics, and scientific research to smart healthcare [10].It has promoted the wide application and development of UAV technology.
At the same time, with the continuous rise of artificial intelligence in recent years, different machine learning algorithms have been applied in all walks of life, and UAVs have also been pushed to the forefront of the era of intelligence.Many scholars combine the communication characteristics of UAVs with machine learning algorithms to make UAVs more autonomous and intelligent.UAV trajectory design based on the machine learning algorithm has been extensively examined in various studies, and significant results have been achieved.By taking full advantage of the high mobility of UAVs, various communication efficiencies (e.g., throughput, coverage, and energy efficiency) are maximized under different UAV-ground channel models [11].
The internet of things (IoT) plays a significant role in UAV applications.The advancements in wireless connectivity, such as 5G and even 6G, have provided ubiquitous and reliable mobile broadband with high speed and low latency [12,13].This connectivity is crucial for enabling the IoT.Wireless power transmission (WPT) technology, utilizing radio frequency (RF) transmission, has emerged as a potential solution to address the energy supply challenge faced by large-scale IoT devices [14].As compared to obtaining energy from renewable sources, WPT offers several advantages, such as extending the battery life of widely used devices and ensuring a stable and continuous energy supply through wireless connectivity.Additionally, WPT provides the benefits of low maintenance costs and high flexibility [15].WPT is paired with wireless information transfer (WIT) technology to transfer power and information synchronously across the RF spectrum.In [16], the author studied the integrated design scheme of WPT and WIT and designed three types of methods, namely, simultaneous wireless information and power transmission, the wireless power communication network, and wireless power backscatter communication.The basic problem of the communication system based on WPT is to meet the needs of data transmission and energy collection (EH) at the same time.UAV combined with WPT can conduct data acquisition and energy transmission, and it has become an important part of the internet of things network [17].
In recent years, UAV-assisted wireless power transmission has attracted a great deal of research interest.In [18,19], the author studied the full-duplex IoT model of UAV and proposed the hover and flight energy collection (HF-EH) scheme, which means the UAV can be charged while hovering and flying.In [20], a UAV-assisted wireless power communication network is investigated.This network architecture involves ground nodes that receive a constant power supply from UAVs.The UAVs in this setup are responsible for collecting data from the ground nodes using a time division multiple access (TDMA) scheme.The study focuses on optimizing the position and time allocation of the UAVs to enhance the transmission rate of the ground nodes.By strategically positioning the UAVs and efficiently allocating time slots for data collection, the aim is to improve the overall performance and data transfer rates of the ground nodes in the network.This research highlights the potential of using UAVs to assist in wireless power communication networks, where UAVs act as mobile power sources and data collectors.The optimization of UAV positioning and time allocation techniques plays a vital role in maximizing the network's transmission capabilities and overall efficiency.In [21], UAVs are used to move edge computing architectures.The ground node offloads part of its computing tasks to the mission to the UAV and uses the WPT integrated on the UAV to collect energy through the launch power of the ground node to optimize the flight trajectory design of the UAV to improve the computing capability of the ground node.In [22], the author proposes a highly energy efficient UAV cooperative relay scheme, which guarantees the successful transmission rate of the UAV.On the premise of ensuring the bit error rate, the data relay schedule is developed to reduce the energy consumption of UAV.In [23], a study adopts an energy consumption model to analyze the tradeoff between energy consumption and task completion time.It provides insights into how optimizing energy usage can impact the overall efficiency of UAV applications.Furthermore, in [24], a multi-UAV control strategy is proposed that utilizes deep reinforcement learning (DRL) techniques.The aim is to achieve effective and fair communication coverage for ground users in a specific target area.Specifically, the authors employ the deep deterministic policy gradient (DDPG) algorithm [25] for continuous control tasks, enabling efficient decision-making for the UAVs.Building upon this work, Ref. [26] delves deeper into the research and presents a multi-agent distributed solution.It further refines the control strategy proposed in [24] and explores how multiple UAVs can collaborate and coordinate with each other to optimize their performance in achieving communication coverage goals.Some recent works [27][28][29][30] began to use DRL to optimize the design of the internet of things with unlimited power transmission.In [27], the author studied the trajectory planning and time resource allocation scheme of UAVs to maximize the minimum throughput in the wireless powered communication network of multiple UAVs.In [28], the author proposed a novel UAV communication scheme based on reconfigurable intelligent surface (RIS) assistance, in which IoT devices collect energy in the downlink and transmit information to the uplink of the UAV in the uplink.In scenarios characterized by dynamic environments, real-time data processing is crucial to maintain optimal performance.Therefore, in [29], the author considered the freshness of the collected data and minimized the information age based on the DDQN algorithm.In [30], the authors analyze the coverage probability of UAV-supported mobile networks in scenarios with clustered users.In [31], the author uses DDQN algorithm to solve the resource allocation problem in the UAV-assisted cellular network with interference, to maximize the energy efficiency and total network throughput, but does not consider the flight energy consumption of UAVs.In [32], the author uses federated learning to construct a radio map and on this basis plans the trajectory of the UAV so that it can maintain communication with the base station.In [33], the author considers the communication quality of cargo of the UAV during flight.In order to enable logistics drones to provide real-time status to ground operators or base stations during flight, the author uses deep reinforcement learning on the basis of radio maps to achieve this purpose.
In the research of UAV-assisted wireless energy transmission, knowing how to maximize the total data throughput, maximize the total collected energy, and minimize the energy consumption of UAVs are subjects that are all worth studying.However, in many studies mentioned above, only one of these optimization objectives is considered or not fully considered [34].However, there are tradeoffs between these three optimization objec-tives.In this paper, we give full consideration to the trade-offs before the three optimization objectives in UAV-assisted wireless energy transmission.With the aid of radio maps, UAVs use flight hover protocols to continuously access various IoT devices.We consider both energy transmission and information transmission schemes, and IoT devices collect the energy transmitted by UAVs in the downlink.In the uplink, the UAV receives the data sent by the base station.The DDPG algorithm is employed to optimize the flight decisions of the UAV, aiming to maximize total data throughput, optimize collected energy, and minimize energy consumption.To achieve this, a reward scheme is designed as a four-dimensional vector.Three elements correspond to the three optimization goals, while an additional element ensures the completion of the fundamental task.The simulation results demonstrate the scheme's feasibility and effectiveness, showcasing two benchmark schemes and two optimal strategies.
The rest of the paper is organized as follows.Section 2 presents the system model.This section provides a detailed explanation of the system model adopted in the study.It covers the key components, parameters, and assumptions that form the basis of the proposed approach.Section 3 introduces our proposed algorithm.In this section, the paper presents the novel algorithm developed by the authors.The algorithm's methodology, steps, and implementation details are described, highlighting its contribution and potential improvements over existing approaches.Section 4 shows the numerical results.This section showcases the numerical results obtained from the experiments or simulations conducted in the study.It includes relevant metrics, graphs, or tables to demonstrate the algorithm's performance and effectiveness.The results are analyzed and discussed in relation to the research objectives.Finally, we summarize the conclusion in Section 5.

System Model and Problem Formulation
As shown in Figure 1, we consid ered a wireless portable internet of things system.The system consists of K randomly distributed IoT devices with limited battery power, a rotorcraft drone, and ground base stations (GBSs).The drones need to provide wireless power to IoT devices in order to prevent them from running out of battery power.In this paper, the flight hover protocol is adopted.In this protocol, the UAV does not communicate with ground equipment during flight and only carries out data collection and energy transmission when it is hovering.The UAV is equipped with a hybrid access point (HAP) and two antennas.It uses one antenna to transmit power from the downlink and another antenna to receive data from the base station.
We assume that the number of IoT devices is K, where K {k = 1, 2, 3, . . .K} represents the number of IoT devices, which are randomly distributed on the buildings of the city model.For the convenience of analysis and calculation, we assume that the height of all IoT devices is h u , and the position of the k-th IoT device is expressed as [x k , y k ].Consider the application scenarios where IoT devices will continuously consume their battery power in actual work; we use β e k (t) to denote the remaining energy in the battery of the k-th IoT device at time t, and µ k (t) to denote the rate at which the k-th IoT device consumes energy.It is worth noting that the energy consumption speed µ k (t) of different IoT devices is set to be different.We assume that µ k (t) follows the Gaussian distribution, and different IoT devices have the same variance and different expectations, i.e., the parameters of the Gaussian distribution are different.The battery energy renewal formula of the k-th IoT device is shown as follows: where ∆t is the time division interval; the upper limit of β e k (t) represents the battery capacity, denoted by β e max ; the maximum capacity of the battery is limited by hardware; and the maximum battery capacity of each IoT device is the same.If the power consumption of IoT devices is complete, the devices will stop working or dry up.Therefore, it is very important for UAV to provide timely power to IoT devices.We assume that N e (t) denotes the number of devices running out of battery at time t.Once β e k (t) ≤ 0, we set β e k (t) = 0 and increase the value of N e (t).In addition, we assume that the flying altitude of the UAV is H and the descending transmission power is P d when the flying position of the UAV is [x u , y u , H].Because the energy consumption rate of different IoT devices is different, the priority of UAV to supply energy to them is also different.w u k is used to represent the energy supply priority of the k-th device, which can be expressed by the following formula: It can be seen from the above equation that the energy supply priority not only depends on the ratio of the required energy of the device to the total energy but also is related to the energy consumption rate, which includes the prediction of the future priority.

Communication Model between UAV and GBSs
During the flight, the UAV establishes communication links with ground-based base stations that are distributed in the area.The quality of the communication signals between the UAV and these base stations depends on various factors, including the large-scale path loss, small-scale fading caused by multipath propagation, and base station antenna gain.These factors collectively determine the end-to-end channel coefficient between the UAV and a specific sector of a base station.To represent this relationship, the channel coefficient between the UAV and sector l can be mathematically expressed as: In the context of the equation, several variables are involved.The large-scale channel power gain between the UAV and sector l is represented by γ l (q(t)).The transmit and receive power gain of the base station antenna is denoted as G l (q(t)).Additionally, the variable h(t) represents small-scale fast fading, which is a random factor affecting the channel quality.In addition, the constant value of the large-scale channel power gain γ l (q(t)) and the transmit power gain G l (q(t)) of the base station antenna can be obtained after the position of the UAV is given.
At each moment, the UAV selects only one signal source for communication, so signals emitted by other base stations act as interference.In this paper, the total number of base stations is m, and after standard fan partitioning technology, the total number of sectors is l, and l = 3 m.We define a binary variable α l (t) with values of 0 and 1.For example, α l (t) = 1 means that sector l provides data communication services for the UAV at time t; otherwise, α l (t) = 0.In the flight mission of UAV, UAV only chooses to connect to one base station at every moment.Therefore, α l (t) should meet the following restrictions: In the equation, the transmitting power of sector l is denoted as p l (t).It is assumed that all base stations have the same transmitting power.By utilizing Shannon's theorem formula, we can calculate the transmission rate of the UAV at time t as follows: where represents the common interference signals caused by all base stations not associated with UAV and different sectors of the same base station.
In addition, it can be observed from the above equation that we did not consider the impact of noise in this paper, because the interference of the communication link between UAV and the base station is more serious than that of the ground, and its communication performance is usually limited by interference.Moreover, we considered the worst scenario of frequency reuse: all of the unassociated base stations contribute to the interference term, so for simplicity we ignore the effect of noise.
In the research work of this paper, we take the transmission rate R(t) as an important index to measure the connection quality of the cellular UAV wireless communication link, and we define a minimum rate R min to judge whether the UAV meets the connection requirements.When the transmission rate of the UAV at any time is greater than the minimum rate R min , the UAV and the base station are in the state of information transmission.On the contrary, the UAV is not in the coverage range of the base station, and the signal of the UAV is interrupted at this time.However, it can be observed that, due to the randomness of small-scale fading, R(t) of Formula ( 5) is a random variable.Therefore, we can obtain the average rate by expecting small-scale fading: However, for the above description, small-scale fading h(t) does not have a probability density function to represent it.In this paper, small-scale fading is accounted for differently depending on whether the channel condition between the UAV and the base station is line-of-sight (LoS) or non-line-of-sight (NLoS).For LoS links, the small-scale fading is modeled as Rician fading with a Rician factor of 15dB.On the other hand, for NLoS links, small-scale fading is considered as Rayleigh fading.Therefore, to calculate the expectation of small-scale fading, one of the simplest implementation methods is to measure the signal transmitted to the UAV from l sector multiple times in a short period of time and then average the actual value after obtaining multiple times.To achieve this, the UAV can utilize measurements such as reference signal received power (RSRP) and reference signal received quality (RSRQ).Assuming that the UAV performs Z measurements of the transmission rate, we can denote the transmission rate between the UAV and the base station for the z-th measurement as Rz (t).By analyzing the collected data, an empirical transmission rate formula can be obtained: According to the Law of large numbers, when Z is large enough, R(t) in ( 6) can be replaced by the empirical average rate value R(t), i.e., lim

Channel Model between UAV and IoT Device
The air-to-ground channel exhibits unique characteristics compared to the ground channel, primarily due to the increased likelihood of a line-of-sight (LoS) connection.The behavior of the air-to-ground channel is heavily influenced by factors such as elevation and propagation environment.To accurately model the air-to-ground channel, we employ a probabilistic path loss model that combines both LoS and non-line-of-sight (NLoS) components.The probability of establishing an LoS connection between the UAV and IoT device at time t can be expressed as follows: where η a and η b are two constants related to the propagation environment of the UAV flying area.θ k (t) represents the elevation angle of air-to-ground channel between the UAV and the k-th IoT device, which can be calculated by , where H − h u is the height difference between the UAV and IoT devices, and is the Euclidean distance between the UAV and k-th IoT device.The probability of NLoS is obviously ).According to [35], the path loss expressions corresponding to LoS channel and NLoS channel between UAV and IoT devices are shown as follows: where γ 0 = ( 4π f c c ) −2 represents the channel power gain when the reference distance is 1 m, and f c and c are the carrier frequency and the speed of light, respectively.α is the path loss exponent, and µ NLoS is the additional attenuation coefficient of NLoS link.h k (t) represents the power gain of downlink channel between the UAV and IoT devices, which can be expressed as:

UAV Energy Consumption Model
The energy consumption of UAVs can be categorized into two main parts: communication energy consumption and propulsion energy consumption.Communication energy consumption is primarily utilized to facilitate information transmission and signal processing between UAVs and ground base stations (GBSs).It encompasses the energy required for maintaining communication links, processing data, and managing the transmission of information.On the other hand, propulsion energy consumption is directly tied to the UAV's flight and hover operations.It encompasses the energy needed to support the UAV's movement, including take-off, landing, and maneuvering during flight.Propulsion energy consumption is influenced by factors like speed, acceleration, and changes in flight altitude.In order to facilitate processing, the influence of acceleration change on energy consumption is ignored in this paper, which is effective for typical communication applications, because in these applications, the maneuvering time of UAV only accounts for a small part of the whole operation time.The velocity of the drone at time t is V(t), which is a vector whose value is denoted by v t .Therefore, the relationship between pushing power consumption and the speed of UAV is as follows [36]: The three parts of the right equation in the above equation are represented as the blade profile power, the induced power, and the parasitic power.In addition, the power value when the UAV speed v t = 0 represents the power consumption of the UAV in the hovering state, which is represented by P hov .It can be seen from the formula that P hov = P(v t = 0) = P 0 + P i .The parameter description in ( 12) can be found in [36].

Problem Formulation
In our work, the main task of UAV is to maximize the total transmitted energy and received throughput through real-time path planning and minimize the flight energy consumption of UAV.The flight speed, flight direction, and hovering position chosen by the UAV should not only take into account the service experience of the IoT device and its own energy consumption but also consider that the IoT device will stop working due to power failure.For this purpose, the UAV first selects the device with the highest priority to power supply, and the target device selection can be expressed as: Because UAV has limited downlink transmission power, it can only communicate with IoT devices within its coverage range.It is assumed that the maximum charging coverage radius of the UAV is D. After selecting the target user, the UAV will fly and explore towards the target device.When the target device falls within the coverage range of the UAV, i.e., d k (t) ≤ D, the UAV will hover at the corresponding position to power the target device in the downlink and receive data transmitted from the uplink base station at the same time.i is used to represent the i-th hover of the UAV, and the corresponding target device is represented by k i .Therefore, the power received by the target device at time t is: In this paper, we adopt the nonlinear energy transmission model [31].Compared with the simple linear model, the nonlinear energy transmission model is more suitable for the actual communication scenario.By applying this model, the transmitted energy can be expressed by the following formula: where P m represents the maximum output DC power, and a and b represent two constants related to the circuit characteristics in the energy transfer model.Once the target device is within the coverage range of the energy transmission of the UAV, the UAV will hover in the front position for energy transmission and data collection until the battery capacity of the target device is full and then move to the next energy supply target.Therefore, the energy supply time of the UAV to the target device k i at the i-th hover position is: Meanwhile, the UAV keeps receiving data from the base station with the highest transmission rate.Therefore, the data transmission rate between the UAV and base station j at the i-th hover position is: where W is the bandwidth of wireless communication, and σ 2 is the channel noise power.Based on the above two Formulas ( 16) and ( 17), the throughput of UAV at the i-th hover position is: We assume that the UAV hovers a total of I times in the total mission time T. Therefore, the total transmitted energy, the total throughput, and the flying and hovering energy consumption of UAV during the whole mission period are, respectively, Based on the above discussion, the problem of multiple optimization objectives is expressed as : (P0) : max In the above optimization problems, the flight speed v t and elevation angle ψ t represent the flight decision of the UAV.The maneuverability of the UAV is controlled by adjusting these two parameters to optimize the throughput, the energy of wireless communication transmission, and the total energy consumption, in particular, v max = 20 m/s.In addition, it can be seen that the three optimization objectives are balanced against each other.In terms of the total transmission energy, maximizing it includes two parts: first, increasing the hover times I as much as possible under the total task time, which requires increasing the speed of the UAV so that it has more time to visit more target devices, and reducing the number of devices that stop working due to power failure; second, the closer the hovering position of the UAV is to the target device, the faster the UAV can charge the IoT device, and also reduce the communication time of each hover to have more time to visit more target devices.Therefore, if the UAV hovers directly above the target device, it is optimal to maximize the total transmitted energy.To maximize the total throughput, it can also be optimized from two aspects.First, it can also increase the hover times of the UAV because the UAV adopts full-duplex communication when hovering and receives data transmission when hovering.Second, the closer the UAV is to the base station when hovering, the more the data transmission rate can be improved, thus maximizing the total throughput.This is in contradiction to hovering directly above the target device to maximize power transfer; one needs to be close to the base station, and one needs to be close to the target device.In addition, for UAV flight energy consumption, although flying at a fast speed can achieve the purpose of optimizing throughput and energy transmission, it will also cause excessive energy consumption.If the flying speed is too slow, the number of devices running out of power will increase.The three optimization problems are tradeoffs between each other.For multiple optimization objectives, it is difficult to solve such problems with traditional convex optimization methods, which has not appeared in the current research literature.Since the power consumption of IoT devices is dynamic over time, the traditional method to find the optimal hover position will have a large computational overhead.However, the deep reinforcement learning algorithm shows great potential in solving complex problems.Its core is to let UAV learn in the environment by itself and find an optimal flight decision based on experience.In this paper, we use the deep deterministic policy gradient (DDPG) algorithm and consider the weight influence of three optimization objectives.

Trajectory Design as an Markov Decision Process
In this section, to solve the multi-objective optimization problem, we first establish the environment model of UAV interaction and then transform the problem into the Markov decision process (MDP).The UAV as an agent constantly adjusts its behavior and learns flight strategy in the interaction process, finally obtaining the optimal solution.Next, we design the core problems of the Markov decision process: state space, behavior space, and reward function.

•
State: Considering the actual situation, the UAV can observe itself and part of the network information.Firstly, the UAV should fly to the IoT devices with the highest energy supply priority to explore.In order to make better flight decisions, the state space of the UAV should first include its position x u (t), y u (t) so that it can know its flight direction.Secondly, in order to timely power the IoT device, the UAV should take into account the distance d x (t), d y (t) between itself and the target device, so that a more appropriate speed can be selected to save energy and avoid power failure of the device.Therefore, the state of the UAV should also include the distance from the device.In order not to let the UAV fly out of the area too much, causing the waste of computing resources, the state of UAV also includes the cumulative number of the drone hitting the wall N f (t).In addition, in order to enable the UAV to serve powerlacking devices as soon as possible, we put the number of failures of all IoT devices N e (t) due to power failure into the state space.Thus, the state space is systematically set to: S = x u (t), y u (t), d x (t), d y (t), N f (t), N e (t) • Action: In this paper, the UAV adopts the form of continuous action and makes realtime flight decisions under the observation of state space.The setting of the action space system is: A = {v cos ψ, v sin ψ} • Reward: The purpose of our work is to maximize the total transmitted energy and received throughput and minimize the flight energy consumption of the UAV by planning the flight path of the UAV.Therefore, we set the reward as the sum of r ET (t) for maximizing transmission energy, r th (t) for maximizing throughput, and r ec (t) for minimizing flight energy consumption and auxiliary reward function r aux (t).
We assign more rewards to energy and throughput, respectively, and penalize the UAV's own energy consumption.µ ET , µ th and µ ec , respectively, represent the proportion of the three optimization objectives.Different proportions of the three objectives lead to different emphasis strategies of UAVs.In addition, the weight of the corresponding auxiliary function is set to 1.
In summary, the reward function of the whole optimization objective is set as

Multi-Objective Path Planning Algorithm Based on DDPG
The reward function is carefully designed in the last section, which enables the UAV to achieve the maximum energy transmission, maximum data throughput, and minimum energy consumption within the specified task time.The multi-objective DDPG algorithm in this section is also based on the DDPG framework to improve the design of reward function, and its algorithm framework is shown in Figure 2. We maintain a behavior network µ(s|θ µ ) to improve the target policy, which establishes a mapping from state to action.In essence, given the current environment, the output of the action network is an action in the action space.The value network Q(s, a|θ Q ) is used to evaluate the value of the action.Its output is a signal in the form of time-difference (TD) errors that criticizes the action of the agent in the current environment state.θ µ and θ Q are the parameters of the behavior network and the value network, respectively.The weights of both networks are initialized with a normal distribution whose mean is 0 and variance is f , and f is the number of input units of the weight vector.The parameters of the target behavior network µ(s|θ µ ) and the parameters of the target behavior network Q(s, a|θ Q ) are copied from the main behavior network and the target network at certain frequency intervals.In the update stage of multi-objective DDPG algorithm, a batch of samples is randomly selected from the experience replay buffer to train the network.In the calculation process of the target value in this stage, unlike the classic DDPG, the reward function given by a single optimization objective of the classic DDPG is a scalar.The multi-optimization objective reward function of our multi-objective DDPG algorithm is a four-dimensional vector, i.e., r = [r ET , r th , r ec , r aux ].We use w = [µ ET , µ th , µ ec , µ aux ] to represent the reward weight, and we use linear summation to sum all the rewards, i.e., R = rw .
The update of the main network uses the method of gradient descent.First, we need to calculate the difference between the target value and the current value.In the target network, the target value can be calculated by the following formula: where a = µ (s |θ µ ) is the next action input of the next state into the target behavior network, and then put the obtained action a into the target value network to obtain Q(s , a ); the target value is equal to Q(s , a ) plus the reward obtained by the current agent, and the TD error is equal to the difference between the target value and the current behavior value of the state.The update formula of the main network is: We update the behavior network by calculating the Q value of the value network, and the update formula is:

Simulation Result
In this section, we give the relevant parameters of the whole optimization model and give numerical results to verify the performance of the proposed DDPG algorithm in the communication model.First, the allowable flying space of the UAV is a square area of 500 m × 500 m, and the altitude of UAV is 100 m.The maximum flight speed of the UAV is v max = 20 m/s.The maximum task completion time is 600 s.At the beginning of each training, the UAV will randomly initialize the starting position to perform the task.The downward power transmission p d of the UAV is equal to 5 Watt.The charging coverage range of the UAV is set at 15 m to 35 m, with intervals of 5 m.Secondly, the number of IoT devices is set to 150; they are deployed at the height of a building with a height of 90 m.Their energy consumption rates are divided into four types:{0.2,0.5, 0.8, 1.0}, and the energy consumption rates obey the Gaussian distribution.Each device has a maximum storage energy of 120 µW.In addition, there are five base stations deployed in the UAV flight area, and the positions of base stations are [80, 80, 25], [80, 420, 25], [420, 420, 25], and [420, 80, 25].Some important simulation parameters are shown in Table 1.It should be noted that the activation function of neural network is ReLu, and the final output layer of actor network is tahn to constrain the behavior.Reward discount factor 0.9

Batch update size 64
We run the DDPG algorithm and obtain the training curve of the agent; Figure 3a,b show the convergence process of the algorithm, and its convergence trend verifies the effectiveness of the algorithm.The cumulative reward corresponds to the reward tradeoff of the three objective functions.The DDPG algorithm is proposed to consider the antagonism of multiple objectives and jointly optimize the three objective functions to maximize the reward sum.The corresponding weight parameters are set as µ ET = 1, µ th = 1, and µ ec = 1.The convergence curves of the three objective functions are shown in Figure 4.As can be seen from Figure 3a, the UAV quickly converges to a higher reward and stabilizes at a higher level.In addition, it can be seen from Figure 3b that loss value is 0 in about 10 training rounds, and the reward convergence curve also has a slight fluctuation.The reason is that at the beginning of training, the UAV is not fully familiar with the environment and is still in the initial exploration stage.The UAV does not have enough experience learning, and the actions are randomly selected.As the UAV gained experience, the experience storage container was full, and a series of training networks in the experience replay memory were periodically extracted; then, there was a steep rise in 50 training rounds, and then a quick drop to a lower level, which then stabilized.The reason for this phenomenon is that the first batch of data is extracted for neural network input after the experience replay buffer is first full.The network training error is relatively large at the beginning, and then with more and more training data, the network training error gradually decreases.At about 400 episodes, the value of TD loss tends to be stable.The UAV is in the learning state.At the same time, during 50 to 400 training rounds, the energy transmitted and throughput received by the UAV also gradually increase, while the energy consumption of the UAV decreases.In addition, TD loss increased sharply at about 1400 training rounds.As can be seen from Figure 4, the energy consumption and throughput of the UAV increased because the UAV learned relatively complete environmental knowledge at this moment and was able to obtain more throughput at the hovering position.Therefore, adjusting the strategy consumes more energy to improve throughput.After adjusting the strategy, TD loss subsequently decreases and remains at a stable low level.This proves that the proposed multi-objective DDPG algorithm can converge to a better action strategy.
In addition, the UAV optimizes the target by adjusting its speed and direction.In order to evaluate the performance of its algorithm, we compare it with two other action strategies: (1) flying at the maximum speed v max = 20 m/s and hovering directly above the device to be charged; (2) flying at the most energy-efficient speed v ME = 9.8 m/s and hovering directly above the IoT device.
Flying at the maximum speed v max can save more time to power more users.Therefore, this situation corresponds to the optimization goal of UAV: the maximum total transmission energy; flying at the most energy-efficient speed v ME can achieve the optimization goal of UAV: the minimum total energy consumption.Figure 5 shows the multi-objective optimization results under different behavior strategies.The results comprise the average data measured in 200 rounds after neural network training.The horizontal coordinate is the range of UAV charging coverage radius, and the comparison reference is v max and v ME .These two benchmarks have good channel conditions, and it can be seen from Figure 5c that their average energy transmission is the highest.In addition, it can be seen from Figure 5a that the total transmitted energy of UAV at radius 15 m and 20 m is higher than the two benchmark schemes.As the charging coverage radius R ET increases, the total transmitted energy of UAV decreases, because when the coverage radius of UAV increases, the UAV will hover once the IoT device falls within the coverage range of UAV.As a result, the distance between the UAV and the target device increases, the energy supply rate is slow and the hover time is long, reducing the total energy transmission.In addition, it can be seen from Figure 5b that the number of devices visited by the UAV increases with the increase in the coverage radius R ET , which is reasonable because the probability of the UAV meeting the target device increases in a large coverage range.Moreover, if the UAV covers multiple target devices at the same time, the UAV can stay in the same hovering position and access multiple IoT devices at the same time.However, the maximum speed v max benchmark needs to be directly above the device, so it is difficult to cover multiple targets, and certain flight time is required between two consecutive IoT devices.Therefore, the proposed DDPG algorithm is higher than the other two benchmark schemes.It can be seen from Figure 5d that the throughput of our proposed optimization strategy is also better than that of other schemes because the information of the environment is unknown to other schemes and no feedback information can be received, so it is difficult to adjust the strategy.The proposed DDPG algorithm makes the UAV have higher flexibility in selecting the position.Therefore, in the DDPG algorithm, the UAV can select the position with the highest transmission rate in the radio map, thus maximizing throughput.As for Figure 5e, the two benchmark schemes for UAV flying at v max and v ME speeds correspond to the highest and lowest energy consumption, respectively, and both correspond to maximizing one of the optimization objectives while weakening the remaining optimization objectives.The energy consumption value obtained by the DDPG algorithm is between the energy consumption value obtained by these two benchmark schemes, and all optimization objectives are considered.The numerical results also show that the DDPG algorithm can realize the tradeoff between multiple objectives.Figure 6 shows part of the flight path of the UAV and compares the flight path obtained at the maximum speed of the UAV with that obtained based on DDPG algorithm.Figure 6 shows the hovering points and flight tracks obtained by the UAV with the maximum flight speed strategy and the hovering points and flight tracks obtained by the UAV with the DDPG algorithm.In Figure 6, the white pentacle is the hovering position selected above the device when the UAV flies at the speed of v max , and the charging coverage radius is the yellow coverage area.The black pentacle is the hovering position selected by DDPG algorithm, and the charging coverage radius is the red coverage area.The radio map in Figure 6 shows the signal intensity received from the base station at the flight height of the UAV, where the darker the color, i.e., the darker the color in a certain area, the worse the signal intensity in the area, which is not conducive to the UAV to collect data from the base station and thus reduce the throughput of the entire system.By combining the radio map with the hover points, we can find that the hover points obtained by the DDPG algorithm are basically located in the yellow area, i.e., the area with high signal strength, while many of the hover points obtained by the UAV at the maximum flight speed are located in the black area, i.e, the area with low signal strength.Obviously, this is not conducive to the UAV collecting data from the base station.The reason for this phenomenon is that in the DDPG algorithm, data throughput is taken as a part of the reward function.The communication quality between the UAV and the base station at the hovering point will be considered when the UAV selects the hovering point, while the throughput factor is not considered when the UAV selects the hovering point at the maximum flight speed strategy.In order to better illustrate that the hover points selected by DDPG can improve the total data throughput of the system, we visualized the data transmission rate of the UAV on the hover points obtained by two different methods to prove the effectiveness of the algorithm, as shown in Figure 7.The horizontal coordinate "loc1" represents the position of the first hovering point, which is approximately (320, 400), and "locx" represents the x-th hovering point.Combined with the trajectory diagram, it can be observed that the UAV has a higher energy transmission rate when it flies at the speed v, while under the control strategy of DDPG algorithm, the UAV can obtain a higher data transmission rate, but the cost is that the energy transmission performance will be reduced.This shows that the DDPG algorithm can achieve multi-objective optimization and obtain a better trade-off.In order to further verify that the action strategy of the DDPG algorithm is better than that of other benchmark schemes, we made a set of data by adjusting the weight parameters and optimized three optimization objectives separately: we maximized the sum of transmitted energy, and then the weight was set as S ET : µ ET = 1, µ th = 0, µ ec = 0; we maximized the total data throughput received by the UAV, and then the weight was set as S th : µ ET = 0, µ th = 1, µ ec = 0; and we minimized UAV energy consumption, and then the weight was set as S ec : µ ET = 0, µ th = 0, µ ec = 1.In multi-objective optimization, all weight parameter groups are set to 1, as follows: Mo : µ ET = 1, µ th = 1, µ ec = 1.The results of the optimization are shown in Figure 7.As the charging radius of the UAV increases, the UAV has a wider range of hovering locations to choose from, and better throughput can be obtained by taking advantage of good channel conditions.In addition, comparing the single optimization and joint optimization of the three optimization objectives, as shown in Figure 8a, the total energy transfer obtained by the optimization strategy Mo of the three optimization objectives jointly is higher than that obtained by the strategy S ec of energy consumption and the strategy S th throughput optimization but only lower than that obtained by the strategy S ET of energy transmission optimization separately.This is reasonable because strategy S ec and strategy S th do not consider energy transmission: µ ET = 0, so the energy transmission performance of these two benchmark schemes is not as good as that of strategy Mo proposed.However, strategy S ET only considers energy transmission, and strategy Mo also takes into account throughput and energy consumption.Therefore, strategy Mo has worse energy transmission performance than strategy S ET .It can be observed from Figure 8b,c that strategy S ET is also better than the other two strategies in terms of the total number of charging devices and energy transmission rate.It can be observed that in Figure 8d,e, strategy S th and strategy S ec have the best performance in maximizing the data throughput received by the UAV and minimizing the energy consumption of the UAV, respectively, which is similar to the explanation in Figure 8a.In terms of data transmission rate performance, strategy Mo is better than strategy S ET and strategy S ec but worse than strategy S th .In terms of energy consumption performance, strategy Mo is only lower than strategy S ec , which is superior to the other two strategies.This proves that the DDPG algorithm can comprehensively optimize multiple targets.

Conclusions
This paper studies a UAV path planning problem based on the radio map; optimizes the multi-objective problems such as wireless energy transmission, data collection, and energy consumption; analyzes the UAV's energy consumption model; and builds a radio map, which can be used to select the best hover position for the UAV after training with a certain understanding of the map.We use the deep reinforcement learning algorithm to realize the control strategy of UAV, transform the problem into Markov sequence decision problem, and set appropriate state space and action space for the reinforcement learning problem.The numerical results verify the effectiveness of DDPG algorithm, and the optimal strategy can be obtained compared with other schemes.At the same time, for multi-objective optimization, the weight parameters of each optimization index are defined, and the DDPG algorithm can obtain comprehensive trade-offs according to the combined weight parameters of each optimization objective.

Figure 1 .
Figure 1.UAV-assisted wireless energy transmission communication network system.

Figure 3 .
Figure 3. Plot of training parameters versus number of training sessions.
(a)Total transmission energy.(b) Total number of charging devices.(c) Average transmission of energy.(d) Throughput.(e) Energy consumption.(f) The number of power-off devices.

Figure 4 .
Figure 4.The training curve of the optimization objective.
(a) Total transmission energy.(b) Total number of charging devices.(c) Average transmission of energy.(d) Throughput.(e)Energy consumption.

Figure 5 .
Figure 5. Optimized results under different behavioral strategies.

Figure 6 .
Figure 6.The trajectory of the UAV under different strategies.

Figure 7 .
Figure 7.Comparison of throughput at seven hovering points of UAVs under different strategies.

Figure 8 .
Figure 8. Optimization results of different weight parameters.
r ET (t), r th (t), r ec (t), and r aux (t) are