In Situ MIMO-WPT Recharging of UAVs Using Intelligent Flying Energy Sources

: Unmanned Aerial Vehicles (UAVs), used in civilian applications such as emergency medical deliveries, precision agriculture, wireless communication provisioning, etc., face the challenge of limited ﬂight time due to their reliance on the on-board battery. Therefore, developing efﬁcient mechanisms for in situ power transfer to recharge UAV batteries holds potential to extend their mission time. In this paper, we study the use of the far-ﬁeld wireless power transfer (WPT) technique from specialized, transmitter UAVs (tUAVs) carrying Multiple Input Multiple Output (MIMO) antennas for transferring wireless power to receiver UAVs (rUAVs) in a mission. The tUAVs can ﬂy and adjust their distance to the rUAVs to maximize energy transfer gain. The use of MIMO antennas further boosts the energy reception by narrowing the energy beam toward the rUAVs. The complexity of their dynamic operating environment increases with the growing number of tUAVs and rUAVs with varying levels of energy consumption and residual power. We propose an intelligent trajectory selection algorithm for the tUAVs based on a deep reinforcement learning model called Proximal Policy Optimization (PPO) to optimize the energy transfer gain. The simulation results demonstrate that the PPO-based system achieves about a tenfold increase in ﬂight time for a set of realistic transmit power, distance, sub-band number and antenna numbers. Further, PPO outperforms the benchmark movement strategies of “Traveling Salesman Problem” and “Low Battery First” when used by the tUAVs.


Introduction
The recent years have seen increasing advancements and decreasing costs of lowaltitude UAVs, commonly known as drones. Drones carrying a range of technologies for sensing and communication are becoming popular with service providers as innovative service delivery platforms, such as for emergency medical deliveries, precision agriculture, aerial imagery, etc. This popularity contributes directly to the growth of the global market for drone-delivered commercial services to an estimated value of USD127bn [1]. Drones are also employed in 5G networks either as aerial base stations providing a wireless Hotspot or mobile relaying services to the ground nodes [2,3], or as aerial nodes of cellular UAV networks [4,5].
With such a staggering market value, reliability through service continuity becomes a critical success factor [6]. However, drones have short flying times due to their dependency on on-board, limited capacity batteries for power supply. For example, the typical flighttime of the DJI Spreading Wings S900 drone is about 18 min when the battery is fully charged [3]. This implies that drones need to make frequent trips to the ground charging stations so their batteries can be replaced or recharged, which creates significant service To the best of our knowledge, this is the first attempt to consider such a wireless charging architecture for UAVs using multiple dedicated and coordinating aerial energy sources.
The main contributions of this paper can be summarized as: (i) we propose a system of multiple tUAVs to facilitate aerial wireless charging of rUAVs using multi-band MIMO beamforming, (ii) we propose a PPO-based movement decision algorithm for the tUAVs in selecting the next rUAVs to recharge as per their battery levels, and (iii) we compare the PPO-based system performance using simulations, with two benchmark movement decisions strategies of the tUAVs: traveling Salesman Problem (TSP) and Low Battery First (LBF). Our results demonstrate that with PPO, the system achieved a tenfold flight time extension compared to no WPT. Further, this strategy outperforms the benchmark movement strategies of TSP and LBF when used with WPT.
The rest of the paper is organized as follows. The related works are discussed in Section 2. System description and the DRL model is presented in Section 3 followed by the performance evaluation of proposed model in Section 4. We finally conclude the paper and discuss future works in Section 5.

Related Works
Researchers commonly address the drone energy-limitation issue through the design of their energy-efficient functioning mechanisms. These mechanisms include flight path (trajectory) planning and communication methods. UAVs consume energy due to their mechanical (flying, hovering) and electronic (wireless communication) functions, presenting scopes for improving energy efficiency of both. However, since the mechanical energy consumption is significantly more than that from the electronic functions, researchers mostly focus on optimizing the trajectory to shorten the flight paths for reducing mechanical energy consumption, e.g., in [7,8,24]. As our current work is on the topic of energy replenishment solutions for the deployed UAVs, we omit details of the work on energy-efficient UAV operations.
One approach to address the energy replenishment objective of the deployed UAVs is using tethered UAVs [25], wherein the UAVs are connected via a cable to the ground station to receive continuous supply of power. As the ground station has an unlimited power supply, the tethered UAVs can operate perpetually. However, this approach restricts the UAVs' mobility and deployment locations to be only in the areas with existing ground stations. Another solution is to use UAV battery swapping [26], which requires the deployed UAVs to fly back to the ground station for "hot-swapping" the battery, whereby the external power sources keep the UAVs powered on while the battery replacement takes place. This approach is quicker than recharging the on-board battery at the ground station. However, this approach leads to service interruptions due to the UAVs leaving the serving area for the battery swapping. To allow greater freedom of deployment locations and mobility of the UAVs, we consider a far-field wireless powering approach in our current work for which we discuss the related work below.
As previously mentioned, there have been many trials in the past demonstrating the viability of far-field WPT techniques. Further development and practical use of such techniques are somewhat stalled; however, we see a renewed interest, as evident in recent industry needs, activities and trials. A New-Zealand based startup company, Emrod, was recently reported to have developed a long-range, high-power, WPT technology to deliver wireless electricity to end users without needing copper power lines [27]. This follows the country's second largest power company Powerco, planing to trial the technology in 2021 [28]. In 2020, a US-based company, Powerlight Technologies (formerly known as Laser-Motive), demonstrated a wireless power receiver for drones. In an earlier demonstration, the company used a laser beam to fly a drone for more than 12 h [29].
On the other hand, the increasing demand for UAVs' autonomous wireless recharging is clearly evident by their numerous commercial operations such as power line monitoring (e.g., [30]), food delivery (e.g., [31]) and law enforcement (e.g., [32]), to name a few. The short flight times of these drones are causing serious deployment hindrances in these industries. According to a 2020 Bloomberg report, the predicted market value of the global autonomous wireless charging and infrastructure market for drones will reach USD 249.3 Million by 2024 [33]. This justifies the unified efforts from research and industry that are required to develop practical solutions for wireless charging of drones and, more importantly, in situ solutions.
The utility of far-field WPT using EM radiation is well established: it provides placement flexibility and mobility of transmitters and receivers, can work even in non-LoS conditions, and can power over a distance. Due to the energy conversion efficiency limitation, generally this technique is suitable for low-power devices. Despite this limitation, various works on WPT suggest that far-field WPT can also be used for recharging UAV batteries (e.g., [15]). To this end, wireless recharging of UAVs were proposed using RF WPT in [20] and optical energy transfer in [21], both from ground base stations in a simultaneous wireless information and power transfer system. Using a power-splitting and time-switching architecture, authors proposed a relaying system in [20] in which the UAV harvests energy and information from the base station, and relay the information to a ground node. With an objective of prolonging the lifetime and throughput maximization of the network, authors optimized the system parameters along with the UAV deployment location; however, no explicit results on received power were mentioned. The authors in [21] studied a similar system but with an optical transmitter at the ground base station casting optical beam to the UAV carrying both data and energy, providing simultaneous communication and charging. The numerical results showed that the system achieved a high network throughput and a 25% extra hovering time in the drones. However, both proposals require the UAVs to be in the proximity of the terrestrial base station to achieve LoS and receive power. This limits the locations where the UAVs can be deployed due to the fixed terrestrial base stations. Therefore, flexible in situ wireless charging of UAVs remains a challenging open problem.
In our previous works [34,35], we studied different modes of dedicated, aerial WPT chargers to observe their performance subsequently, with tUAVs carrying omnidirectional antennas. In [34] we utilized aerial, stationary (i.e., hovering at fixed locations) tUAVs to study their optimal placement locations with respect to the rUAVs to maximize total received power at the rUAVs. In [35], we utilized one flying tUAV to power all rUAVs. In that work, for the single tUAV to recharge all rUAVs, a single-agent optimization of the tUAV's trajectory was presented via Q-Learning to enhance power delivery. However, Q-Learning comes with a scalability issue and both observation and action space must be limited. As such, for multi-agent systems (current work) Q-learning poses limitations. Further, the use of omnidirectional antennas waste energy since energy is radiated all around the antenna, not only to the energy receiver. The fundamental differences of our current work with our prior works is that our current work uses multi-band MIMO antennas at the flying energy sources which distributes power over a wider spectrum and exploits targeted energy beams through beamforming to recharge chosen UAVs. This boosts energy delivery at the rUAVs. Further, we employ a multi-agent optimization model for multiple tUAVs in the network to optimize the tUAVs' movement decisions using the PPO algorithm. The advantages of PPO over Q-Learning is discussed in the next section.

System Description
In this section, we present our UAV recharging architecture involving multiple tUAVs and rUAVs. We also present the deep reinforcement learning algorithm using the PPO technique to control the movements of the tUAVs in targeting the next rUAVs to recharge. The optimization aims to enhance the MIMO-WPT efficiency to achieve longer flying times of the rUAVs in the presence of multiple coordinated tUAVs serving multiple rUAVs with dynamic battery levels.

UAVs Recharging Architecture
Our proposed UAV recharging architecture consists of specialized, flying UAVs equipped with multiple high gain RF antennas (tUAV) that transmit wireless power to recharge the rUAVs' batteries. We assume that the rUAVs are deployed in an area to provide Hotspot wireless communication services to the ground users ( Figure 1). The tUAVs are assumed to have a significantly greater power supply than the commodity rUAVs, e.g., by carrying a larger battery or having hybrid power sources. As such, the tUAVs are expected to be costlier and bulkier than the rUAVs. Further, a tUAV which recharges several rUAVs can be replaced with another tUAV when its energy is depleted. However, the tUAV replacement does not interrupt the services of the rUAVs. It is to be noted that our aim is to extend each rUAV's operating time as much as possible, thus reducing the number of times the rUAVs would return to the ground station once their batteries eventually deplete. The tUAVs fly about and position themselves in a way to minimize the distance between them and the rUAVs, and to improve the line-of-sight RF links for the target rUAVs. This enhances the power transfer effectiveness.
To increase energy transfer efficiency, we propose a MIMO system to perform an energy beamforming and focus energy toward the receiver [36][37][38]. Hence, we consider a point-to-point MIMO system with m t antennas installed on the tUAVs and m r antennas on the rUAVs. Without the loss of generality, we assume a uniform square array of antenna on each side. We use the system model obtained from [14,37] where a total of N orthogonal sub-bands are used to transmit energy. On each sub-band, sine-wave signal S n is emitted at carrier frequency f n by m t tUAV antennas as where n = 1, 2, . . . , N and s mn (t) is the beamforming component of s n (t) by antenna m at frequency n. Thus, the total received power at all m r receiver antennas is defined as where h H in h * i1n , · · · , h * im t n is the channel vector from transmitter antennas to receiver antenna i, H n is the channel matrix between m t transmitter antennas and m r receiver antennas and S n is the transmit covariance matrix all at sub-band n. Similarly, the total transmit power at frequency f n is The maximum transmit power at each sub-band is constrained by regulation and hardware limits. Thus, let us assume tr(S n ) ≤ P s , ∀n. (3) Based on [14] and assuming the maximum sum-power P s is transmitted at each sub-band, the received power at each sub-band n is obtained as where λ max,n = λ max (H H n H n ) denotes the maximum singular value of H H n H n for sub-band n. As a result, the total harvested energy is Since there is almost a pure LoS MIMO channel between a pair of tUAV and rUAV during the recharging process as the tUAV adjusts its position to achieve this, H n is a rank one matrix and an optimal beamforming can be achieved by an SVD-based beam-former [39,40] when only one strong beam is formed by transmitter antennas as optimal energy beamforming with a gain of n t [37,41], where a n is the signal attenuation along the LoS path at frequency n which is assumed to be the same for all antenna pairs. This assumption is valid when the distance between transmitter and receiver is much larger than the antenna array size [39]. For this purpose, the Channel State Information (CSI) should be available at transmitter side. In contrast with MIMO channel's information, the energy transfer channel in our system is significant, stable and relatively time invariant. Hence, measuring the CSI feedback is not a challenging task. Attenuation is also achieved by where G t and G r represent each antenna gain at the transmitter and receiver, respectively, c is the speed of light and d is the distance between the transmitter and receiver. Thanks to mechanical alignment, high gain antennas can be employed to boost MIMO gain [41,42]. We applied a limit of 90% efficiency [43] to the RF gain in (5) to model nonideal implementation of MIMO system, i.e., mutual coupling. There is also RF to DC conversion efficiency at the receiver which represents how much of received wireless energy can be converted to usable energy by the rUAVs. In this work, we assume a constant RF to DC efficiency of 80% [44,45] that is denoted by γ. Additionally, we assume that the receiver antennas are installed on the top of the rUAVs to minimize blockage by rUAV's frame or blades. Equation (7) shows that the energy transmission is significant for short distances; therefore, the tUAV should hover above rUAV to maximize energy transfer and this will minimize the blockage. Furthermore, small movements of both tUAV and rUAV do not reduce efficiency of beam alignment since CSI can be measured several times in a second to update beam direction. As discussed earlier, we limited the RF efficiency in our proposed conjecture to model non-ideal implementation. However, the true and highly accurate model can be achieved by a prototype system.

Proposed Trajectory Selection Algorithm
Proximal policy optimization (PPO) is a model-free, online, on-policy reinforcement learning method from policy gradient family [46,47]. This method supports both discrete and continuous spaces for observations and actions. A PPO agent transitions from one state to another, by taking random actions. A set of states S which are defined based on the observations from the environment and a set of actions A define the learning space. By performing an action a ∈ A and observing the resulting state, a revenue function calculates a numeric reward. The learner's goal is to maximize the discounted long-term reward state-action pairs from beginning up to reaching the goal state, so called the optimal policy. The optimal policy indicates which action is the best to take in different states, which results in a maximized overall gain. PPO selects actions based on the probability distribution and we define the optimal policy so that the action with maximum likelihood is chosen after training as deterministic exploitation. PPO finds the best location and movement for the tUAVs at a given observation of entire network of rUAVs.
Reinforcement learning has been widely used in UAV related research recently. This includes a range of application from military threat avoidance [48] and obstacle avoidance [49] to trajectory optimization for improving services in wireless communications [50]. In our previous work [35], we used Q-Learning which is not scalable for large observation space and multi-agent systems. In our current work, we utilize DRL to solve the scalability issue where deep neural networks are used to improve reinforcement learning. Among several DRL methods such as deep Q-network (DQN), deep deterministic policy gradient (DDPG), PPO and Twin-Delayed Deep Deterministic Policy Gradient Agents (TD3), we found PPO to perform the best in terms of faster learning, relatively little hyperparameter tuning and simplicity [51]. Hence, we employ PPO with discrete observation and action space. PPO also allows us to use fine-grained discrete observation value. In contrast, discretizing observation values can be an implementation issue in Q-Learning as it increases the Q-table size sharply. PPO components in our solution are defined as the following: • Agent (tUAV) observes the current state and takes actions. There are multiple agents in our scenario. To keep the model simple, we implemented multiple tUAV system as a single agent PPO with multiple actions. Revenue (R) is the combination of rewards and penalties after taking action a at state S and moving to state S . It returns a reward for the energy that all rUAVs receive from tUAV and/or applies a penalty if an rUAV has to move to a terrestrial charging station due to low battery. R is formulated as: where w 1 , . . . , w 5 are adjusting weights, P r is the total harvested power by rUAVs noted in (5), T is the time step and N o is the number of out of charge rUAVs which should be replaced and resulted in service interruption. B l represents the low battery thresholds of rUAVs and is defined as where B max is the battery capacity of rUAV. B f denotes the full battery if the battery is more than 97% charged. Finally, Q indicates the conflict between the tUAVs if their distance from each other is less than a threshold. This can force them to not charge the same rUAV in the same time and also avoid a crash. The second PPO function approximator is the critic V(S) that takes observation S and returns the expectation of the discounted long-term reward [46,47].
In the above model, each agent (tUAV) needs to observe all rUAVs' geographical locations and their remaining battery levels. We assume that the rUAVs remain in the same geo-cell in our considered area; therefore, only their battery status needs to be sent to the tUAVs at each time step. Hence, our tUAVs and rUAVs must have a light periodic signaling to exchange information.
Considering the discussed Reinforcement Learning components, we follow Algorithm 1 to obtain an optimal flying trajectory (i.e., movement decisions) of the tUAVs and recharging mechanism that maximizes the overall flying duration of all rUAVs. In this algorithm, each tUAV receives updated information of the rUAVs at each time step. The observation includes the tUAVs current location indicating the current state. The agent makes a decision on movement based on the actor output. Recharging is considered only when the tUAV arrives to hover above the chosen rUAV because recharging is assumed inefficient when the tUAV is flying. The PPO details are not presented in Algorithm 1 as it can be found in [46,47]. The algorithm can be executed centrally in a ground control station or by the tUAVs individually.

Performance Evaluation
In this section, we first describe the simulation set-up including the baseline algorithms that we compare the PPO's performance against. We then present and discuss key results of this research.

Simulation Setup
In our scenarios, we consider six rUAVs and two tUAVs, located in an environment modeled as a 100 × 100 grid (Figure 2). To simplify our simulation design, we assume all rUAVs can be located only at the center of cells as illustrated in Figure 2. Each tUAV sends the recharging beam toward the target rUAV that is selected by the algorithm. We selected an arbitrary frequency of 25-27 GHz which can be adjusted as per the spectrum regulations in the region. Note that increasing the frequency increases free space path loss but more antennae can be installed in the same antenna size since the MIMO proper inter-element space is related to wavelength. For example, the wavelength of frequencies below 1 GHz is very large for MIMO. Additionally, 1-7 GHz is highly saturated for current wireless communications [40]. On the other hand, high path loss in higher frequency is helpful to minimize the interference of WPT to ground stations. Hence, we propose mmWave spectrum for our conjecture. We assume a maximum power of 1Watt is transmitted at each sub-band of 10 MHz width. This is reasonable in terms of regulations as in most countries mobile devices that work in the millimeter-wave spectrum are permitted to operate in 83 dBm/100 MHz range [23]. There are 256 antenna elements installed on each tUAV and rUAV in a uniform square array, and since the EM wavelength is about 1.2 cm, the array can be readily fitted on a small drone. For a square array, the number of antenna elements should be a power of 2, e.g., 256. All simulation parameters are defined in Table 1.

Simulation Component Value
Transmit power of each sub-band P s 1 Watt Antenna element gain G t , G r In order to evaluate our algorithm's performance, we used the MATLAB Reinforcement Learning Toolbox to simulate the environment and implement the PPO algorithm. Additionally, we simulated the following two benchmark schemes for the tUAVs' movement decisions: • Traveling Salesman Problem (TSP): Each tUAV recharges a group of three rUAVs periodically and in order. The groups and orders should be selected so that the traveling times of the tUAVs are minimized. We solve the TSP using an iterative approach to find the best two groups to be served by the two tUAVs. To compare the performance of the PPO and the above baseline schemes and also to show the WPT recharging effect, we counted the number of times that an rUAV battery reaches the minimum threshold and it is replaced with a full battery rUAV after few seconds. The rUAV replacement can result in a service interruption to the nodes that are served by the respective rUAV (e.g., in the Hotspot service scenario). As such, the replacements should be minimized. Additionally, we calculated the average flying time of all rUAVs. We assumed that the WPT recharging in our scenario is not enough to keep all rUAVs in service for long time. This is because the total recharging power is less than consumed power. A period of 10 h was simulated to study the impact of the recharging.

Results
In this section, we present the simulation results based on the above system model and algorithm.
First, we ran simulations with and without the WPT capability to see the viability of our proposed model. We can evaluate both systems' performances based on the number of times the rUAVs need to be replaced due to the battery depletion. As is illustrated in Figure 3, using the MIMO based WPT-enabled tUAVs significantly improves the system performance by reducing the number of rUAV replacements from 108 (when no WPT recharging is used) to less than 20 during a 10 h simulation period. In the same simulation, we also evaluated the performance of the proposed algorithm against benchmark schemes with WPT enabled tUAVs. As it can be seen from Figure 3, PPO outperforms the benchmark schemes where in 10 h duration, the number of rUAV replacements is only 11 in comparison with 14 and 17 for the LBF and TSP schemes. We assumed 120 s time steps for tUAVs to retake decisions in this simulation. Second, since the tUAV's recharging ability is not used when it is traveling, increasing the time step may improve the results for benchmarks. For this purpose, we simulated the scenario with different time step durations to compare different models' performances. The result is plotted in Figure 4. As can be seen, the PPO's performance can also be improved for longer time steps of about 150-200 s. It is further observed that TSP can be as good as PPO for some time steps. However, the figure shows that the PPO performance superiority is maintained for all time step values despite the fact that the gap is reduced in 150-200 s. To conclude, the best overall performance is recorded for a time step of 150 s by PPO when only 10 replacements are recorded over 10 h. Furthermore, we have recorded the flight duration of all rUAVs in our simulation. Figure 5 demonstrates their average flying times. We have used the best time step for each scheme based on Figure 4. As is shown, while it is only 33 min without employing the WPT recharging mechanism, the proposed PPO based WPT can increase rUAVs' flight duration up to 390 min for the studied scenario. Moreover, it can be observed that the low complexity schemes of LBF and TSP can achieve an approximately 240 min flying duration, which is significant. This result demonstrates the merit of our proposed MIMO-WPT based UAVs recharging architecture, irrespective of the specific movement strategy of the tUAVs. However, the PPO's performance achieved notable gains of about 60% higher than the benchmark movement schemes. Additionally, a confidence interval of 95% is shown in Figure 5 that shows energy distribution fairness among rUAVs. Among the three schemes of WPT, TSP is distributing energy more equally. On the other hand, we present the average flight duration of rUAVs in Figure 6 for different numbers of antennas in which we consider the same value for m r and m t . Clearly, increasing the number of antenna elements can increase the beamforming gain and improve the WPT efficiency. However, this is limited by the total transmitting power, and the WPT's efficiency (RF-RF) which cannot be more than 100%. Note that we assumed a maximum of 90% RF-RF efficiency to address the non-ideal implementation factors such as mutual impedance between antenna elements in the antenna array [43].

Conclusions and Future Work
We studied the concept of using dedicated, flying chargers equipped with MIMO antenna for in situ recharging of UAVs' batteries using wireless power transfer. We formulated the movement decision of the aerial chargers to recharge the UAVs as a multiagent optimization problem using the Proximal Policy Optimization (PPO) to optimize the energy transfer gain and enhance the UAVs' flying times. Using simulation studies, we demonstrated that the MIMO-WPT provided a tenfold increase in the flight time for the deployed system compared to no wireless recharging of the UAVs. The maximum gain was achieved when PPO was employed to place and move wireless energy sources intelligently. Although we have extracted simulation parameters and assumptions from practical works, implementation challenges may affect the gain of MIMO-WPT UAV recharging.
Although we simulated a scenario of Hotspot UAVs that hover above fixed locations, it can be generalized for all applications where the power receiver UAVs are hovering above a certain location. Future work could consider scenarios with mobility and dynamic positioning of the power-receiver UAVs, and the use of hybrid power sources at the flying chargers. Additionally, the CSI measurement at the tUAVs is not a challenge for the rotorbased rUAVs, which is what we have used in our work. However, for the winged rUAVs, this is not the case since they cannot stand still. So, in practical implementations of our system with the winged rUAVs, the CSI acquisition and dynamic beamforming will be challenging. In future work, fast beam switching technology in the face of changing CSI using ML or codebook-based beamforming can be investigated.