Cache-Enabled Data Rate Maximization for Solar-Powered UAV Communication Systems

: Currently, deploying ﬁxed terrestrial infrastructures is not cost-effective in temporary circumstances, such as natural disasters, hotspots, and so on. Thus, we consider a system of caching-based UAV-assisted communications between multiple ground users (GUs) and a local station (LS). Speciﬁcally, a UAV is exploited to cache data from the LS and then serve GUs’ requests to handle the issue of unavailable or damaged links from the LS to the GUs. The UAV can harvest solar energy for its operation. We investigate joint cache scheduling and power allocation schemes by using the non-orthogonal multiple access (NOMA) technique to maximize the long-term downlink rate. Two scenarios for the network are taken into account. In the ﬁrst, the harvested energy distribution of the GUs is assumed to be known, and we propose a partially observable Markov decision process framework such that the UAV can allocate optimal transmission power for each GU based on proper content caching over each ﬂight period. In the second scenario where the UAV does not know the environment’s dynamics in advance, an actor-critic-based scheme is proposed to achieve a solution by learning with a dynamic environment. Afterwards, the simulation results verify the effectiveness of the proposed methods, compared to baseline approaches.


Introduction
Lately, wireless communication has been evolving not only for high throughput, but also for ultra-reliability, efficient energy consumption, and to support highly diversified applications with heterogeneous requirements for quality of service (QoS) [1]. To this end, extensive research efforts have mainly been devoted to fixed terrestrial infrastructures such as ground base stations (BSs), access points, and relays, which generally restrict their capability to cost-effectively meet the ever-increasing multifarious traffic demand. In order to address this problem, there is a great deal of growing interest in providing wireless connectivity from the sky under various airborne platforms, such as unmanned aerial vehicles (UAVs) [2], balloons [3], and helikites [4]. Currently, UAV's classification has a broad diversity. It was classified according to the basis of weight, altitude, and range, wings and rotors, and their application [5]. By leveraging low-altitude UAVs (i.e., less than about one kilometer above the ground [5]), the wireless communication system can provide swift deployment and high flexibility in mobility [2]. UAVs can be used as a flying base station to handle short-term, erratic traffic demand in hotspots, such as a concerts and sports events, or through data offloading for congestion mitigation [6,7]. In other words, the UAV can provide additional aid as either a stand-alone aerial BS [4,8], or it can serve as a part of a heterogeneous network in a multi-tier

Motivations and Contributions
To the best of our knowledge, in most existing works on EH-powered wireless communication systems, information about the energy harvesting arrival is assumed to be known. Thus, it is not always available for designing appropriate solutions to the real wireless UAV communication issues. Moreover, the consumption models for the propulsion energy of the UAVs are quite sophisticated and critically depend on many factors such as the UAV's trajectory, velocity, and acceleration [29,34,35]. This can increase the complexity of developing contemporary schemes. On the other hand, establishing the schemes using only harvested energy for both wireless communications and flight operations might not optimize EH-powered UAV communication performance if the UAV carries a high energy consumption aerial base station or if the harvested energy arrival rate is small [34][35][36][37]. By combining all these issues, devising a method to optimize the service performance of the EH-powered UAV to multiple ground users is still a very challenging task, especially in unexpected circumstances such as temporary disaster areas or complex terrains.
Motivated by the above analysis, in this paper, we propose two joint caching and power allocation schemes for solar-powered, UAV-enabled NOMA communication systems under two scenarios. In the first scenario, the system has the prior knowledge of the harvested energy distribution of the UAV. On the other hand, in the second scenario, we consider the case that the system does not know the harvested energy distribution of the UAV. The GUs require the number of data items stored in the local station. Nevertheless, there are no available direct links between the local station and the GUs due to unexpected or emergency circumstances such as natural disasters, obstacles, and long-distance transmissions. The deployment of terrestrial infrastructure can be infeasible and challenging owing to sophisticated environments, as well as high operational costs. Thus, the UAV is employed to cache part of the content from the local station and deliver data to the GUs. In this work, the UAV can harvest solar energy from the ambient environment. However, the solar panel equipped on the UAV cannot sufficiently provide long-term operation due to its large mass, high mobility energy, and communication energy. To address this problem, the battery is fully recharged at the local station (LS) by the grid power whenever the UAV returns to the station.
There are two portions in the battery: mobility capacity used for flight operation and transmission capacity used for data transmissions. Mobility capacity representing the space needed for flight energy occupies a large portion of the battery. Therefore, the remaining space required for data transmissions (i.e., transmission capacity) in the battery is significantly limited. The amount of initial energy for data transmissions in the battery is not enough for providing the higher data rate to the GUs in the long term. It is supposed that the UAV always harvests the energy during its flight. Hence, during the serving time of each round, the UAV can leverage harvested solar energy to transmit data to the GUs. The mobility energy is assumed to be preserved enough in each round; thus, the harvested energy used for data transmission has a higher priority during the serving time. This means the harvested energy is used for replenishing the transmission capacity before it is used to charge the mobility capacity during the serving time. Besides, the battery is always recharged by the harvested energy during the non-serving time to reduce the grid power consumption required for charging and additional charging time when the UAV is at the LS. In other words, the harvested energy is stored in the on-board battery, which can be used not only for providing data transmission services to GUs during the serving time (i.e., the duration time that the UAV flies around the circular trajectory), but also for recharging the battery for its flight operation during the non-serving time (the time when the UAV approaches the LS and the time when the UAV goes to the serving area). Therefore, it is worth applying solar harvesting to the UAV-based communication system.
Instead of using conventional orthogonal multiple access (OMA) (e.g., TDMA, FDMA, CDMA), which causes low spectrum efficiency, the NOMA technique is applied to enhance the data rate of the UAV system in which the UAV can simultaneously transmit data to the GUs. In this paper, there are three phases of the UAV's operation: (1) performing the caching update process and then approaching the serving area, (2) flying along the circular trajectory while doing the communication process, and (3) returning to the LS for re-caching the files and recharging the battery, as shown in Figure 1a. The caching update process is implemented at the local station in which the UAV pre-caches part of the content from the local station and replenishes the battery for the next round. Then, it approaches its serving area to start flying along the predefined circular trajectory where the GUs can be served. Next, the communication process of the UAV will be executed in which the UAV can transmit data based on the content requests of the GUs during the UAV's flight following the predefined circular trajectory. After finishing a circular trajectory flight period, the communication process will temporarily be terminated, and the UAV needs to go back to the LS for re-caching the content and battery recharging. These processes will repeat until the UAV satisfies the GU's requests. In this paper, using solar harvesting for the UAV will help relieve the burden of grid power-based energy consumption. Furthermore, finding the proper solution for the solar-powered UAV to provide the energy-efficient communications is still a challenging task under the limited energy harvesting technology. This can make the solar-powered UAV system more applicable to the real wireless system scenarios. In a nutshell, the main contributions can be summarized as follows.

•
Firstly, we study a model of a cache-enabled downlink UAV communication network. Ground users request data items stored in the library of a local station, but direct links are not available. Thus, the solar-powered UAV is employed to cache content from the local station and then approach distant users to execute data transmissions using NOMA technology. However, the UAV is equipped with both limited battery capacity and cache capacity. Therefore, we aim to efficiently allocate the harvested energy to the GUs for the long-term operation. • Secondly, we formulate the problem of the sum data rate maximization as the framework of a partially observable Markov decision process (POMDP). An iteration-based dynamic programming approach is proposed to obtain the optimal policy for the UAV in order to maximize the system data rate under the assumption that the UAV has prior environment information.
With this method, the UAV can efficiently cache the content from the local station at the beginning of each flight period and allocate an appropriate portion of transmission power for the GUs throughout every time slot under energy and cache constraints. • Thirdly, we present another approach using an actor-critic-based reinforcement learning algorithm to deal with the problem in the scenario where the UAV does not have information on environment dynamics in advance. With the actor-critic-based method, the UAV can interact with the environment and gradually learn the optimal policy as time goes on, based on trial-and-error without prior environment knowledge. • Lastly, extensive numerical results are provided to validate the proposed algorithm's performance through various network parameters. We show that, with joint caching and power allocation, the two proposed schemes are superior to the benchmark schemes where the UAV greedily utilizes transmission power without long-term considerations. The remainder of this paper is organized as follows. The model for the EH-powered UAV downlink communication system is presented in Section 2. Next, we describe the proposed POMDP-based joint cache scheduling and power allocation scheme in Section 3, and the proposed actor-critic-based learning framework is presented in Section 4. The discussions on the simulation results are elaborated in Section 5. Finally, we conclude this work in Section 6.

System Model
We consider a caching-based UAV-enabled downlink wireless transmission system adopting non-orthogonal multiple access and content caching technologies where a UAV, F, is employed as a mobile base station to serve a group of I ground users, denoted by I = {1, 2, ..., I}. We assume GUs do not have direct links to the local station (LS) where all content that the GUs requests is stored. This kind of network scenario can be a practical instance in suburban environments where the deployment of communication infrastructures is still restricted or in urban environments where damage of the infrastructures may happen due to natural disasters. Thus, the remote users may not get services from a local station. For that reason, the UAV is dispatched to obtain cached contents from the LS, and it then flies along a predefined trajectory to transmit the requested data to GUs. In the existing works, given the user distribution in the network, many effective methods have been used to optimize the UAV's placement and UAV's trajectory [19,20]. Besides, the approach to maximize the coverage area of the UAVs was well studied in [8]. Therefore, in this paper, we do not optimize the flight trajectory and coverage region of the UAV in the network. Instead, we aim to maximize the long-term throughput based on the predefined trajectory. For that reason, we assume that the circular trajectory of the UAV's flight is known based on the locations of the GUs, and the coverage region of F is large enough to guarantee the connection to all GUs by following the predefined circular trajectory with a reasonable radius and altitude. This means that the GUs are still in the UAV's coverage during the UAV's circular flight such that the GUs always get the data delivery from F. It is noteworthy that the system still can be well applied for other flight trajectories of the UAV. Our main goal is to allocate the appropriate transmission power to the GUs and schedule data caching of the UAV under a predefined trajectory to obtain the maximal long-term data rate of the system.
Each data transmission is executed in every time slot t, and meanwhile, each caching action is executed at the beginning of a flight period, which is determined as a round in which the UAV flies to the serving area and then flies along its predefined circular trajectory and returns to the LS. However, due to a limited cache capacity, it can only periodically cache part of the content from the LS at the beginning of every flight period. The GUs are assumed to have a fixed power supply, whereas the UAV has a limited-capacity battery. Hence, UAV F is equipped with an energy harvester to scavenge solar energy from the ambient environment to replenish its battery. We assume the UAV works in an ideal environment without any environment factors (e.g., wind). Suppose that the UAV continuously flies at a constant velocity, v F , in a circular trajectory with radius r F , at altitude h F , and the UAV position repeats every T F (seconds). Thus, the flight length for the circular trajectory is defined as T F = 2πr F v F , and the number of time slots discretized in each circular trajectory length is determined as N F = T F T , where T is the time slot duration. Note that the UAV's location is assumed to be unchanged during each time slot when T is chosen sufficiently small in the system [35].
Without loss of generality, we consider three-dimensional (3D) Cartesian coordinates (x, y, z) where (x, y, 0) represents the ground plane and z is the altitude. The location of GU i is denoted as p i = (x i , y i , 0) , i ∈ I. In fact, when disasters occur, the network infrastructure may be corrupted. However, the GUs can still position their location easily thanks to a GPS decoder, which is integrated into most mobile devices currently. Thus, the GUs can report their locations to the UAV such that the UAV can calculate the flight trajectory to serve the GUs' requests. For the devices without GPS, the UAV can still estimate the GUs locations based on the received signal strength indicator (RSSI), which is well studied in the literature [38,39]. Furthermore, when the locations of the users are known, determining the flight trajectory of the UAV was proposed in [20]. In this paper, we do not focus on an approach to obtain the GUs' locations and the UAV's trajectory. Instead, we mainly focus on the power allocation with data caching at the UAV to maximize the long-term data rate of the system. Therefore, it is assumed that the GUs' locations and the UAV's trajectory are known in advance. Herein, we establish the formulation for the circular trajectory of the UAV in the serving area, which is defined as the region where the GUs are located. The 3D setup of the considered network consisting of the LS, the UAV, and multiple GUs is illustrated in Figure 1a. Point O , located at p O = (0, 0, h F ), is the center of the circular trajectory with radius r F , in which F flies. Let ω denote the angle of the circle of F's location with respect to the x-axis. The location of F at time slot t can be determined as p F (t) = (x F (t), y F (t), z F (t)) = (r F cos ω (t) , r F sin ω (t) , h F ). The time frame structure of the system is illustrated in Figure 1b. The time frame is divided into four phases: GUs' requests (t re ), UAV's decision (t de ), data transmission (t tr ), and information update (t up ). At the start of a time slot, the GUs will send data item requests to F. Then, a decision will be determined at F by allocating the transmission power to the GUs based on the current state of the system. Subsequently, data transmission will be conducted according to the assigned power portions for the GUs in the data transmission phase. Finally, the system will update its state at the end of the time slot.

Channel and Transmission Models
According to the above network setup, the time-dependent distance between F and GU i can be calculated as: where . denotes the Euclidean norm operation. In practice, the air-to-ground wireless channels from the UAV to GUs are normally dominated by LOS links, where the quality of the channel only depends on communication distance [40]. Moreover, UAV-assisted information dissemination is more necessary in rural regions than in urban regions [2]. In rural regions, building density is very low, and thus, the probability of non-line-of-sight links is also low. Therefore, in this paper, wireless channels from F to the GUs are assumed to follow a free-space path loss model. As a consequence, channel power gain from F to GU i at time slot t can be expressed as [41]: where β 0 represents the channel power gain at the reference distance, d 0 = 1 m, which depends on the carrier frequency, antenna gain, etc.; and α is the path loss exponent. Suppose that F has access to the flight control and location information of the GUs for power allocation. Besides, it is worth noting that the channel gain between F and the GUs varies over period T F due to the movement of F. Given the location of F at time slot t, the channels of the GUs are sorted in F to apply NOMA. Typically, a NOMA scheme enables a base station to serve multiple users simultaneously over the same frequency band. The power portions for users are assigned in an inversely proportional manner based on their channel conditions, in which the low channel gain user requires a higher allocated transmission power, and vice versa. We assume that each GU's channel gain is placed in an ascending manner in time slot t.
According to the downlink NOMA principle, UAV F will transmit a combined signal, s F (t), to all GUs with the assigned power portions in time slot t. Specifically, with the content requests of the GUs in time slot t, the transmitted signal by UAV F can be written as: where s i (t) is the normalized information for GU i in time slot t with E |s i | 2 = 1; P F (t) = e tr (t) t tr represents the total transmission power that F uses to transmit data to the GUs, in which e tr (t) is the amount of transmission energy used by F in the time slot; and λ i (t) denotes the power portion The received signal at GU i in time slot t can be given by:  (1)), treats all signals of other GUs as interference and directly decodes its own information without using SIC. Nevertheless, other GUs need to employ the SIC process where they first decode signals that are stronger (i.e., the GUs with a higher assigned portion) than their own desired signals. Then, those signals will be subtracted from the received signal, and this process will continue until the GUs' own signals are decoded. In other words, each GU will decode its own information by treating other GUs' signals (with smaller power portions) as interference.
As explained above, assume that all the signals of GU o(l) , for l < n, have been perfectly decoded by GU o(n) . Thus, the signal-to-interference-plus-noise ratio (SINR) at GU o(n) for decoding its own information is given as: Consequently, the achievable rate at GU o(n) in (b/s/Hz) to decode its own information in time slot t can be calculated as: Additionally, the SINR at GU o(n ) to decode the information of GU o(n) , for n < n , can be expressed as: Similarly, the achievable rate at GU o(n ) in (b/s/Hz) to decode the information of GU o(n) for n < n in time slot t can be calculated as: Finally, the sum rate of the system in time slot t can be expressed as follows: where R GU i (t) represents the achievable rate at GU i in time slot t subject to o(n) = i ∈ I. More specifically, for a better understanding, let us take an example with I = 2: if h FU 1 (t) > h FU 2 (t), then λ 1 (t) < λ 2 (t) and o(t) = [2,1]. At GU 1 , by using SIC, it first decodes s 2 (t) and then cancels it out from (4) to decode its own signal, s 1 (t). Meanwhile, at GU 2 , s 2 (t) is directly decoded without performing SIC. As a result, the achievable data rates at GU 1 and GU 2 can be respectively calculated by: and: Eventually, the sum rate of the system in time slot t can be given as follows:

Data Request Behavior of the Ground Users
In this paper, library K in the LS contains K different finite data items for the requests of GUs. Data items are essentially an abstraction of application data, which might range from database records, web pages, ftp files, etc. We consider the content requests of the GUs to be unrelated to each other. Let us assume that the probability that each GU accesses the same data item in the two consecutive time slots is pretty high, but accesses to the other data item are smaller. That is realistic since the users tend to frequently access the same data source of their interest for a long duration. Thus, we model the request of each GU as a discrete-time Markov chain where the state transition probability of GU i for two adjacent time slots is illustrated in Figure 2a. P mm,i and P m m,i (where P mm,i = P m m,i | m ∈ K\ {m}) represent the probabilities that GU i requests the same data item, m, or another data item, m, respectively, in two adjacent time slots. P m m,i and P mm,i (where P m m,i = P mm,i ) are the probabilities that GU i requests different items in two adjacent time slots. It is assumed that if the request of GU i in time slot t is item m, then the probability that GU i requests item m in time slot t + 1 can be computed as: where K is the total number of data items in library K. It is worth noting that when GU i requests an item that is not among the cached data items in the UAV, it cannot receive that requested data from the UAV, and thus, no transmission power will be allocated for GU i in this time slot, i.e., λ i (t) = 0.

Content Caching Model of UAV
This paper adopts a traditional caching technique for UAV F for serving the requests of the GUs in the network. Since the number of data items that can be cached by F is restricted to a caching capacity, C F , the UAV needs to cache new data items from K after each flight period j to replace the old cached items. With periodical caching, performance can be enhanced according to the GUs' requests. In this paper, the non-serving time that includes the duration for the UAV to cache the items, approach the serving area, and return to the LS is approximately unchanged and will not affect the data rate maximization during the serving time. Therefore, the non-serving time can be ignored in the paper, and the term flight period can be referred to as the circular trajectory period of the UAV henceforth. Let c j = c j,1 , c j,2 , ..., c j,C F denote the cache content vector of UAV F in period j. Based on the data request behavior of the GUs, the cache content vector, c j , where the data items are cached in period j is divided into two parts: the request-based cache vector, c req j , and the random cache vector, c ran j , and can be expressed as The former consists of the items cached based on the latest requests of the GUs, while the latter is determined by randomly caching items from the library, except for the items in the request-based cache. In particular, at the start of new flight period j, F will cache the same data items based on the latest items requested by the GUs (i.e., the items requested at the last time slot of previous period j − 1), and the rest of the space in c j is fulfilled by randomly selecting another items from library K in the LS, such that each item cached in c j is unique in the current period. The reason for this caching model is because the probability that GU i requests the same item is assumed to be much greater than that of GUs requesting a different item between two adjacent time slots, i.e., P mm,i P m m,i , as presented in the previous subsection. We use q(t) = [q 1 (t), q 2 (t), ..., q I (t)] to denote the item request vector of the GUs, where q i (t) ∈ {1, 2, ..., K} represents the item request of GU i at the start of time slot t, and meanwhile, N F denotes the total number of time slots in each circular trajectory period. If the GUs request data items different from each other in the last time slot of period j, i.e., q i (jN F ) = q i (jN F ), the request-based cache vector, c req j+1 , and the random cache vector, c ran j+1 , for the next period, j + 1, can be respectively determined as follows: and: where c req j+1,i ∈ {1, 2, ..., K} is the cached item i th of c req j+1 . It is worth noting that if there are similar requested items among the GUs' requests in the last time slot of period j, then UAV F will only cache these same items one time in c req j+1 for use in the next period, j + 1, to save cache space in c j+1 . An example of the caching process by UAV F can be illustrated in Figure 2b with N F = 30, C F = 5, I = 3, and K = 10. In time slot t = 30, which belongs to period j = 1, the requests of the GUs are q 1 (30) = 5, q 2 (30) = 8, and q 3 (30) = 3, and then, c

Energy Harvesting Model of the UAV
In this paper, UAV F is assumed to have a limited-capacity battery, E Bat , and it is equipped with an energy harvesting circuit to harvest solar energy for its operation. UAV F can simultaneously harvest solar energy and perform other operations such as forward movement, climbing up and down, and data transmissions. In this work, we aim at efficiently using the harvested solar energy in the UAV in order to allocate proper transmission power to the GUs during the serving duration. Since the amount of flight energy consumed for a round trip of the UAV can be approximately estimated, for simplicity, the energy portion for the mobility of the UAV is not shown in the formulation. Thus, we only consider the battery capacity portion required for the data transmission (i.e., transmission capacity), and it is also denoted as E Bat for our simplified formulation purposes. If E Bat is full during the serving time (i.e., the maximum value of the transmission capacity portion is achieved), the rest of the amount of harvested energy will be stored in the mobility capacity portion that is used for the UAV's flight. Herein, the amount of energy harvested by F in time slot t, denoted as .., ξ} and is assumed to follow a Poisson distribution [42]. The authors in [42] carried out empirical measurements for the modeling of a solar-powered wireless sensor node in time-slotted operation and showed that the stored energy characteristics depend on many factors such as the time slot duration, light intensity, power level, and the deployment environment. As a result, the Poisson distribution model achieved a near fit for the collected measurements. The probability distribution of the energy harvested by F can be given by: where E h,avg represents the mean energy harvested by F. For tractability in the simulation, the amount of harvested energy can be approximated, and the maximum harvested energy can be determined according to network parameters such that the cumulative distribution function is close enough to one.

Sum Rate Maximization Formulation
In this paper, we aim to optimize the transmission power allocated to the GUs and the content caching by UAV F such that the sum cumulative data rate of ground users can be maximized in a long-term operation. Thus, the problem formulation can be expressed as follows: where c j is the cache content vector of UAV F in flight period j; P max F represents the upper bound of the transmission power that F can use to transmit data to the GUs. Constraint (a) specifies that the UAV totally assigns its transmission power, P F (t), to GUs that request items from the UAV's cache in time slot t. Constraint (b) guarantees that the total transmission power for GUs in each time slot is no greater than the maximum transmission power that the UAV can use without causing it to be inactive owing to an energy shortage. Finally, Constraint (c) ensures that every cached item is unique in the cache content vector for period j, where c j,i represents the i th item of cache content vector c j .
It is worth noting that although maximizing the energy utilization in the current time slot can optimize the temporal data rate of the system, it may cause inactivity upon data transmission in the subsequent time slots due to an energy shortage in F. Consequently, it can significantly degrade the long-term sum rate of the network. Furthermore, dynamic data requests of the GUs will also affect the performance of the system, since the caching constraint on F is taken into account. Therefore, according to the system state, finding an optimal policy for joint cache scheduling and power allocation in F to obtain the maximum long-term sum rate of the system is the main goal of this study.

Proposed Solution Using the POMDP Framework
In this section, we propose a joint optimal cache scheduling and power allocation scheme using a POMDP framework for F over the long run, based on prior information for the harvested energy distribution and the request model for the GUs. To be more specific, after receiving the requests by the GUs, F will allocate the optimal transmission power for each GU in order to obtain the maximized long-term sum data rate for the system. The problem of sum data rate maximization is first formulated as the framework of a partially observable Markov decision process where the effect of the decision in the current time slot on the subsequent time slots is taken into account [43]. Subsequently, the optimal policy can be obtained by adopting the approach of value iteration-based dynamic programming [44].

Markov Decision Process
The Markov decision process (MDP) is generally defined as a tuple S, A, P, ϕ , where S, A, and P are the state space, action space, and state transition probability space, respectively; ϕ : S × A → R represents the reward function. We define the system state as s(t) = e rm (t), ω(t), θ(t), t in (t), c j ∈ S, where e rm (t) is the remaining energy in F; 0 ≤ ω(t) ≤ 2π is the angle of the circle for F's location with respect to the x-axis; θ(t) = [θ 1 (t), θ 2 (t), ..., θ I (t)] is the belief vector, with θ i (t) as the belief (probability) that the requested content of GU i will be in the current cache content vector, c j , in time slot t; t in (t) ∈ {1, 2, ..., N F } is the index of time slot t in terms of flight period j. Note that c j will only be updated based on the requests of the GUs at the end of time slot t when t in (t) = N F in each flight period, and meanwhile, s(t) is always updated based on the selected action by F and the amount of harvested energy at the end of each time slot. The set of actions can be denoted as A = a 1 , a 2 , ..., a |A| , where a υ = e tr υ , λ 1,υ , λ 2,υ , ..., λ I,υ |υ ∈ {1, 2, ..., |A|} is the action υ in A; where e tr υ (0 ≤ e tr min ≤ e tr υ ≤ e tr max ) is the transmission energy in UAV F, and 0 ≤ λ i,υ ≤ 1 is the power portion assigned for GU i . The notations e tr min and e tr max represent the minimum and maximum transmission energy in the UAV. We further define the reward for the system as the sum data rate of the network. Thus, given state s(t) and action a(t), the corresponding reward, denoted by R (s(t), a(t)), is computed by using Equation (9).
The operation of UAV F can be expressed as follows. At a given time instant t, F employs action a(t) based on the system state and the content requests of the GUs, and then, the reward for the system, R (s(t), a(t)), will be achieved at the end of the time slot. Action a(t) causes the system to transit from state s(t) to a new state, s(t + 1). Thus, the state of the system will be updated for the next operation when the data transmission in time slot t is finished.
In this paper, we aim to find the optimal transmission power allocation policy based on the cache scheduling discussed in Section 2.3 for UAV F in each slot t in order to maximize the accumulated reward from the time slot to the time horizon. In addition, transmission power is determined by using transmission energy e tr and transmission data duration t tr , i.e., P F (t) = e tr (t) t tr . Therefore, according to the above MDP formulation, Equation (17) can be rewritten as follows: where 0 ≤ β ≤ 1 is the discount factor, which indicates the effect of the current action on the future rewards. According to the dynamic item requests of the GUs, the observation is defined as the probable case that shows whether the item requests of the GUs are in cached items of F in a given time slot and will be discussed in the next subsection.

Observation Description
This section introduces possible observations, the respective rewards, and the ways to update the system state for the next time slot according to the selected action of a given time slot. Let us consider a network with two GUs (I = 2) connecting to UAV F to acquire data according to their requests. At the given state, s(t), the requests of the GUs are q 1 (t) and q 2 (t), and the UAV executes action a(t). It is obvious to note that for all possible observations, the angle of UAV F in the next time slot is updated as where ω next (t) denotes the next angle of the UAV in its predefined circular flight trajectory. In the following, we present a way to update other information regarding the remaining energy, the belief vector, the transition probability, the time slot index, and the cache content vector in each observation for this particular circumstance. These can be respectively described as follows.

Observation 1 (O 1 )
The requests of both GU 1 and GU 2 are in the cached items in c j of UAV F. The probability that the event happens can be calculated as: The reward can be obtained as follows: where R GU i can be obtained by using Equation (6). The belief vector can be updated as follows: where τ i = 1−P mm,i K−1 (C F − 1) |i ∈ {1, 2, ..., I} . Next, the remaining energy in F for the next time slot is: with transition probability: where Pr E h (t) = E h z can be calculated as in Equation (16). To explain Equations (22) and (23) with the case of t in (t) = N F , the remaining energy (i.e., the energy in transmission capacity) at F is always full because UAV F finishes one circular trajectory and returns to the LS for recharging its battery. The index of the next time slot in terms of flight period j can be updated as: Finally, the cache content vector can be updated by: where c req j+1 and c ran j+1 can be determined with Equations (14) and (15), respectively. It is important to note that the UAV will only update the cache content vector when it is in the last time slot of period j.

Observation 2 (O 2 )
The request of GU 1 is in the cached items in c j , but that of GU 2 is not in c j of UAV F. The probability that the event happens can be calculated as: The reward can be obtained as follows: where R GU 1 can be calculated with Equation (6). The belief vector can be updated as follows: where τ i is calculated in a way similar to Equation (21). The remaining energy, the transition probability, the index of the time slot, and the cache content vector can be updated with Equations (22)-(25), respectively.

Observation 3 (O 3 )
The request of GU 1 is not in the cached items in c j , but that of GU 2 is in c j . The probability that the event occurs can be calculated as: The reward can be obtained as follows: where R GU 2 can be computed with Equation (6). The belief vector can be updated as follows: where τ i is calculated as it is in Equation (21). The remaining energy, the transition probability, the index of the time slot, and the cache content vector can be updated with Equations (22)-(25), respectively.

Observation 4 (O 4 )
The requests of both GU 1 and GU 2 are not in the cached items in c j of UAV F. The UAV will stay silent; and hence, there is no reward in this case, i.e., R(s(t), a(t) |O 4 ) = 0. The probability that the event occurs can be calculated as: The belief vector can be updated as follows: The remaining energy in F for the next time slot is: with the transition probability being the same as Equation (23). Similarly, the index of the time slot and the cache content vector can be updated with Equations (24) and (25), respectively.

Value Iteration-Based Dynamic Programming Solution
According to the POMDP principle, the value function is defined as the maximum value of the cumulative discounted system reward that starts from the current time slot to the infinite time horizon, and it is used to select the optimal action for the UAV. Thus, given a state s(t), the value function can be given as follows: where Pr [O m ] represents the probability that the observation O m occurs; Pr [e rm (k + 1) |e rm (k), O m ] is the probability that the remaining energy of the UAV will transfer from e rm (k) to e rm (k + 1) with corresponding observation O m ; R (s(k), a(k)) indicates the reward of the system when it takes the action a(k) at the state s(t).
The value function in Equation (35) can be obtained by using value iteration-based dynamic programming [44]. Owing to the dynamic item requests of the GUs and the harvested energy, the expected reward for the possible actions in the current time slot will be considered in each time slot. Accordingly, the optimal decision of the UAV in time slot t can be obtained as follows: where R im (s(t), a(t)) is the expected immediate reward for the system based on action a(t), which can be obtained by Equation (9). The term ∑ t+1 Pr [e rm (t + 1) |e rm (t) ] V s(t+1) is the expected future reward from action a(t) in time slot t + 1, where V s(t+1) can be achieved by solving the problem in Equation (35). For the above setup, the MDP problem in Equation (18) can be transferred to Equation (36), and the optimal policy for long-term data rate maximization can be obtained by using the POMDP framework. The flowchart of the proposed POMDP-based approach is given Figure 3. For further details, the procedure of the slot-by-slot operation of the system when using this scheme is presented in Algorithm 1.

Algorithm 1
Operation of the UAV when using the proposed POMDP-based scheme to obtain the maximum long-term data rate in N time slots.
1: Input: d FU i , K, σ 2 , T, T re , T de , T up , C F , E Bat , E h,avg , e tr min , e tr max , P mm , P m m , h F , r F , T F , α, β 0 , β. 2: Output: Optimal action a opt (t). 3: Define S, A, and P. 4: Apply iteration-based dynamic programming to obtain the value function for every possible state of S in Equation (35). 5: for t = t 0 // Start from time slot t = t 0 6: Define the current system state, s(t).

18:
Transmit data to that GU. 19: else 20: Transmit data to GUs by using NOMA. 21: end if 22: Obtain the immediate reward for the system. 23: end if 24: end if 25: Update c j when t in (t) = N F with Equations (14) and (15). 26: Update system state s(t + 1). 27

Proposed Solution Using the Actor-Critic Learning Framework
In the previous section, we elaborated on the POMDP-based solution to the joint cache scheduling and power allocation problem of the UAV-enabled communication system where prior knowledge of the energy harvesting distribution of the GUs is assumed to be known. Nevertheless, it is hard to identify an evolution model in some network scenarios due to the complexity of network dynamics; hence, acquiring prior information regarding the arrival of harvested energy of a user may be impractical in some circumstances. Moreover, the value iteration programming technique requires a large number of formulations and much computational overhead. For this reason, we formulate and propose a kind of model-free reinforcement learning (namely, an actor-critic-based method) to deal with the MDP problem assuming there is no prior information on the energy harvesting distribution. Although applying the actor-critic learning approach may lead the system to a locally optimal policy [45], it helps the system learn information about the dynamic wireless environment by interacting directly with the environment to generate a policy without having information on essential network models a priori. Hence, this model-free learning approach can benefit from less formulation and fewer computational effort, compared to the POMDP-based algorithm.
Subsequently, we present the classic actor-critic learning-based scheme to obtain the solution to the MDP problem described in the previous section.

Actor-Critic Framework Formulation
Generally, the actor-critic framework is composed of three main components: an actor, a critic, and the environment. The actor is responsible for taking an action according to a policy; meanwhile, the critic evaluates the quality of the action and adjusts the policy through temporal difference (TD) [46]. The generalized actor-critic framework is illustrated in Figure 4.  The value function for the actor-critic-based framework in this paper is the total discounted reward from the current time slot, and it can be modified according to policy Ω during the training phase, which can be obtained as follows [47]: where Pr [s |s, Ω (s) ] represents the transition probability that the system will transfer to state s after taking an action based on policy Ω (s) in state s. Similar to the POMDP-based scheme, in this paper, the actor-critic framework is in charge of determining the optimal policy, Ω * (s), and thus, the problem in Equation (18) can be rewritten as: In time slot t, the UAV selects and then executes an action, a(t), based on the current state, s(t), and the current policy, Ω, which is determined by applying a Gibbs soft-max function [47] as follows: Ω (a(t) |s(t) ) = Pr [a(t) ∈ A |s(t) ] = e Θ(a(t)|s(t) ) ∑ a∈A e Θ(a|s(t) ) , where Θ (a(t) |s(t) ) is the tendency of the UAV to select action a(t) when the system is in state s(t).
Note that this parameter can be adjusted over time such that the UAV can select the best action for each state when the training phase finishes. After the action is executed, the system will transit to a new state, s(t + 1), with transition probability: Pr s ∈ S |s(t), a(t) = 1 if s = s(t + 1) 0 otherwise (40) and the corresponding immediate reward, R (s(t), a(t)), will be obtained as expressed in Equation (9). By applying Equation (40) to Equation (38), it obviously implies that the actor-critic-based scheme does not need to have information on the energy arrival distribution in advance, since it actually explores the next state, s(t + 1), at the end of time slot t after performing action a(t). As a result, at the end of the time slot, the critic component will evaluate the quality of the action performed by the UAV by using the TD error. In other words, determining the value function's difference from current state s(t) at the end of each time step will help the UAV gradually find the maximum value function that maps state s(t) to optimal action a opt (t). Consequently, the TD error in time slot t, which is referred to as the difference between the left and right sides of the Bellman equation [47], is computed as follows: Then, the value function for state s(t) will be updated by: where α c denotes the critic step size. Furthermore, the actor component will modify the policy according to the tendency as: where α a represents the actor step size. According to Equations (42) and (43), the training stage will be terminated as convergence occurs, and the convergence rate will significantly depend on the values of both α c and α a . Therefore, the optimal value of these parameters can be adjusted by following empirical designs on various applications.

Actor-Critic Training Description
The details of the training process for the proposed actor-critic-based scheme, presented in Algorithm 2, can be summarily expressed as follows. At the start of time slot t, the UAV will execute action a(t) based on current state s(t) and the item requests of the GUs, q(t). The UAV has to stay silent when none of requests of GUs are in the content cached in the UAV, or it will transmit the corresponding data to the GUs when at least one GU's request is in c j . The corresponding immediate reward, R (s(t), a(t)), and the information of the next state, s(t + 1), will be gained based on the observations presented in Section 3.2. The UAV then modifies its parameters, such as ∆(t), V s(t) , Θ (a(t) |s(t) ), and Ω (a(t) |s(t) ), at the end of each time slot. In addition, it is worth noting that the UAV will only re-cache the LS items into c j when it finishes a flight period. Unlike the proposed POMDP-based scheme, where the optimal policy is obtained based on an offline formulation that requires energy harvesting distribution information, the proposed actor-critic-based scheme determines the policy from a practical learning process, and thus, it can converge to the locally optimal policy [45]. In other words, by applying the actor-critic solution, we do not need to know the energy harvesting distribution in advance for the transition probability calculation in order to achieve the optimal policy, as in the POMDP-based solution. As a result, it can make this scheme more practical in various network scenarios where no prior knowledge regarding the environment dynamics is known.

Algorithm 2
The detailed training process of the UAV using the proposed actor-critic-based scheme. 1: Input: d FU i , K, σ 2 , T, T re , T de , T up , C F , E Bat , e tr min , e tr max , P mm , P m m , h F , r F , T F , α, β 0 , β. 2: Output: Optimal policy Ω * (s). 3: Define S and A. 4: Define the total number of time slots for training, N t . 5: Initialize Θ (a |s ), V s , and Ω (s) where a(t) ∈ A, s ∈ S. 6: repeat 7: Define the current system state, s(t).

8:
Receive the requests of GUs, q(t). 9: if no request by the GUs is in c j 10: Stay silent. 11: else 12: Choose an action a(t) ∈ A according to Ω (s(t)). 13: if Action is "stay silent" (i.e., e tr (t) = 0) 14: Stay silent. 15: else 16: if only one GU's request is in c j
For the comparison of complexity between the two proposed methods, the main computational difference between the two approaches is that the POMDP-based scheme needs to find the value function for the state-space through an offline approach. This leads to higher computational complexity when using Algorithm 1. Specifically, the complexity for each iteration in the POMDP scheme can be computed as O |A| |S| 2 O obs |P| , where O obs is the number of possible observations. Let us define that the computational complexity for the UAV in each state during the training in Algorithm 2 is O(1). Thus, the total complexity of Algorithm 2 depends on the system state and action spaces and can be calculated as O (|A| |S|). Furthermore, the convergence rate of the actor-critic scheme is considerably dependent on the actor and critic step sizes. As a consequence, these values should be carefully chosen according to other system parameters. We further provide the summary of the most used symbols and notations in Table 1 to make the paper more readable.

Simulation Results
In this section, we present the numerical simulation results regarding the performance of the two proposed schemes and those of other benchmark schemes based on the Myopicmethod [48]: a Myopic-NOMA scheme, a Myopic-NOMA-RCscheme, and a Myopic-OMA scheme. The term "Myopic" represents the solution in which the optimal decision is made only for the current time slot without considering the future evolution. In the Myopic-NOMA scheme, the UAV always transmits data with optimal transmission power to the GUs by using NOMA whenever more than two GUs' requests are in the cached content of the UAV. Similarly, in the Myopic-NOMA-RC scheme, the UAV randomly caches items from the LS and always transmits data to the GUs with the optimal transmission power by using NOMA. Lastly, in the Myopic-OMA scheme, OMA data transmission is always used with the optimal transmission power. In particular, with this scheme, the data transmission phase is divided into I oma equal sub-slots, where I oma (t) is the number of involved GUs for the data transmissions in time slot t, and the UAV will transmit the corresponding data to each GU through each sub-slot. Therefore, the sum data rate of the Myopic-OMA scheme in time slot t can be calculated with R OMA (t) = . Nevertheless, these benchmark schemes only consider the current time slot for maximizing the sum rate. Thus, their optimal policy is made by using the maximum level of transmission energy available in the battery in the current time slot, which can lower system performance in a long operation owing to an energy shortage for data transmissions in subsequent time slots. Meanwhile, the proposed schemes consider not only the current reward, but also the future reward, which was thoroughly presented in Sections 3 and 4. Thus, in the following, we can verify the effectiveness of the two proposed schemes under changes in network parameters. Table 2 shows the parameter setup, and the network topology with I = 3 is illustrated in Figure 5. Unless otherwise stated, the transmission energy in the UAV is divided into five equal levels ranging from 0 ≤ LV1 ≤ LV2 ≤ ... ≤ LV5 ≤ E Bat , and there are eight levels in the UAV's battery, from zero to E Bat . The span of power portion λ is 0.025. In this paper, the simulation results were achieved by averaging N = 2 × 10 5 time slots. Besides, the harvested energy was stochastically generated in each slot by a Poisson distribution with the mean value of harvested energy E h,avg = 75 µJ. During the serving time, there might be no energy for data transmissions by the UAV, which is referred to as energy shortage. In that case, it has to stay silent and wait for upcoming harvested energy in subsequent time slots to transmit data to the GUs.  We first examine the convergence rate of the actor-critic-based scheme during the training process under various values of λ c and λ a for the mean value of harvested energy, E h,avg = 75 µJ, based on the achievable sum rate calculated every 1000 time steps, as shown in Figure 6. Besides, the optimal value line is plotted according to the policy obtained by the POMDP-based approach. It is noted that the convergence condition of the algorithm is defined as the convergence condition of the sum data rate. That means that during the training process, the sum data rate is averaged after every batch of 1000 training time slots, and then, the difference between two adjacent updates, ∆ c , is calculated. In the simulation, we set the convergence condition for the algorithm at |∆ c | < 7 × 10 −3 . It is observed from Figure 6 that the sum rate of the system after each iteration of 1000 slots sharply increases in the first 100,000 time slots and then gradually converges to a locally optimal policy that depends on the values of λ c and λ a . Therefore, in the simulation, we repeated the training process a number of times and then selected the policy with the proper actor and critic step size values that provide the maximum average rate. In particular, with step sizes greater than 0.1, the proposed scheme provides faster convergence; however, it leads to a lower data rate after 200,000 time slots of training. We can also see that if we keep decreasing the step size values to less than 0.1, the algorithm might converge to a worse policy due to overfitting. Besides, it is obvious that with the network parameters in this paper, the proposed scheme with critic and actor step sizes α c = α a = 0.1 provides better performance, in which the data rate mostly converges to the optimal value, given by the POMDP-based scheme, after 200,000 time slots of training. Therefore, we chose actor-critic step size values at α c = α a = 0.1 for the rest of the simulations.  Figure 7 shows the sum rate according to the mean value of harvested energy in the UAV. It can be seen that the throughput of the system increases when the mean value of the harvested energy goes up. That is because the UAV can harvest more energy from the environment; thus, a number of higher power transmissions can be used for data transmissions during its flight period. We can see that the system rates of the proposed schemes dominate the conventional schemes in which the actor-critic-based method can be approximately as good as the POMDP-based method, and the two proposed schemes can provide a system data rate 10% higher than the Myopic approaches. Next, we compare the energy efficiency of the schemes with respect to mean value of harvested energy in Figure 8. In this study, we aim to efficiently utilize the solar harvested energy of the UAV in the long-term operation. When the transmission capacity is full during the serving time, the rest of harvested energy can also be stored for the mobility capacity portion to support the UAV's flight. Moreover, the overflow energy of the battery is considered as the wasted energy consumption of the system. For that reason, in the simulation, the energy consumption is calculated as the total harvested energy during the UAV's operation. All schemes with each mean value of harvested energy, in Figure 8, have the same total amount of energy consumption in N = 2 × 10 5 time slots. In the paper, energy efficiency is defined as the sum data rate over the total harvested energy during the UAV's operation. As a consequence, the curves in Figure 8 can be interpreted as the sum-rate according to energy consumption.
In order to explore the behavior in terms of transmission power by the UAV, in Figure 9, we plot the statistics of the actions in the POMDP scheme, the actor-critic scheme, the Myopic-NOMA scheme, and the Myopic-OMA scheme over 200,000 time slots. The notation TM − LVx represents the transmission mode with a level of LVx where LVx ∈ {LV1, LV2, ..., LV5} is the level of transmission energy. We can see in Figure 9 that the Myopic-NOMA scheme and the Myopic-OMA scheme tend to choose the highest transmission power for the purpose of maximizing the instant reward. Obviously, the statistics of selected actions in these myopic schemes are similar, but the achievable reward of the NOMA scheme is higher than that of the OMA scheme owing to the effective utilization of the NOMA technique. However, due to the limitation on harvested energy, using too much energy in a time slot may cause the energy shortage, in which the UAV has to stay silent for many future time slots. This will lower the data rate of the system. On the other hand, simultaneously assigning an appropriate amount of transmission energy can give the UAV more chances to stay active and transmit data to the GUs under the environment dynamics, such that a maximum long-term data rate can be guaranteed.  In Figure 10, we plot the sum data rate according to different values of caching capacity. The curves show that the system performance is enhanced if the UAV has a higher caching capacity. Obviously, with a larger value of C F , the UAV can store more items from the LS, and then, the probability that the GUs' requests are in the cached content of the UAV will increase, which leads to the higher data transmission rate. On the other hand, we can see that the higher P mm also brings higher performance of the system. The reason is that the GUs will more frequently request their own items of interest during the time slots.

80
100 120 140 160 Caching capacity C F  Figure 10. The sum data rate with respect to caching capacity. Figures 11 and 12, respectively, show the impact of noise variance at the GUs and the effect of the altitude of the UAV on the system reward. We can see that system performance notably declined as the noise power at the ground users (as well as the altitude of the UAV) grew. In order to explain this, noise power will lower the throughput for each GU's data recipient, and meanwhile, a farther distance between F and the GUs will increase path loss during data transmissions. Noise variance σ 2 (dBm)   Finally, we further investigated the joint effect of both the number of items, K, in the library, and caching capacity C F in the UAV on the system data rate. Figure 13 indicates that the system reward will increase with an increment in the ratio of C F over K. For example, if the number of items is K = 300, the data rate of the system will go up when increasing caching capacity C F . Furthermore, the results of the POMDP-based and actor-critic schemes are superior to the Myopic-NOMA scheme. The reason is that the proposed POMDP scheme exploits prior information on the harvested energy distribution and on the request model of the GUs, and then, it calculates the possible situations and corresponding probabilities. The actor-critic method can explore the information from interacting directly with the environment, and it then learns the optimal policy through trial and error. Consequently, the next state of the system can be predicted, and the UAV can efficiently allocate transmission power for the GUs based on NOMA and caching technologies under the long-term operation considerations. On the other hand, the presented numerical results validate the effectiveness of the proposed approaches through various network parameters in this paper.

Conclusions
In this paper, we investigated non-orthogonal multiple access with data caching for UAV-enabled downlink transmissions under constraints on energy and the caching capacity in the solar-powered UAV. The two innovative approaches, based on POMDP and the actor-critic frameworks, were proposed for a joint cache scheduling and resource allocation issue to maximize the long-term data rate of the system in cases with and without prior information of the energy arrival distribution. The optimal policy can be obtained by using the two proposed schemes, such that the UAV can efficiently use harvested solar energy to transmit data to a group of ground users that need a service fulfilling their item requests. Eventually, the numerical results via MATLAB simulations verified the superiority of the proposed schemes, compared to baseline alternatives in which the context under long-term data rate maximization is not taken into account under diverse network conditions. The shortcoming of this work is that the high formulation complexity and computational complexity may be considerably imposed on multi-UAV systems, where the coverage region for data communications is extended by deploying multiple UAVs to meet surging data transmission demands. In this regard, a deep reinforcement learning framework can be one of the promising solutions to the optimization issues in large state and space UAV systems for 5G and beyond 5G, which is considered in our future research directions.