Deep Reinforcement Learning for Computation Offloading and Resource Allocation in Unmanned-Aerial-Vehicle Assisted Edge Computing

Computation offloading technology extends cloud computing to the edge of the access network close to users, bringing many benefits to terminal devices with limited battery and computational resources. Nevertheless, the existing computation offloading approaches are challenging to apply to specific scenarios, such as the dense distribution of end-users and the sparse distribution of network infrastructure. The technological revolution in the unmanned aerial vehicle (UAV) and chip industry has granted UAVs more computing resources and promoted the emergence of UAV-assisted mobile edge computing (MEC) technology, which could be applied to those scenarios. However, in the MEC system with multiple users and multiple servers, making reasonable offloading decisions and allocating system resources is still a severe challenge. This paper studies the offloading decision and resource allocation problem in the UAV-assisted MEC environment with multiple users and servers. To ensure the quality of service for end-users, we set the weighted total cost of delay, energy consumption, and the size of discarded tasks as our optimization objective. We further formulate the joint optimization problem as a Markov decision process and apply the soft actor–critic (SAC) deep reinforcement learning algorithm to optimize the offloading policy. Numerical simulation results show that the offloading policy optimized by our proposed SAC-based dynamic computing offloading (SACDCO) algorithm effectively reduces the delay, energy consumption, and size of discarded tasks for the UAV-assisted MEC system. Compared with the fixed local-UAV scheme in the specific simulation setting, our proposed approach reduces system delay and energy consumption by approximately 50% and 200%, respectively.


Introduction
In the past decade, the exponential growth and diversity of Internet of Things (IoT) devices have changed the way we live [1]. Advances in 5G technology and the IoT have made many emerging applications possible, such as autonomous driving, smart cities, virtual reality (VR)/augmented reality (AR), real-time video analysis, and cloud gaming. Nevertheless, the traditional cloud-based computing paradigm is not suitable for those IoT terminal devices with limited computing and battery resources. The emergence of mobile edge computing (MEC) technology is expected to improve this situation. The MEC server deployed at the edge of the access network can provide terminal devices with computing and communication resources, bringing many benefits, such as reducing computing workload, delay, network congestion, and energy consumption. Traditional MEC servers are usually deployed on cellular base stations (BS) or Wi-Fi access points (AP). However, considering the construction cost and the gradual infrastructure update, not all BSs and APs can deploy edge servers. Therefore, mobile platforms, such as vehicles and UAVs are regarded as alternative candidates for MEC servers. communication range and execute tasks by the MEC servers. We also consider that the computing task is partially offloaded, which is different from most existing studies' binary offloading model and is closer to the actual situation [20]. We take the minimization of the weighted total cost of delay, energy consumption, and the size of discarded tasks as the optimization objective and further formulate the offloading decision problem as a Markov decision process. To this end, we propose a dynamic computation offloading approach based on the soft actor-critic (SAC) DRL algorithm. The SAC algorithm introduces entropy into the traditional actor-critic algorithm, improves the decision-making performance and obtains the global optimal policy. Numerical simulation results have proved the effectiveness of our proposed SAC-based dynamic computing offloading (SACDCO) algorithm compared with other baseline schemes. The differences between our work and the existing literature are summarized in Table 1.
The remainder of this paper is organized as follows. Section 2 describes the system model and problem formulation. Section 3 introduces the UAV-assisted edge computation offloading approach we proposed. The performance evaluation of our proposed approach is achieved through a series of simulations, and numerical results are given in Section 4. Section 5 summarizes the paper.

System Model
The UAV-assisted MEC system is composed of multiple edge servers, multiple endusers, and a UAV, which is shown in Figure 1. We consider a set of end-users, and each end-user periodically executes compute-intensive and delay-sensitive tasks during the decision episode. Due to signal congestion and the limited communication distance of the end-user, stable wireless communication cannot be established between the end-user and the MEC server. The UAV is equipped with antennas to communicate with end-users and MEC servers in the coverage area. In UAV-assisted MEC systems, end-users can offload computing tasks to UAVs. Compared with end-users, UAV has stronger computing power, but it still cannot compare with the computing power of MEC servers. Considering the limited battery capacity and computing power of the UAV, if the UAV cannot complete the computing task well, it will further consider offloading the computing task to the MEC server in the distance. The follow-me cloud (FMC) [27] controller is used in our proposed UAV-assisted MEC system, which could obtain the global information of end-users, MEC servers, and the UAV. Therefore, the proposed dynamic computing offloading algorithm is executed on the FMC controller. Without loss of generality, a set of end-users is denoted by N = {1, 2, · · · , n, · · · , N}, a set of MEC servers is denoted by S = {1, 2, · · · , s, · · · , S}, and the UAV is denoted by U = {u}. The entire decision episode is divided into multiple time slots, where T = {1, 2, · · · , t, · · · , T} denotes their corresponding set. The UAV stays at a fixed altitude h u (t) = H, ∀t ∈ T . We define the three dimensional Cartesian coordinates of the UAV as L u (t) = [x u (t), y u (t), h u (t)], and the coordinates of the end-user as L n k = x n k , y n k , 0 , and the coordinates of the MEC servers are L s k = x s k , y s k , 0 , where K ∈ (1, 2, . . . , k, . . . , K) denotes the corresponding serial number. Unless otherwise stated, the important notations used in this paper are summarized in Table 2.

N
The set of end-user n S The set of MEC server s U The set of unmanned aerial vehicle u T The set of time slot t K The maximum number of end users or MEC servers L n k (t) The location of the end-user n k L s k (t) The location of the MEC server s k L u (t) The location of the UAV g n k ,u (t) The channel gain between the end-user n k and the UAV u r n k ,u (t) The transmission rate between the end-user n k and the UAV u g u,s k (t) The channel gain between the UAV u and the MEC server s k r u,s k (t) The transmission rate between the UAV u and the MEC server s k t f ly (t) The flight delay of UAV u t u,n k tr (t) The transmission delay between the end-user n k and the UAV t u cal (t) The channel gain between the MEC server s k and the UAV u t n k cal (t) The calculation delay of the end-user n k t u,s tr (t) The transmission delay between the UAV and the MEC server s k D n k (t) The computing tasks that end-user n k needs to complete R uav (t) The offloading ratio of UAV R s k (t) Whether to further offload to the MEC server s k S(t) The total size of the discarded tasks in time slot t

Communication Model
The UAV offers services to all end-users but only serves one end-user in each time slot. We assume that all end-users are fixed at a certain coordinate L n k = x n k , y n k , 0 . At the beginning of the whole decision episode, the UAV u is deployed at the initial position L u (0) = [x u (0), y u (0), h u (0)]. When a certain end-user needs to provide services, the UAV flies directly above the end-user and establishes the communication link. Similar to [18], the communication links are presumed to be dominated by the LOS channels. Thus, the channel gain between end-user n k and the UAV u could be denoted as d n k ,u (t) denotes the Euclidean distance between the end-user n k and the UAV u, · denotes the Euclidean norm, and α 0 denotes the received power at the reference distance of 1 m for the transmission power of 1 W. Considering the blocking of the communication signal by the building, the wireless transmission rate can be denoted as r n k ,u (t) = B log 2 1 + P down g n k ,u (t) where B denotes the assigned communication bandwidth, P down denotes the received power of the UAV, σ 2 denotes the noise power, P NLOS denotes the transmission loss, f n k ,u (t) denotes whether there is a communication block between end-user n k and the UAV in time slot t (that is, 0 means no blocking, and 1 means blocking). Similarly, when the UAV needs further to send the computing tasks to the remote MEC server, the channel gain between the UAV u and the MEC server s k could be denoted as Similarly, the wireless transmission rate between the UAV u and MEC server s k could be denoted as

Computation Model
Due to the limited computing resource of the end-user, our proposed offloading decision optimization algorithm is applied to each time slot. According to the offloading policy, the end-user offloads part of the tasks to the UAV, and then the UAV determines to process it locally or further offload to the MEC server. It should be noted that compared with the entire communication and calculation delay, the time to divide the task is very short, so this part of the delay is ignored in our model. In addition, in some computingintensive applications, such as video analysis, the output data size of the computing results is often much smaller than the input data size. Therefore, the delay of the downlink is also ignored. The key components of the total delay during the offloading process are described as follows. • The flight delay from the previous location to the end-user directly above; • The transmission delay from the end-user to the UAV; • The calculation delay of the UAV; • The calculation delay of the end-user; • The transmission delay from the UAV to the MEC server; • The calculation delay of the MEC server.
The flight delay from the previous location of the UAV u to the end-user directly above could be described as where v u is the average flight speed of the UAV u. The transmission delay from end-user n k to UAV u could be described as where R uav (t) ∈ [0, 1] is the offloading rate of the end-user n k to the UAV, and D n k (t) is the computing task size of the end-user n k in time slot t. The calculation delay of the UAV u could be described as where s denotes the CPU cycles required to process each byte, and f uav denotes the calculation frequency of the MEC servers' CPU. Similar to (7), the local calculation delay of the end-user n k in time slot t could be denoted as where f n k denotes the calculation frequency of the end-user n k . According to the offloading policy, it is decided whether to offload the computing tasks to the MEC servers. Due to the limited battery capacity, we consider offloading all the computing tasks received by the UAV to the MEC servers. Therefore, the transmission delay from the UAV u to the MEC server s k could be denoted as The calculate delay of the MEC server s k could be denoted as To define the service delay of each time slot, we assume that the UAVs and the MECs server can only start executing the computing tasks after the transmission is completed to ensure the reliability of the calculation result. We also assume that the end-users execute locally and transmit computing tasks at the same time. Based on the above assumption, the service delay of each time slot could be denoted as For end-user and the UAV.
For end-user, the UAV, and the MEC server.

Energy Model
Battery capacity has always been a bottleneck in UAV applications. The battery capacity of the UAV is denoted as E battery . At the beginning of the decision episode, the UAV is in a fully charged state. The UAV continues to serve the end-user until the battery capacity is exhausted. Our study mainly focuses on the calculation and transmission energy consumption of the UAV while ignoring other energy consumption, which has nothing to do with our decision-making. The key components of the energy consumption during the offloading process are described as follows. • The flight energy consumption of the UAV; • The transmission energy consumption when UAV receives tasks from end-users; • The calculation energy consumption of the UAV; • The transmission energy consumption from the UAV to the MEC server.
The flight energy consumption of the UAV could be denoted as where F = m uav * g, which is related to the weight of the UAV. The transmission energy consumption when UAV receives tasks from end-users could be denoted as where P down denotes the received power of the UAV, and t u,n k tr (t) denotes the transmission delay. Similar to [28], we model that the calculation power is positively correlated with computing capacity, i.e., κ( f uav ) 3 , where κ denotes the energy consumption factor. The UAV calculation energy consumption is denoted as The sending power of the UAV is denoted as P up , and the transmission energy consumption of the UAV could be denoted as E u,s k tr (t) = P up t u,s k tr (t) (15) According to the above analysis, the total energy consumption of the UAV could be denoted as , For UAV and end-user. E f ly (t) + E n k ,u tr (t) + E u,n k tr (t), For UAV, end-user, and MEC server.

Problem Formulation
Our study objective is to minimize the weighted total cost of the service delay, energy consumption of the UAV, and the size of the discarded tasks through optimize the offloading policy. The joint optimization problem could be denoted as where ρ 1 , ρ 2 > 0 in (17) are the parameters that define the relative weight, and S(t) denotes the total size of the discarded tasks in time slot t. Constraint (18) limits the UAV's range of movement. Constraint (19) means the total energy consumption during the decision episode cannot exceed the maximum battery capacity of the UAV. Constraint (20) denotes the value range of the offloading ratio. Constraint (21) denotes whether to further offload to the MEC server. In (22), D denotes the total size of computing tasks that should be executed during the decision episode. Constraint (23) denotes that the size of the discarded tasks does not exceed the total size of computing tasks in each time slot.

Soft Actor-Critic Based Dynamic Computation Offloading Algorithm
Our study objective is to obtain the optimal offloading policy by minimizing the weighted total cost of the service delay, energy consumption of the UAV, and the size of the discarded task during the entire decision episode. We consider the standard reinforcement learning framework [29] and formulate the UAV-assisted edge computing offloading decision-making and resource-allocating problem as a Markov decision process (MDP).

Markov Decision Process
The MDP is usually described as a quintuple M = S, A, P, R, γ , which denotes state, action, state transition probability, reward, and discount factor, respectively. The FMC controller used in our proposal can obtain all global information of the end-user, the UAV, and the MEC server. Therefore, the DRL-based algorithm we proposed runs on the FMC controller in each time slot. Furthermore, the state space, action space, and reward are defined as follows.

•
State space: We consider the current location of the UAV L u (t), the UAV battery capacity E battery (t), and the size of computing tasks D n k (t) as the current state. Therefore, the state space can be denoted as • Action space: We consider the offloading rate R uav (t), whether to further offload to the MEC servers R s k (t) as the current action of the agent. Therefore, action space can be denoted as • Reward: We define cumulative rewards to minimize the weighted sum of service delay, energy consumption, and the size of discarded task. Thus, rewards can be denoted as where ρ 1 , ρ 2 > 0 denote the relative weight.

Soft Actor-Critic DRL Algorithm
Previous studies have shown that DRL algorithms can solve the offloading decision problems in the MEC environment [21,22,30]. However, those DRL algorithms suffer from two main problems: high sample complexity (large amounts of data needed) and other being their brittleness with respect to learning rates, exploration constants, and other hyperparameters. Algorithms, such as DDPG and twin delayed DDPG (TD3), are used to tackle the challenge of high sample complexity in actor-critic frameworks with continuous action spaces. However, they still suffer from brittle stability with respect to their hyperparameters.
Soft actor-critic (SAC) algorithm introduces an actor-critic framework for arrangements with continuous action spaces where in the standard objective of reinforcement learning, i.e., maximizing expected cumulative reward, is augmented with an additional objective of entropy maximization, which provides a substantial improvement in exploration and robustness. Thus, the optimization objective of SAC algorithm is described as where α > 0 is the temperature parameter, which determines the relative importance of the entropy term against the reward. H represents the entropy function. The entropy of a random variable x following a probability distribution P is defined as H(P) = E x∼P [− log P(x)]. Similar to the traditional actor-critic algorithm, value function V π (s) and state-value function Q π (s, a) could be defined in the SAC algorithm, which are given as follows According to above analysis, V π and Q π are connected by: SAC learns a policy π θ and two Q functions Q φ 1 , Q φ 2 and their target networks concurrently. The two Q-functions are learned in a fashion similar to TD3, where a common target is considered for both the Q functions, and clipped double Q-learning is used to train the network. The action-value for one state-action pair can be approximated as whereã (action taken in next state) is sampled from the policy. SAC also uses replay buffer like other off-policy algorithms. The quintuple (s, a, r, s , d) from each episode is stored into the replay buffer D. Batches of these transitions are sampled while updating the network parameters.
Just like TD3, SAC uses clipped double Q-learning to calculate the target values for the Q-value network. The target is given by whereã is sampled from the policy. The loss function can be defined as The main objective of policy optimization will be to maximize the value function, which, in this case, can be defined as In SAC, a reparameterization trick is used to sample actions from the policy to ensure that sampling from the policy is a differentiable process. The policy is now parameterized asã The maximization objective is now defined as The pseudocode of soft actor-critic algorithm is given in Algorithm 1 [31]. We optimize the computation offloading policy via the soft actor-critic DRL algorithm in each time slot, thereby minimize the optimization objective. The pseudocode of the SAC-based UAV-assisted computation offloading algorithm is given in Algorithm 2.

Performance Evaluation
In this section, a detailed numerical evaluation is presented to study the performance of our proposed SACDCO algorithm compared to other baseline schemes. In our proposal, all algorithms and the corresponding simulations are implemented based on Python and executed on a desktop computer with Intel Core i7-8700 6 cores CPU and 32 GB RAM.

Simulation Settings
As mentioned above, our proposed UAV-assisted MEC system consists of three entities: end-users, UAVs, and edge servers, which is different from previous studies. The offloading approaches in existing literature are not directly applicable to our settings. Thereby, we consider the following intuitive schemes as the baseline schemes. In our proposal, we consider a two-dimensional square area, in which four end-users are distributed in an area of L × W = 500 × 500 m 2 and fixed positions. Furthermore, four MEC servers are deployed at the edge of the area, and each MEC server is equipped with 8 cores 3.0 GHz CPU. At the beginning of the decision episode, we assume that the UAV is deployed at an initial position L u = (250, 250) with a height of H = 100 m. For UAV, we refer to the parameters of DJI Air 2S [32]. Unless otherwise stated, the simulation parameters are summarized in Table 3.

Simulation Result
In this section, we have verified the convergence of the SACDCO algorithm and the simulation results of different hyperparameters on the SACDCO algorithm to select the optimal hyperparameters. To prove the importance of offloading policy optimization, we have compared our proposed SACDCO approach with other baseline schemes regarding the delay, energy consumption, and the size of discarded tasks. Subsequently, we have also studied the influence of different UAV parameters on offloading decision-making. The specific simulation results and corresponding charts are given as follows.
Firstly, we study the impact of different learning rates on the cumulative reward of the SACDCO algorithm. As shown in Figure 2, when the learning rate is set to 0.003 and 0.0003, higher cumulative rewards can be obtained. Compared with the learning rate of 0.003, when the learning rate is 0.0003, the cumulative reward of the SACDCO algorithm can converge faster. Therefore, the following experimental results will be based on the learning rate set to 0.0003.
Secondly, we study the impact of different relative weight ρ 1 , ρ 2 on the cumulative reward of the SACDCO algorithm. Considering the value range of the delay, energy consumption, and the size of discarded tasks, we design four schemes and conduct the corresponding simulation. According to the simulation result shown in Figure 3, we find that the SACDCO algorithm could get the highest cumulative reward when ρ 1 set to 0.01, and ρ 2 set to 0.1. Similarly, the following experimental results will be based on ρ 1 set to 0.01 and ρ 2 set to 0.1.  To prove the importance of offloading policy optimization, we compared our proposal with other baseline schemes in terms of delay, as shown in Figure 4. Since it is not affected by the local calculation delay, the UAV-only scheme has the lowest delay, about 32 s. After the SACDCO algorithm optimizes the offloading policy, the delay converges to about 40 s. The delays of the other two schemes are relatively high, around 80 s and 90 s, respectively.
The comparison of different schemes in terms of energy consumption of UAV is shown in Figure 5. Since the local-only scheme does not consume the energy of the UAV, the corresponding result does not appear in the figure. We find that the energy consumption generated by our proposed SACDCO approach is at a low level, which is about 2900 KJ (accounted for 60% of the highest energy consumption scheme only).  As shown in Figure 6, we compare our proposal with UAV-only and fixed local-UAV in terms of the size of discarded tasks. When UAV's battery runs out, the remainder of offloaded tasks on the UAV will be discarded. The simulation result shows that the size of the discarded tasks generated by our proposed SACDCO algorithm is at a low level, which is about 35 Mbit (accounted for 43% of the size of total tasks).
We then study the impact of UAV computing capability and bandwidth on the weighted sum of delay, energy consumption, and the size of discarded tasks. We adjust the UAV's computing power by changing the number of UAV's CPU cores, and the simulation result is shown in Figure 7. The corresponding result shows that the proposed SACDCO algorithm could obtain the highest cumulative rewards when the UAV is equipped with two cores. Then, we study how cumulative rewards behave as the UAV bandwidth increase from 5GHz to 30GHz. We observe that as the bandwidth increases, the cumulative reward of the SACDCO algorithm will also increase. Compared with other baseline algorithms, the SACDCO algorithm can maintain good performance. It should be noted that too high bandwidth is usually impractical. The simulation result is given in Figure 8.

Conclusions
Considering stochastic computation tasks generated by end-users, the mobility of the UAV, and the limited battery capacity of the UAV, we have studied the computation offloading decision problem in the UAV-assisted MEC environment with multiple users and multiple MEC servers. To obtain the global optimal offloading policy, we minimize the weighted total cost of system delay, energy consumption, and the size of discarded tasks as the optimization objective. We propose the soft actor-critic dynamic computation offloading approach to optimize computation offloading and resource allocation policy. Unlike previous studies, we consider letting UAV and MEC servers work collaboratively to provide computing services for end-users. To this end, we consider three intuitive schemes as the baseline schemes in the simulation, i.e., local-only scheme, UAV-only scheme, and fixed local-UAV scheme, respectively. Extensive simulations have demonstrated the superiority of our proposal in terms of delay, energy consumption, and the size of discarded tasks. In particular, compared with the fixed local-UAV scheme in the specific simulation setting, our proposed approach reduces system delay and energy consumption by approximately 50% and 200%, respectively.
In the future, we will extend the research on offloading decision optimization to a multi-UAV collaborative edge computing scenario. Existing literature has shown the superiority of multi-UAV collaborative assisted edge computing. In [23], multiple UAVs work collaboratively to service IoT devices, and a DQN-based optimization approach was proposed to improve the efficiency of each UAV while maintaining the QoS of ground IoT devices. Nevertheless, the DQN algorithm could not handle high-dimensional action problems well, as we mentioned before. Savkin et al. [33] proposed a multi-UAV collaborative approach for improving the network performance between the UAV and the base stations. Chen [34] proposed an approach for improving pairing rate through optimizing the power allocations, UAVs' locations, and nodes scheduling. However, their optimization objective only focuses on system delay, which is different from ours. In the future, we will introduce multi-agent reinforcement learning algorithms to our proposal, which could further improve the algorithm's performance. We hold the opinion that multi-agent reinforcement learning algorithms will play an indispensable role in complex decision problems.