Decentralized Ofﬂoading Strategies Based on Reinforcement Learning for Multi-Access Edge Computing

: Using reinforcement learning technologies to learn ofﬂoading strategies for multi-access edge computing systems has been developed by researchers. However, large-scale systems are unsuitable for reinforcement learning, due to their huge state spaces and ofﬂoading behaviors. For this reason, this work introduces the centralized training and decentralized execution mechanism, designing a decentralized reinforcement learning model for multi-access edge computing systems. Considering a cloud server and several edge servers, we separate the training and execution in the reinforcement learning model. The execution happens in edge devices of the system, and edge servers need no communication. Conversely, the training process occurs at the cloud device, which causes a lower transmission latency. The developed method uses a deep deterministic policy gradient algorithm to optimize ofﬂoading strategies. The simulated experiment shows that our method can learn the ofﬂoading strategy for each edge device efﬁciently.


Introduction
With the emergence of the fifth-generation mobile communication system (5G), many computing-intensive services have been developed, such as face recognition [1], self-driving cars [2], and real-time translation [3].However, the existing civil devices cannot provide sufficient computing resources for these computing-intensive tasks.A feasible method is cloud computing [4], which transfers tasks to a high-performance cloud server, and the computing results are returned to user devices.The following problem is that the too-far cloud server results in a higher transmission latency.Therefore, edge computing is required, deploying several services at the edges of the network [5].
Generally, edge services are closer to the user device than the cloud service, but the different distances between these services and the user device make the edge computing network need an efficient offloading strategy to achieve low latency [6].Two types of latencies are generated in the offloading process: One is the transmission latency, which is resulted from sending tasks to services by uplink and returning the computing results to the user device by downlink [7].The other one is the computing latency, related to the task and the corresponding service's computation ability.In a multi-access edge computing system (MEC), an excellent offloading strategy can assign edge services, according to the service state and tasks.
In the early design, MEC systems offloaded tasks through some fixed rules or algorithms, such as Round Robin and first-come-first-service.These methods were enough to offload a small number of tasks and avoid the block.However, once more user devices exist, or the MEC system has only finite services, the traditional methods cannot satisfy the user devices' needs.Hence, some heuristic algorithms are developed regarding task offloading as an optimization problem.
As deep reinforcement learning has achieved great success on sequential decisionmaking scenarios [8], more and more researchers have introduced reinforcement learning into the MEC task offloading strategy.Benefiting from the deep neural network, reinforcement learning is extended to high-dimensional space, and its outstanding generalization ability is what MEC task offloading needs [9].Modeling task offloading to a Markov Decision Process (MDP), some researchers have attempted to design a reinforcement learning model for MEC task offloading in different ways.For a small-scale MEC system, modelbased reinforcement learning methods can be introduced to control the task offloading through modeling the MEC environment [10].In large-scale MEC systems or mobile edge computing scenarios, model-free reinforcement learning technologies are more feasible [11].
The existing reinforcement learning-based offloading strategies devote a centralized strategy for all edge servers, by which an additional communication cost is made.Although these methods can achieve low-latency resource allocation after training, the total cost barely reduces.For this reason, decentralized offloading strategies are desired by MEC systems, especially for large-scale MEC systems.In this work, we regard edge servers as a multi-agent system, building a multi-agent reinforcement learning model for MEC task offloading.Considering a high-powered cloud server and several individual edge servers, we introduce the centralized training and decentralized execution mechanism (CTDE) [12] to learn the offloading strategies.Specifically, every edge server executes its individual offloading strategy, and the corresponding experiences are sent to the cloud server for centralized training.With that, edge servers need no communication channel after training, so no additional cost occurs.
The main contributions of this work are summarized as follows: • We discuss some shortages in the existing reinforcement learning-based offloading strategies in MEC.For these shortages, we develop a new framework to improve task offloading by introducing the CTDE mechanism.

•
We first introduce the centralized training and decentralized execution mechanism into MEC systems, modeling a more feasible reinforcement learning model for MEC task offloading.

•
We conduct several experiments on simulation platforms to compare our framework with several existing methods.The results show that our framework outperforms the baseline methods.
The rest of the paper is organized as follows: We first give the background in Section 2. In Section 3, we present the proposed reinforcement learning-based framework.In Section 4, we present and discuss the results of our experiments.Finally, we conclude in Section 5 and give directions for future research.

Motivation
The persistent development of the Internet of Things is driven by the support of 5G communication, cloud computing, multimedia, mobile computing, and the use of big data technologies to generate the smart analytics value.To reduce network stress, edge computing shifts resources at the edge of the network, resulting in a lower transmission latency.Multi-access edge computing (MEC) is developed to achieve high quality of service (QoS) with low energy consumption for computation-intensive applications [13].The multi-access system aims to address transmission latency and computing delay for the video process services, Internet of Things, augmented reality, virtual reality, optimized local data caching, and many other use cases for Smart Cities [14].
Different from traditional cloud computing, in a MEC architecture, several servers can be used for offloading tasks, so the offloading strategy cannot be modeled as a single object optimization issue.One direct method is to regard all servers as an entire server.However, the optimization complexity will not be decreased through such a paradigm.Some researchers modeled task offloading in MEC as an NP-hard Knapsack Problem, calculating the optimum resolution by dynamic planning [15], the genetic algorithm [16], or other heuristic algorithms [17].The calculations for these methods require a vast iteration or matrix manipulation, and it is hard to satisfy the real-time requirement.For this problem, researchers desired a feasible method to learn a fixed strategy for task offloading.There are three requirements for the learning: the learned strategy is efficient enough for the complicated MEC architecture; the learned strategy can reflect offloading as soon as possible; the learned strategy generates less additional cost in the execution.For these purposes, reinforcement learning has attracted intense scholarly interest.Reinforcement learning can execute real-time decision-making after training, while the learned policy needs no additional calculation, except the policy network.
The following problem is that the state space and action space in a reinforcement learning model will increase with the number of edge servers.In a general or small-scale MEC architecture, reinforcement learning-based methods perform well.However, once these methods are used to learn offloading strategies for a large-scale MEC architecture, the training cost will be unaffordable.
When using multi-agent reinforcement learning methods to learn offloading strategies, the coupling among edge servers should be taken into account.Independent learning ignores the effect of other agents, resulting in misconvergence or suboptimal solutions.Centralized strategies lead to additional communication costs during the execution process.For these reasons, the existing MEC systems need a decentralized offloading strategy with coupling among edge servers.

Task Offloading in MEC
The development of task offloading methods in MEC can be divided into three phases.In earlier research, the resource scheduling in MEC is modeled as a nondeterministic polynomial problem, and heuristic algorithms and subtask optimization play important roles in such research.For example, Xiao et al. [18] divided the task into multiple subtasks, optimizing the placement of the edge computing server; Samanta et al. [19] developed a distributed and latency-optimal microservice scheduling mechanism, enhancing the quality-of-service (QoS) for real-time (mobile) applications; and Xu et al. [20] aimed to the offloading framework for deep learning edge services, leveraging heuristic searching to acquire the appropriate offloading strategy.
Considering that the task offloading in MEC is continuous and the offloading process has a Markov property, researchers focused on reinforcement learning based task offloading, leveraging the stability and generalization of deep reinforcement learning technologies.Chen et al. [11] utilized a deep Q-network to learn the optimal offloading strategy, minimizing the long-term cost; Wang et al. [21] leveraged the Resource Allocation scheme to improve the reinforcement learning model, learning the offloading strategy in the mutative MEC conditions; Liu et al. [22] considered the channel conditions between the end devices and the gateway, using the reinforcement learning approach to solve the resource allocation problem.
With the MEC systems become more and more complicated, the too-large state space and action space drove researchers to look for a new paradigm.Some researchers have paid attention to multi-agent reinforcement learning, looking for a convenient MDP model.Liu et al. [23] regarded task offloading as a stochastic game, using independent learning agents to learn offloading strategy for each edge server; Wang et al. [24] leveraged multi-agent deep Deterministic Policy Gradient to train UAVs in a MEC architecture; Munir et al. [25] considered a microgrid-enabled MEC network, using asynchronous advantage actor-critic algorithm to train agents.

Centralized Training and Decentralized Execution
In the training process for a multi-agent reinforcement learning model, agents continuously interact with the environment to generate experiences used to optimize their policy networks.However, mainstream training algorithms use the actor-critic network [26].That is, the execution is separated from the training.From this point, some researchers developed the CTDE mechanism [12].Specifically, when agents execute their policy, every agent makes decisions according to only its own observation or perception, while the optimizing process needs to take into account all agents' perceptions and behavior.
Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [12] is the typical multiagent reinforcement learning model based on CTDE mechanism.A multi-agent reinforcement learning experience with N agents can be termed as < S, A, S , R >, where S = [s 1 , s 2 , . . ., s N ] is the states for all agents, A = [a 1 , a 2 , . . ., a N ] is the joint action under joint state S, S = [s 1 , s 2 , . . ., s N ] denotes the next states for these agents, and R = [r 1 , r 2 , . . ., r N ] is the rewards returned from the environment.Based on the actor-critic model, MADDPG uses an actor network to fit the policy for each agent, while a critic network is built to evaluate its policy.Let π i (. . .|θ i a ) denote the policy network for the i-th agent, and the corresponding criticnetwork is represented by Q i ( ˙|θ i c ).The goal is for agents to maximize the excepted reward: In the sampling process, the i-th agent perceives the environment, gets their state s i , and then selects a behavior according to its policy: . After all agents have selected their actions, they execute their joint action a = [a 1 , a 2 , . . ., a N ], and rewards r = [r 1 , r 2 , . . ., r 3 ] are returned from the environment, while the next joint state for agents is s = [s 1 , s 2 , . . ., s N ].Then, an experience < s, s , a, r > is sent to the replay buffer.
Once the replay buffer receives enough experience, the policies can be optimized.For the i-th agent, its critic network estimates the Q-value as follows: The optimization objective for the critic network is fitting the discounted reward ∑ T t=t 0 γ t−t 0 r i,t , where γ is the discounted factor.The γr i,t+1 , γ 2 r i,t+2 , . . ., γ T−t 0 r i,T can be replaced with γQ i (a , s |θ i c )| a ∼π i (s i |θ i a ) .Hence, the loss function for the i-th critic-network is as follows: As for the corresponding actor network, MADDPG uses the Deterministic Policy Gradient to substitute for the Policy Gradient, that is, a deterministic policy is output by the actor network rather than a distribution.With that, the gradient of the objective J(θ i a ) = E s i R[s i , a i ] can be overwritten as follows: MADDPG allows every agent to make decisions according to only its own state, but the evaluation for policy needs to consider the joint state and action.With the centralized training and decentralized execution mechanism, MADDPG can achieve decentralized multi-agent reinforcement learning with a centralized communication channel.

The Proposed Method
In this work, we consider a multi-access edge computing architecture with a cloud server and several edge servers, developing a Decentralized Multi-Agent Reinforcement Learning-based offloading strategy (DMARL) for MEC.In this section, we first describe the proposed DMARL framework.Then, the specific learning process with the CTDE mechanism is given.

Learning Model for MEC
As shown in Figure 1, there are three layers in our DMARL model: a user layer with user devices, an edge layer with several edge servers, and a cloud layer with a cloud server.Using N to represent the number of edge servers, we make s E i =< w i , f i > denote the state for the i-th server, in which w i is the work state for the server and f i represents its computation ability.In the user layer, the tasks published by the user devices constitute a task set [t 1 , t 2 , ..., t M ], where M is the maximal number of tasks.The j-th task in the task set can be termed as a tuple < d j , l j , u j , k j >, where d j is the importance for the j-th task, l j is the transmission delay caused by downlink and uplink, u j denotes the workload for executing the j-th task, and k j is the waiting time.We develop a decentralized reinforcement learning model for task offloading, using actor-critic networks for learning the offloading strategies for edge servers.We assign each edge server with an actor network π i ( ˙|θ i a ) in the edge layer, and its critic network is deployed in the cloud server for centralized optimization.When a new task is published or an edge server completes its current task, the task set is encoded as s T and transmitted to the edge layer.Every edge server receives the encoded task set, combining s T and its server state s E i as its individual state Then an action is selected by the actor network: , which is a deterministic policy.Each server has M + 1 selectable actions: offloading one of the M tasks and no offloading.The edge server executes its action while sending s i and a i to the cloud server.Then a reward is generated by calculating the mean waiting time for the task as follows: These processes for different edge servers are decentralized.When all edge servers conduct this operation and the next step is done, the cloud server combines the received data as an experience < S, S , A, A , R >, where S = (s 1 , s 2 , . . ., s N ), S and A is the joint state and joint action at next step, A = (a 1 , a 2 , . . ., a N ), and R = (r 1 , r 2 , . . ., r N ).The experience is sent to the replay buffer, which is also in the cloud layer.The replay buffer samples experiences train the actor networks and critic networks in the case of enough experiences.In the optimization process, the critic networks are updated in a centralized mechanism, while the experiences are transmitted to the edge layer for the update of the actor networks in a decentralized way.The specific update is described in the next subsection.

Update and Training
Similar to MADDPG, the update for edge servers' offloading strategies use the CTED mechanism.After the replay buffer receives enough experiences, the cloud server optimizes the critic networks, while the corresponding experiences and evaluations are sent to edge servers to optimize the actor networks.The update for critic networks is centralized, and that for actor networks is decentralized.
A experience batch sampled from the replay buffer is termed as D. The critic networks are used to estimate the Q-values, so that the purpose for its optimization is to approximate the discounted reward r + γQ(s , a ).Let Q i ( ˙|θ i c ) denote the critic network for the i-th edge server, and π i ( ˙|θ i a ) is the corresponding actor network.For the critic network of the i-th edge server, its loss function is calculated as follows: Then, the calculated Q i (S, A|θ i c ) is packaged with (S, A), and they are sent to the edge layer to update the actor networks.A packaged experience for the actor network is In the update for actor networks, Q i (S, A|θ i c represents the reward for executing the a i ∈ A at state s i ∈ S. The policy gradient for the i-th actor-network is as follows: where P(a i |π(s i |θ i a )) is the probability of selection a i at state s i .With the proposed DMARL, edge servers have decentralized offloading strategies, which is more suitable for distributed systems.Moreover, we deploy the critic networks at the cloud layer while the actor networks are stored in corresponding edge servers, resulting in a lower training cost and additional transmission.
Then, we analyze the training cost of the proposed DMARL and other multi-agent reinforcement learning-based methods.We make D s denote the dimensionality of s i and D a represent the dimensionality of a i .In centralized learning-based methods, the policy network can be regarded as a mapping from (N * D s ) to (D a ), and the evaluation network is also fed by a (N * D s )-dimensional data.In CTDE-based methods, the policy network is decentralized so that the corresponding input is D s -dimensional, but the evaluation network needs a (N * D s )-dimensional input.The proposed DMARL models MEC systems as decentralized multi-agent reinforcement learning models so that the inputs in both the policy network and evaluation network are D s -dimensional, which results in a lower space cost.As for the computational cost and training time, decentralized networks require fewer floating point operations than centralized networks.However, the total floating point operations depend on the learning efficiency, which is difficult to quantify theoretically.We present the performance of DMARL and baseline methods under the same number of training episodes to investigate the learning efficiency in the next section.

Experiments
In this section, several simulated experiments are conducted to validate our method.We first introduce the baseline methods used in these experiments; then, the experimental setups are given.Finally, we give and analyze the experiment results.

Baseline Methods
In the following paragraph, three task offloading methods are introduced, and they are chosen as the baselines to evaluate the proposed DMARL.
HEFT (Heterogeneous Earliest Finish Time) [27] is a heuristic algorithm that calculates the earliest finish time of executing each task in each server.Round-Robin [28] assigns tasks to servers in turn, in which the load and current state of each task is ignored.Round-Robin is also a heuristic method.PPO [29] is a reinforcement learning-based method that uses the Proximal Policy Optimization (PPO) algorithm to train the joint offloading strategy for edge servers.Different from our method, PPO regards just the task set as the state for servers.

Experiment Setup
In the simulated experiment, we set eight edge servers at the edge layer (N = 8), with a cloud server in the cloud layer.A user device publishes tasks continuously, and the maximal number of tasks is 10 (M = 10).The computing capabilities of the eight servers are 5, 10, 10, 15, 15, 20, 20, and 25 GHz, and the transmission latency is sampled from a uniform distribution [1,20] ms.The required number of CPU cycles by a task ranges from 1 to 10 megacycles.
As for the reinforcement learning model, we set the learning rate to 0.001, and the discounted factor γ is 0.9.The encoder for the task set is a multi-layer perceptron with a hidden layer (64 nodes).The actor network and critic network consist of fully-connected layers with 128 neurons.The capacity of the replay buffer is set to 10,000, while the batch size is 128.

Result and Analysis
The experiment results are given in Table 1, which records the average of total waiting times r 1 , r 2 , . . ..rN after training.From Table 1, we can see that the reinforcement learningbased methods' performance is related to the training episodes.At the beginning of the training (0-10,000 episodes), both the proposed DMARL-and PPO-based methods fall behind, and HEFT obtains the best result.As the episodes increase, our method and PPObased method can rival the heuristic algorithm.At the end of the training (100,000 episodes), the proposed DMARL achieves the best result: the total waiting time is just 1586 ms.Compared with heuristic methods, our method needs less time to conduct the learned policy, and the learned offloading strategies can adapt in most cases, due to the generalization of neural networks.Moreover, the proposed DMARL utilizes the CTDE mechanism to train offloading strategies, capturing the coupling among edge servers by centralized critic networks so that it outperforms the PPO-based method.As an independent learning method, the PPO-based method has the advantage of quick learning speed, benefiting from decoupling the relationship among different agents.However, this decoupling makes it difficult for the edge server to obtain an accurate estimation of the Q-value, so that some correct behavior may be punished due to other edge servers' incorrect policies.The policies learned by the PPO-based method always converge to a local optimum or misconverge.On the contrary, our DMARL uses the CTDE mechanism to ensure the coupling of different agents, resulting in an accurate estimation for the Q-value.That is why our DMARL methods outperforms the PPO-based method at the end of the training.To show the superiority of our DMARL more convincingly, we calculate the average collision (two or more edge servers offloading the same task) rates for the proposal and PPO-based method.To compare algorithms, we therefore report the performance averaged across 10 runs.We do this to ensure that we are not simply reporting a result due to stochasticity.The results are given in Figure 2.For all edge servers in the MEC system, using our DMARL can achieve a lower collision rate than using the PPO-based method.The reason is that the PPO-based method is an extension of the independent learning mechanism, neglecting the relationship among servers.Our DMARL connects these servers' behavior in the process of evaluation, which ensures the relationship for them to some extent.The relationship among servers ensures their coordination.Once a server ignores others' policies, it cannot accurately judge others' behaviors when making a decision.Likewise, other servers also cannot avoid collision by predicting the server's behavior.However, the CTDE mechanism estimates the Q-value in a centralized way, and the coordination of servers occurs in the unbiased training, so the collision hardly occurs.

Conclusions
Aiming at a decentralized offloading strategy for MEC systems, this work leverages the CTDE mechanism for training the offloading strategy for each edge server.Modeling task offloading as a reinforcement model, we separate the sampling process from the update process.The proposed DMARL method is based on the actor critic algorithm.However, we deploy the critic networks at the cloud layer, while the actor networks are in the edge layer.With the centralized learning and decentralized execution mechanism, edge servers can hold their coupling through centralized critic networks.The simulated experiment shows that the proposed DMARL outperforms heuristic methods and other reinforcement learning based methods.

Figure 1 .
Figure 1.The proposed architecture, in which the centralized critic networks are deployed at the cloud layer, and the actor networks are deployed at the edge layer.

Figure 2 .
Figure 2. The average collision rates for edge servers.

Table 1 .
The results for the experiments.