Next Article in Journal
Measuring the Effectiveness of Carbon-Aware AI Training Strategies in Cloud Instances: A Confirmation Study
Next Article in Special Issue
Flexible Hyper-Distributed IoT–Edge–Cloud Platform for Real-Time Digital Twin Applications on 6G-Intended Testbeds for Logistics and Industry
Previous Article in Journal
Time-Efficient Neural-Network-Based Dynamic Area Optimization Algorithm for High-Altitude Platform Station Mobile Communications
Previous Article in Special Issue
PrismParser: A Framework for Implementing Efficient P4-Programmable Packet Parsers on FPGA
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Task Offloading and Resource Allocation Strategy Based on Multi-Agent Reinforcement Learning in Mobile Edge Computing

1
School of Artificial Intelligence Technology, Guangxi Technological College of Machinery and Electricity, Nanning 530007, China
2
School of Information Engineering, Guangxi Vocational University of Agriculture, Nanning 530007, China
3
School of Computer and Electronic Information, Guangxi University, Nanning 530004, China
*
Author to whom correspondence should be addressed.
Future Internet 2024, 16(9), 333; https://doi.org/10.3390/fi16090333
Submission received: 15 August 2024 / Revised: 3 September 2024 / Accepted: 9 September 2024 / Published: 11 September 2024
(This article belongs to the Special Issue Convergence of Edge Computing and Next Generation Networking)

Abstract

:
Task offloading and resource allocation is a research hotspot in cloud-edge collaborative computing. Many existing pieces of research adopted single-agent reinforcement learning to solve this problem, which has some defects such as low robustness, large decision space, and ignoring delayed rewards. In view of the above deficiencies, this paper constructs a cloud-edge collaborative computing model, and related task queue, delay, and energy consumption model, and gives joint optimization problem modeling for task offloading and resource allocation with multiple constraints. Then, in order to solve the joint optimization problem, this paper designs a decentralized offloading and scheduling scheme based on “task-oriented” multi-agent reinforcement learning. In this scheme, we present information synchronization protocols and offloading scheduling rules and use edge servers as agents to construct a multi-agent system based on the Actor–Critic framework. In order to solve delayed rewards, this paper models the offloading and scheduling problem as a “task-oriented” Markov decision process. This process abandons the commonly used equidistant time slot model but uses dynamic and parallel slots in the step of task processing time. Finally, an offloading decision algorithm TOMAC-PPO is proposed. The algorithm applies the proximal policy optimization to the multi-agent system and combines the Transformer neural network model to realize the memory and prediction of network state information. Experimental results show that this algorithm has better convergence speed and can effectively reduce the service cost, energy consumption, and task drop rate under high load and high failure rates. For example, the proposed TOMAC-PPO can reduce the average cost by from 19.4% to 66.6% compared to other offloading schemes under the same network load. In addition, the drop rate of some baseline algorithms with 50 users can achieve 62.5% for critical tasks, while the proposed TOMAC-PPO only has 5.5%.

1. Introduction

Research shows that the number of Internet of Things (IoT) devices is expected to reach 30.9 billion in 2025, and the amount of data generated will exceed 175 zebytes [1,2]. The number of mobile devices is growing exponentially, and the communication and computing capabilities of devices are facing serious challenges [3,4]. As one of the solutions, cloud computing usually causes high delay and privacy leakage due to the long distance of deployment [5,6]. Mobile edge computing (MEC) aims to make up for the shortcomings of cloud computing, decentralize computing resources to the network edge user side, and provide mobile users with short-distance and low-delay computing services. Although computing offloading has been a research hotspot in the field of MEC, the majority of existing schemes have the following problems: (1) Many studies have failed to consider the actual situation, ignoring the possibility of terminal mobility management and network failure, and also failed to consider the information observation and synchronization of network equipment. (2) Some studies have used reinforcement learning (RL) technology to solve the offloading decision problem, but do not consider the training difficulties that may be caused by the delayed reward feature of the MEC environment [7]. (3) Although some studies have taken into account the distributed characteristics of MEC, they have failed to decompose the decision problem. Instead, they choose single-agent RL to solve the decision problem of the whole system, which may lead to problems such as too large a dimension of decision space or difficult convergence of training.
Given the uncertainties faced in real network environments such as system failures and user mobility for crossing regions, how to ensure reliable communication and collaboration among MEC network devices, and how to handle efficiently task offloading and resource allocation on edge servers, are all challenges currently faced in the MEC field. Therefore, this paper proposes a distributed task offloading and resource allocation scheme based on “task-oriented” multi-agent reinforcement learning, which can effectively reduce system average cost, delay, energy consumption, and task drop rate. The main contributions of this paper are as follows:
(1)
This paper constructs a cloud-edge collaborative computing model, and related task queue, delay, and energy consumption model, and gives joint optimization problem modeling for task offloading and resource allocation with multiple constraints.
(2)
In order to solve the joint optimization problem, this paper designs a decentralized task offloading and resource allocation scheme based on “task-oriented” multi-agent reinforcement learning. In this scheme, we present information synchronization protocol and offloading scheduling rules and use edge servers as agents to construct a multi-agent system based on the Actor–Critic framework.
(3)
An offloading decision algorithm TOMAC-PPO (Task-Oriented Multi-Agent Collaborative-Proximal Policy Optimization) is proposed. The algorithm applies the proximal policy optimization to the multi-agent system and combines the Transformer neural network model to realize the memory and prediction of network state information. Experimental results show that this algorithm has better convergence speed, and can effectively reduce the service cost, energy consumption, and task drop rate under high load and high failure rates.
The structure of this paper is as follows. Section 2 introduces the related works; Section 3 establishes a cloud-edge collaborative model; Section 4 proposes a distributed task offloading and resource allocation scheme based on task-oriented multi-agent reinforcement learning; Section 5 verifies the performance of the algorithm proposed in this paper; Section 6 summarizes this paper and gives future research directions.

2. Related Works

Many scholars have conducted extensive research on optimizing the task offloading scheduling scheme of MEC networks to improve network service quality. Traditional offloading optimization schemes usually establish mathematical programming models to solve optimization problems. For example, the authors established a 0-1 integer programming model for joint optimization of energy consumption and delay in [8], and designed a service request distribution method based on game theory in [9]. However, such schemes ignore the mobility of devices and information observation, making it difficult to adapt to dynamic and highly random large-scale network environments. The authors used a filling method based on linear programming to solve resource allocation problems and conduct joint optimization with dynamic pricing problems in [10]. The authors proposed an asynchronous computing framework, and the general benders decomposition method is used to decompose and iteratively solve the problem of user scheduling and resource allocation in [11]. These traditional optimization schemes typically have high operational efficiency and a solid theoretical foundation but often require accurate system state information, which is challenging in practical MEC systems. Moreover, traditional optimization schemes often struggle to cope with the high-dimensional decision space in optimization problems due to their high computational complexity.
Artificial intelligence solutions utilizing neural networks have been proven to be effective for complex task offloading and resource allocation problems. RL techniques are often used to solve offloading decision problems. Due to the fundamental idea of RL being to enable agents to interact with their environment and learn the optimal strategy to achieve their goals, RL is suitable for solving offloading decision problems.
In recent years, traditional RL methods have gradually been replaced by deep reinforcement learning (DRL), such as the DRL method used to solve offloading decisions, while also addressing the allocation of bandwidth, cache, and computing power in [12]. The authors combined DRL to solve the optimal offloading strategy and federated learning methods to address data privacy issues in [13]. Compared with the previous optimization methods, this type of scheme does not require the establishment of a mathematical model for solving and can autonomously explore the optimal offloading strategy without prior knowledge, making it more suitable for high-dimensional state spaces in large-scale networks. However, these studies typically require a significant amount of time and data to train neural networks, and all adopt centralized decision architectures, resulting in a high dependence on the central control node, leading to low robustness, poor scalability, and other issues. In addition, DRL has been applied to unmanned aerial vehicles (UAVs) for navigation, trajectory planning, and radio resource management [14,15].
The performance differences between traditional Q-learning and deep Q network (DQN) algorithms in solving task offloading scheduling problems were compared in [16]. However, such schemes do not consider distributed architecture, which reduces robustness while increasing training difficulty. Therefore, the authors adopt the asynchronous advantage Actor–Critic algorithm to achieve a distributed architecture, which accelerates training while reducing the correlation between state transition samples in [17]. However, unlike parallel RL architectures such as A3C, all nodes in MEC networks are not in independent environments that do not affect each other. Therefore, distributed offloading decision-making is a multi-agent problem, and using traditional single-agent or parallel RL schemes may make it difficult for the algorithm to converge [18]. Furthermore, some scholars have also begun to use multi-agent reinforcement learning (MARL) technology to solve task-offloading decision problems. For example, reference [19] proposed a multi-agent DQN algorithm based on value decomposition for task offloading strategy problems. The authors used the multi-agent deep determining policy Gradient (MADDPG) algorithm to solve the multi-agent task offloading problem [20]. However, existing research on DRL-based offloading decisions has not taken into account the training difficulties that may arise from reward delay in MEC network environments.

3. Preliminaries: Network Model and Problem Definition

3.1. Cloud-Edge Collaboration Model for MEC Network

As shown in Figure 1, the cloud-edge collaboration model can be divided into user layer, edge layer and cloud layer. The user layer is composed of m mobile devices used by all users in the service area, and it can be denoted by   U D = u d 1 , u d 2 , u d m . User equipment may move locations and generate computing tasks at any time. The edge layer includes n edge nodes, which are represented as E N = e n 1 , e n 2 , e n n . Each node consists of a wireless base station and an edge server.

3.2. Task Queue Description

The queue model of this paper is shown in Figure 2. Each user device maintains a cache, a computing queue, and a transmission queue. After a task is generated, it first enters the cache to wait for an offloading decision. After the offloading decision is determined, the task can enter the calculation queue to perform local computing or enter the transmission queue to offload to edge layer devices. The u d i lengths of the cache area, computing queue, and transmission queue recorded are q i comp , q i cache , and q i tran , respectively. The available cache capacity of the user’s device is R ud .
Each edge server maintains a wireless transmission queue, a wired transmission queue, and n u m core (the number of CPU cores in the edge server) computing queue. If a task is offloaded to the edge server, it can enter the server’s computing queue or enter the server’s wired transmission queue to be sent to other nodes or the cloud. If the task is completed within the edge server or cloud server, the computing result can be sent back to the user in the wireless transmission queue of the edge server. The e n j length of the computing queue and transmission queue, as well as the cache capacity of a single edge server, are R en , q j en-tran , and q j en-comp , respectively, and all queues comply with FIFO (First In First Out).
Due to the different priorities of various tasks, a simple FIFO queue is difficult to meet practical needs. Therefore, this paper allows computing devices to perform queue insertion arrangements when receiving tasks. When the decision node makes an offload decision on task x t i , the priority weight q f i , t of the task can be determined. Task x t i is queued according to weight q f i , t when it enters any queue, and, the higher the weight, the higher the priority.
To further solve queue congestion and improve throughput and delay performance, this paper also introduces an active queue management mechanism, which is described as follows:
  • If the average queue length is less than the minimum threshold t h m i n , newly arrived tasks are pushed into the queue;
  • If the average queue length is greater than the minimum threshold t h m i n and less than the maximum threshold t h m a x , randomly drop the newly arrived task according to probability;
  • If the queue length has reached the maximum threshold t h m a x , the newly arrived task is dropped p drop .

3.3. Task Model

Let x t i denote the generated task at the time t of the mobile device u d i , and its deadline is τ i , t . The 0-1 variable D i , t k 0 k n + 2 indicates the offloading status of the task x t i , and we have the following.
D i , t k = 1   ,   For   x t i ,   the   offloading   goal   is   e n k 0   ,   the   offloading   goal   is   not   e n k , 1 k n
Here, if D i , t 0 is 1, the task x t i will be performed in local server. If D i , t n + 1 is 1, x t i will be offloaded at edge sever. If D i , t n + 2 is 1, x t i will be discarded by the user. The offloading decision vector D i , t = D i , t 0 , D i , t 1 , D i , t n + 2 can uniquely determine the offloading direction of the task x t i . When the task x t i is offloaded to a node e n j , the weight of CPU computing resources allocated to the task x t i is expressed as f f i , t j . When the server is running, it dynamically allocates its computing resources to each task according to its weight. Specifically, the computing power f f i , t j that can be occupied by the task x t i executed on nodes can be expressed as follows:
f i , t j = f f i , t j f f sum j f edge
where f edge represents the computing power owned by a single edge node and f f sum j represents the total computing power weight of all tasks performing computing in the node e n j .
This paper categorizes computing tasks into four categories based on their requirements for security and delay: high-priority tasks, critical tasks, low-priority tasks, and regular tasks. The specific classification description is as follows:
(1)
High-priority tasks have high requirements for delay, which are related to security and have hard indicators for task completion rate.
(2)
The key task is a computing task that requires extremely high security and can appropriately lower delay requirements but cannot be discarded.
(3)
Low-priority tasks are not related to security but are in scenarios that require energy savings, or tasks with excessive data volume and computation time.
(4)
Except for the above three types of computing tasks, all other computing tasks are routine tasks.

3.4. Network Communication and Observation Model

Suppose the radio channel bandwidth resource of a single base station is B, and the service radius is r. In addition, we assume that the base station has dynamic channel allocation capability, the user u d i allocated bandwidth weight for the node e n j is b f i , j , and then the allocated bandwidth B i , j can be expressed as follows:
B i , j = b f i , j x = 1 m b f x , j B
Here, if the user u d i leaves the service area, the b f i , j = 0. The above Formula (3) represents the proportion of the bandwidth allocated by the base station to a user in the total bandwidth of the base station. So, the maximum data transmission rate between u d i and e n j can be expressed as follows:
C i , j = B i , j log 2 1 + P G i , j σ 2
where P is the transmission power of the sender’s equipment. G i , j is the channel gain between u d i and e n j . σ 2 is the Gaussian noise power of the channel.
Electromagnetic waves travel at the speed of light V space in the service area. For wired communication, the link transmission rate is C fiber and the link propagation rate is V fiber . Each line has a fault probability p fiber in every minute, and communication capability can be recovered after maintenance time T repair .
The MEC network model considered in this paper also has the problem of local observation. A single device only has the ability of local observation, and the global state information of the network can only be obtained by synchronous communication between devices. In addition, for user equipment, the computing density and computing result size of the task cannot be estimated.

3.5. Network Delay and Energy Consumption Model

The task x t i generated by the node u d i is performed on any device at any time t, the delay can be expressed as follows:
T i , t comp = λ i , t ρ i , t f
where λ i , t is the amount of input data for x t i , and ρ i , t is the computing density of x t i . That is the number of CPU cycles required to process each bit of data for x t i . f is the CPU frequency of the computing device.
The sum of transmission delay and propagation delay generated by communication between u d i and e n j can be expressed as follows:
T i , j space = λ C i , j + d i , j V space
where   λ is the size of the sent data and d i , j is the distance between u d i and e n j .
The sum of transmission delay and propagation delay generated by communication between adjacent edge nodes can be expressed as follows:
T fiber = λ C fiber + d edge V fiber
where   d edge is the distance between adjacent nodes.
The delay caused by edge node communication with cloud server can be expressed as follows:
T cloud = λ C fiber + d cloud V fiber + T delay cloud
where d cloud is the distance between edge nodes and cloud server, and T delay cloud is the forwarding delay of core network.
Local calculated energy consumption for x t i can be expressed as follows:
E i , t comp = κ λ i , t ρ i , t f i user 2
where   κ is the energy consumption efficiency coefficient of user equipment. f i user is the CPU frequency of u d i . If data are sent from u d i to e n j , the duration T i , j space can be calculated by (6), and the energy consumption of user equipment during the period can be expressed as follows:
E i , j tran = P i T i , j space
where P i is the transmission power of u d i .

3.6. Offloading Decision and Resource Allocation Problem Modeling

The cost of computing task in this paper can be expressed as follows:
Z x t i = φ 1 T i , t total τ i , t + φ 2 E i , t total μ 1 λ i , t ρ i , t + d r o p i , t
where T i , t total is the total processing delay of task x t i and E i , t total is the total energy consumption of x t i . ρ i , t is the calculated density of x t i . λ i , t is the amount of input data for x t i . τ i , t is the upper limit of the tolerable delay of x t i . The constant μ 1 is the scaling factor, so that 0 < E i , t total μ 1 λ i , t ρ i , t 1 . In order to achieve the effect of normalization, variable d r o p i , t is defined as follows:
d r o p i , t = 1   ,   x t i   i s   d i s c a r d e d 0   ,   x t   i c o m p l e t e s   t h e   c a l c u l a t i o n
For each task in this paper, φ is set as follows. The optimization target for high-priority tasks does not include energy consumption, so there is φ 2 = 0 for high-priority tasks. The optimization target for critical mission only considers the drop rate, so there is φ 1 = φ 2 = 0 for critical missions. The optimization goal of low-priority tasks does not include delay, so there is φ 1 = 0 for low-priority tasks. The optimization of routine tasks needs to comprehensively consider the delay, energy consumption, and drop rate, so there is φ 1 = φ 2 = 1 for routine tasks.
The goal of this paper is to optimize the task offloading decision of each decision-making device and the bandwidth allocation decision of the server to the user equipment, so as to minimize the sum of all system task costs in a long period of time. Thereby reducing network delay, drop rate and user energy consumption. The above optimization problem can be modeled as follows:
m i n D , f f , q f , b f 1 i m t T i task Z x t i s .   t .   C 1 :   D i , t k 0 , 1   , i m ,   t T i task C 2 :   k = 0 n + 2 D i , t k = 1   ,   i m ,   t T i task C 3 :   f f i , t j 0 , 1 , x t i   is   calculated   in   e n j ,   i m ,   t T i task C 4 :   q f i , t 0 , 1 ,   i m ,   t T i task C 5 :   b f i , j 0 , 1 ,   communication   between   u d i   and   e n j ,   i m ,   j n C 6 :   q i cache + q i comp + q i tran R i ud   ,   i m C 7 :   q j en comp + q j en tran R en   ,   j n C 8 :   T i , t total τ i , t   ,   i m ,   t T i task C 9 :   d i , j r   ,   communication   between   u d i   and   e n j ,   i m ,   j n
where D is the set of offloading decision vectors D i , t corresponding to all tasks. f f is the set of weights f f i , t j assigned to the computing power corresponding to all tasks.   q f is the set of queue priority weights corresponding to all tasks. b f is the set of bandwidth allocation weights b f i , j for all users. T i , t total is the total delay for task processing. Setting T i task contains all the moments that generate tasks within the system runtime. Constraint C1 indicates that each task has only two states for offloading the target device: completing offloading and no offloading. Constraint C2 indicates that each task has and only has one offloading decision result. Constraints C6 and C7 indicate that the sum of the queue and cache lengths of user devices or edge nodes does not exceed their cache capacity. Constraint C8 indicates that the total processing delay of the task does not exceed its upper of tolerance delay. Constraint C9 indicates that the distance between the user and the edge node during communication does not exceed the service radius of the node base station.

4. A Joint Optimization Scheme for Task-Oriented Multi-Agent PPO Offloading Decision and Resource Allocation

4.1. Network Information Synchronization Protocol

At the user layer, users periodically report their own location coordinates to the edge nodes. After receiving the information, the node regards itself connected to the user and records the location information. At the edge layer, each edge node sends synchronous data to all adjacent nodes, so the node can obtain the topology of the edge layer and construct a routing table after receiving the data and store the status information of the remaining nodes in its own database. In the cloud layer, cloud server periodically sends synchronization information to all edge nodes to confirm link connectivity. As shown in Figure 3, all synchronization information is sent periodically. If synchronization is not received after a period of timeout, the node considers that it has disconnected from the device and does not retransmit.

4.2. Distributed Offloading Scheduling Rules

Considering the mobility of users in MEC network and the hidden trouble of nodes and lines, it is necessary to design a set of offloading scheduling rules to deal with a variety of unexpected situations. In an ideal situation, the offloading scheduling rules are shown in Figure 4.
The description is as follows:
(1)
If the user is within the service area, when they generate a task, they first report the summary information of the task to the nearest edge node they are connected to (referred to as the “decision node”).
(2)
The decision node makes an offloading decision based on this summary information and network conditions and sends it back to the user.
(3)
Users perform local calculations or offload tasks to decision nodes based on the received decision results.
(4)
After receiving task data, the node forwards the task to the offloading target node for calculation.
(5)
After the calculation is completed, offloading the target node will determine the node closest to the user as the “return node”, and then send the calculation result to the return node in the shortest path.
(6)
The final calculation result is sent back to the user by the return node.

4.3. Multi-Agent System Based on Actor–Critic Framework

This section proposes a multi-agent system architecture suitable for MEC networks based on the classic strategy learning framework Actor–Critic. The multi-agent system that was designed is shown in Figure 5.
The system adopts the region partitioning method, dividing the entire network into multiple subnets based on the hop distance. If an undirected graph is used to represent a node network, the system will select the node with the highest degree as the commentator node for each subnet in the network graph every time the network topology changes. The commentator node will act as the sole agent in the subnet, making decisions for all nodes within the subnet. Ordinary nodes only need to submit their observed information to the commentator node as a decision basis and execute the action values issued by the commentator node. The commentator node maintains multiple policy networks and a value network internally, with each policy network corresponding to the offloading strategy of each node in the subnet. The intelligent agent can make decisions and execute actions based on the output values of the policy network. Subsequently, the intelligent agent will use the value network to evaluate this action based on environmental feedback rewards and new observation information. The policy network and value network will update their parameters accordingly.

4.4. Task-Oriented Markov Decision-Making Process

This paper proposes a task-oriented MDP model for MEC networks, which has an indefinite length of time steps and multiple time steps are performed simultaneously. As shown in Figure 6, the x t 1 i time step corresponding to the task starts at the time of its generation t1 and ends at the time of x t 1 i completion of processing t 1 + T i , t 1 total . x t 1 i After a brief delay, the intelligent agent t 1 observes the state at all times s x t 1 i and performs an offloading action on the task a x t 1 i . When x t 1 i is processed completing, reward feedback for environmental is r x t 1 i . Multiple tasks are processed in parallel.
The TOMDP model includes elements such as state, action, and reward, defined as follows:
(1)
Status
The state information is composed of the observations of the node itself and the observations of all nodes it can connect to. Assuming that the node e n j receives the x t i summary information of the task at a certain moment, define the relevant x t i state as s x t i . We have the following.
s x t i λ i , t μ 2 , τ i , t μ 3 , T P i , t , H i , q i cache + q i comp + q i tran R ud , q j en comp + q j en tran R en , j n , L i n k j , t h m i n , t h m a x
where λ i , t represents the amount of input data for x t i . τ i , t represents the maximum tolerance for delay for x t i . Constants μ 2 and μ 3 are scaling factors used to achieve normalization effects. Vector T P i , t represents the type of task. The matrix H i represents the e n j user coordinate records stored in. j represents the objective function of strategy learning. The vector L i n k j represents the e n j connectivity with other nodes. The capacities of the u d i cache area, calculation queue, transmission queue, and total cache are q i cache , q i comp , q i tran , and R ud , respectively. The capacities e n j of the calculation queue, transmission queue, and total cache are q j en-comp , q j en-tran , and R en , respectively. The minimum and maximum thresholds for the queue are t h m i n and t h m a x . Define the “completion state” observed by e n j when x t 1 i finishes processing as s c x t i .
(2)
Action
The decision vector includes the offloading target selection of the task D i , t , computing power allocation weight f f i , t j , queue priority weight q f i , t , bandwidth allocation weight b f i , j , and thresholds t h m i n and t h m a x in the queue. Assuming the x t i offloading goal of the task is e n j , define the x t i action related to it as a x t i . Specifically, it is expressed as follows:
a x t i D i , t , f f i , t j , q f i , t , b f i , j , t h m i n , t h m a x
(3)
Rewards
The definition of rewards x t i in this paper is as follows:
R x t i 1 Z x t i t T i task Z x t i t T i task 1 + 1 2 P E N x t i
where t T i task Z x t i t T i task 1 represents the u d i average cost of all generated tasks. P E N x t i 0 , 1 indicates the penalty value of x t i given by the decision constraint module for the initial decision.

4.5. Neural Network Structure Used by TOMAC-PPO

In order to enable intelligent agents to have a certain degree of memory and solve the local observation problem in network models, the TOMAC-PPO algorithm proposed in this paper combines the Transformer model with a fully connected network. The strategy network structure used by TOMAC-PPO is shown in Figure 7. The network input is state s , and the output is action probability density π ( a | s ; θ j ) , where θ j represents the j policy network parameters of the agent. Firstly, the strategy network inputs s into the Transformer model. Next, learn the mapping relationship between state information and action probability distribution through three fully connected layers, and finally output the probability corresponding to each action in that state. The final fully connected (FC) layer specifically adopts the Softmax activation function to ensure that the output value probabilistically satisfies the definition of the probability density function. Similarly, the value network also adopts an almost identical structure, as shown in Figure 8. The goal of the value network is to fit the S state value of the current state V π s , so only one output value is needed v s ; ω j (where ω j represents the agent’s j value network parameters). The structure of the target value network is completely consistent with the value network, and its parameters are represented as ω ^ j .

4.6. TOMAC-PPO Algorithm Process

The PPO algorithm, proposed by Open AI and DeepMind, is widely regarded as one of the most successful DRL algorithms due to its excellent performance, high efficiency, and stable characteristics [21]. PPO is a strategy learning method, based on the Actor–Critic framework, which can be well adapted to the multi-agent system proposed in this paper. Therefore, this paper combines PPO with Transformer and applies it to the TOMDP model and multi-agent system built in this paper, ultimately forming the TOMAC-PPO algorithm framework. During the training process of the TOMAC-PPO algorithm, constant θ now is used to represent the current parameters of the policy network. θ indicates the parameters of the policy network during the next update, which is the optimization variable of the algorithm. The objective function in TOMAC-PPO theory is denoted as J θ . Due to the J θ difficulty in obtaining the expression, when θ is in the confidence domain N θ now , an expression L θ can be constructed that is close enough to approximate J θ and easy to solve, and L θ can be used instead of J θ as the approximate objective function to solve [22].
To satisfy the confidence domain constraint θ N θ now , the divergence needs to be used to measure and limit the difference between the new strategy π ( A | S ; θ ) and the old strategy π ( A | S ; θ now ) . The TOMAC-PPO algorithm proposes a scheme of approximate optimization objective pruning. The approximate objective function L θ is clipped so that θ does not exceed the confidence domain N θ now when the gradient ascent algorithm is run. In this scheme, the constraints in the objective function are removed, and the constraint function on θ is retained, which makes the optimization problem easier to solve.
Here, s x t i , a x t i and r x t i will be simplified as s t , a t and r t .
Note that r a t i o t θ = π a t | s t ; θ π a t | s t ; θ now ; the derivation process in reference [23] yields the approximate objective function L CLIP θ after being trimmed by the approximate optimization target clipping method in the TOMAC-PPO algorithm.
L CLIP θ = E t { min [ r a t i o t θ A d v t ,   clip r a t i o t θ , 1 ς , 1 + ς A d v t + c H s t ; θ ] }
where ς and c are hyperparameters. clip r a t i o t θ , 1 ς , 1 + ς means that the maximum value of ratio t θ is truncated in the interval θ 1 ς , 1 + ς , so that the L CLIP θ function must have a global maximum value in the interval 1 ς , 1 + ς as shown in Figure 9. H s t ; θ is based on the entropy reward introduced in references [17,22], which can encourage agents to fully explore more actions and avoid premature convergence of strategies to local optima. The advantage function A d v t can be expressed as follows:
A d v t = Q π s t , a t V π s t
It is hard to find the specific values of Q π and V π in actual training, so it is necessary to approximate A d v t . Firstly, TOMAC-PPO causes the agent to collect the T step trajectory s t , a t , r t with the current strategy π A | s t ; θ now so that it can get a discount return   u t = r t + γ r t + 1 + + γ T t r T . Then, in (18), the Q π s t , a t approximation is replaced by u t , and the V π s t approximation is replaced by the value network output v s t ; ω , so that the approximate advantage A d v t can be obtained:
A d v t = r t + γ r t + 1 + + γ T t r T v s t ; ω
So far, the variables in the expression of the approximate objective function L CLIP θ can be obtained, and the gradient ascent method can be used to continuously update the strategic network parameter θ of each agent to increase L CLIP θ .
θ θ now + β θ L CLIP θ now
where β is the learning rate, which is the training value network, and the value network loss function is defined as follows:
L V ω t = 1 T ( A d v t ) 2
The larger the approximate advantage A d v t in (21), the smaller the loss of the value network. The purpose of defining the loss function in this way is to encourage the agent to try to increase the dominance value and choose the action that increases the dominance as much as possible. The value network parameter ω can be updated by the gradient descent method to reduce the loss L V ω :
ω ω now α ω L V ω now
where α is the learning rate. The target value network parameter ω ^ can be updated in a similar way as shown in (23).
ω ^ σ ω + 1 σ ω ^
where parameter σ 0 , 1 is the update ratio for the target network.
The specific steps of the TOMAC-PPO algorithm are shown in Algorithm 1.
Algorithm 1. Task-Oriented Multi-Agent Collaborative-Proximal Policy Optimization (TOMAC-PPO)
Input: Training rounds   e m a x , pruning parameters   ς , value network α learning rate, strategy network learning rate   α , discount rate   γ , target value network update ratio   σ .
Output: Optimal task offloading and resource allocation strategy π ( a | s ; θ j ) .
1. Randomly initialize the parameter θ of each strategy network, as well as each value network parameter ω , and the target value network parameter ω ^ ;
2. For e p i s o d e = 1 , 2 , , e m a x do
3. For all agents j , where 1 j n do in parallel
4. According to the current strategy π A | s t ; θ now j ,   collect T   step trajectory   s t , a t , r t ;
5. According to (19), calculate the approximate advantage A d v t ;
6. Update the policy network parameter θ j of agent j according to (20);
7. According to (22) and (23), the value network parameter ω j and the target value network parameter ω ^ j of j are updated.
End for
End for
In the following, we give the complexity analysis of the TOMAC-PPO algorithm. The main time spent on this algorithm is in the second step, which is a training round loop with emax training rounds and a time complexity of O (emax). The third step is that the agent operates in parallel, with a time complexity of O (n). So, the overall time complexity is O (nemax).

5. Experimental Results and Analysis

To evaluate the optimization effect of the proposed scheme on MEC network performance, we compare the following four offloading decision schemes with TOMAC-PPO.
(1)
TOMAC-A2C (Task-Oriented Multi-Agent Cooperative Advantage Actor–Critic). A2C algorithm is one of the classic strategy learning algorithms in the RL field.
(2)
TO-A3C (Task-Oriented Asynchronous Advantage Actor–Critic). TO-A3C belongs to the parallel RL method and does not use multi-agent systems. The A3C algorithm improves its performance by establishing multiple independent single agent A2C training environments, enabling them to train in parallel [24].
(3)
CCP (Cloud Computing Priority). CCP adopts the principle of “deliver tasks to upper level processing as much as possible”, and prioritizes offloading all tasks to the cloud for processing.
(4)
LC (Local Computing). After the task is generated, skip the information reporting process and directly calculate locally by the user.
We build the algorithm environment for active queue management, offloading scheduling rules, and multi-agent systems described in this paper to run the TOMAC-PPO algorithm. We give convergence analysis, and test average cost, average delay, average energy consumption, and drop rate.

5.1. Experimental Environment and Parameter Settings

The experimental environment is implemented using Python language, and the real dataset used includes the longitude and latitude of mobile users and the maximum CPU frequency as user device data. This paper stipulates that the probabilities of users generating high-priority tasks, critical tasks, low-priority tasks, and regular tasks are [0.2, 0.2, 0.2, 0.4], respectively. The specific parameters are shown in Table 1.

5.2. Convergence Analysis

For 50 users and a probability of failure p e n = p f i b e r = 0.1%, we give convergence results for the TOMAC-A2C algorithm and the TOMAC-PPO algorithm.
The cumulative reward convergence of TOMAC-A2C and TOMAC-PPO during training is shown in Figure 10. We can see that the moving average reward curves of both algorithms gradually converge with the increase in training episodes, proving the effectiveness of TOMAC-A2C and TOMAC-PPO algorithms. TOMAC-PPO not only has a significant advantage in convergence speed compared to TOMAC-A2C but also improves the final convergence value by about 23.9% compared to MAC-A2C, demonstrating its outstanding performance advantage. From the stability of the convergence curve, although the final convergence value of TOMAC-PPO is significantly higher than TOMAC-A2C, its vibration amplitude is also obviously larger. This may be a phenomenon caused by the introduction of entropy in the approximate objective function L CLIP θ   of TOMAC-PPO, which leads to a higher desire for agents to explore randomly.

5.3. Optimization Performance Evaluation

To evaluate the performance of various offloading schemes under different network load levels, this experiment tested the variation of average task cost with the number of users under the p e n = p f i b e r = 0.1% setting of failure probability. As shown in Figure 11, the TOMAC-PPO method always maintains the lowest task overhead under various load conditions, and its performance advantage continues to expand with the increase in the number of users. When the number of users is 50, TOMAC-PPO reduces the average cost by 19.4% to 66.6% compared to other solutions.
To test the response capability of various offloading schemes to line faults, this experiment tested the average task cost with the variation of the failure probability of edge servers and lines under the setting of a large number of users (50).
As shown in Figure 12, when the failure rate is less than 70%, the task overhead corresponding to TOMAC-PPO and TOMAC-A2C methods is significantly lower than other schemes. When the failure rate reaches over 70% and continues to increase, the task overhead of TOMAC-PPO and TOMAC-A2C increases sharply and gradually approaches the curve of the LC scheme, and the gap between the two also decreases. When the failure rate reaches 100%, it is equivalent to the entire edge layer device being in a disconnected state. At this time, except for the LC scheme, the average cost of the other schemes is around 2.3, slightly higher than the average cost of the LC scheme, which is 2.24. This indicates that, when the failure rate is 100%, each optimization plan has no other countermeasures besides performing local calculations. Due to the fact that users still need to attempt to report task information at this time, the cost incurred is slightly higher than that of the LC scheme. When the failure rate is less than 50%, the cost curve of the CCP scheme shows a downward trend. This is because line failures force more tasks to be handed over to edge layers with lower latency, reducing unnecessary tasks from cloud migration. When the failure rate exceeds 50%, the CCP overhead curve gradually increases and approaches the LC scheme, because the reduction in the number of available servers results in a large number of tasks that cannot be offloaded and can only be processed at the user layer. In the vast majority of failure rate scenarios, the TOMAC-PPO scheme can maintain the lowest average cost, which proves that compared to other optimization schemes, TOMAC-PPO has better robustness in dealing with various failure situations.
To test whether each offloading scheme adopts targeted offloading processing for different types of tasks, we test the processing effect of each scheme on different types of tasks under the conditions of 50 users and failure probability with p e n = p f i b e r = 0.1%, as shown in Figure 13.
As shown in the figure, besides energy consumption, the TOMAC-PPO method can achieve the best optimization results in various evaluation indicators that are important for tasks. Due to the significant energy-saving advantages of the LC scheme, the TOMAC-PPO method falls short in optimizing energy consumption for low-priority tasks compared to LC. However, TOMAC-A2C not only focuses on the energy consumption of this type of task but also considers the optimization of the task dropout rate. If too many low-priority tasks are delegated to local processing, it may result in a high dropout rate due to weak local computing power. Therefore, TOMAC-A2C generates more energy consumption by balancing these two evaluation indicators. Except for LC, TOMAC-PPO still has the best energy optimization effect on low-priority tasks. The results demonstrate that the TOMAC-PPO scheme can more accurately respond to diverse task requirements compared to other schemes when dealing with different types of tasks.

5.4. Experiment Results Discussion

In the above experiments, we compare the four offloading decision schemes with TOMAC-PPO at average cost, average delay, average energy consumption, and drop rate. TOMAC-PPO proposed can effectively reduce the average task cost in MEC network environments with different numbers of users and system failure probabilities. Compared to the other four baseline methods, the proposed scheme has better optimization performance in the vast majority of cases, and its advantages are particularly prominent in dealing with high-load situations. For example, the proposed TOMAC-PPO can reduce the average cost by from 19.4% to 66.6% compared to other offloading schemes under the same network load. In addition, the drop rate of some baseline algorithm with 50 users can achieve 62.5% for critical task, while the proposed TOMAC-PPO only has 5.5%. The main reasons are that (1) multi-agent parallel offloading and task allocation are introduced in the TOMAC-PPO and (2) active queue management and offloading scheduling rules have been implemented for different tasks. Certainly, the proposed TOMAC-PPO has some limitations: (1) With the continuous increase in network scale, the scalability of multi-agent systems will be challenged. The information synchronization protocol used within the system can further reduce communication overhead, and the information lag of large-scale systems is also a major issue that needs to be addressed. (2) Due to various limitations, this paper can only use Python code to implement a simulation system for simple simulation of MEC networks. In future research, professional simulation software can be further used to achieve more realistic simulations, and consideration can be given to MEC computation offloading in more scenarios such as drones and connected vehicles.

6. Conclusions and Further Works

Task offloading and resource allocation is a research hotspot in cloud-edge collaborative computing. This paper constructs a cloud-edge collaborative computing model, and related task queue, delay, and energy consumption model, and gives joint optimization problem modeling for task offloading and resource allocation with multiple constraints. Furthermore, it designs a decentralized task offloading and resource allocation scheme based on “task-oriented” multi-agent reinforcement learning. In this scheme, we present information synchronization protocol and offloading scheduling rules and use edge servers as agents to construct a multi-agent system based on the Actor–Critic framework. The proposed TOMAC-PPO applies the proximal policy optimization to the multi-agent system and combines the Transformer neural network model to realize the memory and prediction of network state information. Experimental results show that this algorithm has better convergence speed and can effectively reduce the service cost, energy consumption, and task drop rate under high load and high failure rates. For example, the proposed TOMAC-PPO can reduce the average cost by from 19.4% to 66.6% compared to other offloading schemes under the same network load. In addition, the drop rate of some baseline algorithms with 50 users can achieve 62.5% for critical tasks, while the proposed TOMAC-PPO only has 5.5%.
In future works, due to the innovation of the TOMDP dynamic time slot model proposed in this paper, it conflicts with existing DRL code patterns, resulting in certain encoding difficulties. Therefore, how to be compatible with existing DRL code frameworks is an area that TOMDP needs to optimize. In addition, the design of states, actions, and rewards in TOMDP is also a decisive factor in the optimization effect of the algorithm and can be further improved.

Author Contributions

Conceptualization, G.J.; methodology and validation, R.H.; investigation and writing—original draft preparation, G.W. and Z.B.; writing—review and editing, R.H. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China. No. 62062007, 2022 Guangxi Vocational Education Teaching Reform Research Key Project, and Research on the Six in One Talent Training Model of “Post, Course, Competition, Certificate, Training, and Creation” in the College of Artificial Intelligence Industry, No. GXGZJG2022A011. Design and Application of Computer Experiment Site Management System Based on Information Technology, 2023YKYZ001.

Data Availability Statement

Data are available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

TOMAC-PPOTask-Oriented Multi-Agent Collaborative-Proximal Policy Optimization
PPOProximal Policy Optimization
IoTInternet of Thing
MECMobile Edge Computing
RLReinforcement Learning
DRLDeep Reinforcement Learning
UAVsUnmanned Aerial Vehicles
DQNDeep Q Network
MARLMulti-Agent Reinforcement Learning
MADDPGMulti-Agent Deep Determining Policy Gradient
FIFOFirst In First Out
FCFully Connected
TOMAC-A2CTask-Oriented Multi-Agent Cooperative Advantage Actor–Critic
CCPCloud Computing Priority
LCLocal Computing

References

  1. IoT and Non-IoT Connections Worldwide 2010–2025. Available online: https://www.statista.com/statistics/1101442/iot-number-of-connected-devices-worldwide (accessed on 28 August 2024).
  2. IoT Is Not a Buzzword but Necessity. Available online: https://www.3i-infotech.com/iot-is-not-just-a-buzzword-but-has-practical-applications-even-in-industries/ (accessed on 28 August 2024).
  3. Zhang, Y.-L.; Liang, Y.-Z.; Yin, M.-J.; Quan, H.-Y.; Wang, T.; Jia, W.-J. Survey on the Methods of Computation Offloading in Molile Edge Computing. J. Comput. Sci. Technol. 2021, 44, 2406–2430. [Google Scholar]
  4. Duan, S.; Wang, D.; Ren, J.; Lyu, F.; Zhang, Y.; Wu, H.; Shen, X. Distributed artificial intelligence empowered by end-edge-cloud computing: A survey. IEEE Commun. Surv. Tutor. 2022, 25, 591–624. [Google Scholar] [CrossRef]
  5. Hua, H.; Li, Y.; Wang, T.; Dong, N.; Li, W.; Cao, J. Edge computing with artificial intelligence: A machine learning perspective. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
  6. Kar, B.; Yahya, W.; Lin, Y.-D.; Ali, A. Offloading using traditional optimization and machine learning in federated cloud-edge-fog systems: A survey. IEEE Commun. Surv. Tutor. 2023, 25, 1199–1226. [Google Scholar] [CrossRef]
  7. Arjona-Medina, J.A.; Gillhofer, M.; Widrich, M.; Unterthiner, T.; Brandstetter, J.; Hochreiter, S. RUDDER: Return decomposition for delayed rewards. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 13544–13555. [Google Scholar]
  8. Zhang, X.-C.; Ren, T.-S.; Zhao, Y.; Rui, F. Joint Optimization Method of Energy Consumption and Time Delay for Mobile Edge Computing. J. Univ. Electron. Sci. Technol. China 2022, 51, 737–742. [Google Scholar]
  9. Wu, H.-Y.; Chen, Z.-W.; Shi, B.-W.; Deng, S.; Chen, S.; Xue, X.; Feng, Z. Decentralized Service Request Dispatching for Edge Computing Systems. Chin. J. Comput. 2023, 46, 987–1002. [Google Scholar]
  10. Ma, L.; Wang, X.; Wang, X.; Wang, L.; Shi, Y.; Huang, M. TCDA: Truthful combinatorial double auctions for mobile edge computing in industrial internet of things. IEEE Trans. Mob. Comput. 2021, 21, 4125–4138. [Google Scholar] [CrossRef]
  11. Cang, Y.; Chen, M.; Pan, Y.; Yang, Z.; Hu, Y.; Sun, H.; Chen, M. Joint user scheduling and computing resource allocation optimization in asynchronous mobile edge computing networks. IEEE Trans. Commun. 2024, 72, 3378–3392. [Google Scholar] [CrossRef]
  12. Peng, Z.; Wang, G.; Nong, W.; Qiu, Y.; Huang, S. Task offloading in multiple-services mobile edge computing: A deep reinforcement learning algorithm. Comput. Commun. 2023, 202, 1–12. [Google Scholar] [CrossRef]
  13. Li, J.; Yang, Z.; Wang, X.; Xia, Y.; Ni, S. Task offloading mechanism based on federated reinforcement learning in mobile edge computing. Digit. Commun. Netw. 2023, 9, 492–504. [Google Scholar] [CrossRef]
  14. Li, Y.; Aghvami, A.H.; Dong, D. Path Planning for Cellular-Connected UAV: A DRL Solution with Quantum-Inspired Experience Replay. IEEE Trans. Wirel. Commun. 2022, 21, 7897–7912. [Google Scholar] [CrossRef]
  15. Li, Y.; Aghvami, A.H. Radio Resource Management for Cellular-Connected UAV: A Learning Approach. IEEE Trans. Commun. 2023, 71, 2784–2800. [Google Scholar] [CrossRef]
  16. Kuang, Z.-F.; Chen, Q.-L.; Li, L.-F.; Deng, X.H.; Chen, Z.G. Multi-user edge computing task offloading scheduling and resource allocation based on deep reinforcement learning. Chin. J. Comput. 2022, 45, 812–824. (In Chinese) [Google Scholar]
  17. Tuli, S.; Ilager, S.; Ramamohanarao, K.; Buyya, R. Dynamic scheduling for stochastic edge-cloud computing environments using A3C learning and residual recurrent neural networks. IEEE Trans. Mob. Comput. 2020, 21, 940–954. [Google Scholar] [CrossRef]
  18. Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Springer: New York, NY, USA, 2021; pp. 321–384. [Google Scholar]
  19. Zhang, P.; Tian, H.; Zhao, P.; He, S.; Tong, Y. Computation offloading strategy in multi-agent cooperation scenario based on reinforcement learning with value-decomposition. J. Commun. 2021, 42, 1–15. (In Chinese) [Google Scholar]
  20. Cao, Z.; Zhou, P.; Li, R.; Huang, S.; Wu, D. Multiagent deep reinforcement learning for joint multichannel access and task offloading of mobile-edge computing in industry 4.0. IEEE Internet Things J. 2020, 7, 6201–6213. [Google Scholar] [CrossRef]
  21. Wang, Y.; He, H.; Tan, X. Truly proximal policy optimization. In Proceedings of the 35th Uncertainty in Artificial Intelligence Conference (UAI), Tel Aviv, Israel, 23–25 July 2019; PMLR: New York, NY, USA, 2020; Volume 115, pp. 113–122. [Google Scholar]
  22. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; PMLR: New York, NY, USA, 2015; Volume 37, pp. 1889–1897. [Google Scholar]
  23. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  24. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York City, NY, USA, 19–24 June 2016; PMLR: New York, NY, USA, 2016; Volume 48, pp. 1928–1937. [Google Scholar]
Figure 1. Cloud-edge collaboration model for MEC network.
Figure 1. Cloud-edge collaboration model for MEC network.
Futureinternet 16 00333 g001
Figure 2. Task computing and transmission queues.
Figure 2. Task computing and transmission queues.
Futureinternet 16 00333 g002
Figure 3. Synchronization timeout due to disconnection (The blue arrow is the direction of information transmission, and the red cross indicates that the sender and receiver are not connected).
Figure 3. Synchronization timeout due to disconnection (The blue arrow is the direction of information transmission, and the red cross indicates that the sender and receiver are not connected).
Futureinternet 16 00333 g003
Figure 4. Offloading and scheduling when user moves across cell (The numbers ①–⑥ represent the order in which the scheduling rules are executed).
Figure 4. Offloading and scheduling when user moves across cell (The numbers ①–⑥ represent the order in which the scheduling rules are executed).
Futureinternet 16 00333 g004
Figure 5. Multi-agent system based on the Actor–Critic framework.
Figure 5. Multi-agent system based on the Actor–Critic framework.
Futureinternet 16 00333 g005
Figure 6. Task-oriented Markov decision process.
Figure 6. Task-oriented Markov decision process.
Futureinternet 16 00333 g006
Figure 7. Policy network structure.
Figure 7. Policy network structure.
Futureinternet 16 00333 g007
Figure 8. Value network structure.
Figure 8. Value network structure.
Futureinternet 16 00333 g008
Figure 9. Clipped summarize objective schematic diagram.
Figure 9. Clipped summarize objective schematic diagram.
Futureinternet 16 00333 g009
Figure 10. Offloading and allocation algorithms conversion and cumulative rewards versus episodes.
Figure 10. Offloading and allocation algorithms conversion and cumulative rewards versus episodes.
Futureinternet 16 00333 g010
Figure 11. Average task cost versus user number.
Figure 11. Average task cost versus user number.
Futureinternet 16 00333 g011
Figure 12. Average task cost versus failure probability.
Figure 12. Average task cost versus failure probability.
Futureinternet 16 00333 g012
Figure 13. Optimization effect of schemes on the key performance metrics of different types of tasks.
Figure 13. Optimization effect of schemes on the key performance metrics of different types of tasks.
Futureinternet 16 00333 g013
Table 1. Parameters setting.
Table 1. Parameters setting.
SymbolMeaningValue
  n Number of edge nodes10
R ud User device cache capacity 49,152   Mbit
  R en Edge node cache capacity 262,144   Mbit
B Base station bandwidth resources 20   MHz
  r Base station service radius100 m
  P edge Base station transmission power200 W
  P user User device transmission power0.2 W
κ User device energy efficiency coefficient U   4.13 , 66.16 × 10 27
σ 2 Gaussian noise power 1.5 × 10 8   W
V space The propagation rate of electromagnetic waves in the air 3 × 10 8   m / s
V fiber The propagation rate of electromagnetic waves in a circuit 2 × 10 8   m / s
C fiber The transmission rate of wired communication 1000   Mbit / s
T repair Fault repair duration U 10 , 60   s
λ Task input data volume N 5100   Mbit
λ out Task calculation result data volume N 1 , 50   Mbit
ρ Task computing density N 0.297 , 0.1   G . c . / Mbit
num core Number of CPU cores on edge servers14
f edge Edge server CPU frequency2.4 GHz
f cloud Cloud server CPU frequency10 GHz
T delay cloud Core network forwarding delay N 50 , 15   ms
d edge The distance between adjacent nodes150 m
p drop RED Dropped Task Probability1/50
α , β Learning rate 3 × 10 4
e m a x Training epochs1000
γ Discount rate 0.95
σ Target network update ratio0.01
ς Crop parameters 0.2
c Entropy weight 0.01
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, G.; Huang, R.; Bao, Z.; Wang, G. A Task Offloading and Resource Allocation Strategy Based on Multi-Agent Reinforcement Learning in Mobile Edge Computing. Future Internet 2024, 16, 333. https://doi.org/10.3390/fi16090333

AMA Style

Jiang G, Huang R, Bao Z, Wang G. A Task Offloading and Resource Allocation Strategy Based on Multi-Agent Reinforcement Learning in Mobile Edge Computing. Future Internet. 2024; 16(9):333. https://doi.org/10.3390/fi16090333

Chicago/Turabian Style

Jiang, Guiwen, Rongxi Huang, Zhiming Bao, and Gaocai Wang. 2024. "A Task Offloading and Resource Allocation Strategy Based on Multi-Agent Reinforcement Learning in Mobile Edge Computing" Future Internet 16, no. 9: 333. https://doi.org/10.3390/fi16090333

APA Style

Jiang, G., Huang, R., Bao, Z., & Wang, G. (2024). A Task Offloading and Resource Allocation Strategy Based on Multi-Agent Reinforcement Learning in Mobile Edge Computing. Future Internet, 16(9), 333. https://doi.org/10.3390/fi16090333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop