You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

26 April 2023

A Meta Reinforcement Learning-Based Task Offloading Strategy for IoT Devices in an Edge Cloud Computing Environment

,
,
,
,
and
1
SINOPEC Research Institute of Petroleum Processing Co., Ltd., Beijing 100083, China
2
School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
3
School of Information Technology, Shanghai Jian Qiao University, Shanghai 201306, China
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Future Internet of Things: Applications, Protocols and Challenges

Abstract

Developing an effective task offloading strategy has been a focus of research to improve the task processing speed of IoT devices in recent years. Some of the reinforcement learning-based policies can improve the dependence of heuristic algorithms on models through continuous interactive exploration of the edge environment; however, when the environment changes, such reinforcement learning algorithms cannot adapt to the environment and need to spend time on retraining. This paper proposes an adaptive task offloading strategy based on meta reinforcement learning with task latency and device energy consumption as optimization targets to overcome this challenge. An edge system model with a wireless charging module is developed to improve the ability of IoT devices to provide service constantly. A Seq2Seq-based neural network is built as a task strategy network to solve the problem of difficult network training due to different dimensions of task sequences. A first-order approximation method is proposed to accelerate the calculation of the Seq2Seq network meta-strategy training, which involves quadratic gradients. The experimental results show that, compared with existing methods, the algorithm in this paper has better performance in different tasks and network environments, can effectively reduce the task processing delay and device energy consumption, and can quickly adapt to new environments.

1. Introduction

Innovative mobile applications in a large number of IoT devices (e.g., face recognition, smart transportation, and AR/VR) are becoming an important part of today’s life as network communication technologies develop. According to the industry report by Cisco [1], by the end of 2023, about 3.5 billion people will have access to Internet services, which means that more than half of the world’s population will use one or more devices to connect to IP networks, and the number of devices connected to the network will be three times the global population, according to the European Telecommunications Standards Institute [2]. The Internet of everything age has arrived, and, with the increasing number of devices and the explosive growth of network traffic, the requirements for cloud computing resources from innovative mobile applications in IoT devices have skyrocketed, making it difficult to meet the basic needs of applications due to network latency caused by large network traffic and computational demand generated by offloading tasks. Mobile edge computing (MEC) [3] is a key technology that can help solve this problem. It does this by using edge servers with a certain type of computing power to move cloud computing and storage capacity to the edge, where it is not centralized [4].
Innovative IoT applications are frequently developed using modular programming in the mobile edge cloud computing environment [5], which contains many task modules internally that can be partially offloaded, and these task modules and their interdependencies can be well abstracted as a directed acyclic graph (DAG) [6], where the nodes represent tasks and the edges between the nodes represent task dependencies. This model has more task offload flexibility and can fully apply the heterogeneous parallel environment of MEC compared to the total offload model, but the resulting offload scheduling decisions are more difficult, mainly in terms of the following:
(1)
There are many subtasks, some of which can be processed concurrently, and there are more strategies to choose from. This makes the size of the problem several orders of magnitude bigger than the total offloading model, and it also makes the algorithm requirements higher.
(2)
The effect of the offloading strategy may be affected by various application types and subtask dependency characteristics, and the number of subtasks is inconsistent and not adaptive to all situations for conventional reinforcement learning algorithms.
Because mobile devices (MDs) have limited computational power and battery capacity, providing long-term steady services for various smart situations is difficult. Although the wireless power transfer (WPT) service can provide a stable power supply, the hybrid access point (HAP) technology in it provides both data transfer and energy transfer functions for MDs, and only one operation can be completed at a time; hence, a strategy to decide which subtasks need to be offloaded should be devised. The edge server and MD may collaborate to handle DAG tasks and improve service quality by offloading subtasks and WPT service times.
Much research is now being conducted by deep reinforcement learning (DRL), in which the DAG task offloading model is trained for a period of time and then utilized to select the optimum offloading solution, which is better than greedy algorithms, heuristic algorithms, and metaheuristics. However, this method requires a substantial quantity of sample data and suffers from low sample usage, slow learning speed, and low DRL adaptability. Different DAG tasks correspond to the DAG tasks considered in this paper, and, as users change applications, the MEC environment will change, making the relevant network settings incorrect. If the task offloading module is applied to multiple DAG tasks, it is necessary to retrain the network for different types of DAG tasks, which is very time-consuming and challenging to do in reality.
The meta reinforcement learning (MRL) framework may help accelerate the learning of new tasks by using the previous experience of different tasks to achieve quick adaptation to new tasks. MRL learns in two stages: an outer loop that learns a series of common experiences from multiple tasks to get the meta-strategy parameters, which requires more computational resources, and an inner loop that is based on meta-strategy parameters that can be adapted to new tasks with a small amount of experience-specific learning [7]. Using MRL to solve the computational offloading problem has several advantages for our MEC system. The meta-strategy learning in the system can be performed on the edge server, while the inner-loop training is performed on the MD. This is because the inner-loop training only requires a few simple steps and a small amount of sampled data; hence, it can be performed on the MD with limited computing power and data, so that the resources of the edge server and the MD can be utilized.
Considering the above, this paper conducts research around dynamic edge scenarios and proposes MTD3CO, a meta reinforcement learning-based task offloading strategy that takes into account DAG task types, subtask decisions, device power, WPT service time, task completion time, and other factors to solve the problem that conventional reinforcement learning algorithms are inefficient and cannot adapt to various types of mobile applications. In this paper, we first describe the process of unloading scheduling decisions for different types of DAG tasks as each different MDP, then modify the network of the TD3 algorithm to design a strategy and action network suitable for task unloading problems with different numbers of subtasks, and finally transform the DAG subtask decision process into a sequential prediction process. Each RNN network included in the algorithm is trained on the basis of the TD3 algorithm, with its training process primarily divided into inner-loop training and outer-loop training. Outer-loop training is primarily performed by the outer-loop learner of the edge server, which is uniformly trained for all applications in the system corresponding to the MDP environment and obtains the meta-strategy. The inner-loop learner on the edge device then downloads the meta-strategy parameters, initializes its own network, and performs a small number of training iterations to fine-tune the strategy according to its own task characteristics, achieving the goal of adapting to different tasks without requiring time-consuming retraining for general-purpose tasks. The following is a summary of this paper’s primary contributions:
  • An edge system model, including a wireless charging module, is designed for the complicated edge computing environment for various types of mobile applications; on the basis of this architecture, a delay model, energy consumption model, and power model are built to quantify strategy performance.
  • To address the problem of new tasks not being adapted quickly enough, a meta reinforcement learning-based task offloading decision (MTD3CO) is proposed. Among them, a Seq2Seq-based deep network suitable for the task offloading process is created to change the offloading decision into a sequential prediction process in order to accomplish adaptation for applications with different numbers of subtasks. To optimize the second-order gradient present in the meta-strategy solution, a first-order approximate optimization algorithm is proposed to accelerate the solution.
  • Some simulation experiments were built on the basis of practical applications for tasks with different DAG topologies, including DAG topology, number of tasks, and transmission speed between MD and edge servers. After simulation experiments and comparing them to some baseline methods (e.g., improved DRL algorithm, HEFT-based heuristic algorithm, and greedy algorithm), it was shown that MTD3CO produced the highest results after only a few training steps.
The remainder of this paper is organized as follows: Section 2 introduces related academic research; Section 3 constructs the edge system model and models the task offloading problem; Section 4 describes the MTD3CO algorithm’s design and implementation; Section 5 presents simulated experiments and results; Section 6 concludes and reviews the paper’s work.

3. System Model

3.1. Problem Description

The face recognition program can express its submodules in DAG and has many calculations. Using the task offloading on MEC as an example, task offloading may be more easily understood. Modules such as stitching, detection, and feature merging work together to perform a single face recognition task. These modules are in charge of their own services, which run independently as application subtasks based on business processes. Some subtasks are uploaded to the edge server for processing through the mobile device’s decision module, and the results are then returned to the mobile device through the network, while other subtasks run on the mobile device. Each edge server runs numerous virtual machines to help process the tasks uploaded by each mobile device, and the computational capability of the edge server is represented by fs (the number of CPU cores multiplied by the clock speed of each core). In this paper, we assume that virtual machine resource allocation is equal; thus, each virtual machine’s computing power can be represented as f v m = f s k . Many innovative applications, including face recognition applications, can be executed collaboratively by a number of subtasks split by business processes. In this paper, face recognition is applied to a DAG representation, denoted as G = T , E , where the vertex set T denotes the set of subtasks, while the edge set E records the execution requirements of the business process for the subtasks. The edge set is made up of directed edges e = t i , t j , where e denotes that task t j is a subtask of task t i , and that task t i is a precursor task of task t j . According to the edge constraint, a subtask cannot begin execution until all of its preceding tasks have been completed. Lastly, in the graph G = T , E , the tasks without subtasks are called exit tasks and represent the end of the application task.
As shown in Figure 1, each individual subtask can be offloaded to an edge server or executed locally on a mobile device, both of which have different processing latencies. If the task is executed locally, the task latency is defined as T i M D = C i f M D , where f M D denotes the mobile device’s CPU frequency, and C i denotes the number of computation periods required for the task; if the task is executed on an edge server, the task latency is divided into three parts: task upload, task execution, and result download. The latency of these three elements is mostly determined by job execution metrics, as well as system upload and download speeds. Using task t i as an example, some task execution metrics include task data size d a t a i u , received result data size d a t a i d , immediate wireless uplink channel transmission speed R u , wireless downlink channel transmission speed R d in edge environments, and edge server CPU frequency f v m to process the task. These factors have an impact on task latency. Therefore, when task t i is offloaded to an edge server for execution, the task delay may be expressed as
T i E D G E = d a t a i u R u + C i f v m + d a t a i d R d .
Figure 1. Example of face recognition in MEC.
Making appropriate offloading decisions for all subtasks so that they can minimize the total task processing latency while ensuring stable service is the goal of the edge task offloading strategy. To that aim, the latency, energy consumption, and device power in a single-edge-server multidevice MEC scenario are modeled in Section 3.2 in this paper.

3.2. Problem Modeling

The order of execution of a subtask-generated plan for a mobile application represented by G = T , E is A 1 : n = a 1 , a n a n , where a i is made up of the offloading decision d i of subtask t i and the mobile device’s WPT service time T i W P T . The offload decision of subtasks and the WPT service of mobile devices are performed sequentially according to the generated plan, and all precursor tasks in the plan are completed before the subtasks due to the dependencies between tasks. Therefore, the completion time of task t i depends on both the completion time of its precursor task and the resource availability time. In order to model the time delay of local processing and offloading processing, the upload completion time, edge server processing time, result download time, and local processing completion time of task t i are set as F T i u p l o a d , F T i c o m p u t e , F T i d o w n l o a d , and F T i M D , respectively. Moreover, if the WPT service shares a module with uplink and downlink, their available time has an impact, and the available time of uplink and downlink resources is denoted as A T i u p l o a d and A T i d o w n l o a d . A T i c o m p u t e and A T i M D are used to describe the available time of edge server and mobile device computing resources. The available time of these resources is mostly determined by the end time of the preceding task that uses the resource, which in this paper is set to 0 if the resource is not utilized. According to the aforementioned definition, the task execution delay model, device energy consumption model, and device power model are created correspondingly.

3.2.1. Latency Model

If task t i is offloaded to the edge server for execution, then t i must not start execution until all its parent tasks are completed, the WPT service charging is performed, and the different resources required are accessible. Assuming that the set of parent tasks of task t i is p a r e n t t i , the task upload completion time F T i u p l o a d and uplink availability time A T i u p l o a d may be calculated from the following equations according to the preceding requirements:
F T i u p l o a d = m a x A T i u p l o a d , m a x j p a r e n t t i F T j M D , F T j d o w n l o a d + T i u p l o a d + T i W P T A T i u p l o a d = m a x A T i 1 u p l o a d , F T i 1 u p l o a d T i u p l o a d = d a t a i u R u .
The edge server computation completion time F T i c o m p u t e for task ti relies on the time when computing resources are available A T i c o m p u t e , the parent task completion time, and the actual computation time. Where the two requirements, parent task completion and computation resource availability, must be met at the same time, the bigger of these two values is picked. When the calculation is performed, the mobile device may start downloading the results only when the downlink is available. Hence, the time to start downloading is stated as the larger of the downlink availability time A T i d o w n l o a d and the computation completion time. In summary, the download completion time F T i d o w n l o a d may be introduced by the following equation:
F T i c o m p u t e = m a x A T i c o m p u t e , m a x F T i u p l o a d , m a x j p a r e n t t i F T j c o m p u t e + T i c o m p u t e , F T i d o w n l o a d = m a x A T i d o w n l o a d , F T i c o m p u t e + T i d o w n l o a d , A T i c o m p u t e = m a x A T i 1 c o m p u t e , F T i 1 c o m p u t e , A T i d o w n l o a d = m a x A T i 1 d o w n l o a d , F T i 1 d o w n l o a d , T i c o m p u t e = C i f v m , T i d o w n l o a d = d a t a i d R d .
If task t i is run locally, then the start time of task t i relies on the completion time of its parent task, the WPT service time, and the computational resources of the mobile device. F T i M D may be obtained from the following equation:
F T i M D = m a x A T i M D , m a x j p a r e n t ? t i F T j M D , F T j d o w n l o a d + T i M D + T i W P T A T i M D = m a x A T i 1 M D , F T i 1 M D . .
Finally, the total latency T A 1 : n a l l of the DAG tasks completed according to the execution plan A 1 : n may be described by the following equation:
T A 1 : n a l l = m a x m a x t k K F T k M D , F T k d o w n l o a d ,
where K is the set of exit tasks of the DAG task G. The task is complete when the last completed exit task finishes. Overall, the purpose of the execution plan is to ensure the job completion rate and find the lowest execution delay and energy consumption while maintaining the power of the mobile device.

3.2.2. Energy Consumption Model

The WPT service time and energy consumption in the offload strategy of each subtask jointly affect device power, and this technique enables mobile devices with limited power to maintain consistent energy for task execution and data transfer. It can be described by the following equation:
C E i = δ P t d θ g T i W P T ,
where δ denotes the energy conversion efficiency and has a value between 0 and 1, P t denotes the charging power of the mobile device, d denotes the distance between the wireless charging modules, θ represents the distance loss, and g represents the channel gain. Suppose the battery capacity of the gadget is B c a p and the power after charging cannot exceed the battery capacity of B c a p . Then, after the WPT service time T i W P T has passed, the power of the mobile device B i C E can be figured out as follows:
B i C E = min ( B i 1 r e m a i n + C E i , B c a p ) ,
where B i 1 r e m a i n remaining is the remaining battery power of the mobile device after the previous task t i 1 has been completed. In accordance with the energy consumption model, the remaining power of the device after the two policies of processing tasks, local execution and offloading to the edge server, can be expressed as follows:
B i M D = max B i C E E i M D , 0 ,
B i o f f l o a d = max B i C E E i o f f l o a d , 0 .
If there is insufficient power, the processing task is considered unsuccessful. When the task offloading decision is denoted by d i 0,1 , the remaining processing power after task t i is expressed as
B i r e m a i n = B i M D · d i + B i o f f l o a d · 1 d i .
In the edge environment, the offloading method of DAG tasks is highly flexible, but most of the tasks have very strict requirements for latency. Solving mixed-integer linear programming problems using standard heuristic algorithms demands a huge amount of processing resources, which is power-consuming, takes a long time to make decisions for edge devices, and does not match the real-time requirements. Deep reinforcement learning enhances the strategy by learning to support a large state space and action space, which is beneficial to solving such complex problems and meets the real-time requirements. However, deep reinforcement learning is sensitive to changes in the environment and DAG types, and, when the environment changes, relearning the strategy takes great time and computational resources, which hinders practical applications. Therefore, this paper proposes a task offloading algorithm called MTD3CO, which is based on meta reinforcement learning, to solve this problem. Table 2 shows the description of the main parameters in the text.
Table 2. Summary of main notations.

4. Design of Proposed Algorithm

In this section, we propose the edge task offloading method MTD3CO for MEC systems, which is based on meta reinforcement learning and uses the TD3 reinforcement learning algorithm. It explains the system components and the training process of the algorithm, then designs the MDP model under this architecture based on the model proposed in Section 3, and finally explains the algorithm’s design, principles, and pseudocode.

4.1. System Architecture

The MTD3CO algorithm makes full use of the MEC system capabilities by network training the algorithm on both mobile devices and edge servers. It splits the training process into two loops, with the inner-loop training for task-specific policies and the outer-loop training for meta-policies, with the inner-loop training performed on the mobile device and the outer-loop training performed on the edge server. Figure 2 shows a task offloading system that includes a mobile device and an edge server.
Figure 2. System architecture of MTD3CO task offloading.
Because mobile devices have a variety of different devices to perform application tasks with different foci, the DAGs used to describe the applications will change. Firstly, the inner-loop learner must download the meta-strategy from the outer-loop learner, initialize the strategy’s parameters to the inner-loop learner’s network, and then try some offloading with the specific application for which the device is responsible. Secondly, on the basis of these offloading experiences, the inner-loop learner’s network is trained to get its specific offloading strategy. Lastly, the edge server receives the inner-loop learner strategy for this device. This completes the inner-loop learning.
At the level of the edge server, it is responsible for collecting the inner-loop training experience of all inner-loop learners and extracting their shared characteristics. Using the collected data, the edge server trains the outer-loop learner network, gets the new meta-strategy, and performs the next round of training. Once a more stable meta-strategy has been obtained, outer-loop training can be stopped. The inner-loop learner can use the learning experience contained in this meta-strategy to quickly learn the specific offloading strategy for the task it is responsible for, and, because the meta-strategy contains experience shared by all inner-loop learners, only a few loops are required to achieve better results. With this fast iterative learning, the algorithm can adapt to different DAG tasks and environments.

4.2. MDP Modeling

For the different types of tasks in this chapter, we model them as multiple MDPs; each MDP corresponds to an individual task in meta-learning, and the aim of each task is to learn an effective offloading strategy for each MDP. We define the task distribution as ρ T , and the MDP for each task T i ~ ρ T is denoted as T i = S , A , P , P 0 , R , γ . To accommodate different task types, we divide the learning process into two parts: the first, in which each MDP learns its own specific strategy based on the meta-strategy; the second, in which all MDPs’ specific policies are extracted into a common meta-strategy and updated for the next learning loop. According to the system model, the status, action, and reward functions of the MDPs are defined below.
(1)
State Space
When offloading a subtask, the subtask’s execution latency is dependent on the CPU cycles required by the task, the size of the data being uploaded and downloaded, the DAG topology, the decision of the previous task, the transfer rate, the device power, and the MEC resources. The state space of subtask t i is denoted as s i = G , A 1 : i 1 , d a t a i u , C i , d a t a i d , T i m a x , B i 1 r e m a i n , where G denotes the DAG topology of the task to which the subtask belongs. A 1 : i 1 = a 1 , a n a i 1 is the offload decision record of the task ti predecessor; d a t a i u , C i , and d a t a i d are the amount of data uploaded by task t i , the amount of computation of the task, and the amount of data downloaded, respectively; T i m a x is the maximum tolerated delay of the task, and exceeding it is defined as task failure; B i 1 r e m a i n is the remaining power on the device before executing task t i . To convert the set of subtasks represented by G into a sequence of subtasks and preserve the priority relationship, we add indices to the tasks using r a n k t i , and then sort the tasks according to the indices. We define r a n k t i as follows:
r a n k t i = T i E D G E i f   t i K T i E D G E + m a x t j c h i l d t i ( r a n k t j ) i f   t i K ,
where T i E D G E is the time interval between the start of offloading of task t i and the result obtained from the edge server, and c h i l d t i is the set of direct subtasks of task t i . According to r a n k t i , G is converted to a direct parent task index vector and a direct child task index vector of task t i , and the size of this vector is set to the maximum value contained in all applications or filled with 0 if it is less than the maximum value.
(2)
Action Space
The offload strategy for each task includes the offload decision and WPT service time, a i = d i , T i W P T , where d i denotes the offload decision ( d i = 0,1 , where 0 means local execution and 1 means offload execution), and T i W P T represents the duration of WPT service before each task execution. Here, it is specified that T i W P T = t . To simulate discrete actions, t is taken as a multiple of 0.01. WPT service duration and offload decisions affect task latency and device energy consumption together.
(3)
Reward Function
The optimal task scheduling decision should have the lowest task processing latency and the lowest energy consumption for mobile devices. If a task fails to execute due to inadequate power or execution timeouts, it is called a “task failure”, which is defined as T i f a i l . The power consumed by each task t i execution may be expressed as the difference of total energy consumption Δ E i = E A 1 : i t o t a l E A 1 : i 1 t o t a l , and task latency can be expressed as Δ T i = E A 1 : i a l l E A 1 : i 1 a l l . Task failure can be defined as
T i f a i l = 1 , Δ E i > B i 1 r e m a i n , Δ T i > T i m a x ,
where B i 1 r e m a i n denotes the remaining power after the MD has completed task t i 1 , and T i m a x denotes the task’s maximum processing time. To minimize the total energy consumption E A 1 : n a l l and the total delay T A 1 : n a l l of the task processing, and to improve the task processing success rate, the algorithm designs the reward function as the task t i negative increments of delay and energy consumption and punishments for task failure, which can be expressed as
r i = ( φ Δ E i + 1 φ Δ T i 1 T i f a i l + ω T i f a i l ) ,
where ω is the punishment factor for task failure, and φ is able to control the weights of the two optimization objectives according to demand.

4.3. Algorithm Design

4.3.1. Seq2Seq

The strategy for offloading task t i is defined as π a i s i on the basis of the MDP setting in Section 3.2.2. This means that, when task t i arrives at the decision module, the decision module makes the task offloading decision a i according to the current state s i . Assuming that a particular DAG task G can be represented by n subtasks, the strategy for these n subtasks can be expressed as π A 1 : n G . Because each task’s offloading decision is linked to the task before it, the chain rule can be used to represent π A 1 : n G in terms of π a i s i .
π A 1 : n G = i = 1 n π a i s i .
The decisions taken by each of a task’s n subtasks affect how that task will be executed, and the number of subtasks differs from task to task. This difference can cause training difficulties for traditional neural networks. The Seq2Seq deep network is able to support the input of a different number of decisions, and it accepts and processes this chained strategy with recurrent neurons, whose structure is shown in Figure 3 and Figure 4, which contains an encoder and a decoder. Both parts are implemented by recurrent neural networks, where the encoder compresses the subtask sequences into uniform context vectors, and then the decoder decodes these vectors to output the policy.
Figure 3. Task unloading network based on seq2seq.
Figure 4. Training process of MTD3CO.
To keep the performance from going down because the context vector is too long, an attention mechanism is used. This makes the decoder focus on the parts that are closer to it when it is making the output. In the process of decoding and output, the context vector element is given greater weight the closer it is to the source. If the encoder’s input is the subtask state sequence s 1 , s 2 , , s n , the decoder’s output is the corresponding policy a 1 , a 2 , , a n , the encoder function is f e n c o d e r , and the decoder function is f d e c o d e r ; then, the encoder of the i-th task hidden output may be represented as
e i = f e n c o d e r s i , h i 1 .
The context vector can be expressed as c = e 1 , e 2 , , e n , and then the message is decoded by the decoder according to the last action a j 1 in the context vector and the last decoder output d j 1 ; its decoding formula can be expressed as
d j = f d e c o d e r c j , d j 1 , a j 1 ,
where c j is the weighted sum of partial context vectors by applying the attention mechanism, which is calculated as follows:
c j = i = 1 n μ j i h i ,
where μ j i is a probability distribution, which is calculated as follows:
μ j i = s o f t m a x s c o r e h i , d j 1 ,
where s c o r e h i , d j 1 is a function to determine the degree of matching between h i and d j 1 , and this function is defined as a trainable feedforward neural network in the literature. Lastly, the TD3 algorithm is improved by transforming the decoder output d = d 1 , d 2 , , d n into a policy and value network using two fully connected layers to make it applicable to the model of task offloading policy.

4.3.2. MTD3CO (Meta TD3 Computation Offloading) Implementation

In order to make the DAG task offloading adaptable to different tasks and, thus, achieve generalization, this chapter combines meta-learning and the TD3 algorithm [24] to improve the algorithm’s adaptability to the environment. After the initial training is complete, the algorithm can set up the outer-loop learner of the mobile device with a meta-strategy. The strategy can then be fine-tuned with a small number of specific tasks, and then the outer-loop learner strategy can be changed to fit the specific tasks of the mobile device.
For the actor–critic method, there is an unavoidable problem of overestimation due to cumulative errors. First, the TD3 algorithm uses the idea of double Q-learning by using two independent value functions, and using the smaller value between them in the update to reduce the deviation caused by overestimation. Second, when performing TD updates, the errors made at each step add up and make the estimation variance too high; thus, the TD3 algorithm uses the target network and delayed updates to solve this problem.
Lastly, the TD3 algorithm enhances the exploratory nature by adding noise to the target actions, which smoothens the values in the region near the value network actions and reduces the generation of errors. On the basis of the above improvements, assuming that the policy network in the TD3 algorithm is π ϕ s , and the two value networks are Q θ 1 s , a and Q θ 2 s , a with parameters ϕ , θ 1 , and θ 2 , respectively, then the actions and expected rewards of the task unloading policy at each training can be expressed as
a ~ = π ϕ s + c l i p N 0 , σ ~ , c , c y r + γ m i n i = 1,2 Q θ i s , a ~ ,
where ϕ , θ 1 , and θ 2 denote the parameters of the target policy network and the two target value networks, respectively, and c l i p N 0 , σ ~ , c , c is the clipped noise function that makes the action fluctuate in a small range. γ 0,1 denotes the discount factor of learning. According to the expected reward, two value networks can use the same expected reward y and the respective rewards of the current value network Q θ 1 s , a and Q θ 2 s , a to determine the TD error and update the current value network parameters, respectively, by minimizing the error as the objective, whose objective function formula can be defined as
J c r i t i c T D 3 θ i = N 1 y Q θ i s , a 2 , i = 1,2 .
For the update of the strategy network, the objective is to find the strategy parameter with the largest expected return, which can be updated in the actor–critic method using the deterministic strategy gradient algorithm, which uses gradient ascent to find the parameter with the largest return ϕ . Therefore, for the actor network in this section, the objective function can be expressed as
J a c t o r T D 3 ϕ = N 1 Q θ 1 s , a a π ϕ s .
According to the model established in Section 3.2.2, we model the offloading tasks corresponding to different types of applications as multiple MDPs, each of which is responsible for generating an offloading policy for the same class of tasks. Formally, the task distribution is defined as ρ T , and each task follows the task distribution T i ~ ρ T . MTD3CO and gradient-based meta reinforcement learning share a similar structure and also have two parts: inner-loop learning and outer-loop learning.
The inner-loop learning combines the Seq2Seq network with the TD3 algorithm, enabling its actor–critic network to adapt to training in environments with different numbers of subtasks. Compared with the VPG algorithm in the literature [25], it has better exploration ability and training stability. For each learning task T i , according to the objective function above we define its value network objective function as J c r i t i c T D 3 ( θ i T i ) with the objective of maximizing the expected gain, as shown in Algorithm 1, using the objective function gradient to update θ 1 and θ 2 . Each task is updated a certain number of times to obtain the set of value network parameters; similarly, for the strategy network, we define the objective function as J a c t o r T D 3 ϕ T i , with the objective of minimizing the TD error by updating the parameters ϕ T i with the gradient.
In outer-loop learning, following the theory of model-agnostic meta-learning proposed in the literature [5], the computation of the outer-loop learning objective function can be obtained; hence, for the meta critic network of outer-loop learning, its definition can be expressed as follows:
J c r i t i c M T D 3 C O θ 1,2 = E T i ~ ρ T J c r i t i c T D 3 U θ 1,2 T i , T i ,
where J c r i t i c T D 3 is the objective function of the inner-loop learning critic network for task T i , and U θ , T i is the parameter after the task-set number of inner-loop learning gradient updates, which is defined as U θ 1,2 T i , T i = θ 1,2 T i + α t = 1 k θ 1,2 T i J c r i t i c T D 3 θ 1,2 T i t , where k is the number of gradient updates for inner-loop learning and outer-loop learning with the objective of minimizing the objective function with gradient updates of the critic meta-parameters, but this objective function involves gradients of gradients that bring huge computational cost under a complex network such as Seq2Seq. To solve this problem, we use the first-order approximation method in [26]; the gradient of J c r i t i c M T D 3 C O θ 1,2 can be expressed as
θ 1,2 J c r i t i c M T D 3 C O θ 1,2 = g r a d c r i t i c M T D 3 C O = 1 n · 1 k i = 1 n t = 1 k θ 1,2 T i J c r i t i c T D 3 θ 1,2 T i t ,
where n denotes the number of all tasks, k denotes the number of inner-loop training gradients, and similarly, for the meta-actor network with outer-loop learning, its objective function and approximate gradient can be obtained as
J a c t o r M T D 3 C O ϕ = E T i ~ ρ T J a c t o r T D 3 F ϕ T i , T i ,
F ϕ T i , T i = ϕ T i + α t = 1 k ϕ T i J a c t o r T D 3 ϕ T i t ,
ϕ J a c t o r M T D 3 C O ϕ = g r a d a c t o r M T D 3 C O = 1 n · 1 k i = 1 n t = 1 k ϕ T i J a c t o r T D 3 ϕ T i t .
According to the objective function, we describe the overall training procedure in Algorithm 1.
Algorithm 1. Meta TD3 computation offloading
Input: Task distribution ρ ( T )  Output: DAGs
  • The parameters of the random initialization policy network π ϕ and the two value networks Q θ 1 and Q θ 2 are ϕ , θ 1 , and θ 2
  • Initialize the parameters of the target network ϕ ϕ , θ 1 θ 1 , θ 2 θ 2
  • for iteration k = 1,2 , , K do
  • According to the task distribution   ρ T , random sample of n tasks { T 1 , T 2 , T 3 , , T n }
  • for task T i ;   i = 1,2 , , n do
  • Initialize ϕ i ϕ , θ 1 i θ 1 , θ 2 i θ 2   a n d   ϕ i ϕ , θ 1 i θ 1 , θ 2 i θ 2
  • Initialize experience pool D i
  • Follow a ~ π ϕ i + ϵ , ϵ ~ N 0 , σ , Sampling the task T i and storing the trajectory in D i
  • for iterations i = 1,2 , , k do
  • Sampling N trajectories from the experience pool D i
  • According to the objective function J c r i t i c T D 3 , Update parameters using mini-batch gradient θ 1 i , θ 2 i
  • if i mod d then
  • According to the objective function J a c t o r T D 3 , Update parameters using mini-batch gradient ϕ i
  • Update target network parameters
  • ϕ i τ ϕ i + 1 τ ϕ i
  • θ 1 i τ θ 1 i + 1 τ θ 1 i
  • θ 2 i τ θ 2 i + ( 1 τ ) θ 2 i
  • end if
  • Calculate g r a d a c t o r M T D 3 C O , g r a d c r i t i c M T D 3 C O
  • end for
  • Update meta-parameters ϕ = ϕ + β ϕ J a c t o r M T D 3 C O ϕ
  • Update meta-parameters θ 1,2 = θ 1,2 + β θ 1,2 J c r i t i c M T D 3 C O θ 1,2
  • end for
  • end for
The parameters of the meta-strategy and value network in the algorithm are _1 and _2, respectively, and the algorithm is a training process for the meta-parameters, which is mainly divided into inner-loop learning and outer-loop learning, where the learning task is first sampled once, followed by performing inner-loop training for each sampling task. When all inner-loop training is completed, we update the meta-parameters using the formula in lines 21 and 22 of Algorithm 1, and then proceed to the next inner- and outer-loop training.

4.4. Analysis of Algorithm Time Complexity

For the proposed MTD3CO offloading strategy, the main calculation lies in the inner cycle and the outer cycle. In the inner loop, the computational complexity is determined by the size of the state space and action space and the network. The computational complexity of the outer loop is O(nK), where n is the number of tasks, and K is the number of iterations.

5. Experimental Evaluation

In order to evaluate the proposed model and algorithm in this chapter, this section designs a simulated experimental environment, introduces the hyperparameters of the algorithm, and designs a set of experiments to evaluate the effect of the algorithm. In the simulation experiments, we designed an edge system simulator of the proposed model and generated some different applications represented by DAG in this simulation system to train and test the proposed algorithm.

5.1. Experimental Setup

5.1.1. Parameter Settings

The MTD3CO algorithm was implemented by TensorFlow, where the encoder–decoder network consisted of two layers of long short-term memory (LSTM) networks with 256 hidden units each, and a fully connected layer as the strategy network π ϕ in the algorithm, including two value networks Q θ 1 and Q θ 2 . Both the inner-loop training and the outer-loop training learning rates were set to 5 × 10 4 The algorithm’s noise parameter σ ~ was set to 0.2, and the clipping parameter c was set to 0.5; thus, the noise was normally distributed in the range of ( 0.5 , 0.5 ) . In inner-loop training, the gradient update number k was set to 6, the parameter d for delayed update of the target network was set to 2, and the target network’s learning rate τ was set to 0.005. See Table 3 for related parameter settings.
Table 3. Parameters used in the MTD3CO algorithm.

5.1.2. Types of Tasks

Many real-world applications can be represented by DAGs having different topologies and different numbers of subtasks. When the features of topologies and the number of subtasks are comparable, they might represent similar applications, and the offloading techniques have many similarities. In this chapter, a DAG task generator is implemented in accordance with the literature [25] to generate different DAG datasets in order to simulate a variety of application tasks.
Four primary parameters control the DAG’s topology and characteristics: the number of subtasks N, the DAG width, the DAG density, and the task computation communication share. DAG width indicates the number of concurrent subtasks; when the number of subtasks is the same, the greater the width, the greater the number of subtasks that can be executed concurrently. DAG density shows how much subtasks depend on each other. Higher values mean that there are more backward and forward links between subtasks. The ratio of computational communication is used to control the task’s characteristics. The delay of task offloading is mainly composed of network communication and computational consumption time, where the larger the computational communication ratio, the larger the proportion of computational consumption time.
For the MEC environment, the upload and download speeds were set to R u and R d = 8 Mbps respectively, which would have some loss as the distance between the device and the signal source increased; the edge server f v m was set to 10 GHz; the clock speed of the mobile device was set to 1.5 GHz; the upload power and download power were P u p l o a d = 0.5   W and P d o w n l o a d = 0.6   W ; the charging power of WPT service was P t = 3   W ; the battery capacity of the mobile device was 10 Wh.

5.1.3. Experimental Environment Settings

To evaluate the effectiveness of the MTD3CO algorithm proposed in this chapter, this section simulated a comparison experiment of task offloading strategies when the application and MEC environments changed. Three scenarios were created to evaluate the energy usage and latency of all algorithms. To make the problem simpler, the data size and required calculation for each subtask were assumed to be within a certain range. For example, assume that each subtask’s data size is between 5 kB and 50 kB, and that the CPU cycles required for each subtask are between 107 and 108 cycles. The computational communication ratio of the tasks is set to a random value between 0.5 and 0.8 since most mobile applications are computationally intensive. The time it takes to finish the computational part of a task depends on how high the computational communication ratio is.
In this paper, a task offloading model for MEC was developed by considering the computational performance, signal range, and geographical location of the edge server. The simulation environment used was deployed under the Ubuntu18 system, and, to implement the meta reinforcement learning model, the mainstream Tensorflow machine learning framework was used, using datasets from [23]. Throughout the experiments, we divided the dataset simulated by the DAG task generator into training and test datasets, with each set of DAG parameters differing in width, density, and number of subtasks. Each dataset had 100 DAG tasks with the same parameters but different topologies, simulating an application’s subtask relationships. The ultimate goal was to find the optimal offloading strategy for all offloading learning applications. The MTD3CO algorithm performed inner- and outer-loop training on multiple training datasets using Algorithm 1, and the DAG tasks in each dataset were trained to learn the strategy by the same inner-loop to obtain the optimal offloading strategy for this application. The outer-loop training summarized the commonality of the inner-loop training to obtain the meta-strategy, which was used as the algorithm’s initial strategy in the next inner-loop training to continue training specific policies for a specific DAG dataset, and the outer-loop training was continued until the meta-strategy converged. Lastly, the converged meta-strategy was used to set the initial network parameters for a test dataset that was used to see how effectively the offloading strategy worked.

5.2. Comparison Study and Discussion

We compared our approach MTD3CO with existing methods: (1) the improved deep reinforcement learning algorithm in [27], (2) the HEFT-based heuristic algorithm, and (3) the greedy algorithm. These methods were chosen because they are common algorithms when solving task offloading and are similar to our study.
To train the MTD3CO strategy, the DAG task generator was used to generate 20 training datasets with different DAG parameters and 100 DAG tasks with the same width and density in each set, each with 20 subtasks and density and width values of { 0.5 , 0.6 , 0.7 , 0.8 } . Each dataset represented a mobile device application preference, and finding an effective offloading strategy for each dataset was used as a learning task in the MTD3CO algorithm. Lastly, a training dataset was used to train the MTD3CO algorithm. In Algorithm 1, the number of sampling tasks n was set to 10, i.e., 10 training datasets were selected from 20 training datasets. Each dataset sampled N trajectories for gradient update with gradient number k = 6. Every two times the value network was updated, the target network and the strategy network were updated once.
Figure 4 shows the average reward during the training process. When the number of training times reached 200, it can be seen that the average reward increased substantially. This means that the strategy started to work and moved in a better direction. Finally, the average reward stabilized around −6 and converged. When the meta-strategy converges, the meta-strategy network at this point summarizes the commonality of different mobile application offloads so that effective task offloads can be performed for different applications. However, since all applications are considered at this time, the strategy is not the optimal strategy for a specific application at this time. For a specific application, only a small number of inner-loop learning iterations are needed to get the best strategy. Below, we make a few different changes to the environment that can be used to compare how well the meta-strategy adapts to the new environment.
When the meta-strategy training was completed, in order to verify the adaptability of the MTD3CO algorithm to new tasks and new environments, some test datasets different from the training dataset were randomly generated for testing. For a task that needs to be scheduled, the data points that users are most concerned about are latency and mobile device energy consumption performance. Thus, the experiments mainly compare the performance of the algorithm on these two metrics under the new task; three relevant experiments are described below.

5.2.1. Task Scenario Description

In the first experiment, to test the performance of the algorithm when the dependency situation between application subtasks changes, a test dataset with a different density than the training dataset was generated using the DAG task generator to simulate the offloading performance when a mobile device encounters this novel application. Subsequently, the performance of MTD3CO and the baseline algorithm was compared for a small number of tasks offloaded on the test dataset. The parameters of this test dataset were as follows: number of subtasks N = 20 , density = 0.4, and width = 0.8, where the density of DAG tasks did not appear in the training dataset, while both the number and the width of subtasks appeared. The energy consumption and latency for a small number of iterations are shown in Figure 5. HEFT could perform task offloading with a better strategy at the beginning because it predicts the partial offloading method and selects a better strategy, but it could not improve its strategy by increasing the number of iterations; hence, it remained at this performance, which was not the best. In this application, MTD3CO outperformed all algorithms in terms of energy consumption and time delay after five iterations, indicating that it achieved meta-strategy adaptation for this application. In general, because it only considers one factor, the greedy algorithm had the highest latency and relatively high energy consumption. The MTD3CO algorithm proposed in this paper outperformed the HEFT method in terms of latency and energy consumption after only a small amount of training, but the DRL method was still inferior to HEFT in both cases, and its energy consumption performance was less stable because the DRL method does not consider energy consumption. This shows that the MTD3CO method is more flexible and adaptive to new tasks and applications. Regarding the constant values for the greedy and HEFT schemes, they are not dependent on the number of iterations, and their results are expected to remain constant. In other words, the energy consumption and latency values for greedy and HEFT are not supposed to change with the number of iterations.
Figure 5. Scenario 1 comparison of delay and energy consumption after a few iterations.
The second experiment aimed to test the performance of the algorithm when the subtask concurrency varies. The DAG width of the test dataset was different from the training dataset, and the width in this scenario was 0.9, simulating the extreme case of application offloading with high subtask concurrency. As shown in Figure 6, at this point, HEFT did not perform as well as DRL and MTD3CO in terms of latency, and this fixed heuristic strategy tended to fall into suboptimal solutions when different applications were encountered. In terms of energy consumption performance, the DRL algorithm did not improve because it does not consider the energy consumption of the device and only takes the latency as the optimization objective. When faced with new applications, both MTD3CO and DRL algorithms outperformed HEFT after a small number of updates, because both algorithms are methods for updating the strategy, which adjusts the strategy in real time using reinforcement learning’s ability to adapt to the environment. They are more adaptable than algorithms with fixed policies like HEFT, and it can be seen that MTD3CO could adapt to new tasks faster than the conventional DRL algorithm because of its use of internal and external loop learning and its ability to use prior learning experience to guide the learning of new tasks.
Figure 6. Scenario 2 comparison of delay and energy consumption after a few iterations.

5.2.2. Task Type Description

When the number of subtasks changes, the scale of the problem changes, and the algorithm becomes more demanding. In order to compare the performance of each algorithm when the scale of the problem increases, a scenario with the number of subtasks N = 30, a density of 0.6, and a width of 0.6 was generated for experimentation. This application of width and density appeared in the training dataset, but the number of tasks increased, and, although the characteristics were the same, the problem was more complex. As shown in Figure 7, in terms of latency, the average latency of the DRL algorithm was only improved by 20 s after 20 iterations, indicating that the algorithm was still exploring the environment, with more strategies for poorer performance, not yet finding a better strategy, whereas MTD3CO surpassed DRL after two iterations, indicating that it successfully used meta-strategies to guide the learning and could adapt well to the increased complexity of the problem.
Figure 7. Comparison of delay and energy consumption when the number of subtasks changes.

6. Concluding Remarks and Future Directions

The task offloading problem of IoT devices in complex edge environments for various applications was investigated in this paper, assuming backward and forward connections between tasks and using DAG to represent subtask offloading. We proposed MTD3CO, a task offloading strategy based on meta reinforcement learning, in order to improve the algorithm’s adaptability to the environment and new applications. First, we studied the system model for mobile applications with different DAG types and mobile devices with different DAG types. The data and energy transfer system models were built, as well as the system’s latency model, energy consumption model, and MDP model. The concept of meta reinforcement learning was used to model the task offloading process in MEC as multiple MDPs based on the tasks, transform the unloading decision into a sequential prediction process based on the characteristics of subtask execution, design a seq2seq-based parameter sharing network to fit the optimal unloading decision, use this sharing network to improve the TD3 algorithm, and propose a meta reinforcement learning algorithm. Inner-loop training and outer-loop training are the two primary types of algorithm training. The outer-loop training network creates meta-parameters that are used to initialize the parameters for the inner-loop training network. The inner-loop training network then fine-tunes the parameters to quickly adapt to new applications based on their specific applications. Lastly, comparison experiments were used to assess the algorithm’s capability to adjust quickly in a variety of MEC environments and applications. The results show that the algorithm proposed in this paper could quickly adjust the strategy to adapt to the environment. Because the model in this paper ignores some signal effects and losses in the real world and uses a simplified problem model, future research will focus on how to design a more realistic problem model. Furthermore, the deep reinforcement learning algorithm employed is not cutting-edge; there are already some new algorithms under research, and how to apply more cutting-edge reinforcement learning algorithms will be a future focus of research. Lastly, achieving a unified system adaptation for heterogeneous IoT devices is a difficult future research area.

Author Contributions

W.D. and C.G. conceptualized the original idea and completed the theoretical analysis; Q.J. designed the technique road and supervised the research; H.Y. and Z.D. completed the numerical simulations and improved the system model and algorithm of the article and drafted the manuscript; Q.M. designed and performed the experiments. All authors provided useful discussions and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by the Shanghai Sailing Program (No. 20YF1410900), the Shanghai Natural Science Foundation (23ZR1414900), the National Natural Science Foundation (No. 61472139), the Shanghai Automobile Industry Science and Technology Development Foundation (No. 1915), and the Shanghai Science and Technology Innovation Action Plan (No. 20dz1201400). Any opinions, findings, and conclusions are those of the authors, and do not necessarily reflect the views of the above agencies.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created.

Acknowledgments

The authors sincerely thank the School of Information Science and Engineering, East China University of Science and Technology for providing the research environment. The authors would like to thank all anonymous reviewers for their invaluable comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cisco. Cisco Annual Internet Report(2018–2023) White Paper; Cisco: San Jose, CA, USA, 2020. [Google Scholar]
  2. Kekki, S.; Featherstone, W.; Fang, Y.; Kuure, P.; Li, A.; Ranjan, A.; Purkayastha, D.; Feng, J.; Frydman, D.; Verin, G.; et al. MEC in 5G Networks. ETSI White Pap. 2018, 28, 1–28. [Google Scholar]
  3. Ullah, M.A.; Alvi, A.N.; Javed, M.A.; Khan, M.B.; Hasanat, M.H.A.; Saudagar, A.K.J.; Alkhathami, M. An Efficient MAC Protocol for Blockchain-Enabled Patient Monitoring in a Vehicular Network. Appl. Sci. 2022, 12, 10957. [Google Scholar] [CrossRef]
  4. Zhang, H.; Guo, J.; Yang, L.; Li, X.; Ji, H. Computation offloading considering fronthaul and backhaul in small-cell networks integrated with MEC. In Proceedings of the IEEE INFOCOM 2017-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Atlanta, GA, USA, 1–4 May 2017. [Google Scholar]
  5. Alvi, A.N.; Javed, M.A.; Hasanat, M.H.A.; Khan, M.B.; Saudagar, A.K.J.; Alkhathami, M.; Farooq, U. Intelligent Task Offloading in Fog Computing Based Vehicular Networks. Appl. Sci. 2022, 12, 4521. [Google Scholar] [CrossRef]
  6. Liang, J.; Li, K.; Liu, C.; Li, K. Joint offloading and scheduling decisions for DAG applications in mobile edge computing. Neurocomputing 2021, 424, 160–171. [Google Scholar] [CrossRef]
  7. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
  8. Liu, M.; Yu, F.R.; Teng, Y.; Leung, V.C.; Song, M. Distributed resource allocation in blockchain-based video streaming systems with mobile edge computing. IEEE Trans. Wirel. Commun. 2018, 18, 695–708. [Google Scholar] [CrossRef]
  9. Lin, J.; Chai, R.; Chen, M.; Chen, Q. Task execution cost minimization-based joint computation offloading and resource allocation for cellular D2D systems. In Proceedings of the 2018 IEEE 29th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Bologna, Italy, 9–12 September 2018; pp. 1–5. [Google Scholar]
  10. Bi, S.; Zhang, Y.J. Computation rate maximization for wireless powered mobile-edge computing with binary computation offloading. IEEE Trans. Wirel. Commun. 2018, 17, 4177–4190. [Google Scholar] [CrossRef]
  11. Fan, W.; Liu, Y.; Tang, B.; Wu, F.; Wang, Z. Computation offloading based on cooperations of mobile edge computing-enabled base stations. IEEE Access 2017, 6, 22622–22633. [Google Scholar] [CrossRef]
  12. Tareen, F.N.; Alvi, A.N.; Malik, A.A.; Javed, M.A.; Khan, M.B.; Saudagar, A.K.J.; Alkhathami, M.; Abul Hasanat, M.H. Efficient Load Balancing for Blockchain-Based Healthcare System in Smart Cities. Appl. Sci. 2023, 13, 2411. [Google Scholar] [CrossRef]
  13. Liu, J.; Zhang, Q. Code-partitioning offloading schemes in mobile edge computing for augmented reality. IEEE Access 2019, 7, 11222–11236. [Google Scholar] [CrossRef]
  14. Samy, A.; Elgendy, I.A.; Yu, H.; Zhang, W.; Zhang, H. Secure Task Offloading in Blockchain-Enabled Mobile Edge Computing with Deep Reinforcement Learning IEEE Trans. Netw. Serv. Manag. 2022, 19, 4872–4887. [Google Scholar] [CrossRef]
  15. Arkian, H.R.; Diyanat, A.; Pourkhalili, A. MIST: Fog-based data analytics scheme with cost-efficient resource provisioning for IoT crowdsensing applications. J. Netw. Comput. Appl. 2017, 82, 152–165. [Google Scholar] [CrossRef]
  16. Ma, Y.; Wang, H.; Xiong, J.; Diao, J.; Ma, D. Joint allocation on communication and computing resources for fog radio access networks. IEEE Access 2020, 8, 108310–108323. [Google Scholar] [CrossRef]
  17. Alhelaly, S.; Muthanna, A.; Elgendy, I.A. Optimizing Task Offloading Energy in Multi-User Multi-UAV-Enabled Mobile Edge-Cloud Computing Systems. Appl. Sci. 2022, 12, 6566. [Google Scholar] [CrossRef]
  18. Zhang, C.; Liu, Z.; Gu, B.; Yamori, K.; Tanaka, Y. A deep reinforcement learning based approach for cost-and energy-aware multi-flow mobile data offloading. IEICE Trans. Commun. 2018, 101, 1625–1634. [Google Scholar] [CrossRef]
  19. Lu, H.; Gu, C.; Luo, F.; Ding, W.; Liu, X. Optimization of lightweight task offloading strategy for mobile edge computing based on deep reinforcement learning. Future Gener. Comput. Syst. 2020, 102, 847–861. [Google Scholar] [CrossRef]
  20. Li, X.; Xu, Z.; Fang, F.; Fan, Q.; Wang, X.; Leung, V.C.M. Task Offloading for Deep Learning Empowered Automatic Speech Analysis in Mobile Edge-Cloud Computing Networks. IEEE Trans. Cloud Comput. [CrossRef]
  21. Botvinick, M.; Ritter, S.; Wang, J.X.; Kurth-Nelson, Z.; Blundell, C.; Hassabis, D. Reinforcement learning, fast and slow. Trends Cogn. Sci. 2019, 23, 408–422. [Google Scholar] [CrossRef] [PubMed]
  22. Qu, G.; Wu, H.; Li, R.; Jiao, P. Dmro: A deep meta reinforcement learning-based task offloading framework for edge-cloud computing. IEEE Trans. Netw. Serv. Manag. 2021, 18, 3448–3459. [Google Scholar] [CrossRef]
  23. Li, J.; Gao, H.; Lv, T.; Lu, Y. Deep reinforcement learning based computation offloading and resource allocation for MEC. In Proceedings of the 2018 IEEE Wireless Communications and Networking Conference (WCNC), Barcelona, Spain, 15–18 April 2018; pp. 1–6. [Google Scholar]
  24. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
  25. Arabnejad, H.; Barbosa, J.G. List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel Distrib. Syst. 2013, 25, 682–694. [Google Scholar] [CrossRef]
  26. Nichol, A.; Achiam, J.; Schulman, J. On first-order meta-learning algorithms. arXiv 2018, arXiv:1803.02999. [Google Scholar]
  27. Wang, J.; Hu, J.; Min, G.; Zhan, W.; Ni, Q.; Georgalas, N. Computation offloading in multi-access edge computing using a deep sequential model based on reinforcement learning. IEEE Commun. Mag. 2019, 57, 64–69. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.