A DRL-Based Task Offloading Scheme for Server Decision-Making in Multi-Access Edge Computing

Lim, Ducsun; Joe, Inwhee

doi:10.3390/electronics12183882

Open AccessArticle

A DRL-Based Task Offloading Scheme for Server Decision-Making in Multi-Access Edge Computing

by

Ducsun Lim

and

Inwhee Joe

^*

Department of Computer Software, Hanyang University, Seoul 04763, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(18), 3882; https://doi.org/10.3390/electronics12183882

Submission received: 18 August 2023 / Revised: 7 September 2023 / Accepted: 11 September 2023 / Published: 14 September 2023

(This article belongs to the Special Issue Intelligent IoT Systems with Mobile/Multi-Access Edge Computing (MEC))

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Multi-access edge computing (MEC), based on hierarchical cloud computing, offers abundant resources to support the next-generation Internet of Things network. However, several critical challenges, including offloading methods, network dynamics, resource diversity, and server decision-making, remain open. Regarding offloading, most conventional approaches have neglected or oversimplified multi-MEC server scenarios, fixating on single-MEC instances. This myopic focus fails to adapt to computational offloading during MEC server overload, rendering such methods sub-optimal for real-world MEC deployments. To address this deficiency, we propose a solution that employs a deep reinforcement learning-based soft actor-critic (SAC) approach to compute offloading and facilitate MEC server decision-making in multi-user, multi-MEC server environments. Numerical experiments were conducted to evaluate the performance of our proposed solution. The results demonstrate that our approach significantly reduces latency, enhances energy efficiency, and achieves rapid and stable convergence, thereby highlighting the algorithm’s superior performance over existing methods.

Keywords:

mobile edge computing; directed acyclic graphs; deep reinforcement learning; soft actor-critic; Markov decision process; task offloading

1. Introduction

In recent years, advances in wireless technology combined with the widespread adoption of the Internet of Things have paved the way for innovative computation-intensive applications, which include augmented reality (AR), mixed reality (MR), virtual reality (VR), online gaming, intelligent transportation, and industrial and home automation. Consequently, the demand for these applications has surged [1,2,3]. By 2020, the number of IoT devices is projected to skyrocket to 24 billion. This tremendous increase signifies that many smart devices (SDs) and sensors have been responsible for generating and processing an immense volume of data [4].

To cater to these computationally intensive applications, substantial computing resources and high performance are required. Addressing the escalating need for energy efficiency and managing the swift influx of user requests has emerged as significant challenges [5]. Initially, mobile cloud computing (MCC) was considered a viable solution for processing computationally intensive tasks. However, as the demand for real-time processing increased, the limitations of MCC became apparent [6], resulting in the introduction of mobile edge computing (MEC) as a potential solution to meet this burgeoning demand [7].

Multi-access edge computing [8] is effective at deploying computing resources close to SDs, collaborative radio resource management (CRRM), and collaborative signal processing (CRSP). Conversely, a cloud radio access network (C-RAN) employs centralized signal processing and resource allocation, efficiently catering to user requirements [9]. Collectively, the attributes of these technologies have the potential to fulfill diverse requirements in upcoming artificial intelligence (AI)-based wireless networks [10].

Leveraging MEC to offload tasks is a promising approach to curtail network latency and conserve energy. Specifically, MEC addresses the computational offloading requirements of IoT devices by processing tasks closer to the edge rather than relying solely on a central cloud [11]. However, since the task offloading problem is recognized as a non-deterministic polynomial-time hard (NP-hard) problem [12], addressing it is challenging. Although most research in this area has leaned toward heuristic or convex optimization algorithms, the increasing complexity of MEC coupled with varying radio channel conditions makes it difficult to consistently guarantee optimal performance using these conventional methods. Given that optimization problems often require frequent resolution, meticulous planning is imperative for designing and managing future MEC networks.

In recent years, deep reinforcement learning (DRL), a sub-set of AI, has gained significant attention to its ability to tackle complex challenges across various sectors. As IoT networks become more distributed, the need for decentralized decision-making to enhance throughput and reduce power consumption increases, with DRL serving as a key tool. The emergence of the multi-access edge computing (MEC) paradigm has added complexity to multi-user, multi-server environments, bringing data-offloading decision-making to the forefront [13]. This MEC landscape necessitates addressing both user behavioral aspects and server pricing policies. A recent study combined prospect theory and the tragedy of the commons to model user satisfaction and potential server overexploitation, highlighting the intricate nature of the problem. In the context of MEC, while some research has explored DRL for task offloading, the focus has been predominantly on holistic offloading, overlooking the advantages of partial offloading, such as reduced latency and improved quality of service (QoS). Collaborative efforts among MEC servers, especially within a multi-server framework, have been significantly useful in enhancing overall system performance.

In this study, we address the pressing demands and challenges in MEC environments by proposing a computational offloading and resource allocation method based on the soft actor-critic (SAC) algorithm. Our choice of SAC, although commonly used in other works, was carefully considered for its unique objective function and compatibility with Markov decision process (MDP) modeling in a multi-user, multi-server MEC environment. What sets our approach apart from other studies is the context in which SAC is applied and its dynamic interaction with the environment to make optimal offloading decisions. Our method advocates a collaborative computation offloading and resource allocation strategy across multiple MECs to improve both latency and energy efficiency. By employing SAC for task offloading, our system benefits from real-time learning, resulting in noticeable performance enhancements. The key contributions of our research include the following:

Acknowledging the frequent utilization of SAC in related literature, our implementation innovatively models interrelated task division problems within the MDP framework, incorporating the state space, action space, and reward function. We introduce a discrete SAC-based intelligent computing offloading algorithm that prioritizes stability and sample efficiency. Notably, this variant of the algorithm is proficient at exploration within a continuous action space, setting it apart from conventional applications.
Our research explores the complexities encountered when multiple SD users offload tasks to multiple MEC servers through a base station (BS). Given the increasing number and varied distribution of edge servers around SDs, we argue that task offloading transcends simple binary decisions. Our approach addresses both the offloading decision and the intricate selection of a collaborative MEC server.
We stress the importance of inter-MEC collaboration in enhancing task processing within distributed systems. Through the integration of various technologies, inter-MEC collaboration not only improves task processing efficiency but also enhances the user experience.

The remainder of this paper is organized as follows: Section 2 revisits prior research that underpins the aim of this study. Section 3 elucidates the core components of the system, along with a mathematical model tailored for offloading. Then, Section 4 highlights the imperatives of the DRL-based offloading scheme and delves into its associated challenges and goals, and Section 5 details the cost function, optimization challenges, architecture, and mechanics of the proposed DRL-based offloading scheme. Then, Section 6 presents an evaluation of the performance of the scheme through empirical analysis. Finally, Section 7 rounds off the conclusions of the study and outlines potential avenues for future work.

2. Related Work

Recent research in the field of MEC has aimed to reduce latency and energy consumption through computation offloading and resource allocation techniques. A heuristic offloading algorithm designed to efficiently manage computationally intensive tasks was introduced [14]. This algorithm can achieve high throughput and minimize latency when transferring tasks from an SD to an MEC server. However, despite its critical role in enhancing overall system performance, the decision-making process for offloading under the algorithm is overly focused on task priorities.

A collaborative method between fog and cloud computing to curtail service delays on IoT devices was explored [15]. This study focused on strategies for optimizing computing offloading, allocating computing resources, managing wireless bandwidth, and determining transmission power within a combined cloud/fog computing infrastructure. The overarching goal of these optimization strategies was to reduce both latency and energy consumption. Notably, the authors in both [16,17] employed sub-optimal methods, favoring minimal complexity, and they highlighted the significance of practical and efficient approaches.

The dynamics of energy link selection and transmission scheduling, particularly when processing applications that demanded optimal energy within a network linking SDs and MEC servers, were investigated [18]. Relying on an energy consumption model, the authors formulated an algorithm for energy-efficient link selection and transmission scheduling. An integrated algorithm that facilitated adaptive long-term evolution (LTE)/Wi-Fi link selection and data transmission scheduling was presented to enhance the energy efficiency of SDs in MCC systems [19]. Upon evaluation, the proposed algorithm outperformed its counterparts in terms of energy efficiency. Furthermore, it demonstrated proficiency in managing battery life, especially when considering the unpredictable nature of wireless channels. While these two studies prioritized energy efficiency and the proposed algorithms showed commendable performances, the studies did not address the adaptability required under varying network conditions.

The challenges of processing vast amounts of data and computational tasks using deep Q-network (DQN)-based edge intelligence within the MEC framework [20] were addressed. The authors of this study focused on the distribution of computational tasks and the allocation of resources between edge devices and cloud servers. Meanwhile, the authors [21] addressed the performance degradation and energy imbalances in SDs with a deep reinforcement learning-based offloading scheduler (DRL-OS). Noteworthy, as the tally of wireless devices skyrockets, the expenses associated with DQN-based methodologies also increase.

Several studies have leveraged actor-critic-based offloading in MEC environments to optimize service quality by analyzing agent behaviors and policies [22,23]. The authors [24] delved into the offloading challenges in multi-server and multi-user settings, whereas the authors [25] integrated the proximal policy optimization (PPO) algorithm for task offloading decisions. Implementing the PPO in practical scenarios can be challenging because of its extensive sampling requirements. Wang et al. [26] conducted a study centered on task offloading decisions using the PPO algorithm, and Li et al. [27] addressed the offloading issues within a multi-MEC server and in multi-user contexts.

Furthermore, several investigations have focused on using the deep deterministic policy gradient (DDPG) algorithm to counteract the offloading issues inherent in the MEC domain. Notably, DDPG outperforms PPO in terms of continuous action space, data efficiency, and stability, making it pivotal for reinforcement learning endeavors in the MEC space and offering effective solutions to offloading challenges. However, within specific environments, the random-search nature of a network may pose hurdles in identifying the optimal policy. By contrast, SAC boasts greater stability than deterministic policies and exhibits excellent sampling efficiency. Modern research is now leveraging SAC to address computational offloading challenges. Liu et al. [28] enhanced the data’s efficiency and stability using SAC, where multiple users collaboratively execute task offloading in an MEC setting. Similarly, Sun et al. [29] harnessed SAC within 6G mobile networks, achieving heightened data efficiency and reliability in MEC settings. The advantages and disadvantages of some existing approaches are listed in Table 1.

Regarding MCC servers, they possess significantly greater computing capacities than those of MEC servers and are well-equipped to manage peak user request demands. Therefore, task offloading can be effectively achieved through cooperation among MEC servers. Nonetheless, tasks with high computational complexity should be delegated to cloud servers. He et al. [30] explored a multi-layer task offloading framework within the MEC environment, facilitating collaboration between MCC and MEC and enabling task offloading to other SDs. Furthermore, Akhlaqi, M.Y. [31], pointed out that the increasing use of cloud services by devices has highlighted congestion problems in centralized clouds. This situation has prompted the emergence of multi-access edge computing (MEC) to decentralize processing. Chen, Y. et al. [32], and Mustafa, E. et al. [33], address offloading decisions in MEC systems. The former focuses on assessing the reliability of multi-media data from IoT devices using a game-based approach, while the latter introduces a reinforcement learning framework for making real-time computation task decisions in dynamic networks.

3. Problem Statement

The task offloading challenge can be depicted using a DAG, which is denoted as ξ = (V, E), where the vertices symbolize individual tasks and the directed edges indicate task dependencies. Consequently, a subsequent task can only commence when the preceding task has been completed. It has been posited that every task can either be offloaded to an MEC server or processed locally on the SD of a user. When a task is offloaded to an MEC server, the following three distinct phases occur: sending, executing, and receiving. Conversely, for tasks executed locally, there is no data transmission between an SD and a MEC server.

The task set τ is represented as N = {1,2, …, n}, and the set of MEC servers is denoted as M = {1,2, …, m}. The offloading destination of each task is represented by

a_{i}

, which indicates whether a task is processed locally or on the specific MEC server to which it is offloaded. This modeling approach streamlines the intricacies of the offloading dilemma, furnishes a mathematical schema of the situation, and aids in determining the optimal resolution.

The devised framework was tailored to fine-tune offloading decisions while also considering the constraints of computing resources in a setting with multiple users and multiple MEC servers. Such decisions can expedite the processing time of tasks and reduce the energy expenditure of the system. The objective function J is elucidated in Equation (1):

J = β_{1} (T_{l o c}^{a l l} - \sum_{i = 1}^{N} T_{o f l}) + β_{2} (E_{l o c}^{a l l} - \sum_{i = 1}^{N} E_{o f l}),

(1)

where

T_{l o c}^{a l l}

represents the total time required for all the tasks to be executed locally,

\sum_{i = 1}^{N} T_{o f l}

denotes the total time required for the offloaded tasks,

E_{l o c}^{a l l}

is the energy consumed when tasks are processed locally, and

\sum_{i = 1}^{N} E_{o f l}

represents the total energy consumed during offloading. The objective of this function is to optimize both the time and energy efficiency through offloading. By adjusting the weights

β_{1} a n d

β_{2}

the system can decide whether to prioritize time or energy efficiency. For instance, if the battery level is low,

β_{2}

could be elevated to emphasize energy conservation. Ultimately, the overarching aim of this approach is to maximize objective function J.

\max_{A, F} J = β_{1} (T_{l o c}^{a l l} - \sum_{i = 1}^{N} T_{o f l}^{i}) + β_{2} (E_{l o c}^{a l l} - \sum_{i = 1}^{N} E_{o f l}^{i})

(2)

A = {a_{1}, a_{2}, \dots, a_{N}}

F = {f_{1}, f_{2}, \dots, f_{N}}

f_{i} = \{\begin{matrix} f_{n}^{l o c}, i f a_{i} = 0 \\ \sum_{m \in S e l e c t e d M E C s} f_{i, m}^{m e c}, {i f a}_{i} \neq 0 \end{matrix}

s . t .

C 1 : 0 \leq a_{i} \leq | M |

C 2 : T_{s t a r t (τ_{j})} \geq T_{e n d (τ_{i})} + {d e l a y}_{i j}

C 3 : \sum_{i = 1}^{N} E_{o f l}^{i} \leq E_{a v a i l a b l e}

where A denotes the offloading decision for each task, vector F signifies the amount of computing resources designated for each task, and

C 1

and

C 2

determine whether a task is processed locally or offloaded, respectively, to an MEC server. The tasks can be partitioned and processed concurrently across multiple locations. Regarding

C 3

, it ensures that the cumulative computing resources allocated to

τ_{i}

on the m-th MEC server do not exceed the total available computing resources for that particular server.

4. System Architecture

In this study, we introduced a task offloading strategy for a multi-server MEC in a multi-user setting with the primary goal of minimizing service delays and terminal energy consumption by directing each task to an appropriate MEC server under centralized control. The soft actor-critic task offloading scheme (SACTOS) is a DRL-based task offloading framework built on the MEC platform, as defined by the European Telecommunications Standards Institute (ETSI) [34].

The system model presented in this paper addresses scenarios with multiple MEC servers and multi-user applications, encompassing M MEC servers and N user tasks. The data for the tasks to be offloaded are relayed between the MEC server and the SDs via a wireless communication link. Each device can either process the task locally or offload it to an MEC server. All the SDs are situated in an area where wireless connectivity is available. Given the finite computing capacity of an MEC server, there is a constraint on the number of offloading requests that it can concurrently handle. The notations forming the mathematical representation of this system model are provided in Table 2, and Table 3 summarizes the notations.

4.1. MEC Architecture

The proposed system architecture, which is shown in Figure 1, incorporates both an SD and a graph parser (GP) at the user device level. User requests take the form of a graph that encapsulates the dependencies among the tasks. The GP evaluates the dependent tasks to determine the feasibility of offloading.

In the MEC layer, several servers act as computational resources. Computationally intensive tasks are transformed into a directed acyclic graph (DAG) and relayed to the MEC server for DRL processing. An algorithm that facilitates cooperation among MEC servers is employed to learn the optimal distribution of the DAG tasks.

Once the network parameters are learned, they are transmitted from the MEC server to an SD. The trained neural network of the device determines the offloading decision through forward propagation. Based on this decision, the task is either processed by MEC or executed locally. Such an architecture ensures the streamlined processing of user requests, graph-centric task distribution, and resource optimization via offloading. Continuously refining the algorithm through DRL will contribute to consistent enhancements in system performance. This approach is intended to efficiently utilize resources in a smart-device-centric MEC system.

4.2. System Model

Different SDs execute different tasks, each of which results in various delays. In this study, we considered two scenarios: the local execution of the current task and its offloading to the MEC server. Our strategy determines offloading decisions for tasks with dependencies, meaning that the completion time of the preceding tasks influences sub-sequent processing. This sub-section provides the definitions for computation delay, transmission delay, and task completion time. First, the computational delay for both local and MEC servers, as shown in Equation (3):

T_{i, n}^{l o c} = \frac{χ_{i} (d_{i} \times η_{i})}{f_{i}^{l o c}}, T_{i, m}^{m e c} = \frac{(1 - χ_{i}) (d_{i} \times η_{i})}{f_{i}^{m e c}},

(3)

where

T_{i, n}^{l o c}

denotes the local computation delay of task

τ_{i}

;

T_{i, m}^{m e c}

signifies the computation delay of the task when processed by the MEC server; and

f_{i, m}^{m e c}

and

f_{i, n}^{l o c}

represent the CPU clock rates of the MEC server and the SD, respectively, with m and n as identifiers. For scenarios involving partial offloading, a portion of the task is processed locally, and the remainder is offloaded to the MEC server. The weight

χ_{i}

is introduced to quantify this distribution. This weight, which ranges from 0 to 1, indicates the fraction of data processed locally for task

τ_{i}

. Conversely,

(1 - χ_{i})

represents the proportion processed by the MEC server. Moreover,

τ_{i}

refers to the number of clock cycles necessary to process each data bit of the task, and

η_{i}

denotes the total number of clock cycles required for the entire task

τ_{i}

. When addressing SDs, Shannon’s theory, which is delineated in Equation (4), needs to be employed before computing the data offloading delay. This equation is crucial for estimating the maximum channel transmission rate.

R_{i} (φ) = B \times \log_{2} \{1 + \frac{[φ ρ_{t r a n} + (1 - φ) ρ_{r e c v}] \times g_{i}^{t}}{ϑ^{2}}\},

(4)

where B denotes the radio channel bandwidth between the SD and the MEC server, signifying the bandwidth available for use during wireless communication; and the variable

φ

represents the transmission probability of each sub-channel and is characterized as a discrete random variable with the possible values

φ \in \{0, 1\}

. When

φ = 1

and 0,

R_{i} (φ)

represents the data transmission and reception rates, respectively. This suggests that the data rate for a given sub-channel can fluctuate depending on whether data are transmitted or received. The term

g_{i}^{t}

designates the wireless channel gain between the SD and the BS at the time slot t;

ϑ^{2}

represents the noise power; the variables

ρ_{t r a n}

and

ρ_{r e c v}

correspond to the power expended for data transmission and reception, respectively;

ρ_{t r a n}

is the transmission power utilized to relay data from an SD to an MEC server; and

ρ_{r e c v}

is the reception power employed by an SD to obtain data from an MEC server. The latency associated with data transmission is delineated according to Equation (5):

T_{i}^{t r a n} (φ) = \frac{d_{i}}{R_{i} (φ)},

(5)

where

d_{i}

denotes the data size of task

τ_{i}

,

R_{i} (φ)

represents the data transmission rate during data transmission, and

φ

is a variable accounting for various parameters or conditions that may influence the transmission speed, enabling the requisite time for data transmission to be calculated.

Given the constraints of limited computing resources, we postulate that an SD cannot finalize a computing task within a designated timeframe, necessitating offloading in task processing. Under these circumstances, the agent scrutinizes the task and decides on its execution within the MEC server to minimize latency. Beyond merely determining the processing location for the task, the agent also assists in selecting the most suitable MEC server based on the requirements of the SD. The cumulative latency when offloading a task to an MEC server is significantly influenced by computational and uplink transmission latency [31]. The criteria for task processing are expressed in Equation (6):

T_{i}^{c o m p l e t e} = \{\begin{matrix} T_{i}^{l o c} i f a_{k} = 0 \\ T_{i}^{t r a n} i f a_{k} \in 1, 2, \dots, | M | \end{matrix},

(6)

where

T_{i}^{c o m p l e t e}

denotes the completion time of task

τ_{i}

;

T_{i}^{l o c}

represents the processing time of the task when executed locally;

T_{i}^{t r a n}

represents the processing duration of a task that has been offloaded to an MEC server; and

a_{k}

denotes the offloading location of the task, where

a_{k}

= 0 if the task is processed locally and otherwise indicates the designated number of MEC servers from which the task is offloaded. Subsequently, the completion time of task

τ_{i}

, i.e.,

T_{i}^{c o m p l e t e}

, is the summation of the local processing time and waiting period before initiation, as described by Equation (7):

T_{i}^{c o m p l e t e} = T_{i}^{l o c} + T_{i}^{l o c_{w a i t}} = T_{i}^{l o c} + m a x (T_{i}^{l o c}, \underset{(τ_{i}, τ_{j}) \in ε}{complete} T_{i}),

(7)

where

T_{i}^{l o c_{w a i t}}

represents the greater of two values: the local execution time

T_{i}^{l o c}

of task

τ_{i}

and the longest completion time among all the preceding tasks

τ_{j}

. It is ensured that

τ_{i}

does not commence until all preceding tasks have been completed. The offloading of a task to the MEC server involves three primary stages that capture the time required to relay it to the server. The completion time during this stage is the aggregate of the upload time and the waiting period for the upload. Notably, the waiting time for the upload is contingent on the larger value between the maximum completion times of the preceding tasks and the local completion time of the current task. This phase involves the time required to process a task on an MEC server, with the duration varying depending on the complexity of the task and the computational capacity of the MEC server. This phase considers the time required to retrieve the processed task results from the MEC server. The associated completion time is expressed in Equation (8):

T_{i}^{t r a n_c o m p l e t e} = T_{i}^{t r a n} + T_{i}^{t r a n_w a i t} = T_{i}^{t r a n} + m a x (T_{i}^{l o c}, {m a x}_{(τ_{i}, τ_{j}) \in ε} T_{j}^{c o m p l e t e}) .

(8)

In the execution phase of an MEC server, two primary scenarios arise regarding the previous state of a task. The first scenario involves tasks for which the transfer and upload phases have already been completed, and the second scenario involves tasks executed on the same MEC server as the current task. Based on this, the completion time for the previous task is defined in Equation (9):

T_{i}^{p r e_c o m p l e t e} = \{\begin{matrix} T_{i, m}^{c o m p l e t e} i f (τ_{j}, τ_{i}) \in ϵ a n d t h e y o n t h e s a m e M E C s e r v e r \\ T_{i}^{t r a n} i f t a s k h a s c o m p l e t e d t h e t r a n s m i s s i o n p h a s e \end{matrix} .

(9)

The total time to complete a task on an MEC server is the cumulative sum of the execution and waiting times. The waiting time encompasses both the quickest task execution time specific to an MEC server and the waiting duration for the completion of the preceding task. This waiting time is influenced by the completion times of other tasks on the same MEC server, and it is defined in Equation (10):

T_{i, m}^{M E C} = T_{i, m}^{M E C_e x e c} + T_{i, m}^{M E C_w a i t} = T_{i}^{M E C_e x e c} + m a x (T_{i, m}^{M E C_w a i t}, T_{i, m}^{c o m p l e t e_p r e}) .

(10)

Regarding the download and reception processes, the completion time comprises two parts: the completion time on the MEC server from the preceding step of the task and the quickest available download link. The download-completion time of a task that is processed on an MEC server is defined in Equation (11):

T_{i, m}^{d o w n} = T_{i}^{t r a n} + T_{i, m}^{d o w n_w a i t} = T_{i, m}^{t r a n} + m a x (T_{i, m}^{d o w n_a v a i l}, T_{i, m}^{c o m p l e t e_M E C}),

(11)

where

T_{i, m}^{d o w n}

denotes the completion time of the download and reception phase for task

τ_{i}

,

T_{i, m}^{d o w n_a v a i l}

indicates the quickest available download link time, and

T_{i, m}^{c o m p l e t e_M E C}

represents the processing completion time on MEC server m.

Energy consumption is a critical factor for SDs. The overall energy consumption for a task can be categorized into two components: computation and transmission. If a task operates locally, only computational energy costs come into play because no transmission is involved. Conversely, when a task is offloaded to an MEC server, an SD incurs transmission energy costs for both the upload and download. Hence, the cumulative energy consumption of an SD can be expressed as follows:

E_{o f l}^{t o t a l} = \sum_{τ_{i} \in V, a_{i} = 0} E_{i}^{l o c} + \sum_{τ_{i} \in T a s k, a_{i} = {1, 2, \dots, m}} E_{i}^{t r a n} = \sum_{τ_{i} \in T a s k, a_{i} = 0} K_{u} \times {(f_{n}^{l o c})}^{2} \times T_{i}^{l o c} + \sum_{τ_{i} \in T a s k, a_{i} = {1, 2, \dots, m}} (P_{t r a n} \times T_{i}^{u p} + P_{r e c v} \times T_{i}^{d o w n}),

(12)

where

K_{u}

represents the energy constant [35],

f_{n}^{l o c}

is the clock frequency of the local processor,

T_{i}^{l o c}

denotes the time required to process a task locally,

P_{t r a n}

is the power utilized during transmission,

T_{i}^{u p}

indicates the upload time,

P_{r e c v}

is the power consumed during reception, and

T_{i}^{d o w n}

is the download time.

5. Proposed Task Offloading Scheme

This section discusses the implementation methodology of the proposed SACTOS. First, the SAC algorithm, which manages a continuous action space, is introduced. Subsequently, an offloading model that focuses on the optimization problem is examined. Finally, the steps involved in implementing the algorithm are involved.

5.1. SAC Algorithm for Continuous Actions

The SAC algorithm is a DRL algorithm that leverages the maximum entropy in a continuous action space. While most online algorithms estimate gradients using new samples at each iteration, this algorithm is different in that it reuses past experiences. Grounded in a maximum entropy framework, it aims to maximize action diversity while also maximizing the expected reward. Consequently, in terms of discoverability, stability, and robustness in continuous action spaces, SAC tends to outperform deterministic policies [35].

Typically, SAC is tailored to continuous action spaces. This continuity facilitates nuanced action choices within search spaces, thereby paving the way for agile responses under various environmental conditions. The primary distinction between continuous and discrete actions under SAC is their respective outputs and representations. In the continuous action space of SAC, the policy

π_{Φ} (a_{t}, s_{t})

is represented as a density function, whereas in a discrete action space, it is represented as a probability. Regarding entropy, it represents the uncertainty in a random variable [36]. The entropy value of policy

π_{Φ} (a_{t}, s_{t})

is determined using Equation (13):

H (π (\cdot| s_{t})) = E [- \log π (\cdot| s_{t})],

(13)

where H is the entropy, E is the expected value,

- \log π (\cdot| s_{t})

represents the log probability of an action in state

s_{t}

, and the inverted expected value of this log probability is the entropy. Equation (14) determines the policy that maximizes the balance between the expected reward

R (s_{t}, a_{t})

and the policy entropy

H

[36]:

π^{*} = \underset{π}{argmax} E_{(s_{t}, a_{t}) ~ ρ_{π}} [\sum_{t} R (s_{t}, a_{t}) - α H (\cdot| s_{t})],

(14)

where

E_{(s_{t}, a_{t}) ~ ρ_{π}}

denotes the expected value of the state-action pair

(s_{t}, a_{t})

as determined by the policy

π

, and the coefficient

α

modulates the balance between reward and entropy. This formula aims to maximize the expected return in each state while preserving the diversity or exploratory nature of the policies. Moreover, instead of fixing

α

as a static hyperparameter, it can be adaptively fine-tuned using a neural network by backpropagating the error in the entropy normalization coefficient [36]. The objective of the entropy normalization coefficient is given by Equation (15):

J (α) = π_{t} {(s_{t})}^{T} [- α (\log (π_{t} (s_{t}) + \bar{H})] .

(15)

where

π_{t} (s_{t})

represents the probability distribution for state

s_{t}

at time t; and

- α (\log (π_{t} (s_{t}) + \bar{H})

represents the log probability of a policy, where

\bar{H}

represents the target entropy. The disparity between these two values influences the increase or decrease in entropy. The accumulated reward for entropy in state

s_{t}

is characterized by a soft state value function. Owing to the continuous nature of the task set, estimating this value requires intricate computations. To address this issue, value estimation is performed using sampling techniques, such as the Monte Carlo technique; it is defined in Equation (16):

V (s_{t}) : = π_{t} {(s_{t})}^{T} {[Q (s_{t}) - α l o g (π (s_{t}))]}^{2} .

(16)

This equation integrates both the anticipated value of future rewards and the entropy of the policy through a soft-state value function. Given the continuous nature of tasks, it is possible to represent actions in intricate environments. This incorporation permits a richer inclusion of information and fosters the application of exploratory policies.

Q_{θ} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E_{s_{t + 1 ~ ρ}} [{m a x}_{a} (Q (s_{t +!}, a) - a l o g π (a | s_{t + 1})] .

(17)

At this juncture, incorporating the entropy normalization term reflects the uncertainty in action selection and promotes the use of explorative policies. The approach for maximizing the objective function using the soft policy gradient method is defined as follows (refer to Appendix A for details):

J_{π} (θ) = E_{(s_{t}, a_{t}) ~ D} [α l o g π (a_{t}| s_{t}) + Q_{θ} (s_{t}, a_{t})] .

(18)

This method employs a Q-function to estimate the value of an action and promotes exploration using an entropy term. Gradient-based optimization methods are utilized to maximize the objective function, as follows (refer to Appendix B for details):

L_{Q} (θ_{i}) = E_{(s_{t}, a_{t}) ~ D} [Q_{θ} (s_{t}, a_{t}) - r (s_{t}, a_{t})] + γ E_{s_{t + 1}} [m i n Q_{θ_{1}} (s_{t + 1}, π (s_{t + 1})), Q_{θ_{2}} (s_{t + 1}, π (s_{t + 1}))) - a l o g π (a | s_{t + 1}))^{2}] .

(19)

This equation denotes the loss function of the Q function. By employing gradient descent to minimize this loss function, a gradient akin to the aforementioned estimated gradient can be obtained [36].

J_{π} (Φ) = E_{(s_{t}) ~ D} [π_{t} {(s_{t})}^{T} (α l o g π (π_{Φ} (s_{t})) - Q_{Φ} (s_{t}))] .

(20)

When SAC is applied to continuous action spaces, the reparameterization technique is utilized to minimize the policy loss

J_{π} (Φ)

. This method was introduced to address the gradient instability arising from stochastic behavioral sampling in the continuous action space of SAC. The reparameterization technique stabilizes the gradient calculations by decoupling the generation of a stochastic action from an independent noise variable; owing to this, SAC demonstrates superior performance even in intricate environments.

5.2. Markov Decision Process

Markov prediction models are based on memory-less, discrete-time stochastic processes known as Markov chains [37]. The essential feature of Markov chains is their memory-less nature, meaning that the next state relies solely on the current state and not on the sequence of states that occurred before it. This aspect is especially relevant in the context of task offloading. Here, the choice to offload a specific task does not necessarily depend on the historical sequence of tasks but rather on the current system state and the task’s inherent characteristics.

Given these properties, we recognized that the task offloading challenge closely aligns with the dynamics portrayed by Markov chains. To systematically and efficiently tackle this challenge, we framed it as an MDP characterized by states (s), actions (a), and rewards (r). This approach empowered us to leverage the structured decision-making abilities of MDPs, enabling strategic offloading of decisions grounded in current states to maximize cumulative future rewards. The specific definitions of the state space, action space, and reward function for the MDP are elaborated below.

The decision to use MDP offers a structured framework for making sequential decisions amidst uncertainties, aligning seamlessly with the stochastic nature of the task offloading problem. It establishes a systematic approach to determining the optimal offloading strategy, considering not only the immediate reward but also the long-term cumulative reward.

5.2.1. State Space

A state space is defined as a combination of the task DAG and the associated offloading decisions. Regarding the parameters in a state space,

Ω_{1 : i}

denotes the offloading choices from the initial task up to the current one; and

ζ

, which comprises five vectors [P, Q, U, S, and T], encapsulates the encoding of a DAG. Here, P represents the characteristics of a task and its current state, containing information such as the type of task, estimated execution time, required resources, and estimated time to completion; Q is a metric that indicates the priority or importance of a task; the vector U reflects the current state of system resources, including information such as CPU utilization, memory status, input/output (I/O) latency, and network latency; S represents the current status of the offload queue and includes data such as the length of the queue, the average waiting time, and the number of recently offloaded jobs; and the vector T represents the information related to the task load of the system, including the average processing time, processing speed, and failure rate of the offloaded tasks. Therefore, the state space can be expressed as:

M^{'} ≔ m^{'} | m^{'} = (σ, O_{1} : j, R_{1} : k) .

(21)

5.2.2. Action Space

Tasks can be executed locally on a device or offloaded to an MEC server, with each choice influenced by various factors. Local execution may offer faster response times but can be limited by resource constraints, rendering it unsuitable for complex operations. In contrast, offloading to an MEC server relieves the local device’s burden and leverages greater computing capabilities. However, this approach can come with potential drawbacks, such as network delays or data transmission costs.

The task’s location is determined by the variable

a_{i}

. If

a_{i} = 0

, the task is executed locally. Conversely, any value other than zero identifies an MEC server, indicating the task should be processed there. Different MEC servers may have distinct performance metrics and available resources. Consequently, the offloading decision must consider multiple factors, such as server status, task requirements, and network conditions. The action space, denoted as A = {0, 1, 2, …, m}, with m representing the number of accessible MEC servers and 0 indicating local execution, plays a crucial role. To ensure optimal performance, this action space requires dynamic management through the integration of diverse offloading strategies and optimization methods.

5.2.3. Reward Function

Several strategies can improve QoS, with adjustments based on key variables. Notably, system latency and energy consumption hold significant importance. In our study, we introduced a more intuitive and transparent reward function. We refined the formula calculations to precisely capture real-time changes in both delay and energy consumption at each step.

In existing strategies, the reward function primarily focuses on system delay and energy consumption as key factors for optimizing QoS. However, this approach is both computationally demanding and fails to capture real-time changes in delay and energy consumption. To address this, we integrated increments that accurately represent variations in actual delay and energy consumption at every stage.

∆ T_{o f l} = T_{o f l} (N) - T_{o f l} (N - 1),

(22)

∆ E_{o f l} = E_{o f l} (N) - E_{o f l} (N - 1),

(23)

where

∆ T_{o f l}

denotes the variation in system delay between successive time intervals, and

∆ E_{o f l}

signifies the alteration in energy consumption over the same period. These incremental changes are pivotal in promptly discerning shifts in the performance and efficacy of a system and serve as central metrics in both the optimization and evaluation processes.

The reward function imposes penalties on incremental delays and increases energy consumption at each step. By normalizing these values against the overall local delay and energy consumption, the function ensures a balance between penalizing inefficiencies and assessing system performance; it is defined as follows:

R (s_{i}, a_{i}) = β_{1} \times (\frac{∆ T_{o f l}}{T_{l o c}^{a l l}}) + β_{2} \times (\frac{∆ E_{o f l}}{E_{l o c}^{a l l}}),

(24)

where

β_{1} \times (\frac{∆ T_{o f l}}{T_{l o c}^{a l l}})

normalizes the increase in delay relative to the overall local delay, and

β_{2} \times (\frac{∆ E_{o f l}}{E_{l o c}^{a l l}})

normalizes the increment of energy consumption relative to the total local energy consumption. Both elements are structured to mirror the efficiency and responsiveness of the system.

5.2.4. Design of SACTOS

The offloading process comprises three steps. In step 1, tasks within a DAG are topologically sorted and then arranged in descending order based on their rank values, resulting in the following sequence:

Score (v_{i}) = \{\begin{matrix} T_{{t o t a l}_{i}} i f v_{i} \in e x i t \\ {m a x}_{(v_{i}, v_{j}) \in ε} (S c o r e (v_{j}) + T_{{t o t a l}_{i}}, v_{i}) o t h e r w i s e \end{matrix},

(25)

where

T_{{t o t a l}_{i}}

represents the total time required to complete processing within the defined set. In step 2, the tasks are transformed into a sequence of vectors that serve as inputs to the neural network. During this transformation, feedback, such as network and device status (e.g., battery level and CPU usage) is incorporated to enhance offloading efficiency. In step 3, the decision to offload a task is made by selecting the most likely offloading action based on probability. Then, MEC servers collaborate to complete selected tasks in accordance with this action. Moreover, potential overloads are distributed through cooperation between multiple MEC servers or SDs, thereby optimizing resource utilization. The comprehensive procedure for the proposed SACTOS algorithm is detailed in Algorithm 1.

Algorithm 1: SACTOS

Input: Episode, environment, batch size, replay buffer size
1 Initialize neural network parameters

θ

2 Initialize Q-function with random weights

θ_{Q}

3 Initialize policy

π

with random weights

θ_{π}

4  Initialize replay buffer: R
5   for each episode do:
6          Topological sorting of the DAG:
7          Sort tasks in descending order based on their priority values (computed
8          based on

T_{t o t a l}

)
9          Convert tasks to a neural network input:
10    for each task in DAG do:
11          Convert the task to a vector sequence
12          Incorporate feedback from the network and device state
13          (e.g., battery level, CPU usage)
14    end for
15      Offloading the decision:
16         while task not done do:
17          Select action a with the highest offloading probability from policy π given
18           the current state s using:
19          a ~ π(.|s; θ)
20          Execute action a and observe the next state s’ and reward r
21          Store (s, a, r, s’) in R
22          Sample a random mini-batch from R
23          Update the Q-function using the SAC loss
24          Update policy π using the SAC objective
25          Update the target networks
26      end while
27        Collaborate between the smart MEC server and selected task based on action a
28        to complete the task.
29        Distribute the load and optimize the resource usage among multiple MEC hosts or SDs.
30 end for

The SACTOS algorithm optimizes task offloading for the MEC server. After initializing the neural network and memory buffers, it prioritizes tasks and transforms them into neural network inputs. The algorithm continually selects and executes tasks based on the offloading probability, ensuring collaboration between the MEC server and SD to accomplish these tasks.

In Figure 2, we present the detailed structure of the SAC scheme, a cutting-edge deep reinforcement learning algorithm known for its adaptability and efficiency. The SAC architecture uniquely combines an off-policy method with the advantages of actor-critic paradigms and a maximum entropy framework. At its core, the actor network generates actions based on the input state. Unlike deterministic strategies, SAC’s actor produces actions from a stochastic policy, facilitating a comprehensive exploration of potential actions. SAC stands out with its dual-critic system. Each critic network independently predicts the Q-value for a given state-action pair, mitigating overestimation biases—common in deep Q-learning algorithms. For policy optimization, SAC uses the smaller Q-value from its two critics, ensuring a cautious and robust policy that accounts for the variability inherent in single-critic evaluations. SAC’s distinctive feature is its entropy regularization, which encourages extensive exploration by introducing an entropy term to the reward. This nudges the actor towards more exploratory action choices. Lastly, the critics are updated using a ‘soft’ Bellman equation, which blends estimated and target Q-values, greatly enhancing learning stability. Overall, Figure 2 provides a comprehensive visual overview of SAC’s intricate architecture, showcasing its ability to derive optimal policies in complex environments by effectively balancing exploration and exploitation.

6. Performance Evaluation

This section discusses the experimental results of SACTOS and the methodology used to assess its performance. First, various parameters employed in the DAG generator are presented. Subsequently, the simulation environment and hyperparameters are described. Subsequently, the analysis of the average reward value to gauge the convergence attributes of SACTOS is presented. Finally, a comparison of SACTOS with four other benchmark algorithms, meant to highlight its efficacy, is presented.

6.1. Fundamental Approaches

This section describes the assessment of the performance of the proposed offloading scheme under various parameter settings to validate both its effectiveness and convergence. The following four offloading schemes were evaluated:

Local-only: All the computational tasks were performed exclusively on local devices.
Random: Tasks were offloaded based on a random selection [38].
The PPO algorithm is an advancement of the policy gradient method; however, it still has constraints in terms of sampling efficiency [39].
Dueling double (D3QN) deep-Q network-based scheme: This method combines the strengths of both the dueling DQN and double DQN, further enhancing the learning algorithm and structure of the conventional DQN.

6.2. Simulation and Results

Although real-world application programs can be depicted using DAGs with diverse topologies, currently available datasets provide limited application information [34]. To implement the proposed offloading method, a graph theory-based approach is required, even for applications for which the topological information is not present in an actual dataset. By analyzing the interactions and data flows between applications, a graph theory-based approach can facilitate more efficient offloading decisions.

For the simulation experiment, we considered both the channel gain and transmission rate based on the distance between the SD and the MEC server. The baseline transmission rate was assumed to be 100 Mbps.

The CPU clock frequency of the local device, i.e.,

f_{i}^{l o c}

, was set to 2 GHz, while that of the MEC server, i.e.,

f_{i}^{m e c}

, was set to 8 GHz. The transmission and reception powers,

P_{t r a n}

and

P_{r e c v}

, respectively, were set to 2.5 and 1.8 W, respectively. The task sizes ranged from 10 to 50 KB. Additionally, the number of clock cycles required for a single task ranged between

10^{7} a n d 10^{8}

cycle/sec. Each agent consisted of an actor and a critic. Both contained two hidden, fully connected layers, each with 256 neurons. The essential hyperparameters for SAC implementation are listed in Table 4.

The results depicted in Figure 3 show that the rapid convergence and impressive average reward of SAC are immediately noticeable. While the initial performances of all the algorithms appeared comparable, SAC began to gradually distinguish itself via superior performance enhancements over time. Both D3QN and PPO exhibited faster convergence rates than that of SAC in the initial stages, but SAC surpassed them at a convergence speed midway through the experiments. This trend highlighted the more efficient learning strategy of SAC in complex offloading scenarios compared to the strategies of the other two methods. Regarding D3QN, its performance steadily enhanced, ultimately settling at a reward lower than that of SAC and PPO. On the other hand, PPO displayed a rapid improvement in the initial phase but exhibited a decelerated growth rate in the later stages, finally resulting in a performance less than that of SAC. These findings underscored the effectiveness of SAC as a primary reinforcement learning algorithm in handling intricate offloading settings.

Figure 4 presents a comparison of the time and energy conservation scales (TECS) across the four offloading methods as the task count increased. The TECS quantifies the reduction in local task execution costs relative to the total post-offloading costs, effectively highlighting the trade-off between latency and energy consumption. For the experiment, the aspect ratio of the graph was fixed at 0.45, three MEC servers were considered, and a communication-to-computation ratio (CCR) of 0.5 was used. Compared to the improvements exhibited by PPOS, D3QNS, and Random, SACTOS exhibited an average improvement in TECS of 2.24, 52.52, and 74.25%, respectively. This indicated the superior optimization of SACTOS for offloading tasks. Although an increase in task count invariably led to an increase in delay and energy consumption, the gradual increases in post-offloading costs led to higher TECS values for SACTOS.

Figure 5 and Figure 6 provide an in-depth view of the service waiting time and energy consumption for each offloading strategy, influenced by the progressive increase in the number of tasks. As the workload increased, we observed a corresponding rise in service waiting time and energy consumption across all strategies. This consistent increase highlights the inherent challenges associated with managing higher workloads in offloading systems. Notably, the ‘local-only’ strategy exhibited a significant spike in both latency and energy consumption. Such a rapid increase can be interpreted as indicating inherent scalability limitations in this strategy, possibly due to the absence of an adaptive mechanism for efficiently managing and distributing growing tasks.

On the other end of the spectrum, DRL-based offloading strategies such as PPO, D3QN, and SACTOS demonstrated a relatively stable trend. Their consistent performance, even as the number of tasks increased, suggests their ability to effectively balance the dual objectives of conserving energy and reducing waiting time. Among these strategies, SACTOS emerged as a top performer. It significantly reduced standby time by an average range of 1.6–62.19% and demonstrated commendable energy efficiency, reducing consumption by 6.7–88.23%. These statistics highlight SACTOS’s robust learning mechanism, which continuously refines its offloading decisions in response to changes in task volumes.

It is important to note the growth pattern exhibited by the DRL-based strategies. Unlike the sudden increase observed with the ‘local-only’ approach, these strategies demonstrated a more gradual growth. This pattern suggests their capacity to adapt and allocate resources strategically, preventing spikes in demand from causing disproportionate rises in waiting times or energy expenses.

Essentially, while all strategies encountered challenges as the number of tasks increased, the DRL-based approaches, particularly SACTOS, demonstrated superior capabilities in handling these challenges. Their performance not only surpassed that of other strategies, but also highlighted the significance of adaptive learning in optimizing offloading decisions, thereby reducing both latency and energy consumption.

Figure 7 illustrates the variation in the TECS with respect to the CCR. The figure clearly shows that the optimization performance of SACTOS increased when the CCR was low, indicating the dominance of computational tasks. These results also highlighted the effectiveness of SACTOS in environments where the workload was more computationally intensive. Moreover, SACTOS outperformed the other methods, exhibiting a performance that was 6.36, 40.2, and 48.6% superior to that of PPO, D3QN, and random, respectively. Regarding PPO, D3QN, and random, they displayed inconsistent performance shifts with varying CCRs. This observation underscored the necessity for an offloading strategy adept at handling the surges in latency and energy expenditure associated with localized computing in a computation-centric setting. The exceptional efficiency of SACTOS is pivotal, especially in intricate scenarios where a proper balance between communication and computation is essential.

7. Conclusions and Future Work

This study investigated the scheduling of dependent task offloading, aiming to optimize both latency and energy consumption within multi-server and multiple-SD settings. To adeptly navigate the ever-changing MEC landscape, we incorporated a collaborative architecture among MECs, framing it within an MDP context. We articulated the dependent tasks using a DAG and honed the MEC system using a DRL. The proposed SACTOS, which operates under centralized control, significantly reduces service delays and conserves energy. The experimental outcomes confirmed that SACTOS not only converged swiftly and stably but also outperformed existing methodologies such as local-only, random, D3QN, and PPO.

Currently, our future research trajectory points toward the integration of multi-agent reinforcement learning. Our goal is to increase task offloading efficiency by judiciously tapping into the shared resource reservoir inherent in the MEC setup, wherein each SD is perceived as an autonomous agent, and to explore the synergies of multi-agent collaboration to reduce costs and increase the QoS values of systems.

Author Contributions

Conceptualization, D.L.; Methodology, D.L. and I.J.; Software, D.L.; Validation, D.L.; Formal Analysis, D.L.; Investigation, D.L.; Resources, I.J.; Data Curation, D.L. and I.J.; Writing—Original Draft Preparation, D.L.; Writing—Review and Editing, D.L. and I.J.; Visualization, D.L.; Supervision, I.J.; Project Administration, I.J.; Funding Acquisition, I.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & communication Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00107, Development of the technology to automate the recommendation for big data analytic models that define data characteristics and problems).

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author, I.J. The data are not publicly available because they include information that could compromise the privacy of the study participants.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AR	Augmented Reality
BS	Base Station
CRRM	Collaborative Radio Resource Management
CRSP	Collaborative Signal Processing
DAG	Directed Acyclic Graphs
D3QN	Dueling Double Deep-Q Network
DRL	Deep Reinforcement Learning
IoT	Internet of Things
MR	Mixed Reality
PPO	Proximal Policy Optimization
SAC	Soft Actor-Critic
SD	Smart Device
TECS	Time and Energy Conservation Scale
MDP	Markov Decision Process
MCC	Mobile Cloud Computing
MEC	Multi-Access Edge Computing
VR	Virtual Reality

Appendix A

It seems to represent an objective function within the context of entropy-regularized reinforcement learning. Specifically, it combines the expected Q-value

Q_{θ} (s_{t}, a_{t})

with an entropy term

α \log π (a_{t} | s_{t})

, where

α

serves as a temperature coefficient. This combination aims to promote effective exploration of the state-action space while optimizing expected rewards. However, this objective is derived from the maximum entropy reinforcement learning framework, where the goal is to find a policy that maximizes both expected return and entropy. In traditional reinforcement learning (RL), the standard objective is to maximize the expected return, often represented by the Q-value:

E_{(s_{t}, a_{t}) π} [Q (s_{t}, a_{t})]

. To encourage more exploration in the state-action space, an entropy-regularization term can be introduced. This entropy term, denoted as

H (π) = - E_{a π} (\log π (α)

, measures the randomness of a policy distribution

π

, thus incentivizing a more exploratory policy. By combining these objectives, we derive the entropy-regularized objective

J_{π} (θ)

=

E_{(s_{t}, a_{t}) D} [α \log π (a_{t} | s_{t}) + Q_{θ} (s_{t}, a_{t})]

. This composite objective ensures that the policy strikes a balance between efficient exploration (governed by the entropy term) and leveraging learned strategies (represented by the Q-value), with the coefficient

α

serving as the balancing factor.

Appendix B

The function

L_{Q} (θ_{i})

represents the loss function of the Q-network parameterized by

θ_{i}

and its main objective is to minimize the difference between the predicted and target Q-values.

Predicted Q-value

Q_{θ} (s_{t}, a_{t})

This term represents the predicted Q-value for the current state-action pair

s_{t}, a_{t}

using the current Q-network parameters

θ

.

Target Q-value

r (s_{t}, a_{t}) + Υ E_{s_{t + 1}} [m i n (Q_{θ_{1}} (s_{t + 1}, π (s_{t + 1})), Q_{θ_{2}}, π (s_{t + 1}))]

This term captures the expected cumulative rewards. It consists of the immediate reward

r (s_{t}, a_{t})

observed after taking action

a_{t}

in state

s_{t}

, and the discounted future reward, which is the expected value of the minimum Q-value between two Q-functions

Q_{θ_{1}}

and

Q_{θ_{2}}

for the next state

s_{t + 1}

and action chosen by policy

π

.

Entropy Regularization

- α \log π (a | s_{t + 1})

This term encourages exploration. Here

π (a | s_{t + 1}

) represents the probability of executing action

a

in state

s_{t + 1}

according to policy

π

. The term

α

is the temperature parameter that balances the importance of the entropy term against the reward.

Combining all these terms, we get:

L_{Q} (θ_{i}) = E_{(s_{t}, a_{t}) ~ D} [Q_{θ} (s_{t}, a_{t}) - r (s_{t}, a_{t})] + γ E_{s_{t + 1}} [m i n Q_{θ_{1}} (s_{t + 1}, π (s_{t + 1})), Q_{θ_{2}} (s_{t + 1}, π (s_{t + 1})) - a l o g π (a | s_{t + 1})^{2}]

This equation aims to minimize the mean squared difference between the predicted and target Q-values using data sampled from replay buffer D.

References

Hao, W.; Zeng, M.; Sun, G.; Xiao, P. Edge cache-assisted secure low-latency millimeter-wave transmission. IEEE Internet Things J. 2019, 7, 1815–1825. [Google Scholar] [CrossRef]
Nguyen, Q.-H.; Dressler, F. A smartphone perspective on computation offloading—A survey. Comput. Commun. 2020, 159, 133–154. [Google Scholar] [CrossRef]
Min, M.; Xiao, L.; Chen, Y.; Cheng, P.; Wu, D.; Zhuang, W. Learning-based computation offloading for IoT devices with energy harvesting. IEEE Trans. Veh. Technol. 2019, 68, 1930–1941. [Google Scholar] [CrossRef]
Merenda, M.; Porcaro, C.; Iero, D. Edge machine learning for ai-enabled iot devices: A review. Sensors 2020, 20, 2533. [Google Scholar] [CrossRef] [PubMed]
Hamdan, S.; Ayyash, M.; Almajali, S. Edge-computing architectures for internet of things applications: A survey. Sensors 2020, 20, 6441. [Google Scholar] [CrossRef]
Zheng, J.; Cai, Y.; Wu, Y.; Shen, X. Dynamic computation offloading for mobile cloud computing: A stochastic game-theoretic approach. IEEE Trans. Mob. Comput. 2018, 18, 771–786. [Google Scholar] [CrossRef]
Kekki, S.; Featherstone, W.; Fang, Y.; Kuure, P.; Li, A.; Ranjan, A.; Scarpina, S. MEC in 5G networks. ETSI White Pap. 2018, 28, 1–28. [Google Scholar]
Porambage, P.; Okwuibe, J.; Liyanage, M.; Ylianttila, M.; Taleb, T. Survey on multi-access edge computing for internet of things realization. IEEE Commun. Surv. Tutor. 2018, 20, 2961–2991. [Google Scholar] [CrossRef]
Peng, M.; Zhang, K. Recent advances in fog radio access networks: Performance analysis and radio resource allocation. IEEE Access 2016, 4, 5003–5009. [Google Scholar] [CrossRef]
Zhao, Z.; Bu, S.; Zhao, T.; Yin, Z.; Peng, M.; Ding, Z.; Quek, T.Q. On the design of computation offloading in fog radio access networks. IEEE Trans. Veh. Technol. 2019, 68, 7136–7149. [Google Scholar] [CrossRef]
Samanta, A.; Chang, Z. Adaptive service offloading for revenue maximization in mobile edge computing with delay-constraint. IEEE Internet Things J. 2019, 6, 3864–3872. [Google Scholar] [CrossRef]
Wang, B.; Song, Y.; Cao, J.; Cui, X.; Zhang, L. Improving Task Scheduling with Parallelism Awareness in Heterogeneous Computational Environments. Future Gener. Comput. Syst. 2019, 94, 419–429. [Google Scholar] [CrossRef]
Mitsis, G.; Tsiropoulou, E.E.; Papavassiliou, S. Price and risk awareness for data offloading decision-making in edge computing systems. IEEE Syst. J. 2022, 16, 6546–6557. [Google Scholar] [CrossRef]
Xiang, X.; Lin, C.; Chen, X. Energy-efficient link selection and transmission scheduling in mobile cloud computing. IEEE Wirel. Commun. Lett. 2014, 3, 153–156. [Google Scholar] [CrossRef]
Zhang, W.; Wen, Y.; Guan, K.; Kilper, D.; Luo, H.; Wu, D.O. Energy-optimal mobile cloud computing under stochastic wireless channel. IEEE Trans. Wirel. Commun. 2013, 12, 4569–4581. [Google Scholar] [CrossRef]
Zhang, Y.; Niyato, D.; Wang, P. Offloading in mobile cloudlet systems with intermittent connectivity. IEEE Trans. Mob. Comput. 2015, 14, 2516–2529. [Google Scholar] [CrossRef]
Guo, F.; Zhang, H.; Ji, H.; Li, X.; Leung, V.C.M. An efficient computation offloading management scheme in the densely deployed small cell networks with mobile edge computing. IEEE ACM Trans. Netw. 2018, 26, 2651–2664. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Levine, S. Soft actor-critic algorithms and applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Lim, D.; Lee, W.; Kim, W.-T.; Joe, I. DRL-OS: A Deep Reinforcement Learning-Based Offloading Scheduler in Mobile Edge Computing. Sensors 2022, 22, 9212. [Google Scholar] [CrossRef]
Sartoretti, G.; Paivine, W.; Shi, Y.; Wu, Y.; Choset, H. Distributed learning of decentralized control policies for articulated mobile robots. IEEE Trans. Robot. 2019, 35, 1109–1122. [Google Scholar] [CrossRef]
Wang, J.; Hu, J.; Min, G.; Zhan, W.; Ni, Q.; Georgalas, N. Computation offloading in multi-access edge computing using a deep sequential model based on reinforcement learning. IEEE Commun. Mag. 2019, 57, 64–69. [Google Scholar] [CrossRef]
Wang, Z.; Li, M.; Zhao, L.; Zhou, H.; Wang, N. A3C-based Computation Offloading and Service Caching in Cloud-Edge Computing Networks. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Virtual, 2–5 May 2022; pp. 1–2. [Google Scholar]
Li, S.; Hu, S.; Du, Y. Deep Reinforcement Learning and Game Theory for Computation Offloading in Dynamic Edge Computing Markets. IEEE Access 2021, 9, 121456–121466. [Google Scholar] [CrossRef]
Sun, Y.; He, Q. Computational offloading for MEC networks with energy harvesting: A hierarchical multi-agent reinforcement learning approach. Electronics 2023, 12, 1304. [Google Scholar] [CrossRef]
Yong, D.; Liu, R.; Jia, X.; Gu, Y. Joint Optimization of Multi-User Partial Offloading Strategy and Resource Allocation Strategy in D2D-Enabled MEC. Sensors 2023, 23, 2565. [Google Scholar] [CrossRef]
Liu, K.-H.; Hsu, Y.-H.; Lin, W.-N.; Liao, W. Fine-Grained Offloading for Multi-Access Edge Computing with Actor-Critic Federated Learning. In Proceedings of the 2021 IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 3 March–1 April 2021; pp. 1–6. [Google Scholar]
Sun, C.; Wu, X.; Li, X.; Fan, Q.; Wen, J.; Leung, V.C.M. Cooperative Computation Offloading for Multi-Access Edge Computing in 6G Mobile Networks via Soft Actor Critic. IEEE Trans. Netw. Sci. Eng. 2021. [Google Scholar] [CrossRef]
He, W.; Gao, L.; Luo, J. A Multi-Layer Offloading Framework for Dependency-Aware Tasks in MEC. In Proceedings of the ICC 2021-IEEE International Conference on Communications, Montreal, QC, Canada, 14–18 June 2021; pp. 1–6. [Google Scholar]
Akhlaqi, M.Y.; Hanapi, Z.B.M. Task offloading paradigm in mobile edge computing-current issues, adopted approaches, and future directions. J. Netw. Comput. Appl. 2023, 212, 103568. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, J.; Zhou, X.; Qi, L.; Xu, X.; Huang, J. A distributed game theoretical approach for credibility-guaranteed multimedia data offloading in MEC. Inf. Sci. 2023, 644, 119306. [Google Scholar] [CrossRef]
Mustafa, E.; Shuja, J.; Bilal, K.; Mustafa, S.; Maqsood, T.; Rehman, F.; Khan, A.U.R. Reinforcement learning for intelligent online computation offloading in wireless powered edge networks. Clust. Comput. 2023, 26, 1053–1062. [Google Scholar] [CrossRef]
Nath, S.; Wu, J.; Yang, J. Delay and energy efficiency tradeoff for information pushing system. IEEE Trans. Green Commun. Netw. 2018, 2, 1027–1040. [Google Scholar] [CrossRef]
Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1352–1361. [Google Scholar]
Sabella, D.; Sukhomlinov, V.; Trang, L.; Kekki, S.; Paglierani, P.; Rossbach, R.; Li, X.; Fang, Y.; Druta, D.; Giust, F.; et al. Developing software for multi-access edge computing. ETSI White Pap. 2019, 20, 1–38. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Prianto, E.; Kim, M.; Park, J.H.; Bae, J.H.; Kim, J.S. Path planning for multi-arm manipulators using deep reinforcement learning: Soft actor–critic with hindsight experience replay. Sensors 2020, 20, 5911. [Google Scholar] [CrossRef] [PubMed]
Venieris, S.I.; Panopoulos, I.; Venieris, I.S. OODIn: An optimised on-device inference framework for heterogeneous mobile devices. In Proceedings of the IEEE International Conference on Smart Computing (SMARTCOMP), Irvine, CA, USA, 23–27 August 2021; Volume 2021, pp. 1–8. [Google Scholar]
Wang, Y.; Friderikos, V. A survey of deep learning for data caching in edge network. Informatics 2020, 7, 43. [Google Scholar] [CrossRef]
Zou, J.; Hao, T.; Yu, C.; Jin, H. A3c-do: A regional resource scheduling framework based on deep reinforcement learning in edge scenario. IEEE Trans. Comput. 2021, 70, 228–239. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed system.

Figure 2. Structure of the soft actor-critic (SAC) scheme.

Figure 3. Convergence analysis. D3QN: Dueling double DQN.

Figure 4. Impact of the number of tasks on the time and energy conservation scales (TECS).

Figure 5. Impact of the number of tasks on the latency.

Figure 6. Impact of the number of tasks on energy consumption.

Figure 7. Impact of the communication to computation ratio (CCR) on the TECS.

Table 1. Comparison of existing approaches.

Categorization	Advantage	Disadvantage
Actor critic-based algorithm [18,19]	More stable and robust in dynamic environments	The optimization problem needs to be resolved when the environment changes
DQN-based algorithm [20,21]	Suitable for dynamic environments	Expensive when the number of wireless devices grows exponentially
PPO [25,26,27]	Good overall performance; allows discrete and continuous control	Low sample efficiency; unsuitable for actual application scenarios
DDPG [28]	High efficiency	Insufficient discoverability, stability, and robustness
Soft actor critic-based algorithm	Robust in continuous-action spaces; better exploration capabilities due to entropy regularization	Might be more computationally intensive due to twin Q and actor networks

DQN: deep Q-network; PPO: proximal policy optimization; DDPG: deep deterministic policy gradient.

Table 2. Mathematical notations of the system model.

Notations	Definition
$ρ_{t r a n}, ρ_{r e c v}$	Transmitted and received power
$φ$	Transmission probability of a sub-channel
$ξ$	DAG
$ω$	Parameter that determines the transmission or recurve rate
$π (\cdot \| \cdot)$	Offloading policy
$η_{i}$	Number of clock cycles required to process each bit of data
H	Entropy
B	Wireless channel bandwidth
$β_{1}, β_{2}$	Weighted coefficient of energy consumption and latency ratio
$θ_{1}, θ_{2}$	Parameters of the SACTOS
D	Replay buffer

DAG: Directed acyclic graph; SACTOS: soft actor–critic task offloading scheme.

Table 3. Mathematical notations of task offloading.

Notations	Definition
M	Collection of MEC servers
$a_{i}$	Offloading action of task $τ_{i}$
$T_{i, n}^{l o c}, T_{i, m}^{m e c}$	Local and MEC server computation latency of task $τ_{i}$
$f_{i}^{l o c}$	CPU clock speed of the SD where task $τ_{i}$ is located
$f_{i}^{m e c}$	CPU clock speed of the MEC server numbered m
$R_{i} (φ)$	Transmission rate of the receiving rate
$d_{i}$	Data size of task $τ_{i}$
$T_{i}^{c o m p l e t e}$	Time taken to complete task $τ_{i}$
$T_{i}^{t r a n_c o m p l e t e}$	Time taken to upload and download task $τ_{i}$
$E_{i}^{l o c}$	Energy consumed by a local computing model for task $τ_{i}$
$E_{i}^{m e c}$	Energy consumed in computational offloading for task $τ_{i}$
$E_{o f l}^{t o t a l}$	Total energy consumed in computational offloading for task $τ_{i}$
$β_{1}, β_{2}$	Weighting coefficients

CPU: Central processing unit; SD: smart device.

Table 4. Hyperparameters for algorithm implementation.

Parameters	Value
Replay memory size	5000
Minibatch size	128
Discounted factor	0.99
Optimizer	Adam
Learning rate	0.001
Initial value of the temperature parameter	0.2
Learning rate of the temperature parameter	0.001
Update interval of the target network	1000
Epsilon decay rate	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lim, D.; Joe, I. A DRL-Based Task Offloading Scheme for Server Decision-Making in Multi-Access Edge Computing. Electronics 2023, 12, 3882. https://doi.org/10.3390/electronics12183882

AMA Style

Lim D, Joe I. A DRL-Based Task Offloading Scheme for Server Decision-Making in Multi-Access Edge Computing. Electronics. 2023; 12(18):3882. https://doi.org/10.3390/electronics12183882

Chicago/Turabian Style

Lim, Ducsun, and Inwhee Joe. 2023. "A DRL-Based Task Offloading Scheme for Server Decision-Making in Multi-Access Edge Computing" Electronics 12, no. 18: 3882. https://doi.org/10.3390/electronics12183882

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A DRL-Based Task Offloading Scheme for Server Decision-Making in Multi-Access Edge Computing

Abstract

1. Introduction

2. Related Work

3. Problem Statement

4. System Architecture

4.1. MEC Architecture

4.2. System Model

5. Proposed Task Offloading Scheme

5.1. SAC Algorithm for Continuous Actions

5.2. Markov Decision Process

5.2.1. State Space

5.2.2. Action Space

5.2.3. Reward Function

5.2.4. Design of SACTOS

6. Performance Evaluation

6.1. Fundamental Approaches

6.2. Simulation and Results

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI