A DRL-Based Task Ofﬂoading Scheme for Server Decision-Making in Multi-Access Edge Computing

: Multi-access edge computing (MEC), based on hierarchical cloud computing, offers abundant resources to support the next-generation Internet of Things network. However, several critical challenges, including ofﬂoading methods, network dynamics, resource diversity, and server decision-making, remain open. Regarding ofﬂoading, most conventional approaches have neglected or oversimpliﬁed multi-MEC server scenarios, ﬁxating on single-MEC instances. This myopic focus fails to adapt to computational ofﬂoading during MEC server overload, rendering such methods sub-optimal for real-world MEC deployments. To address this deﬁciency, we propose a solution that employs a deep reinforcement learning-based soft actor-critic (SAC) approach to compute of-ﬂoading and facilitate MEC server decision-making in multi-user, multi-MEC server environments. Numerical experiments were conducted to evaluate the performance of our proposed solution. The results demonstrate that our approach signiﬁcantly reduces latency, enhances energy efﬁciency, and achieves rapid and stable convergence, thereby highlighting the algorithm’s superior performance over existing methods.


Introduction
In recent years, advances in wireless technology combined with the widespread adoption of the Internet of Things have paved the way for innovative computation-intensive applications, which include augmented reality (AR), mixed reality (MR), virtual reality (VR), online gaming, intelligent transportation, and industrial and home automation.Consequently, the demand for these applications has surged [1][2][3].By 2020, the number of IoT devices is projected to skyrocket to 24 billion.This tremendous increase signifies that many smart devices (SDs) and sensors have been responsible for generating and processing an immense volume of data [4].
To cater to these computationally intensive applications, substantial computing resources and high performance are required.Addressing the escalating need for energy efficiency and managing the swift influx of user requests has emerged as significant challenges [5].Initially, mobile cloud computing (MCC) was considered a viable solution for processing computationally intensive tasks.However, as the demand for real-time processing increased, the limitations of MCC became apparent [6], resulting in the introduction of mobile edge computing (MEC) as a potential solution to meet this burgeoning demand [7].
Multi-access edge computing [8] is effective at deploying computing resources close to SDs, collaborative radio resource management (CRRM), and collaborative signal processing (CRSP).Conversely, a cloud radio access network (C-RAN) employs centralized signal processing and resource allocation, efficiently catering to user requirements [9].Collectively, the attributes of these technologies have the potential to fulfill diverse requirements in upcoming artificial intelligence (AI)-based wireless networks [10].
Leveraging MEC to offload tasks is a promising approach to curtail network latency and conserve energy.Specifically, MEC addresses the computational offloading requirements of IoT devices by processing tasks closer to the edge rather than relying solely on a central cloud [11].However, since the task offloading problem is recognized as a nondeterministic polynomial-time hard (NP-hard) problem [12], addressing it is challenging.Although most research in this area has leaned toward heuristic or convex optimization algorithms, the increasing complexity of MEC coupled with varying radio channel conditions makes it difficult to consistently guarantee optimal performance using these conventional methods.Given that optimization problems often require frequent resolution, meticulous planning is imperative for designing and managing future MEC networks.
In recent years, deep reinforcement learning (DRL), a sub-set of AI, has gained significant attention to its ability to tackle complex challenges across various sectors.As IoT networks become more distributed, the need for decentralized decision-making to enhance throughput and reduce power consumption increases, with DRL serving as a key tool.The emergence of the multi-access edge computing (MEC) paradigm has added complexity to multi-user, multi-server environments, bringing data-offloading decision-making to the forefront [13].This MEC landscape necessitates addressing both user behavioral aspects and server pricing policies.A recent study combined prospect theory and the tragedy of the commons to model user satisfaction and potential server overexploitation, highlighting the intricate nature of the problem.In the context of MEC, while some research has explored DRL for task offloading, the focus has been predominantly on holistic offloading, overlooking the advantages of partial offloading, such as reduced latency and improved quality of service (QoS).Collaborative efforts among MEC servers, especially within a multi-server framework, have been significantly useful in enhancing overall system performance.
In this study, we address the pressing demands and challenges in MEC environments by proposing a computational offloading and resource allocation method based on the soft actor-critic (SAC) algorithm.Our choice of SAC, although commonly used in other works, was carefully considered for its unique objective function and compatibility with Markov decision process (MDP) modeling in a multi-user, multi-server MEC environment.What sets our approach apart from other studies is the context in which SAC is applied and its dynamic interaction with the environment to make optimal offloading decisions.Our method advocates a collaborative computation offloading and resource allocation strategy across multiple MECs to improve both latency and energy efficiency.By employing SAC for task offloading, our system benefits from real-time learning, resulting in noticeable performance enhancements.The key contributions of our research include the following:

•
Acknowledging the frequent utilization of SAC in related literature, our implementation innovatively models interrelated task division problems within the MDP framework, incorporating the state space, action space, and reward function.We introduce a discrete SAC-based intelligent computing offloading algorithm that prioritizes stability and sample efficiency.Notably, this variant of the algorithm is proficient at exploration within a continuous action space, setting it apart from conventional applications.

•
Our research explores the complexities encountered when multiple SD users offload tasks to multiple MEC servers through a base station (BS).Given the increasing number and varied distribution of edge servers around SDs, we argue that task offloading transcends simple binary decisions.Our approach addresses both the offloading decision and the intricate selection of a collaborative MEC server.

•
We stress the importance of inter-MEC collaboration in enhancing task processing within distributed systems.Through the integration of various technologies, inter-MEC collaboration not only improves task processing efficiency but also enhances the user experience.
The remainder of this paper is organized as follows: Section 2 revisits prior research that underpins the aim of this study.Section 3 elucidates the core components of the system, along with a mathematical model tailored for offloading.Then, Section 4 highlights the imperatives of the DRL-based offloading scheme and delves into its associated challenges and goals, and Section 5 details the cost function, optimization challenges, architecture, and mechanics of the proposed DRL-based offloading scheme.Then, Section 6 presents an evaluation of the performance of the scheme through empirical analysis.Finally, Section 7 rounds off the conclusions of the study and outlines potential avenues for future work.

Related Work
Recent research in the field of MEC has aimed to reduce latency and energy consumption through computation offloading and resource allocation techniques.A heuristic offloading algorithm designed to efficiently manage computationally intensive tasks was introduced [14].This algorithm can achieve high throughput and minimize latency when transferring tasks from an SD to an MEC server.However, despite its critical role in enhancing overall system performance, the decision-making process for offloading under the algorithm is overly focused on task priorities.
A collaborative method between fog and cloud computing to curtail service delays on IoT devices was explored [15].This study focused on strategies for optimizing computing offloading, allocating computing resources, managing wireless bandwidth, and determining transmission power within a combined cloud/fog computing infrastructure.The overarching goal of these optimization strategies was to reduce both latency and energy consumption.Notably, the authors in both [16,17] employed sub-optimal methods, favoring minimal complexity, and they highlighted the significance of practical and efficient approaches.
The dynamics of energy link selection and transmission scheduling, particularly when processing applications that demanded optimal energy within a network linking SDs and MEC servers, were investigated [18].Relying on an energy consumption model, the authors formulated an algorithm for energy-efficient link selection and transmission scheduling.An integrated algorithm that facilitated adaptive long-term evolution (LTE)/Wi-Fi link selection and data transmission scheduling was presented to enhance the energy efficiency of SDs in MCC systems [19].Upon evaluation, the proposed algorithm outperformed its counterparts in terms of energy efficiency.Furthermore, it demonstrated proficiency in managing battery life, especially when considering the unpredictable nature of wireless channels.While these two studies prioritized energy efficiency and the proposed algorithms showed commendable performances, the studies did not address the adaptability required under varying network conditions.
The challenges of processing vast amounts of data and computational tasks using deep Q-network (DQN)-based edge intelligence within the MEC framework [20] were addressed.The authors of this study focused on the distribution of computational tasks and the allocation of resources between edge devices and cloud servers.Meanwhile, the authors [21] addressed the performance degradation and energy imbalances in SDs with a deep reinforcement learning-based offloading scheduler (DRL-OS).Noteworthy, as the tally of wireless devices skyrockets, the expenses associated with DQN-based methodologies also increase.
Several studies have leveraged actor-critic-based offloading in MEC environments to optimize service quality by analyzing agent behaviors and policies [22,23].The authors [24] delved into the offloading challenges in multi-server and multi-user settings, whereas the authors [25] integrated the proximal policy optimization (PPO) algorithm for task offloading decisions.Implementing the PPO in practical scenarios can be challenging because of its extensive sampling requirements.Wang et al. [26] conducted a study centered on task offloading decisions using the PPO algorithm, and Li et al. [27] addressed the offloading issues within a multi-MEC server and in multi-user contexts.
Furthermore, several investigations have focused on using the deep deterministic policy gradient (DDPG) algorithm to counteract the offloading issues inherent in the MEC domain.Notably, DDPG outperforms PPO in terms of continuous action space, data efficiency, and stability, making it pivotal for reinforcement learning endeavors in the MEC space and offering effective solutions to offloading challenges.However, within specific environments, the random-search nature of a network may pose hurdles in identifying the optimal policy.By contrast, SAC boasts greater stability than deterministic policies and exhibits excellent sampling efficiency.Modern research is now leveraging SAC to address computational offloading challenges.Liu et al. [28] enhanced the data's efficiency and stability using SAC, where multiple users collaboratively execute task offloading in an MEC setting.Similarly, Sun et al. [29] harnessed SAC within 6G mobile networks, achieving heightened data efficiency and reliability in MEC settings.The advantages and disadvantages of some existing approaches are listed in Table 1.Regarding MCC servers, they possess significantly greater computing capacities than those of MEC servers and are well-equipped to manage peak user request demands.Therefore, task offloading can be effectively achieved through cooperation among MEC servers.Nonetheless, tasks with high computational complexity should be delegated to cloud servers.He et al. [30] explored a multi-layer task offloading framework within the MEC environment, facilitating collaboration between MCC and MEC and enabling task offloading to other SDs.Furthermore, Akhlaqi, M.Y. [31], pointed out that the increasing use of cloud services by devices has highlighted congestion problems in centralized clouds.This situation has prompted the emergence of multi-access edge computing (MEC) to decentralize processing.Chen, Y. et al. [32], and Mustafa, E. et al. [33], address offloading decisions in MEC systems.The former focuses on assessing the reliability of multi-media data from IoT devices using a game-based approach, while the latter introduces a reinforcement learning framework for making real-time computation task decisions in dynamic networks.

Problem Statement
The task offloading challenge can be depicted using a DAG, which is denoted as ξ = (V, E), where the vertices symbolize individual tasks and the directed edges indicate task dependencies.Consequently, a subsequent task can only commence when the preceding task has been completed.It has been posited that every task can either be offloaded to an MEC server or processed locally on the SD of a user.When a task is offloaded to an MEC server, the following three distinct phases occur: sending, executing, and receiving.Conversely, for tasks executed locally, there is no data transmission between an SD and a MEC server.
The task set τ is represented as N = {1,2, . . ., n}, and the set of MEC servers is denoted as M = {1,2, . . ., m}.The offloading destination of each task is represented by a i , which indicates whether a task is processed locally or on the specific MEC server to which it is offloaded.This modeling approach streamlines the intricacies of the offloading dilemma, furnishes a mathematical schema of the situation, and aids in determining the optimal resolution.
The devised framework was tailored to fine-tune offloading decisions while also considering the constraints of computing resources in a setting with multiple users and multiple MEC servers.Such decisions can expedite the processing time of tasks and reduce the energy expenditure of the system.The objective function J is elucidated in Equation (1): where T all loc represents the total time required for all the tasks to be executed locally, denotes the total time required for the offloaded tasks, E all loc is the energy consumed when tasks are processed locally, and E o f l represents the total energy consumed during offloading.The objective of this function is to optimize both the time and energy efficiency through offloading.By adjusting the weights β 1 and β 2 the system can decide whether to prioritize time or energy efficiency.For instance, if the battery level is low, β 2 could be elevated to emphasize energy conservation.Ultimately, the overarching aim of this approach is to maximize objective function J. max A= {a 1 , a 2 , . . . ,a N } where A denotes the offloading decision for each task, vector F signifies the amount of computing resources designated for each task, and C1 and C2 determine whether a task is processed locally or offloaded, respectively, to an MEC server.The tasks can be partitioned and processed concurrently across multiple locations.Regarding C3, it ensures that the cumulative computing resources allocated to τ i on the m-th MEC server do not exceed the total available computing resources for that particular server.

System Architecture
In this study, we introduced a task offloading strategy for a multi-server MEC in a multi-user setting with the primary goal of minimizing service delays and terminal energy consumption by directing each task to an appropriate MEC server under centralized control.The soft actor-critic task offloading scheme (SACTOS) is a DRL-based task offloading framework built on the MEC platform, as defined by the European Telecommunications Standards Institute (ETSI) [34].
The system model presented in this paper addresses scenarios with multiple MEC servers and multi-user applications, encompassing M MEC servers and N user tasks.The data for the tasks to be offloaded are relayed between the MEC server and the SDs via a wireless communication link.Each device can either process the task locally or offload it to an MEC server.All the SDs are situated in an area where wireless connectivity is available.Given the finite computing capacity of an MEC server, there is a constraint on the number of offloading requests that it can concurrently handle.The notations forming the mathematical representation of this system model are provided in Table 2, and Table 3 summarizes the notations.

Collection of MEC servers a i
Offloading action of task Local and MEC server computation latency of task τ i f loc i CPU clock speed of the SD where task τ i is located Transmission rate of the receiving rate d i Data size of task τ i T

MEC Architecture
The proposed system architecture, which is shown in Figure 1, incorporates both an SD and a graph parser (GP) at the user device level.User requests take the form of a graph that encapsulates the dependencies among the tasks.The GP evaluates the dependent tasks to determine the feasibility of offloading.

MEC Architecture
The proposed system architecture, which is shown in Figure 1, incorporates both an SD and a graph parser (GP) at the user device level.User requests take the form of a graph that encapsulates the dependencies among the tasks.The GP evaluates the dependent tasks to determine the feasibility of offloading.In the MEC layer, several servers act as computational resources.Computationally intensive tasks are transformed into a directed acyclic graph (DAG) and relayed to the MEC server for DRL processing.An algorithm that facilitates cooperation among MEC servers is employed to learn the optimal distribution of the DAG tasks.
Once the network parameters are learned, they are transmitted from the MEC server to an SD.The trained neural network of the device determines the offloading decision through forward propagation.Based on this decision, the task is either processed by MEC or executed locally.Such an architecture ensures the streamlined processing of user requests, graph-centric task distribution, and resource optimization via offloading.Continuously refining the algorithm through DRL will contribute to consistent enhancements in system performance.This approach is intended to efficiently utilize resources in a smartdevice-centric MEC system.

System Model
Different SDs execute different tasks, each of which results in various delays.In this study, we considered two scenarios: the local execution of the current task and its offloading to the MEC server.Our strategy determines offloading decisions for tasks with dependencies, meaning that the completion time of the preceding tasks influences sub-sequent processing.This sub-section provides the definitions for computation delay, transmission delay, and task completion time.First, the computational delay for both local and MEC servers, as shown in Equation (3): where  ,  denotes the local computation delay of task   ;  ,  signifies the computation delay of the task when processed by the MEC server; and  ,  and  ,  represent the CPU clock rates of the MEC server and the SD, respectively, with m and n as identifiers.For scenarios involving partial offloading, a portion of the task is processed locally, and the remainder is offloaded to the MEC server.The weight   is introduced to quantify this In the MEC layer, several servers act as computational resources.Computationally intensive tasks are transformed into a directed acyclic graph (DAG) and relayed to the MEC server for DRL processing.An algorithm that facilitates cooperation among MEC servers is employed to learn the optimal distribution of the DAG tasks.
Once the network parameters are learned, they are transmitted from the MEC server to an SD.The trained neural network of the device determines the offloading decision through forward propagation.Based on this decision, the task is either processed by MEC or executed locally.Such an architecture ensures the streamlined processing of user requests, graph-centric task distribution, and resource optimization via offloading.Continuously refining the algorithm through DRL will contribute to consistent enhancements in system performance.This approach is intended to efficiently utilize resources in a smart-devicecentric MEC system.

System Model
Different SDs execute different tasks, each of which results in various delays.In this study, we considered two scenarios: the local execution of the current task and its offloading to the MEC server.Our strategy determines offloading decisions for tasks with dependencies, meaning that the completion time of the preceding tasks influences sub-sequent processing.This sub-section provides the definitions for computation delay, transmission delay, and task completion time.First, the computational delay for both local and MEC servers, as shown in Equation (3): where T loc i,n denotes the local computation delay of task τ i ; T mec i,m signifies the computation delay of the task when processed by the MEC server; and f mec i,m and f loc i,n represent the CPU clock rates of the MEC server and the SD, respectively, with m and n as identifiers.For scenarios involving partial offloading, a portion of the task is processed locally, and the remainder is offloaded to the MEC server.The weight χ i is introduced to quantify this distribution.This weight, which ranges from 0 to 1, indicates the fraction of data processed locally for task τ i .Conversely, (1 − χ i ) represents the proportion processed by the MEC server.Moreover, τ i refers to the number of clock cycles necessary to process each data bit of the task, and η i denotes the total number of clock cycles required for the entire task τ i .When addressing SDs, Shannon's theory, which is delineated in Equation ( 4), needs to be employed before computing the data offloading delay.This equation is crucial for estimating the maximum channel transmission rate.
where B denotes the radio channel bandwidth between the SD and the MEC server, signifying the bandwidth available for use during wireless communication; and the variable ϕ represents the transmission probability of each sub-channel and is characterized as a discrete random variable with the possible values ϕ ∈ {0, 1}.When ϕ = 1 and 0, R i (ϕ) represents the data transmission and reception rates, respectively.This suggests that the data rate for a given sub-channel can fluctuate depending on whether data are transmitted or received.The term g t i designates the wireless channel gain between the SD and the BS at the time slot t; ϑ 2 represents the noise power; the variables ρ tran and ρ recv correspond to the power expended for data transmission and reception, respectively; ρ tran is the transmission power utilized to relay data from an SD to an MEC server; and ρ recv is the reception power employed by an SD to obtain data from an MEC server.The latency associated with data transmission is delineated according to Equation ( 5): where d i denotes the data size of task τ i , R i (ϕ) represents the data transmission rate during data transmission, and ϕ is a variable accounting for various parameters or conditions that may influence the transmission speed, enabling the requisite time for data transmission to be calculated.Given the constraints of limited computing resources, we postulate that an SD cannot finalize a computing task within a designated timeframe, necessitating offloading in task processing.Under these circumstances, the agent scrutinizes the task and decides on its execution within the MEC server to minimize latency.Beyond merely determining the processing location for the task, the agent also assists in selecting the most suitable MEC server based on the requirements of the SD.The cumulative latency when offloading a task to an MEC server is significantly influenced by computational and uplink transmission latency [31].The criteria for task processing are expressed in Equation (6): where denotes the completion time of task τ i ; T loc i represents the processing time of the task when executed locally; T tran i represents the processing duration of a task that has been offloaded to an MEC server; and a k denotes the offloading location of the task, where a k = 0 if the task is processed locally and otherwise indicates the designated number of MEC servers from which the task is offloaded.Subsequently, the completion time of task τ i , i.e., T complete i , is the summation of the local processing time and waiting period before initiation, as described by Equation ( 7): where T loc wait i represents the greater of two values: the local execution time T loc i of task τ i and the longest completion time among all the preceding tasks τ j .It is ensured that τ i does not commence until all preceding tasks have been completed.The offloading of a task to the MEC server involves three primary stages that capture the time required to relay it to the server.The completion time during this stage is the aggregate of the upload time and the waiting period for the upload.Notably, the waiting time for the upload is contingent on the larger value between the maximum completion times of the preceding tasks and the local completion time of the current task.This phase involves the time required to process a task on an MEC server, with the duration varying depending on the complexity of the task and the computational capacity of the MEC server.This phase considers the time required to retrieve the processed task results from the MEC server.The associated completion time is expressed in Equation ( 8): In the execution phase of an MEC server, two primary scenarios arise regarding the previous state of a task.The first scenario involves tasks for which the transfer and upload phases have already been completed, and the second scenario involves tasks executed on the same MEC server as the current task.Based on this, the completion time for the previous task is defined in Equation ( 9): i f τ j , τ i ∈ and they on the same MEC server T tran i i f task has completed the transmission phase .
The total time to complete a task on an MEC server is the cumulative sum of the execution and waiting times.The waiting time encompasses both the quickest task execution time specific to an MEC server and the waiting duration for the completion of the preceding task.This waiting time is influenced by the completion times of other tasks on the same MEC server, and it is defined in Equation ( 10): Regarding the download and reception processes, the completion time comprises two parts: the completion time on the MEC server from the preceding step of the task and the quickest available download link.The download-completion time of a task that is processed on an MEC server is defined in Equation ( 11): where denotes the completion time of the download and reception phase for task τ i , indicates the quickest available download link time, and represents the processing completion time on MEC server m.
Energy consumption is a critical factor for SDs.The overall energy consumption for a task can be categorized into two components: computation and transmission.If a task operates locally, only computational energy costs come into play because no transmission is involved.Conversely, when a task is offloaded to an MEC server, an SD incurs transmission energy costs for both the upload and download.Hence, the cumulative energy consumption of an SD can be expressed as follows: where K u represents the energy constant [35], f loc n is the clock frequency of the local processor, T loc i denotes the time required to process a task locally, P tran is the power utilized during transmission, T up i indicates the upload time, P recv is the power consumed during reception, and T down i is the download time.

Proposed Task Offloading Scheme
This section discusses the implementation methodology of the proposed SACTOS.First, the SAC algorithm, which manages a continuous action space, is introduced.Subsequently, an offloading model that focuses on the optimization problem is examined.Finally, the steps involved in implementing the algorithm are involved.

SAC Algorithm for Continuous Actions
The SAC algorithm is a DRL algorithm that leverages the maximum entropy in a continuous action space.While most online algorithms estimate gradients using new samples at each iteration, this algorithm is different in that it reuses past experiences.Grounded in a maximum entropy framework, it aims to maximize action diversity while also maximizing the expected reward.Consequently, in terms of discoverability, stability, and robustness in continuous action spaces, SAC tends to outperform deterministic policies [35].
Typically, SAC is tailored to continuous action spaces.This continuity facilitates nuanced action choices within search spaces, thereby paving the way for agile responses under various environmental conditions.The primary distinction between continuous and discrete actions under SAC is their respective outputs and representations.In the continuous action space of SAC, the policy π Φ (a t , s t ) is represented as a density function, whereas in a discrete action space, it is represented as a probability.Regarding entropy, it represents the uncertainty in a random variable [36].The entropy value of policy π Φ (a t , s t ) is determined using Equation ( 13): where H is the entropy, E is the expected value, −log π(•|s t ) represents the log probability of an action in state s t , and the inverted expected value of this log probability is the entropy.Equation ( 14) determines the policy that maximizes the balance between the expected reward R(s t , a t ) and the policy entropy H [36]: where E (s t ,a t )∼ρ π denotes the expected value of the state-action pair (s t , a t ) as determined by the policy π, and the coefficient α modulates the balance between reward and entropy.This formula aims to maximize the expected return in each state while preserving the diversity or exploratory nature of the policies.Moreover, instead of fixing α as a static hyperparameter, it can be adaptively fine-tuned using a neural network by backpropagating the error in the entropy normalization coefficient [36].The objective of the entropy normalization coefficient is given by Equation (15): where π t (s t ) represents the probability distribution for state s t at time t; and −α log π t (s t ) + H represents the log probability of a policy, where H represents the target entropy.The disparity between these two values influences the increase or decrease in entropy.The accumulated reward for entropy in state s t is characterized by a soft state value function.Owing to the continuous nature of the task set, estimating this value requires intricate computations.To address this issue, value estimation is performed using sampling techniques, such as the Monte Carlo technique; it is defined in Equation ( 16): This equation integrates both the anticipated value of future rewards and the entropy of the policy through a soft-state value function.Given the continuous nature of tasks, it is possible to represent actions in intricate environments.This incorporation permits a richer inclusion of information and fosters the application of exploratory policies.
At this juncture, incorporating the entropy normalization term reflects the uncertainty in action selection and promotes the use of explorative policies.The approach for maximizing the objective function using the soft policy gradient method is defined as follows (refer to Appendix A for details): This method employs a Q-function to estimate the value of an action and promotes exploration using an entropy term.Gradient-based optimization methods are utilized to maximize the objective function, as follows (refer to Appendix B for details): This equation denotes the loss function of the Q function.By employing gradient descent to minimize this loss function, a gradient akin to the aforementioned estimated gradient can be obtained [36].
When SAC is applied to continuous action spaces, the reparameterization technique is utilized to minimize the policy loss J π (Φ).This method was introduced to address the gradient instability arising from stochastic behavioral sampling in the continuous action space of SAC.The reparameterization technique stabilizes the gradient calculations by decoupling the generation of a stochastic action from an independent noise variable; owing to this, SAC demonstrates superior performance even in intricate environments.

Markov Decision Process
Markov prediction models are based on memory-less, discrete-time stochastic processes known as Markov chains [37].The essential feature of Markov chains is their memory-less nature, meaning that the next state relies solely on the current state and not on the sequence of states that occurred before it.This aspect is especially relevant in the context of task offloading.Here, the choice to offload a specific task does not necessarily depend on the historical sequence of tasks but rather on the current system state and the task's inherent characteristics.
Given these properties, we recognized that the task offloading challenge closely aligns with the dynamics portrayed by Markov chains.To systematically and efficiently tackle this challenge, we framed it as an MDP characterized by states (s), actions (a), and rewards (r).This approach empowered us to leverage the structured decision-making abilities of MDPs, enabling strategic offloading of decisions grounded in current states to maximize cumulative future rewards.The specific definitions of the state space, action space, and reward function for the MDP are elaborated below.
The decision to use MDP offers a structured framework for making sequential decisions amidst uncertainties, aligning seamlessly with the stochastic nature of the task offloading problem.It establishes a systematic approach to determining the optimal offloading strategy, considering not only the immediate reward but also the long-term cumulative reward.

State Space
A state space is defined as a combination of the task DAG and the associated offloading decisions.Regarding the parameters in a state space, Ω 1:i denotes the offloading choices from the initial task up to the current one; and ζ, which comprises five vectors [P, Q, U, S, and T], encapsulates the encoding of a DAG.Here, P represents the characteristics of a task and its current state, containing information such as the type of task, estimated execution time, required resources, and estimated time to completion; Q is a metric that indicates the priority or importance of a task; the vector U reflects the current state of system resources, including information such as CPU utilization, memory status, input/output (I/O) latency, and network latency; S represents the current status of the offload queue and includes data such as the length of the queue, the average waiting time, and the number of recently offloaded jobs; and the vector T represents the information related to the task load of the system, including the average processing time, processing speed, and failure rate of the offloaded tasks.Therefore, the state space can be expressed as:

Action Space
Tasks can be executed locally on a device or offloaded to an MEC server, with each choice influenced by various factors.Local execution may offer faster response times but can be limited by resource constraints, rendering it unsuitable for complex operations.In contrast, offloading to an MEC server relieves the local device's burden and leverages greater computing capabilities.However, this approach can come with potential drawbacks, such as network delays or data transmission costs.
The task's location is determined by the variable a i .If a i = 0, the task is executed locally.Conversely, any value other than zero identifies an MEC server, indicating the task should be processed there.Different MEC servers may have distinct performance metrics and available resources.Consequently, the offloading decision must consider multiple factors, such as server status, task requirements, and network conditions.The action space, denoted as A = {0, 1, 2, . . ., m}, with m representing the number of accessible MEC servers and 0 indicating local execution, plays a crucial role.To ensure optimal performance, this action space requires dynamic management through the integration of diverse offloading strategies and optimization methods.

Reward Function
Several strategies can improve QoS, with adjustments based on key variables.Notably, system latency and energy consumption hold significant importance.In our study, we introduced a more intuitive and transparent reward function.We refined the formula calculations to precisely capture real-time changes in both delay and energy consumption at each step.
In existing strategies, the reward function primarily focuses on system delay and energy consumption as key factors for optimizing QoS.However, this approach is both computationally demanding and fails to capture real-time changes in delay and energy consumption.To address this, we integrated increments that accurately represent variations in actual delay and energy consumption at every stage.
where ∆T o f l denotes the variation in system delay between successive time intervals, and ∆E o f l signifies the alteration in energy consumption over the same period.These incremental changes are pivotal in promptly discerning shifts in the performance and efficacy of a system and serve as central metrics in both the optimization and evaluation processes.
The reward function imposes penalties on incremental delays and increases energy consumption at each step.By normalizing these values against the overall local delay and energy consumption, the function ensures a balance between penalizing inefficiencies and assessing system performance; it is defined as follows: where T all loc normalizes the increase in delay relative to the overall local delay, and normalizes the increment of energy consumption relative to the total local energy consumption.Both elements are structured to mirror the efficiency and responsiveness of the system.

Design of SACTOS
The offloading process comprises three steps.In step 1, tasks within a DAG are topologically sorted and then arranged in descending order based on their rank values, resulting in the following sequence: where T total i represents the total time required to complete processing within the defined set.In step 2, the tasks are transformed into a sequence of vectors that serve as inputs to the neural network.During this transformation, feedback, such as network and device status (e.g., battery level and CPU usage) is incorporated to enhance offloading efficiency.In step 3, the decision to offload a task is made by selecting the most likely offloading action based on probability.Then, MEC servers collaborate to complete selected tasks in accordance with this action.Moreover, potential overloads are distributed through cooperation between multiple MEC servers or SDs, thereby optimizing resource utilization.The comprehensive procedure for the proposed SACTOS algorithm is detailed in Algorithm 1.
The SACTOS algorithm optimizes task offloading for the MEC server.After initializing the neural network and memory buffers, it prioritizes tasks and transforms them into neural network inputs.The algorithm continually selects and executes tasks based on the offloading probability, ensuring collaboration between the MEC server and SD to accomplish these tasks.Collaborate between the smart MEC server and selected task based on action a 28 to complete the task.29 Distribute the load and optimize the resource usage among multiple MEC hosts or SDs.

end for
In Figure 2, we present the detailed structure of the SAC scheme, a cutting-edge deep reinforcement learning algorithm known for its adaptability and efficiency.The SAC architecture uniquely combines an off-policy method with the advantages of actor-critic paradigms and a maximum entropy framework.At its core, the actor network generates actions based on the input state.Unlike deterministic strategies, SAC's actor produces actions from a stochastic policy, facilitating a comprehensive exploration of potential actions.SAC stands out with its dual-critic system.Each critic network independently predicts the Q-value for a given state-action pair, mitigating overestimation biases-common in deep Q-learning algorithms.For policy optimization, SAC uses the smaller Q-value from its two critics, ensuring a cautious and robust policy that accounts for the variability inherent in single-critic evaluations.SAC's distinctive feature is its entropy regularization, which encourages extensive exploration by introducing an entropy term to the reward.This nudges the actor towards more exploratory action choices.Lastly, the critics are updated using a 'soft' Bellman equation, which blends estimated and target Q-values, greatly enhancing learning stability.Overall, Figure 2 provides a comprehensive visual overview of SAC's intricate architecture, showcasing its ability to derive optimal policies in complex environments by effectively balancing exploration and exploitation.

Performance Evaluation
This section discusses the experimental results of SACTOS and the methodology used to assess its performance.First, various parameters employed in the DAG generator are presented.Subsequently, the simulation environment and hyperparameters are described.Subsequently, the analysis of the average reward value to gauge the convergence attributes of SACTOS is presented.Finally, a comparison of SACTOS with four other benchmark algorithms, meant to highlight its efficacy, is presented.

Fundamental Approaches
This section describes the assessment of the performance of the proposed offloading scheme under various parameter settings to validate both its effectiveness and convergence.The following four offloading schemes were evaluated:

•
Local-only: All the computational tasks were performed exclusively on local devices.

•
Random: Tasks were offloaded based on a random selection [38].

•
The PPO algorithm is an advancement of the policy gradient method; however, it still has constraints in terms of sampling efficiency [39].

•
Dueling double (D3QN) deep-Q network-based scheme: This method combines the strengths of both the dueling DQN and double DQN, further enhancing the learning algorithm and structure of the conventional DQN.

Simulation and Results
Although real-world application programs can be depicted using DAGs with diverse topologies, currently available datasets provide limited application information [34].To implement the proposed offloading method, a graph theory-based approach is required, even for applications for which the topological information is not present in an actual dataset.By analyzing the interactions and data flows between applications, a graph theorybased approach can facilitate more efficient offloading decisions.
For the simulation experiment, we considered both the channel gain and transmission rate based on the distance between the SD and the MEC server.The baseline transmission rate was assumed to be 100 Mbps.
The CPU clock frequency of the local device, i.e.,    , was set to 2 GHz, while that of the MEC server, i.e.,    , was set to 8 GHz.The transmission and reception powers,

Performance Evaluation
This section discusses the experimental results of SACTOS and the methodology used to assess its performance.First, various parameters employed in the DAG generator are presented.Subsequently, the simulation environment and hyperparameters are described.Subsequently, the analysis of the average reward value to gauge the convergence attributes of SACTOS is presented.Finally, a comparison of SACTOS with four other benchmark algorithms, meant to highlight its efficacy, is presented.

Fundamental Approaches
This section describes the assessment of the performance of the proposed offloading scheme under various parameter settings to validate both its effectiveness and convergence.The following four offloading schemes were evaluated:

•
Local-only: All the computational tasks were performed exclusively on local devices.

•
Random: Tasks were offloaded based on a random selection [38].

•
The PPO algorithm is an advancement of the policy gradient method; however, it still has constraints in terms of sampling efficiency [39].

•
Dueling double (D3QN) deep-Q network-based scheme: This method combines the strengths of both the dueling DQN and double DQN, further enhancing the learning algorithm and structure of the conventional DQN.

Simulation and Results
Although real-world application programs can be depicted using DAGs with diverse topologies, currently available datasets provide limited application information [34].To implement the proposed offloading method, a graph theory-based approach is required, even for applications for which the topological information is not present in an actual dataset.By analyzing the interactions and data flows between applications, a graph theorybased approach can facilitate more efficient offloading decisions.
For the simulation experiment, we considered both the channel gain and transmission rate based on the distance between the SD and the MEC server.The baseline transmission rate was assumed to be 100 Mbps.
The CPU clock frequency of the local device, i.e., f loc i , was set to 2 GHz, while that of the MEC server, i.e., f mec i , was set to 8 GHz.The transmission and reception powers, P tran and P recv , respectively, were set to 2.5 and 1.8 W, respectively.The task sizes ranged from 10 to 50 KB.Additionally, the number of clock cycles required for a single task ranged between 10 7 and 10 8 cycle/sec.Each agent consisted of an actor and a critic.Both contained two hidden, fully connected layers, each with 256 neurons.The essential hyperparameters for SAC implementation are listed in Table 4.The results depicted in Figure 3 show that the rapid convergence and impressive average reward of SAC are immediately noticeable.While the initial performances of all the algorithms appeared comparable, SAC began to gradually distinguish itself via superior performance enhancements over time.Both D3QN and PPO exhibited faster convergence rates than that of SAC in the initial stages, but SAC surpassed them at a convergence speed midway through the experiments.This trend highlighted the more efficient learning strategy of SAC in complex offloading scenarios compared to the strategies of the other two methods.Regarding D3QN, its performance steadily enhanced, ultimately settling at a reward lower than that of SAC and PPO.On the other hand, PPO displayed a rapid improvement in the initial phase but exhibited a decelerated growth rate in the later stages, finally resulting in a performance less than that of SAC.These findings underscored the effectiveness of SAC as a primary reinforcement learning algorithm in handling intricate offloading settings.  and   , respectively, were set to 2.5 and 1.8 W, respectively.The task sizes ranged from 10 to 50 KB.Additionally, the number of clock cycles required for a single task ranged between 10 7 and 10 8 cycle/sec.Each agent consisted of an actor and a critic.Both contained two hidden, fully connected layers, each with 256 neurons.The essential hyperparameters for SAC implementation are listed in Table 4.The results depicted in Figure 3 show that the rapid convergence and impressive average reward of SAC are immediately noticeable.While the initial performances of all the algorithms appeared comparable, SAC began to gradually distinguish itself via superior performance enhancements over time.Both D3QN and PPO exhibited faster convergence rates than that of SAC in the initial stages, but SAC surpassed them at a convergence speed midway through the experiments.This trend highlighted the more efficient learning strategy of SAC in complex offloading scenarios compared to the strategies of the other two methods.Regarding D3QN, its performance steadily enhanced, ultimately settling at a reward lower than that of SAC and PPO.On the other hand, PPO displayed a rapid improvement in the initial phase but exhibited a decelerated growth rate in the later stages, finally resulting in a performance less than that of SAC.These findings underscored the effectiveness of SAC as a primary reinforcement learning algorithm in handling intricate offloading settings.across the four offloading methods as the task count increased.The TECS quantifies the reduction in local task execution costs relative to the total post-offloading costs, effectively highlighting the trade-off between latency and energy consumption.For the experiment, the aspect ratio of the graph was fixed at 0.45, three MEC servers were considered, and a communication-to-computation ratio (CCR) of 0.5 was used.Compared to the improvements exhibited by PPOS, D3QNS, and Random, SACTOS exhibited an average improvement in TECS of 2.24, 52.52, and 74.25%, respectively.This indicated the superior optimization of SACTOS for offloading tasks.Although an increase in task count invariably led to an increase in delay and energy consumption, the gradual increases in post-offloading costs led to higher TECS values for SACTOS.
highlighting the trade-off between latency and energy consumption.For the experiment, the aspect ratio of the graph was fixed at 0.45, three MEC servers were considered, and a communication-to-computation ratio (CCR) of 0.5 was used.Compared to the improvements exhibited by PPOS, D3QNS, and Random, SACTOS exhibited an average improvement in TECS of 2.24, 52.52, and 74.25%, respectively.This indicated the superior optimization of SACTOS for offloading tasks.Although an increase in task count invariably led to an increase in delay and energy consumption, the gradual increases in post-offloading costs led to higher TECS values for SACTOS.Figures 5 and 6 provide an in-depth view of the service waiting time and energy consumption for each offloading strategy, influenced by the progressive increase in the number of tasks.As the workload increased, we observed a corresponding rise in service waiting time and energy consumption across all strategies.This consistent increase highlights the inherent challenges associated with managing higher workloads in offloading systems.Notably, the 'local-only' strategy exhibited a significant spike in both latency and energy consumption.Such a rapid increase can be interpreted as indicating inherent scalability limitations in this strategy, possibly due to the absence of an adaptive mechanism for efficiently managing and distributing growing tasks.Figures 5 and 6 provide an in-depth view of the service waiting time and energy consumption for each offloading strategy, influenced by the progressive increase in the number of tasks.As the workload increased, we observed a corresponding rise in service waiting time and energy consumption across all strategies.This consistent increase highlights the inherent challenges associated with managing higher workloads in offloading systems.Notably, the 'local-only' strategy exhibited a significant spike in both latency and energy consumption.Such a rapid increase can be interpreted as indicating inherent scalability limitations in this strategy, possibly due to the absence of an adaptive mechanism for efficiently managing and distributing growing tasks.
highlighting the trade-off between latency and energy consumption.For the experiment, the aspect ratio of the graph was fixed at 0.45, three MEC servers were considered, and a communication-to-computation ratio (CCR) of 0.5 was used.Compared to the improvements exhibited by PPOS, D3QNS, and Random, SACTOS exhibited an average improvement in TECS of 2.24, 52.52, and 74.25%, respectively.This indicated the superior optimization of SACTOS for offloading tasks.Although an increase in task count invariably led to an increase in delay and energy consumption, the gradual increases in post-offloading costs led to higher TECS values for SACTOS.Figures 5 and 6 provide an in-depth view of the service waiting time and energy consumption for each offloading strategy, influenced by the progressive increase in the number of tasks.As the workload increased, we observed a corresponding rise in service waiting time and energy consumption across all strategies.This consistent increase highlights the inherent challenges associated with managing higher workloads in offloading systems.Notably, the 'local-only' strategy exhibited a significant spike in both latency and energy consumption.Such a rapid increase can be interpreted as indicating inherent scalability limitations in this strategy, possibly due to the absence of an adaptive mechanism for efficiently managing and distributing growing tasks.On the other end of the spectrum, DRL-based offloading strategies such as PPO, D3QN, and SACTOS demonstrated a relatively stable trend.Their consistent performance, even as the number of tasks increased, suggests their ability to effectively balance the dual objectives of conserving energy and reducing waiting time.Among these strategies, SACTOS emerged as a top performer.It significantly reduced standby time by an average range of 1.6-62.19%and demonstrated commendable energy efficiency, reducing consumption by 6.7-88.23%.These statistics highlight SACTOS's robust learning mechanism, which continuously refines its offloading decisions in response to changes in task volumes.
It is important to note the growth pattern exhibited by the DRL-based strategies.Unlike the sudden increase observed with the 'local-only' approach, these strategies demonstrated a more gradual growth.This pattern suggests their capacity to adapt and allocate resources strategically, preventing spikes in demand from causing disproportionate rises in waiting times or energy expenses.
Essentially, while all strategies encountered challenges as the number of tasks increased, the DRL-based approaches, particularly SACTOS, demonstrated superior capabilities in handling these challenges.Their performance not only surpassed that of other strategies, but also highlighted the significance of adaptive learning in optimizing offloading decisions, thereby reducing both latency and energy consumption.
Figure 7 illustrates the variation in the TECS with respect to the CCR.The figure clearly shows that the optimization performance of SACTOS increased when the CCR was low, indicating the dominance of computational tasks.These results also highlighted the effectiveness of SACTOS in environments where the workload was more computationally intensive.Moreover, SACTOS outperformed the other methods, exhibiting a performance that was 6.36, 40.2, and 48.6% superior to that of PPO, D3QN, and random, respectively.Regarding PPO, D3QN, and random, they displayed inconsistent performance shifts with varying CCRs.This observation underscored the necessity for an offloading strategy adept at handling the surges in latency and energy expenditure associated with localized computing in a computation-centric setting.The exceptional efficiency of SACTOS is pivotal, especially in intricate scenarios where a proper balance between communication and computation is essential.On the other end of the spectrum, DRL-based offloading strategies such as PPO, D3QN, and SACTOS demonstrated a relatively stable trend.Their consistent performance, even as the number of tasks increased, suggests their ability to effectively balance the dual objectives of conserving energy and reducing waiting time.Among these strategies, SACTOS emerged as a top performer.It significantly reduced standby time by an average range of 1.6-62.19%and demonstrated commendable energy efficiency, reducing consumption by 6.7-88.23%.These statistics highlight SACTOS's robust learning mechanism, which continuously refines its offloading decisions in response to changes in task volumes.
It is important to note the growth pattern exhibited by the DRL-based strategies.Unlike the sudden increase observed with the 'local-only' approach, these strategies demonstrated a more gradual growth.This pattern suggests their capacity to adapt and allocate resources strategically, preventing spikes in demand from causing disproportionate rises in waiting times or energy expenses.
Essentially, while all strategies encountered challenges as the number of tasks increased, the DRL-based approaches, particularly SACTOS, demonstrated superior capabilities in handling these challenges.Their performance not only surpassed that of other strategies, but also highlighted the significance of adaptive learning in optimizing offloading decisions, thereby reducing both latency and energy consumption.
Figure 7 illustrates the variation in the TECS with respect to the CCR.The figure clearly shows that the optimization performance of SACTOS increased when the CCR was low, indicating the dominance of computational tasks.These results also highlighted the effectiveness of SACTOS in environments where the workload was more computationally intensive.Moreover, SACTOS outperformed the other methods, exhibiting a performance that was 6.36, 40.2, and 48.6% superior to that of PPO, D3QN, and random, respectively.Regarding PPO, D3QN, and random, they displayed inconsistent performance shifts with varying CCRs.This observation underscored the necessity for an offloading strategy adept at handling the surges in latency and energy expenditure associated with localized computing in a computation-centric setting.The exceptional efficiency of SACTOS is pivotal, especially in intricate scenarios where a proper balance between communication and computation is essential.

Conclusions and Future Work
This study investigated the scheduling of dependent task offloading, aiming to optimize both latency and energy consumption within multi-server and multiple-SD settings.To adeptly navigate the ever-changing MEC landscape, we incorporated a collaborative architecture among MECs, framing it within an MDP context.We articulated the dependent tasks using a DAG and honed the MEC system using a DRL.The proposed SACTOS, which operates under centralized control, significantly reduces service delays and conserves energy.The experimental outcomes confirmed that SACTOS not only converged swiftly and stably but also outperformed existing methodologies such as local-only, random, D3QN, and PPO.
Currently, our future research trajectory points toward the integration of multi-agent reinforcement learning.Our goal is to increase task offloading efficiency by judiciously tapping into the shared resource reservoir inherent in the MEC setup, wherein each SD is perceived as an autonomous agent, and to explore the synergies of multi-agent collaboration to reduce costs and increase the QoS values of systems.
which is the expected value of the minimum Q-value between two Q-functions Q θ 1 and Q θ 2 for the next state s t+1 and action chosen by policy π.

−αlog π(a|s t+1 )
This term encourages exploration.Here π(a|s t+1 ) represents the probability of executing action a in state s t+1 according to policy π.The term α is the temperature parameter that balances the importance of the entropy term against the reward.
Combining all these terms, we get: This equation aims to minimize the mean squared difference between the predicted and target Q-values using data sampled from replay buffer D.
Time taken to upload and download task τ i E loc i Energy consumed by a local computing model for task τ i E mec i Energy consumed in computational offloading for task τ i E total o f l Total energy consumed in computational offloading for task τ i β 1 , β 2 Weighting coefficients CPU: Central processing unit; SD: smart device.

Figure 1 .
Figure 1.Architecture of the proposed system.

Figure 1 .
Figure 1.Architecture of the proposed system.

Figure 4
Figure 4 presents a comparison of the time and energy conservation scales (TECS) across the four offloading methods as the task count increased.The TECS quantifies the reduction in local task execution costs relative to the total post-offloading costs, effectively

Figure 4
Figure 4 presents a comparison of the time and energy conservation scales (TECS) across the four offloading methods as the task count increased.The TECS quantifies the reduction in local task execution costs relative to the total post-offloading costs, effectively highlighting the trade-off between latency and energy consumption.For the experiment, the aspect ratio of the graph was fixed at 0.45, three MEC servers were considered, and a communication-to-computation ratio (CCR) of 0.5 was used.Compared to the improvements exhibited by PPOS, D3QNS, and Random, SACTOS exhibited an

Figure 4 .
Figure 4. Impact of the number of tasks on the time and energy conservation scales (TECS).

Figure 5 .
Figure 5. Impact of the number of tasks on the latency.

Figure 4 .
Figure 4. Impact of the number of tasks on the time and energy conservation scales (TECS).

Figure 4 .
Figure 4. Impact of the number of tasks on the time and energy conservation scales (TECS).

Figure 5 .
Figure 5. Impact of the number of tasks on the latency.Figure 5. Impact of the number of tasks on the latency.

Figure 5 .
Figure 5. Impact of the number of tasks on the latency.Figure 5. Impact of the number of tasks on the latency.

Figure 6 .
Figure 6.Impact of the number of tasks on energy consumption.

Figure 6 .
Figure 6.Impact of the number of tasks on energy consumption.

Figure 7 .
Figure 7. Impact of the communication to computation ratio (CCR) on the TECS.

Table 1 .
Comparison of existing approaches.

Table 2 .
Mathematical notations of the system model.

Table 3 .
Mathematical notations of task offloading.