Efficient Delay-Sensitive Task Offloading to Fog Computing with Multi-Agent Twin Delayed Deep Deterministic Policy Gradient

Ali, Endris Mohammed; Lemma, Frezewd; Srinivasagan, Ramasamy; Abawajy, Jemal

doi:10.3390/electronics14112169

Open AccessArticle

Efficient Delay-Sensitive Task Offloading to Fog Computing with Multi-Agent Twin Delayed Deep Deterministic Policy Gradient

by

Endris Mohammed Ali

¹

,

Frezewd Lemma

¹

,

Ramasamy Srinivasagan

^2,*

and

Jemal Abawajy

³

¹

Department of Computer Science and Engineering, College of Electrical Engineering and Computing, Adama Science and Technology University, Adama P.O. Box 1888, Ethiopia

²

Computer Engineering, CCSIT, King Faisal University, Al Hufuf 31982, Saudi Arabia

³

Faculty of Science, Engineering and Built Environment, Deakin University, Geelong, VIC 3220, Australia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2169; https://doi.org/10.3390/electronics14112169

Submission received: 19 April 2025 / Revised: 22 May 2025 / Accepted: 22 May 2025 / Published: 27 May 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Fog computing presents a significant paradigm for extending the computational capabilities of resource-constrained devices executing increasingly complex applications. However, effectively leveraging this potential critically depends on the implementation of efficient task offloading mechanisms to proximal fog nodes, particularly under conditions of high resource contention. To address this challenge, we introduce MAFCPTORA (multi-agent fully cooperative partial task offloading and resource allocation), a decentralized multi-agent deep reinforcement learning algorithm for cooperative task offloading and resource allocation. We evaluated the performance of MAFCPTORA and compared it against recent approaches. MAFCPTORA demonstrated superior performance compared to recent methods, achieving a significantly higher average reward (0.36 ± 0.01), substantially lower average latency (0.08 ± 0.01), and reduced energy consumption (0.76 ± 0.14).

Keywords:

fog computing; deep reinforcement learning (DRL); Internet of Things (IoT); multi-agent; task offloading

1. Introduction

The rapid proliferation of Internet of Things (IoT) and consumer electronics (CE) devices, projected to reach a staggering scale by the end of the decade [1], has spurred the development of real-time, low-latency applications in domains such as industrial IoT [2], voice recognition [3], and medical IoT [4]. The inherent resource limitations of these devices prevent them from executing such demanding applications, necessitating efficient task offloading to other computing resources with minimal communication latency. Consequently, fog computing has received significant attention as a promising platform for processing these IoT applications [5,6,7,8,9].

Effectively allocating resources in the distributed, dynamic, and resource-constrained nature of fog devices presents a fundamental challenge for diverse and delay-sensitive IoT tasks. This is further complicated by the dynamic network conditions and the potential lack of effective Fog-to-Fog (F2F) cooperation. To tackle these complexities, prior research has explored various techniques, including machine learning algorithms [7,10] for adaptive decision-making, agent-based systems [9,11,12,13] for distributed control, metaheuristic algorithms [8] for near-optimal solutions, deep reinforcement learning [14], and traditional optimization techniques [15].

While existing approaches commonly focus on offloading tasks to fog nodes, simply enabling this is insufficient. Effectively balancing resource allocation across heterogeneous fog nodes, addressing the need for distributed execution of intensive tasks, and enabling efficient F2F cooperation remain critical. Furthermore, optimizing for a single scheduling parameter often fails to simultaneously ensure high QoS for users and low energy costs for providers [15]. Therefore, finding an optimal solution for balancing task offloading and resource allocation within the dynamic and resource-constrained fog computing environment, particularly for demanding applications, remains a fundamental challenge.

To address this challenge, this paper aims to answer the following fundamental question: How can we minimize system-wide communication and computation latency while simultaneously reducing the energy consumption of fog nodes in the context of executing demanding IoT applications? To this end, we propose a novel approach leveraging multi-agent deep reinforcement learning (MADRL) for efficient task offloading and scheduling on heterogeneous and resource-limited fog computing resources. Integral to this approach is Fog-to-Fog (F2F) cooperation, allowing fog nodes to communicate and coordinate their actions to optimize both local and global performance. Our key contributions include:

Formulate the task offloading and resource allocation problem as a Markov decision process (MDP) with a stochastic continuous action space to address computational and transmission delays in a multi-task, multi-node environment.
Develop a Multi-Agent Fully Cooperative Partial Task Offloading and Resource Allocation (MAFCPTORA) algorithm that enables parallel task execution across adjacent fog nodes in an F2F architecture, leading to faster execution, reduced latency and energy consumption, and balanced workload distribution.
Evaluate the performance of our algorithm, demonstrating its effectiveness in improving the deadline fulfillment rate for IoT tasks while minimizing the total energy consumption and makespan of the system, thereby benefiting fog service providers.

The rest of this paper is organized as follows: Related research works are presented in Section 2, followed by Section 3, which describes the system architecture, including the task model, dynamic task offloading, and resource allocation models. Section 3.6 presents the problem statements. The proposed TD3-based task offloading and resource allocation algorithms and key parameters are detailed in Section 4 and Section 5. A discussion of simulation results and performance analysis is presented in Section 6. Finally, the conclusion and future recommendations are provided in Section 7.

2. Related Work

In this section, we review related work. Table 1 is the description of abbreviations used throughout the paper. Table 2 provides a summary of research that focuses primarily on task offloading across IoT, edge, fog, and cloud environments. Machine learning (ML) techniques, particularly deep reinforcement learning (DRL), are increasingly being explored to tackle the complexities of task offloading and resource allocation in fog computing, aiming to improve quality of service (QoS). These approaches include a multi-agent soft actor critic (MASAC) [16], multi-agent deep deterministic policy gradient (MADDPG) [13], multi-agent proximal policy optimization (MAPPO) [17], and multi-agent twin delayed DDPG (MATD3) [18].

2.1. Machine Learning-Based Approaches

Zhu et al. [19] employed deep learning to balance task completion time and energy consumption by minimizing a weighted sum, leveraging network information like resource status and channel conditions. However, their approach relies on a pre-trained model with a fixed dataset, potentially limiting its adaptability to dynamic network conditions. Ghanavati et al. [8] introduced a novel task offloading and scheduling algorithm with emphasis on enhancing the efficiency of fog computing while simultaneously minimizing energy consumption. Their approach leverages the Ant Colony Optimization for Mining Outliers (AMO) algorithm, a biologically inspired optimization technique. A key aspect of their proposed method involves strategically distributing tasks among geographically proximate fog nodes. This localized task allocation aims to reduce communication overhead and latency, contributing to both improved performance and energy savings. Furthermore, their work addresses the critical trade-off between the system’s operational lifetime and the energy consumed by fog computing services, allowing end-user devices to influence this balance based on their specific requirements and priorities. The authors conducted a comprehensive empirical performance evaluation, comparing their AMO-based scheduling algorithm against established optimization techniques such as Particle Swarm Optimization (PSO), the Bug Life Algorithm (BLA), and the Genetic Algorithm (GA). The results of this evaluation reportedly demonstrate the superiority of their proposed approach in terms of both energy efficiency and processing speed, suggesting a significant advancement in task scheduling for fog computing environments.

Table 2. Related Research Work.

Domain	Problem	RL Method	QoS Objectives	Agent Type	Network	Papers
VFC	V2V Partial Computation Offloading	SAC	Minimize task execution delay and energy consumption	Multi-Agent (UAV)	Dynamic	[16]
IoT-edge-UAV	Task offloading and Resource allocation	MADDPG	Minimize computation cost	Multi-Agent (UAV)	Dynamic	[20]
Vehicular Fog	Optimize Ratio-Based Offloading	DQN	Minimize average system latency, and lower decision and convergence time	Single-Agent	Dynamic	[21]
VFC	Minimize waiting time and end-to-end delay of tasks	PPO	Minimize waiting time, end-to-end delay, packet loss; Maximize percentage of in-time completed tasks	Single-Agent	Dynamic	[22]
Multi-Fog Networks	Joint Task Offloading and Resource Allocation	DRQN	Maximize processing tasks and minimize task drops from buffer overflows	Multi-Agent	Dynamic	[23]
Edge-UAV	Task offloading	MADDPG	Minimize energy consumption and delay	Multi-Agent UAV	Unstable	[11]
IoMT	Computation offloading and resource allocation	MADDPG	Latency Reduction	Multi-Agent	unstable	[13]
Multi-cloud (Fog)	Dynamic Computation Offloading	CMATD3	Minimize long-term average total system cost	Multi-Agent (Cooperative)	Dynamic	[24]
MEC	Multi-channel access and task offloading	DPPG	Minimize task computation delay	MTA	Stable	[25]
Vehicular Fog	Load Balancing problem	TPDMAAC	Minimize system cost	MV	hybrid	[14]
Vehicular Fog	Task offloading and computation load	AC	Maximize resource trading	MA-GAC	hybrid	[9]
Fog-RAN	Computation Offloading and Resource Allocation	DDPG	Minimize task computation delay and energy consumption	Multi-agent (Federated)	Dynamic	[26]
MEC and Vehicular Fog	Trafic congestion and MEC overload	TD3	Minimize latency and energy consumption	Multi-agent	hybrid	[18]

2.2. Reinforcement Learning

The user device computation request and the fog resources scenario are dynamic, where the node resource in the network changes continuously. This network topology with variable computing and network resources explored in [27,28] emphasizes the challenge of designing an optimal model for task offloading and resource allocation decisions to improve response time. To deal with this instability context, [29] designed a distributed task priority (TPRA) and maximal utilization (MaxRU) resource allocation scheme. Furthermore, the application of RL to address the changes in workload status and resource availability of the fog node is explored in [28,30]. The resource allocation mechanisms are designed based on the actual response time measured in terms of the average delay of unit task offloading changes over time in the system. This primarily affects the decisions for task offloading and allocation of resources among the

F_{m, j}

s and their cooperation approaches. However, offloading tasks to a specific

F_{j}

results in a long response time due to limited resources.

2.3. Deep Reinforcement Learning (DRL)

Recognizing that task offloading in dynamic fog environments, characterized by stochastic computing tasks, varying resources, and fluctuating network channels [31], aligns well with DRL’s capabilities, researchers have explored its application. Zhang et al. [32] proposed a DRL-based approach for multi-user, multi-fog environments with dynamic workloads and server capacities, aiming to minimize task execution delay and energy consumption, albeit requiring coordination for their hybrid action space. Qiu et al. [33] addressed the limitations of traditional long-term performance optimization by introducing a distributed and collective training method to maximize cumulative rewards in multi-edge, multi-user offloading scenarios, claiming improved convergence and reduced system cost. Furthermore, Baek et al. [23] presented a deep recurrent Q-network (DRQN) for partially observable, dynamic fog environments with heterogeneous resources, focusing on CPU and memory. Their approach emphasizes local cooperation to maximize aggregated local rewards, with nodes relying solely on local observations. However, the search for a suitable fog node can lead to an exponentially growing search space as task size increases. More recently, Wu et al. [17] proposed a task-offloading algorithm using an improved multi-agent proximal policy optimization (MAPPO) to minimize delay and energy consumption, while Bai et al. [24] applied a DQN-based solution to address user request delays and energy consumption.

From an architectural and management standpoint, a key challenge lies in deciding the optimal placement and management strategy for learning-based offloading algorithms within distributed fog environments, whether centrally managed through the fog gateway or distributed across individual fog nodes. Ghanavati et al. [34] emphasized the importance of fault-aware task scheduling in fog computing platforms for ensuring reliability, efficiency, and resource utilization. Thus, Dynamic Fault Tolerant Learning Automata (DFTLA) is proposed as an approach to fault-tolerant learning automata task scheduling. An active variable learning automaton is embedded in every fog node of a DFTLA-based system, which continuously monitors the process and makes appropriate decisions in response to inputs from the environment. Comparing the proposed DFTLA scheduler with three baseline methods, they evaluated its performance. Using the proposed algorithm, reliable task execution can be accomplished while responding quickly and consuming little energy. All performance evaluation criteria showed that the proposed approach outperformed the baseline algorithms. Cao et al. [25] investigated a distributed deep reinforcement learning (DRL) approach, asserting that multi-agent deep deterministic policy gradient (MADDPG) and multi-agent proximal policy optimization (MAPPO) offer more accurate and faster decision-making compared to single-agent systems in real-time scenarios. Similarly, Hussein et al. [7] considered latency constraints arising from the distributed nature of fog nodes, aiming to identify nodes that meet quality of service (QoS) requirements.

Chen et al. [35] proposed a cooperative multi-agent deep reinforcement learning (CMADRL) framework where IoT devices learn independently but are centrally trained to minimize system-wide costs associated with energy consumption and server rentals. To address these dynamic contexts, multi-agent DRL (MADRL) has emerged as a prominent approach, with agents deployed in both decentralized [11,14] and centralized [13] architectures. Furthermore, Zhang et al. [36] categorized multi-agent settings in stochastic task offloading games based on their reward structures: fully cooperative, fully competitive, and mixed. In this vein, Suzuki et al. [37] and Zhu et al. [38] implemented MADRL techniques employing central training with decentralized execution strategies.

2.4. Multi-Agent Deep Reinforcement Learning

Multi-agent deep reinforcement learning (MADRL) has emerged as a powerful paradigm for tackling complex resource management challenges in distributed systems. Its applicability spans diverse domains, including UAV-based task offloading and resource allocation in air-to-ground (A2G) networks [20], where it optimizes for average local and global maxima in continuous search spaces. MADRL has also been instrumental in enhancing vehicular fog resource utilization for many-to-many task offloading in VFC networks [9] and in managing task offloading within space-aerial-ground integrated networks (SAGIN) [11], while ensuring user device quality of service (QoS). Furthermore, Suzuki et al. [37] reinforce the utility of cooperative MADRL in addressing reinforcement learning problems within multi-cloud edge networks. Building upon this foundation and drawing inspiration from [13], which explores multi-UAV environments with unreliable computational resources, this work formulates the joint task offloading and resource allocation problem as a stochastic game. Multi-agent TD3 (MADRL-based TD3) [39] is proposed to address challenges in large systems by reducing the input and action space for individual agents, leading to faster convergence and more efficient processing than single-agent TD3. A key distinction of our proposed fog network lies in its composition of both reliable and unreliable computational sources and communication networks, enabling adaptation to a wider range of situational scenarios.

In summary, existing research predominantly focuses on vertical task offloading (to edge, fog, or cloud), with less attention given to horizontal offloading among peers or neighbors. Furthermore, simultaneously considering factors like mobility and task dependencies remains a significant challenge. Notably, few prior works in fog computing address task priority, neighbor service availability, or resource contribution incentives. The complexity and robustness of decision-making algorithms, especially their decision time, are also often overlooked.

3. System Model

This section details our proposed two-layer network model, task model, communication model, computation model, and formulation of the problem. For the purpose of this study, we assume reliable fog nodes and communication links. Table 3 shows the lists of notation used throughout this article.

3.1. Network Model

We model a multi-user, multi-fog environment (as illustrated in Figure 2) to handle diverse service demands. The network features multiple heterogeneous and resource-constraint user devices

(U_{n})

, where

n = 1, 2, \dots, N

) connected wirelessly to a hierarchical fog layer (F). There is also a set of

F = F_{1}, F_{2}, \dots, F_{n}

heterogeneous fog nodes, where

M \leq N

. Each fog node maintains its resource status and shares updates with its neighbors, enabling collaborative task offloading and resource management. Moreover, their processing capacity increases from the bottom to the top layer, whereas energy consumption and latency also increase.

Our system model incorporates a hybrid two-stage task execution process: User-to-Fog (U2F) and Fog-to-Fog (F2F). In this model, task execution proceeds in two phases: In the U2F phase, a task originating from a user device (

U_{n}

) is offloaded to a fog node

F_{m}

(the task-owning node,

m \in 1, 2, \dots, N

).

F_{m}

is considered to minimize task transmission delay and response time. We refer to the task execution on

F_{m}

as local computation. The second phase, the F2F phase, is invoked in the event that the execution of a task locally at

F_{m}

may not guarantee QoS requirements. In this case, we employ a horizontal distributed F2F cooperation mechanism following the approach in [28] depicted in Figure 1. In this case, tasks are redistributed from

F_{m}

to an adjacent node

F_{j}

(

j \in 1, 2, \dots, N

) for further processing or collaborative execution. Figure 2 illustrates the task offloading and execution scenario.

In this fog layer, task allocation to adjacent fog nodes (

F_{j}

) is governed by an offloading probability ratio. This ratio is dynamically determined by factors such as the available resources of each

F_{j}

and the requirements of the user’s task, including its computation deadline and priority. Within this cooperative peer-to-peer (F2F) framework, each fog node (

F_{m}

and its neighbors

F_{j}

) can make independent offloading decisions. Consequently, the set of tasks to be offloaded to a specific

F_{j}

is denoted as

(K_{n, j} = 1, 2, \dots, n)

. To analyze the offloading decision, the time step (decision episodes) is presented as a discrete time

(T = 1, 2, 3, \dots, T)

.

3.2. Task Model

We consider multiple tasks initiated from

U_{n}

s denoted as

K_{n}

, where

{n = 1, 2, \dots, N}

. The tasks can be represented as a directed acyclic graph (DAG) shown in Figure 3.

The task profile of the nth user application can be defined in Equation (1):

K_{n} = {V_{k}, C_{n}, τ_{m a x}}

(1)

where

V_{k}

,

C_{n}

, and

τ_{m a x}

denote the task data size, the CPU cycles required to compute the task, and the maximum tolerable deadline (expired time) of the task

K_{n}

, respectively. The task computation

(K_{n})

at time

(t)

is defined by the total size of the task

(V_{k})

and the intensity of the device computation

b_{u} (t)

at time

(t_{i})

.

The DAG task in Figure 3 demonstrates parallel task execution enabled by task dependencies. Taking task

K_{n}

, composed of sub-tasks

k_{1}, k_{2}, k_{3}, k_{4}, k_{5}, k_{6}, k_{7}

, as an example,

k_{1}

,

k_{2}

, and

k_{3}

can run in parallel,

k_{4}

wait for execution of

k_{2}

, then

k_{5}

and

k_{6}

waits for execution of

k_{4}

, while

k_{7}

is dependent on

k_{5}

and

k_{6}

and executes only after both are finished.

3.3. Dynamic Task Partitioning

In our F2F task offloading mechanism, dynamic task partitioning is employed. A DRL agent learns to partition tasks effectively by observing real-time system conditions such as resource availability, task requirements, network bandwidth, and computational cost. For partial offloading, where tasks are distributed across multiple adjacent fog nodes, each fog node

F_{j}

is assigned an offloading probability ratio

p_{j}

, where

(0 < p_{j} < 1)

. The complete probability distribution for a task

(K_{n})

being offloaded from

F_{m}

to its neighbors is represented by the vector

P = (p_{1}, p_{2}, \dots, p_{n} + 1)

. Where

p_{n} (t) = p_{n}

and

p_{n} (t) = 0

if node

f_{m}

processes all the arrived tasks

k_{n}

locally, while

F_{m} (t) = p_{n} + 1

and

p_{n} (t) # 0

implies that the node

F_{m}

offloads the number

p_{n} (t)

of

k_{n}

to adjacent

f_{j}

. This probability vector guides the decision-making process for distributing portions of the task to available

F_{j}

nodes, considering their computational capabilities, the network latency between

F_{m}

and each

F_{j}

, and their current resource availability. Equation (2), presents the task offloading probability ratio for each

F_{j}

:

p_{j} = \frac{k_{n, j}}{\sum_{j = 1}^{n} k_{n, j}}

(2)

where

\sum_{j = 1}^{n} k_{n, j}

defines the sum of tasks offloaded to adjacent

F_{j}

and Equation (3), for task execution by

f_{j}

.

k_{n, j} = p_{j} \times K_{n}

(3)

For the local task execution, the remaining partition to be executed at the source

(f_{m})

is defined as in Equation (4):

P_{m} = 1 - \sum_{j = 1}^{n} p_{j}

(4)

where

K_{n}

indicates the total number of tasks K being offloaded and defined as

(K_{n} = \sum_{j = 1}^{n} k_{n, j})

). With this, the total probability ratio of all

F_{j}

must ensure Equation (5):

\sum_{j = 1}^{n} p_{m} + p_{j} = 1

(5)

This ensures that the sum of all partial offloading ratios forms a valid probability distribution in a continuous action space, where each

x_{m, j} \in (0, 1)

.

Following this partitioning scheme, task offloading proceeds dynamically among horizontally connected fog nodes. To meet task execution requirements under fluctuating conditions, an overloaded fog node (

F_{m}

) transfers tasks to less utilized or idle nodes (

F_{j}

). Figure 2 illustrates this dynamic task partitioning and offloading process.

3.4. Communication Model

The communication model in our proposed network architecture encompasses two types of communication, which are User-to-Fog (U2F), which involves continuous communication between user devices (

U 2 F

) and fog nodes (

F_{m}

), and Fog-to-Fog (F2F), specifically between fog nodes

F_{m}

and

F_{j}

. This section details these communication aspects.

3.4.1. U2F Communication

Following the model in [40], wireless communication between user devices (

U_{n}

) and a fog node (

F_{m}

) is subject to multi-path fading and path loss. The probability of path loss (

P L (d_{u f})

) considers both line-of-sight (LoS) and non-line-of-sight (NLoS) conditions. The path loss

P L

between

U_{n}

and the neighboring fog node

F_{m}

is adopted from [28,41]. To aid offloading decisions, the channel status

(H_{c})

is classified as busy (1) or idle (0). Therefore, the path loss

(η)

is presented as follows at Equations (6) and (7).

P L_{(d_{u f})}^{t} = σ (d_{u f}) \times P L_{L o S} (d_{u f}) (1 - σ (d_{u f})) \times P L_{N L o S} (d_{u f})

(6)

where

σ (d_{u f})

,

P L_{L o S} (d_{u f})

, and

P L_{N L o S} (d_{u f})

represent LoS probability, LoS path loss

η 1

, and NLoS path loss

η 2

, respectively.

P_{L} = 20 log (d_{km}) + 20 log (B_{kHz}) + 32.45 (dB)

(7)

The task transmission delay

τ_{t r}^{t}

is measured based on the bandwidth of the channel, the size of the offloaded task

K_{n}

, the transmission power of the user device’s channel, and the distance to the

F_{m}

.

\begin{matrix} T_{o f f_{k}}^{t} = \frac{K_{n}^{t} d_{m, f}}{τ_{t r}^{t}}; \\ τ_{t r}^{t} = β_{u, f}^{t} \cdot log (1 + \frac{h_{u, f}^{t} \cdot ζ_{u, f}^{t}}{σ}) \end{matrix}

(8)

We model the channel transmission rate

τ_{t r}^{t}

from node

f_{m}

to node

f_{j}

according to [42], where

β_{u, f}^{t}

,

h_{u, f}^{t}

,

ζ_{u, f}^{t}

, and

σ

are the bandwidth, the channel power gain, the transmission power, and the standard deviation of the noise power gain, respectively. Therefore, the sum of tasks offloaded to adjacent

F_{j}

, at the transmission time, is guided by Equation (8). We assume that the maximum available transmission power at a fog node corresponds to the power required for offloading the maximum number of tasks as presented in [23]. Accordingly, the node is capable of transmitting

p_{j}^{t}

tasks within a single time slot, such that the transmission energy consumption does not exceed the maximum permissible value

ζ_{f, (t r)}^{t} \leq ζ_{f, (t r)}^{(m a x)}

. The signal-to-noise ratio

(S N R)

is described as

S N R = \frac{P_{s i g n a l}}{P_{n o i s e}}

. The noise power ratio and channel capacity correlations are designed according to Shannon’s Law by Equation (9) to present the capacity-dependent relationship in the wireless network.

C = β l o g_{2} (1 + (S / N))

(9)

where

C, β, S, N

are the channel capacity, channel bandwidth, average received signal power, and average noise power, respectively.

3.4.2. F2F Communication

Fog-to-Fog (F2F) communication is triggered when the originating fog node

F_{m}

cannot satisfy the quality of service (QoS) demands of a task from user

U_{n}

. Subsequently, a direct offloading process occurs between

F_{m}

and neighboring fog nodes

F_{j}

to enable parallel execution of the partitioned sub-task on

F_{j}

. Consequently, the transmission and computation delays within these fog layers are considered in our analysis. The communication network connecting fog nodes

F_{m}

and

F_{j}

is modeled as a flat, wired, fiber, optic Ethernet-based infrastructure. Therefore, the communication cost is evaluated based on transmission delay and propagation delay, both of which depend on the distance between the nodes and the speed of light. The propagation delay between

F_{m}

and

F_{j}

is calculated as

P_{m, j} = \frac{d_{m, j}}{C_{l}}

, where

d_{m, j}

represents the distance between

F_{m}

and

F_{j}

, and

C_{l}

is the speed of light. For task offloading between fog nodes, we assume a negligible noise ratio.

The energy consumed during the transmission of task portions

K_{n}

between fog nodes (

F_{m}

to

F_{j}

) in partial offloading accounts for both transmission and reception energy as indicated Equation (10). This consumption is influenced by the channel transmission power, the distance between the nodes, the size of the transferred data, and the energy model of the communication interface.

E_{m \to j}^{t r} (t) = E_{m \to j}^{t r a n s} + E_{m \to j}^{r e c}

(10)

where

E_{m \to j}^{t r a n s}

is

ζ_{t r a n s} \times T_{t r a n s}

and

E_{m \to j}^{r e c}

is

ζ_{r e c} \times T_{r e c}

, where

ζ

and T are power and time consumed during transmission and receiving data, respectively. The transmission power utilized by fog node

f_{n}

is calculated based on Equations (11) and (12) as follows

ζ_{n, (t r)}^{t} = \frac{β_{m, j_{i}^{t}}^{t} \cdot σ^{2}}{η 1 d_{m, j_{i}^{t}} - η 2} (\frac{k_{n} p_{j}^{t}}{2^{T β (m, f_{n}^{t})}} - 1)

(11)

Moreover, the total data transmission rate

d_{k_{n}}

for the task

K_{n}

over a set of multiple orthogonal channels from

F_{m}

is calculated with Shannon’s capacity formula [43] as follows.

τ_{t r}^{t} t o t a l = \sum_{t = 1}^{T} \sum_{n = 1}^{N} β_{n} (t) {log}_{2} (1 + \frac{ζ_{n, (t r)}^{t} h_{n} (t)}{σ})

(12)

where

β_{n} (t)

,

h_{n} (t)

,

ζ_{n, (t r)}^{t}

, and

σ

represent the nth channel bandwidth, channel gain, allocated transmission power, and noise power spectral density, respectively.

The transmission and communication delay between

(F_{m})

and

(F_{j})

, for each individual sub-task, can be calculated Equation (13)

τ_{m, j}^{o} = \frac{k_{n, j} \times τ_{t r}^{t} t o t a l}{β_{m, j}} + P_{d}

(13)

where

P_{d}

and

β_{m, j}

are the propagation delay and the achievable data transmission rate between

F_{m}

and

F_{j}

, respectively. Therefore, the total sub-task transmission latency can be computed using Equation (14) as the average of

T_{m, j}^{o}

such that

τ_{t o t a l}^{o} = \sum_{t = 1}^{T} \sum_{j = 1}^{N} τ_{m, j}^{o}

(14)

where, n is the number of sub-tasks

k_{n, j}

offloaded to

f_{j}

.

3.4.3. Local Computations

Assume

T_{F_{m}}^{l}

,

V_{k}^{l}

, and

C_{n}

denote the local computing delay to compute the task portion of

K_{n}

, the task size at

F_{m}

, and the number of CPU cores, respectively. So, the sum of all tasks executed locally at

f_{m}

offloaded from

U_{n}

, the delay can be expressed by Equation (15)

τ_{f_{m}}^{l} = \sum_{t = 1}^{T} \sum_{m = 1}^{N} \frac{V_{k}^{l}}{C_{n}} \cdot p_{m}

(15)

where

C_{n}

depends on the total task size

V_{k}

and

b_{u} (t)

to measure the required intensity of computational effect at the single node to complete its assigned portion. Energy is not a major concern in the fog layer; energy consumption is expressed in Equation (16) measured per CPU cycle

ζ_{n}

, and computing tasks

K_{n}

are considered to calculate energy as follows.

E_{F_{m}}^{l} = \sum_{t = 1}^{T} \sum_{m = 1}^{N} (ζ_{n} \times K_{n}^{e})

(16)

3.4.4. Offloading Computation Delay

As highlighted before, a fog node

(F_{m})

accepts full tasks from

U_{n}

until its maximum tolerable capacity and stops accepting when its workload queue is full. Then,

F_{m}

offloads the task to the cooperating adjacent

(F_{j})

node when it is unable to achieve the objective of the task’s maximum tolerable thresholds. The computation delay for each

F_{j}

is computed as follows in Equation (17):

τ_{c, j}^{o} = \frac{k_{n, j} \times V_{k_{n}}^{o}}{C_{j}}

(17)

where

V_{k_{n}}

represents the total size of task

k_{n}

and

C_{j}

denotes the maximum number of CPU cores available on fog node

F_{j}

. Following the parallel task execution model presented in [28], we consider scenarios where multiple sub-tasks are executed concurrently across several fog nodes. Therefore, the overall task computation delay is determined by the longest execution time among all its sub-tasks computed by Equation (18).

τ_{C, m a x}^{o} = m a x (τ_{C, j_{1}}^{o}, τ_{C, j_{2}}^{o}, \dots, τ_{C, j_{n}}^{o})

(18)

In this case, Equation (19), the total delay

τ_{m, j}^{t}

in

F_{m}

and

F_{j}

is the total transmission latency

τ_{t o t a l}^{o}

plus task execution latency

τ_{c, m a x}^{o}

and local computation delay

τ_{f_{m}}^{l}

.

τ_{m, j}^{t} = τ_{t o t a l}^{o} + τ_{c, m a x}^{o} + τ_{f_{m}}^{l}

(19)

We note that our work primarily focuses on task offloading between fog nodes, assuming that the user device has already offloaded the task to an initial fog node

F_{m}

. Therefore, we assume that tasks in the fog node queue are handled on a first-come, first-served (FCFS) manner without accounting for contention dynamics or task prioritization.

3.5. Energy Consumption

The energy consumption for each

F_{j}

can be calculated as follows using Equation (20).

E_{f_{j}}^{o} = ζ_{n} \times k_{n, j}^{e}

(20)

Whereas, the total task execution energy consumption when tasks are executed in multiple

F_{j}

s is given by Equation (21):

E_{j, total} = \sum_{t = 1}^{T} \sum_{j = 1}^{N} E_{f_{j}}^{o}

(21)

Since the partial offloading model is applied in this work, the total system-wise task computation delay

T_{t o t a l}

of the task considers both

F_{m}

and

F_{j}

as indicated in Equation (22).

τ_{t o t a l} = τ_{F_{m}}^{l} + τ_{F_{j}}^{o}

(22)

Moreover, in the partial task offloading approach, the total energy consumption

E_{t o t a l}

considers both task execution at local

F_{m}

and adjacent

F_{j}

as well as energy consumption at the transmission of task

K_{n}

. Therefore,

E_{t o t a l}

is defined at Equation (23) as follows:

E_{m, j}^{t} = E_{F_{m}}^{l} + E_{j, total} + E_{m, j}^{t r}

(23)

3.6. Problem Formulation

We now formulate the optimization problem with the aim of minimizing the overall system latency and energy consumption in Equation (24a), thereby improving response time and optimizing energy efficiency as follows:

min_{p_{n}} lim_{T \to \infty} \frac{1}{T} \sum_{t = 1}^{T} \sum_{j = 1}^{N} (ω_{t} τ_{m, j}^{t} + ω_{e} E_{m, j}^{t})

(24a)

subject to

C 1 : p_{m} + \sum_{j = 1}^{n} p_{j} = 1

(24b)

C 2 : 0 \leq p_{m}, \{\sum_{j = 1}^{n} p_{j, 1}, \dots p_{j, n}\} \leq 1, \forall m, j \in F

(24c)

C 3 : \sum_{t = 1}^{T} \sum_{n = 1}^{N} p_{j} \times k_{n, j} \leq C_{j}, \forall j \in F, \forall k \in K

(24d)

C 4 : τ_{t o t a l} \leq τ_{m a x} K_{n}, \forall n \in N

(24e)

C 5 : E_{m, j}^{e x c} \times K_{n} \leq E_{t h r e s h o l d}, \forall n \in N

(24f)

C 6 : \sum_{k = 1}^{N} \sum_{p = 0}^{1} K_{n} \leq K_{n, max}, \forall m \in M

(24g)

where N, T,

(ω_{t})

, and

(ω_{e})

are the number of fog nodes, time steps, and weight coefficients of computation delay and energy consumption for both local and offloading cases. The values of

(ω_{t})

and

(ω_{e})

are within the interval

[0, 1]

and satisfy the constraint

(ω_{t}) + (ω_{e}) = 1

. The objective function is formulated to ensure that a user device offloads a task to a specific fog node in a particular time step t. In this optimization problem the constraints are presents:

$C 1$ (24b) defines the total task execution ratio between $f_{m}$ and multiple $f_{j}$ s should not exceed the maximum (i.e., ≤1).
$C 2$ (24c) defines the probability of offloading ratio ranges from $(0)$ up to 1
$C 3$ (24d) ensures that the size of the offloaded task to a fog node should not exceed the node’s maximum capacity.
$C 4$ (24e) shows the latency constraints concerning the maximum tolerable delay threshold $(τ_{m a x})$ of offloaded tasks in terms of transmission and execution deadlines,
$C 5$ (24f) verifies that the available energy in the fog node remains above a specified threshold.
$C 6$ (24g) the sum of the offloaded tasks should not exceed the maximum task.

This problem is in the class of NP-hard problems [34]. Therefore, we will discuss our proposed approach in Section 4.

From network historical data to train the DL-based model and replay buffer

(D)

data for the actor-critic RL network to predict resources, such as bandwidth, channel condition, task queue, memory, and others, for the future state prediction. This MADRL model recognizes the availability and utilization of resources as a complex optimization problem with dependencies and interactions among multiple variables. The dynamic network and resource context create a problem for the agent to converge under optimal conditions. According to [44], in F2F cooperation, they share their updated state and maintain a dynamically updated list of the best nodes, as framed in Table 4.

The agent partially observes local states such as workload, available resources, energy status, latency to other fog nodes, QoS metrics, task nature, node status (active, idle, or sleep), and performance metrics from historical data from each fog node. According to [28], this time-varying context is reflected in the topology designed in Figure 2 for interactions of the user device, and the status of the fog node can be characterized in terms of distance, resources, and task computation ability. Therefore, MAFCPTORA is designed to maximize the overall return accumulated over time, rather than the immediate reward associated with a particular decision.

3.7. TD3 Algorithm Overview

TD3 is a DRL algorithm designed for tasks with continuous action spaces. It is based on the Deep Deterministic Policy Gradient (DDPG) algorithm and incorporates improvements like Clipped Double-Q Learning and Target Policy Smoothing to enhance stability, increase learning efficiency, and minimize overestimation bias compared to DDPG [18]. Two value functions estimate state-action pair values in a clipped Double-Q Learning, and the minimum estimate is used as the goal value during updates. The target value clips a specific range, which helps to mitigate overestimation problems that are frequently seen with single Q-value estimators. The offloading action used the target policy smoothing from the Q-learning target based on target policy

(μ θ_{t a r g})

with clipped noise, and then the action satisfied the probability of

(α_{l o w} \leq α \leq α_{h i g h})

. Therefore, the target offloading actions are defined as

α^{'} (s^{'}) = c l i p (μ θ_{t a r g} (s^{'}) + c l i p (ϵ, - c, c), α_{l o w}, α_{h i g h})

(25)

Equation (25) adds exploration noise to adapt the deterministic policy into a stochastic behavior during the critic update, which helps stabilize learning. A TD3 agent is an actor-critic DRL agent that aims to find the optimal policy to maximize the expected cumulative rewards over the long term. The actor network determines the action, such as the offloading decision or ratio distribution. In contrast, the critic network evaluates the value (expected reward) of taking that action in a given state.

4. A Decentralized Mult-Agent TD3 Based Task Offloading Architecture

The task offloading problem in Equation (24a) is fundamentally modeled as a Markov decision process (MDP) as presented in [41], implying that the next state depends only on the current state and the chosen action. In more realistic scenarios where the agent only has a limited view, the problem is framed as a Partially Observable MDP (POMDP). This context affects the model’s decisions and long-term performance because it focuses on the immediate maximum value. To address this scenario, the decentralized partially observable MDPs (Dec-POMDP) approach was applied to a multi-agent setting as depicted in Figure 4. For this, a multi-agent system task offloading decision-making process involves independent TD3 agents deployed in each fog node. These agents observe the environment and make their decisions based on their observations. The environment’s state includes dynamic factors relevant to decision-making.

Each agent typically only partially observes local states, which include information like workload, available resources, energy status, latency to other fog nodes, QoS metrics, task nature, node status (active, idle, or sleep), performance metrics from historical data, channel condition, location, and distance between fog nodes.

In cooperative multi-agent environments, policies aim to optimize a joint objective, often framed under Dec-POMDPs. These systems pose challenges such as an exponentially growing joint state-action space as the number of agents increases and non-stationarity due to dynamic network and resource availability [45]. These dynamics are formally framed with the MDP tuples

(s, a, s^{'}, r, d)

where:

State Space $(s)$ : In this task offloading problem, referring to the state of the fog $(F)$ at time step $(t)$ observed by the agent from the environment. It includes the task node $(F_{m})$ , resource status, and channel condition, modeled as a set of $s_{t} = {F_{t}, R S_{t}}$ , $\forall s_{t} \in S$ , where $F_{t}$ represents a set of fog nodes and $R S_{t}$ a set of resources at time step t.
Action Space $(a)$ : the MATD3 agents observe the state and make a decision of action $(a_{t})$ based on partial offloading. The agent’s decisions are based on partial observability of the system state. We consider a continuous action space where some tasks are executed at $F_{m}$ , and the remaining are offloaded to an adjacent fog node $F_{j}$ , $(F_{m} \to F_{j})$ as a set of actions $a_{t} \in A$ is the action in time step t. The set of actions at each time step t is defined as $(p_{0}^{0}, p_{n}^{1}, \dots, p_{0}^{R}), \dots (p_{n}^{0}, p_{n}^{1}, \dots, p_{n}^{R})$ , where $p_{n}^{R}$ represents the ratio of the task n allocated to available resource R.
Reward $(r)$ : $r (s, a, s^{'})$ is the reward when action a is taken in state s, and the next state is $s^{'}$ . Where the reward is determined with the state transition function $(s^{'} | s, a)$ with the probability that state $(s^{'})$ happens after action $(a)$ is taken in state $(s)$ .
Done (d): Whether the TD3 agent in the fog environment reaches the terminal state or not is communicated by this element. In this environment, the agent knows that the termination occurs after a hundred offloading decisions or if an offloading action causes $F_{n}$ resources to enter a busy state.

A multi-user (

U_{n}

) and multi-fog (

F_{m, j}

) network environment is considered to simulate decentralized training, and the agent computes tasks in a distributed manner. In the MADRL setting, the agent explores the environment and stores the replay buffer (

D

) experience as an MDP table

(s_{t}, a_{t}, r_{t}, s_{t + 1})

to train the target network. The current state of the agent is updated with real-time data observed from the environment. After that, the individual agent

F_{m, j}

can make independent task offloading decisions for the collective result. Depending on the reward design, policies can be trained using either local or collective feedback. Practically, the agent learns the task offloading policy from the network and observes the real-time task and resource information to make an online decision. A decentralized training and execution are used so that each agent updates its actor based on observation for joint actions of agents to enhance collective reward.

The agent’s primary goal is to maximize the overall return accumulated over time, rather than focusing solely on immediate rewards. The collective reward function is often structured to penalize (use negative values for) metrics like latency and energy consumption. In this case, to deal with the instability of the data set, the Min-Max normalization is applied to stabilize reward variation as illustrated by Equation (26).

x_{n o r m} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(26)

Among them,

x_{n o r m}

is the normalized data, x is the original data, and

x_{m a x}

and

x_{m i n}

are the maximum and the minimum values in the replay buffer data set

(D)

, respectively. The TD3 agent is utilized as an off-policy algorithm with a deterministic policy

μ (s; θ)

and twin Q-value functions

Q_{1} (s, a; w_{1})

and

Q_{2} (s, a; w_{2})

to estimate the expected return. The goal of the reward design is to guide agents toward minimizing normalized latency and energy consumption during task offloading. This is represented using Equation (27) as follows

r_{m, j} = - (ω_{t} \cdot T_{n o r m a l} + ω_{e} \cdot E_{normal}) + r_{penalty}

(27)

In a Markov decision process (MDP) setting, Equation (28) is used to calculate the cumulative reward at time

(t)

as:

r_{t} = \sum_{k = 0}^{N} γ^{k} r_{t + k + 1}

(28)

where

(γ \in [0, 1])

is the discount factor that determines the influence of future rewards. Due to the stochastic nature of the environment, a shaped reward is computed in Equation (29) using an exponential moving average to ensure stable learning and consistent offloading decisions:

{\tilde{r}}_{t} = (1 - α) \cdot {\tilde{r}}_{t - 1} + α \cdot r_{t}

(29)

In decentralized multi-agent settings, this shaping is done separately for each fog agent i, allowing agents to adapt independently to local dynamics. We adapted MADRL scenarios from [23], where agents are designed to cooperate to achieve a common goal, typically by sharing a single reward function that reflects the performance of the entire system. This means all agents work towards maximizing the same objective of minimizing total system delay and energy consumption as a set common objective for multiple agents. Agents may share their local rewards as feedback [35]. This approach reduces communication overhead compared to sharing observations and actions. The learning mechanism in such environments relies on DRL principles. The Bellman Equation (30) helps approximate the action-value function

Q (s, a)

during value iteration as follows:

Q (s, a) \leftarrow r + γ \cdot max_{a^{'}} Q (s^{'}, a^{'})

(30)

In our case, we utilize the TD3-based MAFCPTORA algorithm to identify the optimal deterministic policy from stochastic environments. The TD3 method is particularly suited for continuous action spaces [18], where the actor network learns a policy

μ (s; θ)

and the critic evaluates its quality using twin Q-value networks:

μ : S \to A

5. Algorithm Design

To address the aforementioned task offloading problem, we employ a decentralized, Multi-agent, Fully Cooperative Partial Task Offloading and Resource Allocation (MAFCPTORA) algorithm based on the actor-critic network. The MAFCPTORA algorithm is designed in two steps: task offloading decision and resource allocation.

Proposed MAFCPTORA-Algorithm

We designed a novel Multi-agent, Fully Cooperative Partial Task Offloading and Resource Allocation (MAFCPTORA) algorithm based on probability using the TD3 approach. A decentralized TD3 agent is deployed at each

F_{n}

and is responsible for their own independent decision to find the optimal offloading policy. The TD3 agent is utilized as an off-policy algorithm with a deterministic policy

μ (s; θ)

and twin Q-value functions

Q_{1} (s, a; w_{1})

and

Q_{2} (s, a; w_{2})

to estimate the expected return. For this task offloading problem, DDPG was used in [13]; however, overestimating Q-values poses a challenge for policy breaking due to exploiting errors in the Q-function. We employed TD3, which uses two Q-value functions,

Q_{ϕ_{1}}

and

Q_{ϕ_{2}}

, trained by minimizing the Mean Squared Bellman Error (MSBE) to ensure stable value estimation, Equation (31). The clipped double-Q learning is defined where both of the Q-functions use the single target to find a smaller target value using the two Q functions:

y (r, s^{'}, d) = r + γ (1 - d) min_{i = 1, 2} Q_{ϕ_{i, targ}} (s^{'}, a^{'} (s^{'}) + ϵ)

(31)

In this case, both Q-functions learn by regressing to the target as Equations (32) and (33):

L (ϕ_{1}, D) = E_{(s, a, r, s^{'}, d) \sim D} [{(Q_{ϕ_{1}} (s, a) - y (r, s^{'}, d))}^{2}]

(32)

L (ϕ_{2}, D) = E_{(s, a, r, s^{'}, d) \sim D} [{(Q_{ϕ_{2}} (s, a) - y (r, s^{'}, d))}^{2}]

(33)

The parameter

(μ θ (s))

is then updated by carrying out the gradient descent of the performance measure

Q (ϕ 1)

. This MAPFCTORA algorithm is decomposed into two sequential stages, as outlined in Algorithm 1 and Algorithm 2, which correspond to task offloading and resource allocation

R_{f, k} (t)

, respectively.

Algorithm 1 Multi-agent cooperative Partial Task Offloading

1:: Input: initial policy parameter $θ$ , Q-function parameter $ϕ_{1}$ , $ϕ_{2}$ , number of fog nodes $F_{n}$ , learning rate $α$ , discount factor $γ$ , number of training episode e, agent id i, time $τ$ , replay buffer size $D$ , exploration rate $ϵ$
2:: Initializ: $θ_{targ} \leftarrow θ$ , $ϕ_{targ, 1} \leftarrow ϕ_{1}$ , $ϕ_{targ, 2} \leftarrow ϕ_{2}$
3:: Output: Optimal policy $π$
4:: for each $F_{m} \in {F_{1}, \dots, F_{n}}$ do
5:: Initialize $F_{m}$ , $μ_{θ_{m}}$ , $Q_{ϕ_{1, m}}, Q_{ϕ_{2, m}}$ , and targets ${\hat{μ}}_{θ_{m}}, {\hat{Q}}_{ϕ_{1, m}}, {\hat{Q}}_{ϕ_{2, m}}$
6:: Observe initial state at $F_{m}$ $S_{m} (0)$
7:: end for
8:: for each decision step $t = 0, \dots, τ - 1$ do
9:: for all fog nodes $F_{m}$ in parallel do
10:: Generate exploration noise: $a_{m} (t) = μ_{θ_{m}} (S_{m} (t)) + N (0, σ)$
11:: Clip action: $a_{m} (t) \leftarrow clip (a_{m} (t), a_{l o w}, a_{h i g h})$
12:: With probability $P_{m}$ , select random $a_{m} (t)$ (exploration)
13:: Compute offloading ratio: $x_{m, j} \sim π (x_{m, j} | s_{m} (t))$
14:: Allocate task portions $K_{x_{m, j}}$ to neighbor fog nodes $F_{j}$
15:: Execute offloaded tasks at $F_{j}$ and remaining $x_{m} = 1 - \sum_{j} x_{m, j}$ locally
16:: Observe $s_{m, j} (t + 1)$ and compute reward using Equation (29)
17:: Store transition $(s_{m, j} (t), a_{m, j} (t), r_{m, j} (t), s_{m, j} (t + 1), d)$ in $D$
18:: end for
19:: for all fog nodes $F_{j}$ in parallel do
20:: if enough samples in $D$ then
21:: Sample batch $B = {(s, a, r, s^{'}, d)}$ from $D$
22:: Compute target actions: $a^{'} = μ_{θ_{t a r g}} (s^{'}) + N (0, σ)$
23:: Compute target Q-value using Equation (31)
24:: Update critics $ϕ_{1}, ϕ_{2}$ via gradient descent:

$ϕ_{i} \leftarrow ϕ_{i} - α \nabla_{ϕ_{i}} E_{B} [{(Q_{ϕ_{i}} (s, a) - y)}^{2}]$
25:: Update actor $θ$ via deterministic policy gradient ascent using Equation (34)
26:: Update target networks using Equation (35)
27:: end if
28:: end for
29:: end for

A deterministic policy gradient to update the two critics

ϕ_{1}, ϕ_{2}

, Equation (34) and target networks

ϕ_{t a r g, i}

with Equation (35).

θ \leftarrow θ + α \nabla_{θ} E_{s \in B} [Q_{ϕ_{1}} (s, μ_{θ} (s))]

(34)

ϕ_{t a r g, i} \leftarrow ρ ϕ_{t a r g, i} + (1 - ρ) ϕ_{i} θ_{t a r g} \leftarrow ρ θ_{t a r g} + (1 - ρ) θ

(35)

In the TD3 algorithm, a deterministic policy

μ (s; θ)

is trained to map the observed state

s (t)

to an action

a_{i} (t)

. The policy is learned to maximize the Q-value estimated by the critic network

Q_{ϕ_{1}}

guided by Equation (36):

max_{θ} E_{s \sim D} [Q_{ϕ_{1}} (s, μ_{θ} (s))]

(36)

In this work, maximizing the cumulative reward is the objective of each agent in this decentralized multi-agent system. Furthermore, the cooperation of each TD3 agent (fog node) with the others follows its respective trained policy, denoted as

μ = (μ_{1}, μ_{2}, \dots, μ_{n})

. In addition, the expectation

(E)

is defined with the sum over time and the discounted reward by

γ

at each time step t. Then, the TD3 agent

F_{m, j}

’s value function under policy

μ_{i}

at state s is defined by Equation (37):

V^{μ_{i}} (s) = E [\sum_{t = 0}^{\infty} γ^{t} R_{i} (s_{t}, a_{t}, s_{t + 1})],

(37)

where

R_{i} (s_{t}, a_{t}, s_{t + 1})

is the reward function at time step t.

The value of agent

F_{m, j}

in this multi-agent system depends on the collective policy

μ

rather than just its own policy

μ_{i}

. The relationship is given by Equations (38) and (39):

V_{i}^{μ_{i}^{*}, μ_{- i}^{*}} (s, w) \geq V_{i}^{μ_{i}, μ_{- i}^{*}} (s), \forall s \in S and μ_{i},

(38)

where

μ_{- i}^{*}

represents the set of policies of all other agents except

π_{i}

.

Additionally, the extended

ϵ

-Nash equilibrium is defined by the following general formula:

V_{i}^{μ_{i}^{*}, μ_{- i}^{*}} (s) \geq V_{i}^{μ_{i}, μ_{- i}^{*}} (s) - ϵ, \forall s \in S and μ_{i} .

(39)

Here,

μ_{i}^{*}

is the best response of agent

F_{m, j}

to the policies

μ_{- i}^{*}

of other agents.

Algorithm 2 Multi-agent Fully Cooperative Resource Allocation

1:: Input: Trained actor networks $μ_{θ_{i}}$ for all agents $F_{m, j}$
2:: Initialize task resource requirements: $V_{k}$ , $θ_{t}$ , H, and $F_{m, j}$ resources ( $C_{f}$ , memory, $β$ )
3:: Initialize $μ_{θ_{i}}$ and $Q_{ϕ_{i}}$ for each agent $F_{m, j}$ with random weights $θ_{i}, ϕ_{i}$
4:: Initialize target networks $π_{θ_{i}}^{'}$ and $Q_{ϕ_{i}}^{'}$ with weights $θ_{i}^{'} \leftarrow θ_{i}, ϕ_{i}^{'} \leftarrow ϕ_{i}$
5:: for epoch $e = 0, 1, \dots, E - 1$ do
6:: for each decision step $t = 0, 1, \dots, τ - 1$ do
7:: Parallelize the following for each agent $F_{m, j}$ :
8:: for each $F_{m, j}$ in parallel do
9:: Use trained actor network to decide resource allocation by Equation (40)
10:: Allocate $C_{f}$ , memory, $β$ for task processing based on $a_{i} (t)$
11:: Monitor latency, energy consumption, and resource utilization by Equations (24d)–(24f)
12:: end for
13:: end for
14:: Compute feedback metrics using Equation (28)
15:: Update task states for the next decision step Equation (31)
16:: end for

In this TD3-based task offloading, Gaussian noise is added to the stochastic policy to encourage exploration. The action selection formula uses policy noise for exploration:

a_{i} (t) = μ_{θ_{i}} (S_{i} (t)) + N (0, σ)

(40)

$a_{i} (t)$ : The action chosen by agent $F_{m, j}$ at time t.
$μ_{θ_{i}} (S_{i} (t))$ : The output of the actor network parameterized by $θ_{i}$ , given the state $S_{i} (t)$ .
$N (0, σ)$ : illustrates the Gaussian noise in terms of mean $(0)$ and standard deviation $σ$ Equation (41).

N (0, σ) \sim \frac{1}{\sqrt{2 π} σ} exp (- \frac{x^{2}}{2 σ^{2}})

(41)

x: A random variable sampled from the distribution.
$σ$ : The standard deviation controls the magnitude of the exploration.

The addition of Gaussian noise ensures that the agent explores actions around the deterministic policy

μ_{θ_{i}}

and avoids premature convergence to suboptimal solutions. Moreover, the update critic

Q_{ϕ_{i}}

is done by minimizing the loss function.

L (ϕ_{i}) = E [{(Q_{ϕ_{i}} (S, A) - y)}^{2}]

(42)

where

y = r_{i} + γ Q_{{\hat{ϕ}}_{i}} (S^{'}, A^{'})

In this case,

L (ϕ_{i})

presents the loss function for the critic network

(Q_{ϕ_{i}})

parameterized by

ϕ

. The loss measures the mean squared error (MSE) between the predicted Q value and the target Q value y. The target Q-value y combines the immediate reward

r_{i}

and the discounted future value of the next state-action pair, estimated using the target critic network

Q_{{\hat{ϕ}}_{i}}

. We use this MSE update to train the critic to better predict the expected cumulative reward for each pair of state actions. The smallest mean square error (MSE) between the predicted reward of the critic model and the target reward is then calculated, and this error is used to update the twin critic models using delayed backpropagation. Minimizing this loss improves the accuracy of value estimates, which aids the actor in policy optimization.

Algorithms 1 and 2 We update the actor policy

ϕ_{i}

at each time step

(t)

, corresponding to the delayed policy update by estimating the Q-value of state and action using Equations (34) and (35). The computational complexity of the TD3-based MAFCPTORA algorithm is influenced by the number of agents (N), state and action dimensions (

d_{s}

and

d_{a}

), batch size (B), and hidden layer width (H). For each agent, the actor and critic networks contribute a forward pass complexity of

O (H (d_{s} + d_{a}) + H^{2})

. During training, the critic updates dominate the time complexity with

O (2 B \cdot (H (d_{s} + d_{a}) + H^{2}))

per agent. Including the delayed actor updates, the total training step per agent scales as

O (B (d_{s} + d_{a} + H) H)

. Hence, for N agents:

O (N \cdot B \cdot (d_{s} + d_{a} + H) H)

Dynamic task partitioning involves computing and normalizing task distribution vectors across n neighbors, leading to a per-agent cost of

O (n)

and a total of

O (N \cdot n)

per time step. In decentralized environments, communication overhead arises due to state sharing among neighbors, costing

O (N \cdot n \cdot d_{s})

per step. MATD3 demonstrates faster convergence compared to benchmark solutions, including MAIDDPG, MASAC, and MAPPO. Therefore, our MAFCPTORA algorithm maintains scalable performance with linear growth in agents and manageable quadratic growth in network size.

6. Performance Evaluation

In this section, we evaluate the performance of the proposed approach. We also compared it with several existing approaches. Specifically, we used the MAIDDPG, MATD3, MASAC, and MAPPO DRL baseline algorithms. These algorithms work for decentralized training and execution in a continuous action space based on probability. This implementation represents the state as the task state (task size), resource status, channel conditions of connected nodes (allocated or free), and neighboring nodes. These algorithms follow the actor-critic network, where the actor-critic network trains using exploration and exploitation, updating the network based on the rewards received. Meanwhile, the critic observes the global states and actions of all agents

F_{m, j}

and evaluates the combined actions of the agents. The assumption is that each user device may connect to multiple fog nodes and select multiple nodes to offload its sub-task at a particular time. This selection process is guided by the collective reward function presented in Equation (24a) as the task completion time against the delay threshold.

6.1. Simulation Setting

In this section, the experimental setting adopted for simulation is described. We consider a scenario of a multi-user (

U_{n}

s) and multi-Fog nodes (

F_{n}

s) environment. Each (

U_{n}

s) is equipped with a single-core CPU with a frequency between 1 GHz and 2 GHz, while (

F_{m, j}

s) features multi-core CPUs with 2 to 8 cores and frequencies ranging from 1.6 GHz to 3 GHz [46]. The computing and communication capabilities of

F_{m, j}

s are assumed to be heterogeneous. The bandwidth between

U_{n}

and

F_{m}

is 250 Kbps up to 54 Mbps, depending on the weight of the task [44]. The data transmission rate between

F_{m, j}

also ranges to 100 Mbps [44]. The offloading decision is affected by the task arrival rate

(λ

) packets per second, as indicated by [18] to measure the workload intensity of

F_{m, j}

. The intensity of the task execution rate

(μ)

at

F_{m, j}

is measured in terms of the total task executed per episode duration. A custom fog environment is designed based on OpenAI Gym [47] standards.

6.2. Result and Performance Analysis

The simulation results were structured to address key research objectives: improving offloading decisions and minimizing overall system latency and energy consumption in stochastic horizontal fog environments. In comparing these algorithms, we conduct the training based on Table 5 for 10,000 episodes. Each model is able to deal with this stochastic network and shows good learning progress, as depicted in Figure 5. From these perspectives, MATD3 outperforms the rest to return a stable progressive, long-term cumulative reward formulated in Equation (24a). The overall system-wide evaluation metrics are defined based on the average reward, latency, and energy per 100 episodes to get a more stable and reliable performance estimation. Therefore, the average reward, the average latency, and the average energy consumption per episode are recorded every 100 steps.

Figure 5a presents the evaluation reward comparison for the MAPPO, MASAC, MAIDDPG, and MATD3 models. Figure 5b presents the latency evaluation, where TD3 quickly converges and maintains a consistently low response time. In contrast, MASAC struggles with higher latency, exhibiting a significantly longer response time compared to the other algorithms. Figure 5c presents the evaluation of energy consumption per episode, highlighting TD3’s efficiency with notably low energy usage. All evaluation results demonstrate progressive learning for each model competitively. According to Figure 5, the MATD3 algorithm delivers a consistent performance across several metrics and manages the trade-off between offloading decisions, response time, and energy consumption quite well.

The ablation study is conducted in MATD3, our main evaluation target, to examine the sensitivity performance with a change in the weighting parameter

(ω)

. Comparative methods are used as baselines without extensive parameter tuning, consistent with their original configurations. Figure 5d shows the ablation study for the effects of the weight coefficient on average reward, latency, and energy consumption. We begin with the fact that the base weight coefficients

ω_{t}

and

ω_{e}

are set to 0.5, thereby assigning equal priority to latency and energy consumption in the joint optimization objective. The result of the base weight coefficients shows a balanced trade-off. Hence, we sacrificed

ω_{e}

to give more priority to minimizing task execution latency over energy consumption. The ablation study indicates that prioritizing latency significantly reduces average latency while energy consumption increases, but it maintains an acceptable range. The best trade-off was observed at

ω_{t} = 0.7

, achieving both low latency and minimum energy consumption. Moreover, the average reward exhibits an overall increasing trend; however, a slight decline is observed at the final stage, likely due to the influence of minimized latency.

Figure 6 illustrates the performance of MATD3 based on individual metrics: average reward, latency, and energy consumption evaluated per episode. The average rewards, latency, and energy performance of MATD3 are stable in convergence with progressive learning.

Figure 6a illustrates the average reward evaluation result, which stays stable in a range between 0.32 and 0.36 across the episodes. Figure 6b shows promising latency, mostly ranging between 0.08 and 0.18s, suggesting relatively best response times. Figure 6c shows the energy consumption laid between 0.6 and 1.4, indicating a tolerable balance between performance and efficiency.

The evaluation findings of MAPPO, MASAC, MAIDDPG, and MATD3 are presented in Table 6 according to the average reward, latency, and energy consumption. Among the models compared, MATD3 shows superior performance in all three metrics. It achieves the highest average reward of (0.36 ± 0.01), indicating more effective task offloading decisions. Furthermore, it significantly reduces the average latency to (0.08 ± 0.01), which is considerably lower than the other three models, suggesting faster task execution. Energy consumption is also minimized at (0.76 ± 0.14), showcasing the model’s efficiency in resource usage. In contrast, MAPPO and MASAC show relatively competitive rewards despite having higher latency and energy consumption. They achieve a nearly similar reward level (0.34 ± 0.01) and (0.35 ± 0.01), respectively. The highest average latency is observed in MASAC, with a score of (0.2 ± 0.03), indicating a lower response time relative to the other models. However, MAIDDPG exhibits the highest average energy consumption at (0.999 ± 0.187) among all four models. The convergence of these models is observed to be stable during the training period, with low standard deviations across metrics, affirming the reliability of the learned policies. Our experimental results further validate the effectiveness of this approach, where parallel F2F task execution enabled the best-performing model, demonstrating its suitability for real-time, distributed applications.

7. Conclusions and Future Direction

In this paper, we addressed the challenges of task offloading in dynamic fog computing environments, particularly under scenarios with high computational demands. We proposed a MAFCPTORA, a decentralized multi-agent deep reinforcement learning approach for partial task offloading in a horizontal fog-to-fog (F2F) architecture. This approach allows fog nodes to make autonomous and coordinated decisions, optimizing sub-task offloading based on real-time resource status, workload, and network conditions. The proposed MAFCPTORA algorithm enhances task execution efficiency by reducing latency and energy consumption while maintaining the quality of service (QoS) for end-user applications. We evaluated the performance of our method against various state-of-the-art MADRL algorithms, including MATD3, MASAC, MAIDDPG, and MAPPO. MATD3 demonstrated superior performance, achieving the highest average reward of 0.36 ± 0.01, the lowest average latency of 0.08 ± 0.01, and the lowest energy consumption of 0.76 ± 0.14. These results validate the effectiveness of our decentralized approach for scalable and efficient task offloading in fog computing environments.

In future work, we plan to consider the task success rate for dynamic resource requirements while optimizing response time. We design the tasks to be split into sub-tasks so that dependency-heavy tasks are planned to be addressed in the extension of this work. Implementing the proposed algorithm in practical application scenarios for deployment is also important, as it provides valuable insights. The absence of a multi-objective optimization framework to jointly balance latency, energy usage, bandwidth constraints, and battery health under dynamic and uncertain network conditions is a major gap. This is another gap we plan to address. Also, we plan to extend our work with a fault tolerance approach when a fog node fails. A fallback mechanism is important in mission-critical tasks.

Author Contributions

Conceptualization, E.M.A., R.S., F.L and J.A.; Methodology, E.M.A., F.L. and R.S.; Formal analysis, E.M.A., F.L., R.S. and J.A.; Investigation, E.M.A. and J.A.; Resources, E.M.A. and R.S.; Writing—original draft, E.M.A.; Writing—review & editing, E.M.A., F.L., R.S. and J.A.; Visualization, E.M.A., F.L. and F.L.; Supervision, F.L., R.S. and J.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by The Deanship of Scientific Research, Vice-Presidency for Graduate Studies and Scientific Research, King Faisal University, Ministry of Education, Saudi Arabia. under Grant KFU251733.

Data Availability Statement

The original data presented in the study are openly available on the public GitHub https://github.com/thinkOver87/multi-agentFog (accessed on 20 May 2025).

Acknowledgments

The authors acknowledge the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research at King Faisal University, Saudi Arabia, for financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Statista. Consumer Electronics—Statista Market Forecast. 2023. Available online: https://www.statista.com/outlook/cmo/consumer-electronics/worldwide (accessed on 10 February 2025).
Shahidinejad, A.; Abawajy, J. Efficient Provably Secure Authentication Protocol for Multidomain IIoT Using a Combined Off-Chain and On-Chain Approach. IEEE Internet Things J. 2024, 11, 15241–15251. [Google Scholar] [CrossRef]
Li, X.; Xu, Z.; Fang, F.; Fan, Q.; Wang, X.; Leung, V.C.M. Task Offloading for Deep Learning Empowered Automatic Speech Analysis in Mobile Edge-Cloud Computing Networks. IEEE Trans. Cloud Comput. 2023, 11, 1985–1998. [Google Scholar] [CrossRef]
Abawajy, J.H.; Hassan, M.M. Federated Internet of Things and Cloud Computing Pervasive Patient Health Monitoring System. IEEE Commun. Mag. 2017, 55, 48–53. [Google Scholar] [CrossRef]
Okafor, K.C.; Achumba, I.E.; Chukwudebe, G.A.; Ononiwu, G.C. Leveraging fog computing for scalable IoT datacenter using spine-leaf network topology. J. Electr. Comput. Eng. 2017, 2017, 2363240. [Google Scholar] [CrossRef]
Wang, Y.; Wang, K.; Huang, H.; Miyazaki, T.; Guo, S. Traffic and Computation Co-Offloading with Reinforcement Learning in Fog Computing for Industrial Applications. IEEE Trans. Ind. Inform. 2019, 15, 976–986. [Google Scholar] [CrossRef]
Hussein, M.K.; Mousa, M.H. Efficient task offloading for IoT-Based applications in fog computing using ant colony optimization. IEEE Access 2020, 8, 37191–37201. [Google Scholar] [CrossRef]
Ghanavati, S.; Abawajy, J.; Izadi, D. An Energy Aware Task Scheduling Model Using Ant-Mating Optimization in Fog Computing Environment. IEEE Trans. Serv. Comput. 2020, 15, 2007–2017. [Google Scholar] [CrossRef]
Wei, Z.; Li, B.; Zhang, R.; Cheng, X.; Yang, L. Many-to-Many Task Offloading in Vehicular Fog Computing: A Multi-Agent Deep Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2023, 23, 2107–2122. [Google Scholar] [CrossRef]
Ren, Y.; Sun, Y.; Peng, M. Deep Reinforcement Learning Based Computation Offloading in Fog Enabled Industrial Internet of Things. IEEE Trans. Ind. Inform. 2021, 17, 4978–4987. [Google Scholar] [CrossRef]
Li, Y.; Liang, L.; Fu, J.; Wang, J. Multiagent Reinforcement Learning for Task Offloading of Space/Aerial-Assisted Edge Computing. Secur. Commun. Networks 2022, 2022, 193365. [Google Scholar] [CrossRef]
Shi, W.; Chen, L.; Zhu, X. Task offloading decision-making algorithm for vehicular edge computing: A deep-reinforcement-learning-based approach. Sensors 2023, 23, 7595. [Google Scholar] [CrossRef] [PubMed]
Seid, A.M.; Erbad, A.; Abishu, H.N.; Albaseer, A.; Abdallah, M.; Guizani, M. Multi-agent Federated Reinforcement Learning for Resource Allocation in UAV-enabled Internet of Medical Things Networks. IEEE Internet Things J. 2023, 10, 19695–19711. [Google Scholar] [CrossRef]
Gao, Z.; Yang, L.; Dai, Y. Fast Adaptive Task Offloading and Resource Allocation via Multi-agent Reinforcement Learning in Heterogeneous Vehicular Fog Computing. IEEE Internet Things J. 2022, 10, 6818–6835. [Google Scholar] [CrossRef]
Azizi, S.; Shojafar, M.; Abawajy, J.; Buyya, R. Deadline-aware and energy-efficient IoT task scheduling in fog computing systems: A semi-greedy approach. J. Netw. Comput. Appl. 2022, 201, 103333. [Google Scholar] [CrossRef]
Shi, J.; Du, J.; Wang, J.; Yuan, J. Deep reinforcement learning-based V2V partial computation offloading in vehicular fog computing. In Proceedings of the 2021 IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 29 March–1 April 2021; pp. 1–6. [Google Scholar]
Wu, G.; Xu, Z.; Zhang, H.; Shen, S.; Yu, S. Multi-agent DRL for joint completion delay and energy consumption with queuing theory in MEC-based IIoT. J. Parallel Distrib. Comput. 2023, 176, 80–94. [Google Scholar] [CrossRef]
Wakgra, F.G.; Kar, B.; Tadele, S.B.; Shen, S.H.; Khan, A.U. Multi-objective offloading optimization in mec and vehicular-fog systems: A distributed-td3 approach. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16897–16909. [Google Scholar] [CrossRef]
Zhu, X.; Chen, S.; Chen, S.; Yang, G. Energy and Delay Co-aware Computation Offloading with Deep Learning in Fog Computing Networks. In Proceedings of the 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC), London, UK, 29–31 October 2019; pp. 1–6. [Google Scholar] [CrossRef]
Seid, A.M.; Boateng, G.O.; Mareri, B.; Sun, G.; Jiang, W. Multi-Agent DRL for Task Offloading and Resource Allocation in Multi-UAV Enabled IoT Edge Network. IEEE Trans. Netw. Serv. Manag. 2021, 18, 4531–4547. [Google Scholar] [CrossRef]
Wang, N.; Varghese, B. Context-aware distribution of fog applications using deep reinforcement learning. J. Netw. Comput. Appl. 2022, 203, 103354. [Google Scholar] [CrossRef]
Jamil, B.; Ijaz, H.; Shojafar, M.; Munir, K. IRATS: A DRL-based intelligent priority and deadline-aware online resource allocation and task scheduling algorithm in a vehicular fog network. Ad Hoc Networks 2023, 141, 103090. [Google Scholar] [CrossRef]
Baek, J.; Kaddoum, G. Heterogeneous Task Offloading and Resource Allocations via Deep Recurrent Reinforcement Learning in Partial Observable Multifog Networks. IEEE Internet Things J. 2021, 8, 1041–1056. [Google Scholar] [CrossRef]
Bai, Y.; Li, X.; Wu, X.; Zhou, Z. Dynamic Computation Offloading with Deep Reinforcement Learning in Edge Network. Appl. Sci. 2023, 13, 2010. [Google Scholar] [CrossRef]
Cao, Z.; Zhou, P.; Li, R.; Huang, S.; Wu, D. Multiagent Deep Reinforcement Learning for Joint Multichannel Access and Task Offloading of Mobile-Edge Computing in Industry 4.0. IEEE Internet Things J. 2020, 7, 6201–6213. [Google Scholar] [CrossRef]
Tong, Z.; Li, Z.; Gendia, A.; Muta, O. Deep Reinforcement Learning Based Computing Resource Allocation in Fog Radio Access Networks. In Proceedings of the 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall), Washington DC, USA, 7–10 October 2024; pp. 1–5. [Google Scholar]
Ren, J.; Zhang, D.; He, S.; Zhang, Y.; Li, T. A Survey on End-Edge-Cloud Orchestrated Network Computing Paradigms. ACM Comput. Surv. 2020, 52, 1–36. [Google Scholar] [CrossRef]
Tran-Dang, H.; Kim, D.-S. DISCO: Distributed computation offloading framework for fog computing networks. J. Commun. Networks 2023, 25, 121–131. [Google Scholar] [CrossRef]
Tran-Dang, H.; Kim, D.S. FRATO: Fog resource based adaptive task offloading for delay-minimizing IoT service provisioning. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 2491–2508. [Google Scholar] [CrossRef]
Tran-Dang, H.; Bhardwaj, S.; Rahim, T.; Musaddiq, A.; Kim, D.S. Reinforcement learning based resource management for fog computing environment: Literature review, challenges, and open issues. J. Commun. Networks 2022, 24, 83–98. [Google Scholar] [CrossRef]
Ke, H.; Wang, J.; Wang, H.; Ge, Y. Joint Optimization of Data Offloading and Resource Allocation with Renewable Energy Aware for IoT Devices: A Deep Reinforcement Learning Approach. IEEE Access 2019, 7, 179349–179363. [Google Scholar] [CrossRef]
Zhang, J.; Du, J.; Shen, Y.; Wang, J. Dynamic Computation Offloading with Energy Harvesting Devices: A Hybrid-Decision-Based Deep Reinforcement Learning Approach. IEEE Internet Things J. 2020, 7, 9303–9317. [Google Scholar] [CrossRef]
Qiu, X.; Zhang, W.; Chen, W.; Zheng, Z. Distributed and Collective Deep Reinforcement Learning for Computation Offloading: A Practical Perspective. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 1085–1101. [Google Scholar] [CrossRef]
Ghanavati, S.; Abawajy, J.; Izadi, D. Automata-based Dynamic Fault Tolerant Task Scheduling Approach in Fog Computing. IEEE Trans. Emerg. Top. Comput. 2022, 10, 488–499. [Google Scholar] [CrossRef]
Chen, J.; Chen, P.; Niu, X.; Wu, Z.; Xiong, L.; Shi, C. Task offloading in hybrid-decision-based multi-cloud computing network: A cooperative multi-agent deep reinforcement learning. J. Cloud Comput. 2022, 11, 90. [Google Scholar] [CrossRef]
Zhang, K.; Yang, Z.; Başar, T. Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. Stud. Syst. Decis. Control 2021, 325, 321–384. [Google Scholar] [CrossRef]
Suzuki, A.; Kobayashi, M.; Oki, E. Multi-Agent Deep Reinforcement Learning for Cooperative Computing Offloading and Route Optimization in Multi Cloud-Edge Networks. IEEE Trans. Netw. Serv. Manag. 2023, 20, 4416–4434. [Google Scholar] [CrossRef]
Zhu, X.; Luo, Y.; Liu, A.; Bhuiyan, Z.A.; Member, S.; Zhang, S. Multiagent Deep Reinforcement Learning for Vehicular Computation Offloading in IoT. IEE Internet Things 2021, 8, 9763–9773. [Google Scholar] [CrossRef]
Tadele, S.B.; Yahya, W.; Kar, B.; Lin, Y.D.; Lai, Y.C.; Wakgra, F.G. Optimizing the Ratio-Based Offloading in Federated Cloud-Edge Systems: A MADRL Approach. IEEE Trans. Netw. Sci. Eng. 2024, 12, 463–475. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, J.; Zhang, L.; Zhang, Y.; Li, Q.; Chen, K.C. Reliable Transmission for NOMA Systems With Randomly Deployed Receivers. IEEE Trans. Commun. 2023, 71, 1179–1192. [Google Scholar] [CrossRef]
Hao, H.; Xu, C.; Zhang, W.; Yang, S.; Muntean, G.M. Joint task offloading, resource allocation, and trajectory design for multi-uav cooperative edge computing with task priority. IEEE Trans. Mob. Comput. 2024, 23, 8649–8663. [Google Scholar] [CrossRef]
Sellami, B.; Hakiri, A.; Yahia, S.B.; Berthou, P. Energy-aware task scheduling and offloading using deep reinforcement learning in SDN-enabled IoT network. Comput. Networks 2022, 210, 108957. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Al-Khafajiy, M.; Baker, T.; Al-Libawy, H.; Maamar, Z.; Aloqaily, M.; Jararweh, Y. Improving fog computing performance via fog-2-fog collaboration. Future Gener. Comput. Syst. 2019, 100, 266–280. [Google Scholar] [CrossRef]
Raju, M.R.; Mothku, S.K.; Somesula, M.K. DMITS: Dependency and Mobility-Aware Intelligent Task Scheduling in Socially-Enabled VFC Based on Federated DRL Approach. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17007–17022. [Google Scholar] [CrossRef]
Fu, X.; Tang, B.; Guo, F.; Kang, L. Priority and dependency-based DAG tasks offloading in fog/edge collaborative environment. In Proceedings of the 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China, 5–7 May 2021; pp. 440–445. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]

Figure 1. Fog nodes

F_{m, j}

cooperation in the form of task node (TN), free node (FN), and busy node (BN) states; each state indicated as three TNs (F1, F2, F7), two BN(F6, F8), and three FN (F3, F4, F5) adopted from [28]. In this case, all fog nodes are assumed to have the same protocol running independently to update their status.

Figure 1. Fog nodes

F_{m, j}

cooperation in the form of task node (TN), free node (FN), and busy node (BN) states; each state indicated as three TNs (F1, F2, F7), two BN(F6, F8), and three FN (F3, F4, F5) adopted from [28]. In this case, all fog nodes are assumed to have the same protocol running independently to update their status.

Figure 2. User device-Fog Architecture.

Figure 3. Directed acyclic graph task model.

Figure 4. MATD3 based horizontal F2F Task offloading architecture.

Figure 5. Performance evaluation of MAPPO, MAIDDPG, MASAC, and MATD3 for average reward, latency, and energy consumption per episode. (a) Episode-wise evaluation of Reward. (b) Average latency evaluation per episode. (c) Average energy evaluation per episode. (d) Performance with change in weighting parameter

(ω)

.

Figure 5. Performance evaluation of MAPPO, MAIDDPG, MASAC, and MATD3 for average reward, latency, and energy consumption per episode. (a) Episode-wise evaluation of Reward. (b) Average latency evaluation per episode. (c) Average energy evaluation per episode. (d) Performance with change in weighting parameter

(ω)

.

Figure 6. Performance evaluation of MATD3 for average reward, latency, and energy consumption per episode. (a) MATD3: Average reward per episode. (b) MATD3: Average latency per episode. (c) MATD3: Average energy per episode.

Table 1. Description of abbreviations.

Symbol	Description
MAFCPTORA	Multi-Agent Fully Cooperative Task Offloading and Resource Allocation
MADRL	Multi-Agent Deep Reinforcement Learning
MADDPG	Multi-agent deep deterministic policy gradient
MAPPO	Multi-agent proximal policy optimization
TD3	Twin Delayed Deep Deterministic
MATD3	Multi-agent twin delayed DDPG
DRL	Deep Reinforcement Learning
MDP	Markov Decision Process

Table 3. List of key parameters.

Notation	Description
$F_{n}$	Number of Fog nodes
$U_{n}$	Number of User devices
$M_{f, n}$	Available Memory at each Fog node
$C_{f}$	Available CPU at each Fog node
$T_{m, j}$	Task processing delay
$τ_{m, j}^{o}$	Task offloading delay
$β$	Transmission Bandwidth
$E_{k}$	Energy consumption of task k
$ζ$	Battery life time
$H_{c}$	Channel condition
$φ$	Number of channel available
$K_{n}$	Task to be executed
$V_{k}$	Total size of the task $K_{n}$
$H_{i}$	Channel transmission power
$τ_{m a x}$	Maximum tolerable delay
Q	Waiting tasks in each Fog node
$S_{i}$	State of each agent
$O (t)$	Observation of environment at time t
$a_{i}$	Individual agent action
$X_{n}$	Offloading decision
$R_{f, k} (t)$	Resource allocation decision at time t
$μ$ and $μ \hat{*}$	Optimal and target policy
$r_{t}$ and ${\tilde{r}}_{t}$	Individual and collective average reward

Table 4. Cooperative

F 2 F

sample extracted knowledge frame.

Table 4. Cooperative

F 2 F

sample extracted knowledge frame.

Fog ID	Status	CPU Core	CPU Usage (%)	Memory Usage (%)	Bandwidth	Queue	Idle Time (%)	Task Processed	Next Hope
$F_{1}$	Busy	8	70%	50%	100 Mbps	$[K_{1}, K_{2}, K_{3}]$	30%	10	$F_{2}$

Table 5. Simulation parameters.

Parameters	Quantity
Number of User devices	100
Number of Fog nodes	6
Distance between fog nodes $(D i s t_{m, j})$	1 km to 5 km
Task arrival rate $λ$ at $F_{m, j}$	3 × 10² packet per second
Available CPU core $ν_{f_{m, j}}$	[2–4]
Bandwidth between fog nodes $β$	50 MHz
Channel condition H	$[0 \leq x \leq 1]$
Computational capacity $U_{i}^{m i n}$	[100–1000] MIPS
Computational capacity $F_{m, j_{n}^{m a x}}$	50 MIPS
CPU frequency range $F_{m, j_{\min}^{m a x}}$	$[200 \times 10^{6}$ – $15 \times 10^{8}]$ Hz
Path loss exponent $P L (d_{u f})$	2.7
$U_{i}^{m i n}$ Channel transmission power $τ_{t, r}^{u} (t)$	10 dBm
$U_{i}^{m a x}$ Channel transmission power $τ_{t, r}^{u} (t)$	30 dBm
Number of episode	10,000
Observation step per episode	100
Learning rate $α$	0.001
Discount Factor $γ$	0.99
Weight coefficient $ω_{t}$ and $ω_{e}$	0.7 and 0.3

Table 6. Summary evaluation result for the four Models.

Model	Ave Reward	Ave Latency	Ave Energy Consumption
MAPPO	0.34 ± 0.01	0.17 ± 0.02	0.96 ± 0.15
MASAC	0.35 ± 0.01	0.20 ± 0.03	0.92 ± 0.14
MAIDDPG	0.32 ± 0.01	0.14 ± 0.0143	0.9998 ± 0.1873
MATD3	0.36 ± 0.01	0.08 ± 0.01	0.76 ± 0.14

Note: Bold values indicate the best performance achieved by the MATD3 algorithm for each metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ali, E.M.; Lemma, F.; Srinivasagan, R.; Abawajy, J. Efficient Delay-Sensitive Task Offloading to Fog Computing with Multi-Agent Twin Delayed Deep Deterministic Policy Gradient. Electronics 2025, 14, 2169. https://doi.org/10.3390/electronics14112169

AMA Style

Ali EM, Lemma F, Srinivasagan R, Abawajy J. Efficient Delay-Sensitive Task Offloading to Fog Computing with Multi-Agent Twin Delayed Deep Deterministic Policy Gradient. Electronics. 2025; 14(11):2169. https://doi.org/10.3390/electronics14112169

Chicago/Turabian Style

Ali, Endris Mohammed, Frezewd Lemma, Ramasamy Srinivasagan, and Jemal Abawajy. 2025. "Efficient Delay-Sensitive Task Offloading to Fog Computing with Multi-Agent Twin Delayed Deep Deterministic Policy Gradient" Electronics 14, no. 11: 2169. https://doi.org/10.3390/electronics14112169

APA Style

Ali, E. M., Lemma, F., Srinivasagan, R., & Abawajy, J. (2025). Efficient Delay-Sensitive Task Offloading to Fog Computing with Multi-Agent Twin Delayed Deep Deterministic Policy Gradient. Electronics, 14(11), 2169. https://doi.org/10.3390/electronics14112169

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Delay-Sensitive Task Offloading to Fog Computing with Multi-Agent Twin Delayed Deep Deterministic Policy Gradient

Abstract

1. Introduction

2. Related Work

2.1. Machine Learning-Based Approaches

2.2. Reinforcement Learning

2.3. Deep Reinforcement Learning (DRL)

2.4. Multi-Agent Deep Reinforcement Learning

3. System Model

3.1. Network Model

3.2. Task Model

3.3. Dynamic Task Partitioning

3.4. Communication Model

3.4.1. U2F Communication

3.4.2. F2F Communication

3.4.3. Local Computations

3.4.4. Offloading Computation Delay

3.5. Energy Consumption

3.6. Problem Formulation

3.7. TD3 Algorithm Overview

4. A Decentralized Mult-Agent TD3 Based Task Offloading Architecture

5. Algorithm Design

Proposed MAFCPTORA-Algorithm

6. Performance Evaluation

6.1. Simulation Setting

6.2. Result and Performance Analysis

7. Conclusions and Future Direction

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI