Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy Gradient for Mobile Edge Computing in Digital Twin-Powered Internet of Things

Gao, Yuzhe; Yuan, Xiaoming; Wang, Songyu; Chen, Lixin; Zhang, Zheng; Wang, Tianran

doi:10.3390/math13132164

Open AccessArticle

Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy Gradient for Mobile Edge Computing in Digital Twin-Powered Internet of Things

by

Yuzhe Gao

,

Xiaoming Yuan

^*

,

Songyu Wang

,

Lixin Chen

,

Zheng Zhang

and

Tianran Wang

Hebei Key Laboratory of Marine Perception Network and Data Processing, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2164; https://doi.org/10.3390/math13132164

Submission received: 16 May 2025 / Revised: 13 June 2025 / Accepted: 30 June 2025 / Published: 2 July 2025

(This article belongs to the Special Issue New Advances in Distributed Systems, Edge Intelligence, and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Offloading decisions and resource allocation problems in mobile edge computing (MEC) emerge as key challenges as they directly impact system performance and user experience in dynamic and resource-constrained Internet of Things (IoT) environments. This paper constructs a comprehensive and layered digital twin (DT) model for MEC, enabling real-time cooperation with the physical world and intelligent decision making. Within this model, a novel Flash-Attention-enhanced Multi-Agent Deep Deterministic Policy Gradient (FA-MADDPG) algorithm is proposed to effectively tackle MEC problems. It enhances the model by arming a critic network with attention to provide a high-quality decision. It also changes a matrix operation in a mathematical way to speed up the training process. Experiments are performed in our proposed DT environment, and results demonstrate that FA-MADDPG has good convergence. Compared with other algorithms, it achieves excellent performance in delay and energy consumption under various settings, with high time efficiency.

Keywords:

Internet of Things; mobile edge computing; digital twin; multi-agent reinforcement learning; attention mechanism; flash attention

MSC:

15-02; 68-02

1. Introduction

With the rapid development of the Internet of Things (IoT) industry, smart IoT devices have surged both in number and in variety. Many IoT services are computation intensive and delay sensitive, posing significant challenges for resource-constrained IoT devices. The increasing demand for various services also places heavy burdens on networks, requiring real-time control, low-latency response, and intelligent resource allocation [1]. Mobile edge computing (MEC) has emerged as an effective approach to address these issues [2]. By deploying computing resources at edge nodes, MEC enables user equipment (UE) to handle diverse applications, having the potential to solve the mentioned problems [3].

To comprehensively understand the MEC-assisted IoT ecosystem, it is essential to jointly model the communication network, computational mechanisms, and IoT scenarios. Digital twin (DT) is a rapidly evolving research topic that offers a feasible solution for building an integrated representation of the physical and digital worlds [4]. DTs are virtual replicas of physical assets or systems that are continuously updated with real-time sensor data and computational models. By bridging the physical and digital domains, DTs enable continuous monitoring, predictive analysis, and proactive decision making. The integration of DT-powered IoT and MEC enhances data processing capabilities and facilitates more intelligent, efficient, and reliable services [5]. A framework that maps services and physical entities into the DT space, enabling a more holistic understanding of the IoT environment and supporting more effective offloading strategies, is needed [6].

Although edge computing can resolve the problem of limited resource in user devices, how to perform an offloading decision is a core problem. MEC environments are characterized by heterogeneous tasks, dynamic network conditions, and strict constraints on delay and energy consumption. Improper offloading may lead to extra energy consumption and high latency, which both affect system performance and user experiences. These factors make the task offloading decision process highly complex, requiring intelligent algorithms that can adapt in real time to ensure optimal system performance [7]. To address the offloading decision problem in MEC environments, Multi-Agent Deep Reinforcement Learning (MADRL) has gained attention due to its efficiency in managing dynamic, distributed, and complex scenarios [8]. MADRL can learn optimal strategies in real time by interacting with the environment [9]. Among various MADRL approaches, the MADDPG algorithm stands out by utilizing centralized training with decentralized execution, making it particularly suitable for MEC scenarios involving multiple agents. MADDPG facilitates faster convergence during training and improved decision-making performance.

In MADRL, agents learn from their observations of the environment. However, as the number of agents increases, information redundancy and irrelevance can become significant issues. Inspired by advances in natural language processing and computer vision, the attention mechanism [10] has been introduced into MADRL to help individual agents focus on the most relevant information from their own perspective. This enhances the efficiency and accuracy of offloading decision making.

While attention-augmented MADRL improves decision quality in MEC, the computational complexity of the training process becomes a limiting factor. The training time increases significantly with the number of agents and the scale of observations. To address this challenge, the flash attention mechanism has been proposed to improve computational efficiency. It optimizes the attention operation by using a tiling strategy that divides large matrices into smaller blocks, thereby reducing unnecessary memory access and enhancing computational throughput [11]. By integrating flash attention into the MADRL framework, it is possible to accelerate training without sacrificing performance, which is an advantage that is particularly valuable in highly dynamic MEC environments.

The contributions of this paper can be summarized as follows:

To enable accurate virtual representations within the MEC-assisted IoT, this paper designs a comprehensive and layered digital twin model. The proposed model provides valuable insights into the IoT ecosystem and offers data support, decision supervision, and real-time control to tackle MEC problems.
To improve performance in a MEC scenario, this paper proposes a Flash-Attention-enhanced MADDPG algorithm (FA-MADDPG) for decision making, and its time complexity is analyzed. The integration of attention mechanisms into a critic network enables agents to focus on relevant information, while a flash mechanism ensures efficient and timely training in complex IoT scenarios.
To validate the effectiveness of the proposed FA-MADDPG algorithm in DT, this paper conducts extensive experiments. DT is constructed, and training efficiency is evaluated using reward convergence curves, time efficiency analysis, and performance assessment against several baseline algorithms.

2. Related Work

2.1. DT-Powered MEC

The integration of digital twin technology has brought new opportunities for different study fields’ development. Some researchers have achieved real-time control by building a DT framework; they map physical entities to the virtual world and simulate services from the real world in [12,13,14]. In the specific field of MEC, for the resource allocation optimization problem, ref. [15] proposes a MEC framework for factory manufacture based on a digital twin, aiming to reduce the total latency between devices and ensure the reliability of communication. Ref. [16] combines a digital twin and federated learning, proposing a resource allocation framework in heterogeneous cellular networks. In the task offloading problem field, ref. [17] focuses on collaborative mobile edge computing; a digital twin-driven intelligent task offloading scheme is proposed. Ref. [18] considers the unmanned aerial vehicle-assisted MEC emergency network DT, reflecting the real-time status to optimize offloading decisions.

DT-powered MEC applications are expanded to various IoT fields. Ref. [19] utilizes DT and MEC technologies to construct a healthcare service architecture for senior safety monitoring, and realizes real-time early warning of the health status of the elderly. Ref. [20] proposes a DT framework named VECO for intelligent transportation systems. This framework effectively improves the efficiency and reliability of vehicular edge computing. Ref. [21] explores MEC-assisted DT environments in smart industrial scenarios. It introduces a framework that captures the dynamic relationships among physical entities. However, DT is rarely used to collaboratively optimize MEC strategies, and a DT framework in existing research lacks details such as how DT interacts with the real world and how data are used in DT.

2.2. MADRL for Computation Offloading Decisions

In MEC scenarios, traditional deep reinforcement learning (DRL) algorithms face difficulties when dealing with multi-user computation offloading scenarios [22], and they are also unable to adapt well to dynamic environments [23]. Therefore, MADRL has become a research hotspot. Many scholars have proposed a series of computation offloading decision-making methods based on MADRL.

To solve the problem of partial observability in a dynamic environment, ref. [24] utilizes a cooperative MADRL method in a cloud-edge computing situation to reduce the average latency. Ref. [25] uses the MADDPG algorithm to solve the task offloading problem in air–ground integrated networks by centralized training and decentralized execution. For a more complicated application, a partial offloading and resource allocation algorithm based on MATD3 is designed in [26]. The vehicle computing overhead was reduced and processed timely. A mobile-aware collaborative MADRL method is proposed in [27]. A two-stage decision framework based on MAPPO and MADDPG is designed to balance the system workload and reduce the task completion delay and failure rate. A mixed MAPPO algorithm is proposed in [28] to solve the task offloading problem in crowd-edge computing. End-to-end transmission delay is reduced as a result. Ref. [29] integrates DT and evolutionary selection MADRL for directed acyclic graph task scheduling in large-scale mobile edge networks. Ref. [30] considers the industrial wireless environment where Wi-Fi 6 and 5G coexist. A multi-agent JTSRA algorithm is proposed for task offloading, enabling agents to dynamically adjust schemes according to network conditions and task requirements. However, the situation when an environment involves lots of agents and excessive information can slow training is barely considered.

2.3. Flash-Attention-Advanced DRL

The attention mechanism can extract the most effective part from a large amount of available information, which is suitable for tasks such as text generation [31] and translation [32]. Thanks to the emphasis role of the attention mechanism on important and relevant information, many scholars have also integrated it into DRL. Mao et al. add attention to an actor-critic algorithm to reduce the impact of a redundant message in a multi-agent communication scenario [33]. Bono et al. integrate attention mechanisms with deep reinforcement learning to construct an architecture named MARDAM, thus effectively solving the dynamic and stochastic vehicle routing problem [34]. Wu et al. incorporate an attention mechanism into the MADDPG framework to selectively concentrate on the input state and joint actions, improving the model’s prediction performance in vehicular MEC [35]. In [10], an attention mechanism contributes to faster convergence for an MADDPG algorithm and more precise UAV formation control by focusing on the most relevant information. Ref. [36] introduces a PGA-DRL model that gradually integrates GCN and GAT features through attention, and leverages the advantages of both methods to strengthen feature representation within an actor-critic framework.

However, with the incorporation of an attention mechanism in DRL, the training time also becomes longer as the computational complexity increases. Although this is a significant challenge for power-limited and delay-sensitive MEC, there is relatively little research on it. Flash attention is developed to speed up the attention mechanism. As an I/O-aware method, it seeks to reduce the memory activity caused by the large matrices commonly used in attention mechanisms [37]. It achieves this by applying techniques such as tiling and recomputation, combined with an online softmax approach, allowing the matrix to be computed one tile at a time [11].

3. System Model

In this section, a system model is established for DT-powered MEC. A DT-based IoT model is first depicted; the communication model and the computing model are also respectively embedded in the operational mechanisms of these three layers of DT architecture, serving as the key supporting mechanisms of the system.

3.1. Digital Twin-Based IoT Model

In this part, a digital twin-powered Internet of Things system is first described, which comprises a physical entities layer, a virtual twin layer, and an application layer. As shown in Figure 1, the DT-based system is established for real-time information gathering, effective computation offloading, and resource allocation in IoT environments. Details are illustrated in the following content.

3.1.1. Physical Entity Layer

The physical entity layer (PEL) is the fundamental constituent of the whole system. In this layer, entities such as cloud server, base stations, and user devices are the basic elements. Computing task offloading and resource allocation intrigued by tasks are the basic actions. Real-time data computation and timely analysis are the functionalities. PEL ensures normal functioning of hardware so that single actions are capable of being performed to support the overall functionalities. This layer furthermore promotes the building process of DT by collecting abundant real-world data, and also assists to achieve the goal of excellent strategy making and key index optimization by giving accurate feedback.

In the PEL, each user is equipped with a device to obtain a state, and data are uploaded to the edge node in real time. The user set is defined as

U = {1, 2, \dots, M^{U}}

, where

M^{U}

is the total number of users. The time slots are defined as

T = {1, 2, \dots, M^{T}}

, where

M^{T}

is the total number of time slots. Users can walk randomly in the activity place. The user’s position at a specific time slot is presented as a triplet

(x_{u} (t), y_{u} (t), 0)

in 3-D Cartesian coordinates. In one step, users can either stay in the same place or move at any direction. The data are collected by user devices in every time slot, and a computation task is formed from the data. The tasks are denoted as

C = {C_{1}, C_{2}, \dots, C_{u}}

, where

C_{u}

is the computing tasks of the user u. When tasks are generated, they can be processed locally by using the computation resource of a user device. They can also be offloaded and processed remotely.

B = {1, 2, \dots, M^{B}}

is used to represent the set of base stations, where

M^{B}

is the largest number of base stations. Base stations are installed over the ground, and their positions can be represented in triplets

(x_{b}, y_{b}, z_{b})

. Base stations are fixed in a service area. Users’ and base stations’ positions can be obtained through GPS, and users’ and base stations’ positions are broadcast at every time slot.

3.1.2. Virtual Twin Layer

The virtual twin layer (VTL) is the core constituent of the digital twin system. It has three components inside, which are the simulation and training component, data repository component, and DT supervision component. VTL imitates the actual behaviors and functionalities of the entities in physical world, giving a general comprehension of the physical layer’s running mode. It also helps the decision-making training process and thus can give a high-quality strategy for offloading and resource allocation.

Data Repository Component: The data repository component is for real-time collected data storage. It stores environment and entity information inside, such as wireless channel environment, task information, and user’s state. One feature of a data repository is that, in order to achieve users’ high demand for privacy, data are processed to erase the user’s sensitive information. For instance, the name and phone number are not stored in a database. What is more, data validity is checked and invalid data are abandoned. By doing so, it can not only reduce the storage stress and increase database utility, but also ensure user security. Even if the information is leaked, the corresponding person cannot be found, and user security and information privacy are guaranteed.

Simulation and Training Component: The simulation and training component is crucial for a virtual twin. On the one hand, it receives real data extracted from the real world to rebuild a simulated world for training; on the other hand, it supports an application layer for functions like decision making and data analysis.

When implementing simulation and training, a virtual world is first constructed with data support from a data repository. The features and specialties are reflected, and the main functions are emphasized during this process in real time with precision. Then the model training part is carried out through replay training in the constructed environment. The model begins to implement different actions and obtain timely feedback on system performance including delay and energy consumption. The parameters for policies are updated, and the model can then interact with the environment for iterative optimization.

DT supervision component: The DT supervision component undertakes managerial function. This component controls the running process of DT, and oversees other components’ behavior to guarantee a system’s effective running.

One feature of a DT supervision component is that it is responsible for dirty data detection. During DT construction, abnormal data collected from malfunctioning devices or non-human users will be marked and deleted to avoid interference to a training model. A reasonableness check is also conducted in this component. The offloading and migration strategy of tasks should be considered with caution. They will be implemented in the real world only if they pass the test. At the same time, every action performed in the virtual twin layer will be recorded by the supervision component for the human maintainer to check.

3.1.3. Application Layer

The application layer (AL) is designed to provide convenient use for all stakeholders such as users and managerial staff. Without exposure to the complex mechanism of the proposed IoT system, they only need to see and interact with the operation-friendly interface. By doing so, AL not only simplifies complex operation by offering direct function utilization method to provide a high quality of service, but also ensures that users and workers can access intangible DT for real-time control in an effective way. For example, users are provided with a request function; they can directly call specific functions in various environments. For managerial stuff, they can check a base station network at any time to ensure that the IoT system runs well and monitor the devices to adjust to different scenarios and environments. More importantly, AL is responsible for applying the decisions and strategies generated by the VTL to the real world, enabling dynamic task scheduling and resource allocation. It also supports real-time system monitoring and adaptive adjustments based on current network conditions and task requirements.

3.2. Communication Model

A communication model reflects how data are transmitted between users and edge nodes. When the task is offloaded, wireless links are used to transfer information. This action demonstrates the operation in a physical layer, and provides the necessary data support for DT construction. When a task is offloaded to a base station, the data rate for the task j to transmit to the base station b can be calculated by the following Shannon formula:

\begin{matrix} v_{j, b} & = w l o g_{2} (1 + S I N R) \end{matrix}

(1)

\begin{matrix} S I N R & = \frac{p^{t r a n s} g}{p^{n}} \end{matrix}

(2)

w is the channel bandwidth,

p^{t r a n s}

is the transmitting power between a user device and a base station, g is the channel gain, and

p^{n}

is the noise power.

The transmission power is a function of channel gain. r is the required transmission rate, and w is the channel bandwidth. Transmission power is designed as

\begin{matrix} p^{t r a n s} & = \frac{p^{n} (2^{\frac{r}{w}} - 1)}{\sqrt{g}} \end{matrix}

(3)

3.3. Computation Model

A computation model characterizes how computing tasks are processed, including an offloading and local process. Computation task data are collected from the real world. A DT system continuously attempts different offloading decisions for them in the simulation environment, and updates the strategy based on the feedback received. The obtained training strategies will be used to guide the actual system task scheduling, achieving the DT closed-loop optimization process from physical to virtual and then back to physical.

3.3.1. Local Computation Model

When the task is processed locally, a user device finishes the task by itself. The size of the task j is denoted as

s_{j}

, and a user device’s local computation ability is presented as

c^{l}

. Let

p^{l}

be the local computation power. The computation time

t_{j}^{l}

and energy consumption

e_{j}^{l}

can be calculated as

\begin{matrix} t_{j}^{l} & = \frac{s_{j}}{c^{l}} \end{matrix}

(4)

\begin{matrix} e_{j}^{l} & = t_{j}^{l} p^{l} \end{matrix}

(5)

3.3.2. Offloading Computation Model

Base stations’ computation ability is denoted as

c^{b}

, and computation power is denoted as

p^{b}

. Transmission time

t_{j}^{t r a n s}

, time cost

t_{j}^{o}

, and energy consumption

e_{j}^{o}

can be calculated as

\begin{matrix} t_{j}^{t r a n s} & = \frac{s_{j}}{c^{b}} + \frac{s_{j}}{v_{j, b}} \end{matrix}

(6)

\begin{matrix} t_{j}^{o} & = \frac{s_{j}}{c^{b}} + \frac{s_{j}}{v_{j, b}} \end{matrix}

(7)

\begin{matrix} e_{j}^{o} & = \frac{s_{j}}{c^{b}} p^{b} + \frac{s_{j}}{v_{j, b}} p^{t r a n s} \end{matrix}

(8)

3.3.3. System Computation Model

A system computation model provides a holistic view of how the entire computation operates.

α

and

β

are indicator variables; their values are originally set as 0. When a task is processed locally,

α

= 1. When a task is offloaded,

β

= 1. Let

t_{i, j}

denote latency for the user device i to get the task j processed in the proposed IoT system.

e_{i, j}

is denoted as the total energy consumption in the processing task j of the user i.

\begin{matrix} t_{i, j} & = α t_{j}^{l} + β t_{j}^{o} \end{matrix}

(9)

\begin{matrix} e_{i, j} & = α e_{j}^{l} + β e_{j}^{o} \end{matrix}

(10)

This section deeply integrates the communication process and computation process into the operation flow of DT. Three layers work together to form a complete DT-IoT system loop. This approach makes DT no longer an independent abstract concept but a core mechanism guiding the efficient operation of the system.

4. Problem Formulation and Proposed FA-MADDPG Solution

In this section, the usage and adaptation of FA-MADDPG is elucidated to promote in time control strategies of the problem in an MEC-based IoT environment. This section begins by formulating our problem and then transforms it into a Markov Decision Process (MDP) problem by constructing each agent’s state space, action space, and reward function. Then the preliminary knowledge of MADDPG and flash attention mechanism is introduced. Next, our optimization algorithm for the problem is given in Algorithm 1.

4.1. Problem Formulation

It is worth noticing that edge computing tasks are delay sensitive and user devices are energy constrained. A balance problem between delay and energy consumption is formed. The cost of computing a task j of the user i can be defined as

\begin{matrix} {C o s t}_{i, j} & = θ_{1} t_{i, j} + θ_{2} e_{i, j} \end{matrix}

(11)

θ_{1}

and

θ_{2}

are normalized weight coefficients of the key index of the processing delay

t_{i, j}

and energy consumption

e_{i, j}

. In the context of an IoT system, they can be determined according to the identities of the user i and the properties of the task j. Very important populations and urgent tasks have higher requirements on delay, which is to say that

θ_{1}

has higher priority. For the task with the weight

T W

, it can be designed that

\begin{matrix} θ_{1} & = \frac{T W}{T W + 1} \end{matrix}

(12)

\begin{matrix} θ_{2} & = 1 - θ_{1} \end{matrix}

(13)

Weight is defined in Table 1. By introducing task weight, the model can differentiate tasks based on their computational demand or priority, which helps in better resource allocation and more realistic decision making.

As illustrated above, the optional offloading and power allocation strategies exist for every task j to finish the computing process in an IoT system. The target is to design a scheme for every task to be efficiently offloaded and resource to be accurately allocated so that cost can be minimized and reward can be maximized to reach a high quality of service.

By taking into account both delay and energy consumption, the final MEC optimization problem for a DT-based IoT world is formed and can be shown as follows, where

p_{max}^{l}

is the computing power limit for a user device,

p_{max}^{b}

is the computing power limit for a base station to process the task, and

p_{max}^{t r a n s}

is the max transmission power for user-edge information transmission.

The constraints C1 and C2 guarantee that only a single mode can be appointed to process a medical task for body health supervision in an IoT system; C3 ensures that the value of TW can only be from our proposed weights. C4, C5, and C6 are the power limits.

\begin{matrix} P r o b l e m : min \sum_{i = 1}^{M^{T}} \sum_{j = C_{1}}^{C_{U}} (θ_{1} t_{i, j} + θ_{2} e_{i, j}) \\ s . t . & C 1 : α, β \in {0, 1} \\ C 2 : α + β = 1 \\ C 3 : T W \in {0, 1, 2, 3, 4} \\ C 4 : 0 \leq p^{b} \leq p_{max}^{b} \\ C 5 : 0 \leq p^{l} \leq p_{max}^{l} \\ C 6 : 0 \leq p^{t r a n s} \leq p_{max}^{t r a n s} \end{matrix}

4.2. MDP Problem Construction

In the considered multi-agent system, each base station operates as a DRL decision maker, making decentralized and independent decisions based on its local observation and learned policy. Meanwhile, a central controller possesses access to the global information and performs centralized training based on the aggregated experience from all agents. During execution, each agent follows a distributed policy based on its individual observation.

To leverage deep reinforcement learning algorithms to solve the formulated optimization problem, it is essential to transform the problem into a standard MDP framework. The core components of the reformulation include the definition of the state space, action space, and reward function, which are described below.

4.2.1. State Space

At each discrete time slot t, each base station

k \in B

collects its own observation

o_{k} (t)

from the environment. Let

U_{k}

denote the set of user devices associated with the base station k. The observation at time t is defined as

o_{k} (t) ≜ {n_{k} (t), {TW}_{k} (t), X_{k} (t), Y_{k} (t), z (t)},

where

$n_{k} (t)$ represents the number of user tasks at the beginning of time slot t, which is randomly formed;
${TW}_{k} (t) = {t w_{j} (t)}$ represents the task weight for the task j at the beginning of the time slot t, which is randomly formed;
$X_{k} (t) = {x_{i, k} (t)}$ represents the horizontal distance between the base station k and the user device i, and $d_{x}$ is a random variable sampled from $[- 1, 1]$ ;

$x_{i, k} (t + 1) = x_{i, k} (t) + d_{x}$

(14)
$Y_{k} (t) = {y_{i, k} (t)}$ represents the vertical distance between the base station k and the user device i, and $d_{y}$ is a random variable sampled from $[- 1, 1]$ ;

$y_{i, k} (t + 1) = y_{i, k} (t) + d_{y}$

(15)
$z (t)$ represents the altitude of a base station.

The global state at time t is then defined by

s (t) ≜ {o_{1} (t), o_{2} (t), \dots, o_{M_{B}} (t)},

and the state space is denoted as

S = {s (t), \forall t \in T}

.

4.2.2. Action Space

Based on its observation

o_{k} (t)

, each base station k selects an action

a_{k, j} (t)

through its actor network for the task j. The action consists of multiple sub-decisions depending on the IoT system scenario.

Define the action of the agent k as

a_{k, j} (t) ≜ {D_{k} (t), P_{k}^{trans} (t), P_{k}^{c} (t)},

where

$D_{k} (t) = {d_{j, k} (t)}$ represents the task offloading decision for the task j;
$P_{k}^{trans} (t) = {p_{j, k}^{trans} (t)}$ stands for the transmission power distribution for the task j;
$P_{k}^{c} (t) = {p_{j, k}^{c} (t)}$ stands for the computation power distribution for the task j.

The joint action of all agents is expressed as

a (t) ≜ {a_{1} (t), a_{2} (t), \dots, a_{J} (t)},

and the action space is denoted as

A = {a (t), \forall t \in T}

.

4.2.3. Reward Function

After the executing action

a_{k, j} (t)

, each agent receives an instant reward that reflects the quality of its decision. In the context of collaborative optimization, the reward is a function of system-wide performance, including latency and energy consumption.

The reward function is defined as

r (t) = \frac{1}{M^{U}} \cdot \frac{1}{\sum_{i \in U} \sum_{j \in C} {Cost}_{i, j} (t)} .

(16)

If the selected actions do not satisfy all operational constraints described in a problem, the agent will receive a penalty. This mechanism encourages agents to explore valid action spaces and jointly optimize long-term system performance.

4.3. FA-MADDPG Algorithm Model

4.3.1. Attention-Based MADDPG Algorithm

In single agent deep reinforcement learning, agents often suffer from partial observability and non-stationarity in dynamic DT environments. The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm extends the deterministic policy gradient framework to multi-agent settings, enabling efficient learning through centralized training and decentralized execution. During training in a virtual twin layer, each agent has access to the global state and joint actions, while at execution, each agent relies only on its local observation for decision making.

Each agent in MADDPG is associated with an actor network

μ_{k} (o_{k} (t); θ_{k})

, which maps local observations to deterministic actions, and a critic network

Q_{k} (s (t), a (t); w_{k})

, which evaluates the joint action–state pair using global information. To stabilize learning, target networks

μ_{k} (o_{k} (t); θ_{k}^{-})

and

Q_{k} (s (t), a (t); w_{k}^{-})

are used with slowly updated parameters to provide smooth temporal-difference (TD) targets.

However, as the number of agents increases, the amount of global DT information available during training can become excessive and potentially irrelevant. This may degrade learning efficiency and slow convergence. To address this, an attention-based extension to MADDPG can be used, which enables agents to focus selectively on more relevant agents. This mirrors the natural tendency of agents in physical environments such as base stations to primarily interact with nearby or cooperating agents, enabling more precise and accurate DT construction and effective training.

By incorporating an attention mechanism into the critic network, it enables more refined modeling of agent interactions. In each experience tuple, the collective state action information from all agents is denoted by

(s, a)

, where

s = (o_{1}, o_{2}, \dots, o_{M_{B}})

represents observations and

a = (a_{1}, a_{2}, \dots, a_{n})

represents actions. The Q value for the agent k is computed as

Q_{k} (s, a) = f_{k} (h_{k} (o_{k}, a_{k}), c_{k})

(17)

Here,

h_{k}

is a single-layer perceptron encoding agent k’s own observation–action pair, and

f_{k}

is a two-layer MLP that produces the final Q value. The term

c_{k}

represents the contextual attention vector, which captures the influence of other agents

k^{'} \neq k

on the agent decision making of the agent k. This is computed through a key value attention mechanism:

c_{k} = \sum_{k^{'} \neq k} p_{k^{'}} v_{k^{'}}

(18)

p_{k^{'}} = softmax (s_{k^{'}})

(19)

s_{k^{'}} = \frac{q_{k^{'}} k_{k^{'}}^{⊤}}{\sqrt{d_{k}}}

(20)

Each agent encodes its observation action pair via a shared embedding and linear projections into query

q_{k}

, key

k_{k}

, and value

v_{k}

vectors in

R^{128 \times 1}

. Using the learnable projection matrices

Q, K, V \in R^{128 \times 128}

,

q_{k}

,

k_{k}

and

v_{k}

can be obtained, as follows:

\begin{matrix} q_{k} & = Q h_{k} (o_{k}, a_{k}), \end{matrix}

(21)

\begin{matrix} k_{k} & = K h_{k} (o_{k}, a_{k}), \end{matrix}

(22)

\begin{matrix} v_{k} & = V h_{k} (o_{k}, a_{k}) \end{matrix}

(23)

This design allows each agent to weigh the influence of other agents dynamically based on relevance, rather than treating all agents equally. The attention mechanism enables context-aware, soft aggregation of other agents’ information, making the critic network more robust and scalable in large multi-agent systems.

4.3.2. Flash Attention Mechanism

In our proposed DT environment, MADDPG agents undertake the responsibility to run centralized training and decentralized execution, which incurs high time and energy cost. Moreover, with the incorporation of the attention mechanism, computational complexity becomes more significant. However, edge computing devices usually suffer from limitations such as restricted computational capability and strict power constraints, which poses challenge to DT-real world fusion. Given these drawbacks, ensuring the quality of service is difficult since tasks in an IoT context is time-bounded and power sensitive.

The long training time is partly due to reads and writes between GPU memory levels [11]; GPU SRAM offers high speed but limited capacity, while GPU HBM has larger capacity but lower speed. When SRAM exceeds its storage capacity, a matrix is transferred to HBM, and reloaded into SRAM when needed again. This process costs time.

To decrease training time, a flash attention mechanism is introduced into our paradigm. We redesign the critic network’s attention mechanism by coupling flash attention into the Q value estimation process for multi-agent coordination. By using a tiling strategy to split big matrices into small parts to compute, it can reduce matrix movement between HBM and SRAM; flash attention lowers training time by avoiding full materialization of s and p.

The matrices

q, k, v \in R^{N \times d}

are computed first and stored in HBM. The SRAM capacity is M. Set the column block size as

B_{c} = ⌈\frac{M}{4 d}⌉

and the row block size as

B_{r} = min (⌈\frac{M}{4 d}⌉, d)

to divide

q, k, v, c

.

The final output matrix

c

and intermediate matrices

ℓ, m

to keep temporary results are stored in HBM; they are initialized as

c = 0 \in R^{N \times d}, ℓ = 0 \in R^{N}, m = - \infty \in R^{N}

Matrices can be split:

q

and

o

into

⌈\frac{N}{B_{r}}⌉

blocks of size

B_{r} \times d

, ℓ and

m

into

⌈\frac{N}{B_{r}}⌉

blocks of size

B_{r} \times 1

,

k

and

v

into

⌈\frac{N}{B_{c}}⌉

blocks of size

B_{c} \times d

.

When the above preparation work is done, the computation process starts: first, one key block

k_{β}

and the corresponding value block

v_{β}

are loaded from HBM into SRAM. Next, every block

q_{α}

, the current partial output block

c_{α}

, and the corresponding normalization vectors

ℓ_{α}

and

m_{α}

are sequentially loaded from HBM into SRAM and compute following the next steps.

Computation is performed to calculate the partial attention score matrix, as follows:

s_{α β} = \frac{q_{α} k_{β}^{⊤}}{\sqrt{d}} \in R^{B_{r} \times B_{c}}

(24)

For numerical stability during softmax computation, the row-wise maximum

{\tilde{m}}_{α β}

is computed, and the matrix is normalized and exponentiated to obtain

{\tilde{p}}_{α β}

. The row-wise sum

{\tilde{ℓ}}_{α β}

is also computed.

{\tilde{m}}_{α β} = rowmax (S_{α β})

(25)

{\tilde{p}}_{α β} = exp (S_{α β} - {\tilde{m}}_{α β})

(26)

{\tilde{ℓ}}_{α β} = rowsum ({\tilde{P}}_{α β})

(27)

Subsequently, updated normalization terms are computed via

m_{α}^{new} = max (m_{α}, {\tilde{m}}_{α β})

(28)

ℓ_{α}^{new} = e^{m_{α} - m_{α}^{new}} ℓ_{α} + e^{{\tilde{m}}_{α β} - m_{α}^{new}} {\tilde{ℓ}}_{α β}

(29)

The output block

c_{α}

is then updated as follows:

c_{α} \leftarrow diag {(ℓ_{α}^{new})}^{- 1} (diag (ℓ_{α}) e^{m_{α} - m_{α}^{new}} c_{α} + e^{{\tilde{m}}_{α β} - m_{α}^{new}} {\tilde{p}}_{α β} v_{β})

(30)

The updated

c_{α}

,

ℓ_{α}

, and

m_{α}

are then written back to HBM. Then, the next key block

k_{β}

and the corresponding value block

v_{β}

are loaded to repeat the process from (24) to (30), progressively accumulating the final output matrix

c

without storing the full attention matrix in SRAM. Once all blocks have been processed, the complete output matrix

c

is generated.

4.3.3. Joint Optimization Algorithm Framework

Based on the previously defined components in MDP problem construction, an optimization algorithm for a formulated problem in a DT-based IoT world utilizing the Flash Attention Multi-Agent Deep Deterministic Policy Gradient (FA-MADDPG) framework is proposed, as detailed in Algorithm 1. To aid in comprehension, Figure 2 illustrates the overall architecture of our proposed algorithm.

Algorithm 1: Joint optimization with FA-MADDPG

At the start of each training episode, each agent k observes its local environment and receives the initial state

o_{k} (t)

, while a central controller deployed on the primary agent gathers the global state

s (t)

. Based on this shared global state and individual policies, each actor network generates its respective action

a_{k} (t) = μ_{k} (s (t); θ_{k}) + Ψ (t),

where

Ψ (t)

denotes a temporally correlated noise sampled from a predefined exploration process, encouraging adequate exploration of the action space.

After executing the selected actions

a_{k} (t)

, each agent receives an individual reward

r_{k} (t)

and updates its local observation to

o_{k} (t + 1)

and the global environment transitions to a new state

s (t + 1)

. The tuple

〈 s (t), a (t), r (t), s (t + 1) 〉

is then stored in a shared replay buffer for future learning.

Then, a mini batch of transitions is sampled by experiences in the replay buffer for training. For each agent k, embeddings are computed from both observations and actions using a shared MLP. These embeddings are then transformed into query, key, and value matrices, which are further processed by the flash attention mechanism. This includes partitioning into smaller blocks and performing normalized attention computations through steps such as maximum extraction, softmax approximation, and attention-weighted aggregation, as detailed in Equations (24)–(30).

Following attention computation, the agent evaluates its Q value using the critic network, and the corresponding target value

y = r_{k} + γ Q_{k}^{-} (s^{'}, a^{'})

is derived. The critic is updated by minimizing the loss between the predicted and target Q values. Simultaneously, the actor is refined through policy gradient techniques, leveraging the learned critic to guide updates. Finally, all networks undergo soft updates using an averaging scheme, ensuring stable learning across episodes.

4.3.4. Complexity Analysis

In each training episode, there are

M^{B}

base stations generating experiences. For training, a batch of E experiences is then sampled for each agent. Actor and critic are neural networks; the dimensionality of a network has the most important impact on time complexity. Let S denote state space dimensionality, H denote the number of hidden layer neurons, and A denote action space dimensionality. There are two hidden layers in our settings, and H is usually greater than S and A, so the time complexity for actor is

O_{a} = O (S H + H^{2} + H A) = O (H^{2})

. In a flash attention computation process, the time complexity for computing

S_{α β}

is

O (B_{r} d B_{c})

. The softmax process in (25)–(27) is element-wise; time complexity is

O (B_{r} B_{c})

. The time complexity for (28)–(30) is

O (B_{r} B_{c} d)

. Thus, each

(α, β)

pair’s computation time complexity is

O (B_{r} B_{c} d)

. The number of iteration time is

(\frac{N}{B_{r}}) (\frac{N}{B_{c}})

. The time complexity for flash attention is

(\frac{N}{B_{r}}) (\frac{N}{B_{c}}) O (B_{r} B_{c} d) = O (N^{2} d) .

h_{i}

is a single-layer perceptron for encoding an observation–action pair;

f_{i}

is a two-layer MLP that produces the final Q value. Let M denote encoding a result’s dimensionality, and C denotes an attention vector’s dimensionality. The time complexity for critic is

O_{c} = O ((A + S) M + (M + C) H + H^{2} + H) + O (N^{2} d) = O (N^{2} d + H^{2}) .

Target network shares same network structures. The time complexity for FA-MADDPG is

O (2 M^{B} E (N^{2} d + 2 H^{2}))

.

5. Simulation and Results

In this section, experiments are conducted and results are analyzed. The whole program is run with AMD 6800H CPU and Nvidia 2050 GPU. The DT IoT world is established by Python 3.8. An FA-MADDPG algorithm is implemented by a TensorFlow framework. The task type is confirmed following Section 4.2.1. If not otherwise specified, experimental settings are as shown in Table 2.

5.1. Convergence of Proposed Algorithm

The convergence process of an FA-MADDPG algorithm is in Figure 3.

It is clear that, at the beginning, the reward of our proposed FA-MADDPG algorithm fluctuates between 140 to 180. With the increase in training episode, reward goes up quickly and the curve converges after about 1200 training episodes. The reward finally stays stable at around 260.

5.2. Time Efficiency Analysis

The time efficiency comparison of different algorithms is shown in Figure 4; experiment settings can be seen in Table 2. Training time is set as x-lable, and reward is set as y-lable. Attention-MADDPG, MADDPG, QMix, MAPPO, and random are used as comparison algorithms. Compared with Attention-MADDPG, FA-MADDPG significantly achieves better reward for each time, which proves its high time efficiency. Although QMix, MAPPO, and MADDPG obtain better reward in the first period of time, FA-MADDPG achieves significantly higher reward after 15 min. What is more, their converged reward is between 210 and 230, which are inferior than FA-MADDPG’s 260.

5.3. Performance Evaluation

Energy consumption and delay are key index in terms of performance. The local process capability, task size, number of base stations, and bandwidth are set differently to test the performance and robustness of the FA-MADDPG algorithm.

In Figure 5a,b, a relationship between latency and energy and local process capability is revealed. The CPU processing frequency varies from 0.6 to 1.1 GHz; other parameters are in Table 2. When the local computation capacity increases, all algorithms receive lower latency and energy cost. This is because local processing time decreases as processing capability enhances. CPU power stays the same; time decreases so the energy costs decrease. This is additionally because the policies tend to process more locally with better local CPU, so communication time and energy decrease. FA-MADDPG and Attention-MADDPG have better performance than random other MARL algorithms.

In Figure 6a,b, the relationship between latency and energy to task size is revealed. The task size varies from 80 to 180 KB; other parameters are in Table 2. When the task size increases, all algorithms receive higher latency. This is because computing time increases. All algorithms also receive higher energy cost. This is because the process complexity increases, which need more energy. FA-MADDPG achieves excellent performance in both energy and delay.

In Figure 7a,b, a relationship between latency and energy and base station number is revealed. The number varies from 3 to 8; other parameters are in Table 2. When the base station number increases, all algorithms receive less latency and energy. This is because, in a fixed area, the base station number increases so device–base station distance shortens and resource increases. A task needs less time and energy to offload. FA-MADDPG and Attention-MADDPG achieve better performance in both energy and delay than other algorithms. This reveals that FA-MADDPG has generalizability to heterogeneous networks.

In Figure 8a,b, a relationship between latency and energy and bandwidth is revealed. The bandwidth varies from 40 to 140 MHz; other parameters are in Table 2. When the bandwidth expands, all algorithms receive less latency and energy. This is because a communication condition improves, so a task needs less time and energy to transmit. FA-MADDPG and Attention-MADDPG achieve better performance in both energy and delay than other algorithms. This reveals that the proposed algorithm can achieve great performance in both non-stationary and excellent network environments.

Among all algorithms, FA-MADDPG and Attention-MADDPG have the lowest latency and energy cost no matter how other conditions change. Their performance is close because they share the same algorithm structure except for the matrices’ computation method (flash mechanism) used in FA-MADDPG. This further attests the validity of our proposed scheme; FA-MADDPG has excellent performance while having high time efficiency. A random algorithm receives the highest latency and energy cost because actions are formed randomly without any optimization included. MADDPG, QMix, and MAPPO are better than random but inferior than proposed FA-MADDPG, because although they have observation from a multi-agent, they cannot focus on the most important and valuable information without an attention mechanism.

5.4. Flash Attention Benefits Analysis

Of the algorithms used in the above experiments, FA-MADDPG is MADDPG with flash attention, Attention-MADDPG is MADDPG with standard attention, and MADDPG is without attention. Flash attention not only brings better performance, but also higher time efficiency. Flash attention benefits are analyzed as follows.

Compared with no attention, attention leads to better performance. In Section 5.2, when training time reaches 35 min (all algorithms have converged), the reward with attention reaches around 260, higher than MADDPG’s 230. Additionally, in Section 5.3, regardless of experimental settings, FA-MADDPG and Attention-MADDPG consistently show better performance in delay and energy consumption. Attention provides stable and robust high performance for DT-powered MEC. Compared with standard attention, flash additionally offers high time efficiency in the training process. In Section 5.2, when training time is the same, FA-MADDPG always gains better reward than Attention-MADDPG. FA-MADDPG converges at around 25 min, and Attention-MADDPG converges at around 30 min, saving around 16% time, which is significant for MEC.

6. Conclusions

The rapid development of mobile edge computing presents significant challenges in offloading decisions and resource allocation, which can influence system performance and user experiences. This paper introduces a comprehensive digital twin model tailored for MEC. This model facilitates real-time interaction with the physical world and supports intelligent decision making. Central to this model is the proposed Flash-Attention Enhanced Multi-Agent Deep Deterministic Policy Gradient algorithm. It is designed to effectively address the challenges in MEC environments.

The integration of attention mechanisms within the FA-MADDPG framework enhances the decision-making quality by allowing agents to focus on relevant information, thereby improving system performance in terms of both delay and energy consumption. Furthermore, the flash mechanism optimizes training efficiency, significantly reducing the time required for convergence compared with traditional methods.

Experimental results demonstrate that FA-MADDPG outperforms other algorithms, achieving superior convergence properties with an average reward of approximately 260. It also maintains high time efficiency, saving around 16% time to converge. The advantages of utilizing FA-MADDPG are evident, as they provide stability and robustness to the decision-making process in dynamic and various settings.

However, the proposed method may face limitations when scaling to extremely large agent populations or under strict latency constraints. Future research may extend the FA-MADDPG framework by incorporating additional system constraints, enhancing scalability, and optimizing implementation. Furthermore, the integration of flash attention mechanisms into alternative reinforcement learning algorithms represents a promising direction for continued investigation.

Author Contributions

Conceptualization, Y.G. and X.Y.; methodology, Y.G.; software, S.W.; validation, S.W., L.C. and Z.Z.; formal analysis, T.W.; investigation, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, X.Y.; visualization, L.C.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (62371116).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Symbol	Meaning
$U$	user set
$T$	time slot set
$C$	task set
$B$	base station set
n	number of tasks
$(x_{b}, y_{b}, z_{b})$	base station Cartesian coordinates
$(x_{u}, y_{u}, 0)$	user device Cartesian coordinates
w	channel bandwidth
$p^{t r a n s}$	data transmission power
$p^{n}$	noise power
g	channel gain
r	required transmission rate
s	task size
$c^{l}$	local computation ability
$t^{l}$	local computation time
$p^{l}$	local computation power
$e^{l}$	local computation energy consumption
$c^{b}$	base station computation ability
$p^{b}$	base station computation power
$t^{o}$	total offloading time
$e^{o}$	total offloading energy consumption
t	total delay for one task
e	total energy consumption for one task
$T W$	task weight
$θ$	weight coefficient
$o$	observation
$S$	state space
$A$	action space
r	reward

References

Loutfi, S.I.; Shayea, I.; Tureli, U.; El-Saleh, A.A.; Tashan, W. An overview of mobility awareness with mobile edge computing over 6G network: Challenges and future research directions. Results Eng. 2024, 202, 102601. [Google Scholar] [CrossRef]
Feng, C.; Han, P.; Zhang, X.; Yang, B.; Liu, Y.; Guo, L. Computation offloading in mobile edge computing networks: A survey. J. Netw. Comput. Appl. 2022, 202, 103366. [Google Scholar] [CrossRef]
Yuan, X.; Chen, J.; Yang, J.; Zhang, N.; Yang, T.; Han, T.; Taherkordi, A. Fedstn: Graph representation driven federated learning for edge computing enabled urban traffic flow prediction. IEEE Trans. Intell. Transp. Syst. 2022, 24, 8738–8748. [Google Scholar] [CrossRef]
Hakiri, A.; Gokhale, A.; Yahia, S.B.; Mellouli, N. A comprehensive survey on digital twin for future networks and emerging Internet of Things industry. Comput. Netw. 2024, 244, 110350. [Google Scholar] [CrossRef]
Tang, F.; Chen, X.; Rodrigues, T.K.; Zhao, M.; Kato, N. Survey on digital twin edge networks (DITEN) toward 6G. IEEE Open J. Commun. Soc. 2022, 3, 1360–1381. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, W.; Xu, W.; Xu, Z.; Jia, X. Cost minimization of digital twin placements in mobile edge computing. ACM Trans. Sens. Netw. 2024, 20, 1–26. [Google Scholar] [CrossRef]
Hasan, M.K.; Jahan, N.; Nazri, M.Z.A.; Islam, S.; Khan, M.A.; Alzahrani, A.I.; Alalwan, N.; Nam, Y. Federated learning for computational offloading and resource management of vehicular edge computing in 6G-V2X network. IEEE Trans. Consum. Electron. 2024, 70, 3827–3847. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhu, C.; Dastani, M.; Wang, S. A survey of multi-agent deep reinforcement learning with communication. Auton. Agents Multi-Agent Syst. 2024, 38, 4. [Google Scholar] [CrossRef]
Wu, J.; Li, D.; Yu, Y.; Gao, L.; Wu, J.; Han, G. An attention mechanism and adaptive accuracy triple-dependent MADDPG formation control method for hybrid UAVs. IEEE Trans. Intell. Transp. Syst. 2024, 25, 11648–11663. [Google Scholar] [CrossRef]
Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Han, Y.; Niyato, D.; Leung, C.; Kim, D.I.; Zhu, K.; Feng, S.; Shen, X.; Miao, C. A dynamic hierarchical framework for IoT-assisted digital twin synchronization in the metaverse. IEEE Internet Things J. 2022, 10, 268–284. [Google Scholar] [CrossRef]
Zhang, R.; Xie, Z.; Yu, D.; Liang, W.; Cheng, X. Digital twin-assisted federated learning service provisioning over mobile edge networks. IEEE Trans. Comput. 2023, 73, 586–598. [Google Scholar] [CrossRef]
Qu, Z.; Li, Y.; Liu, B.; Gupta, D.; Tiwari, P. Dtqfl: A digital twin-assisted quantum federated learning algorithm for intelligent diagnosis in 5G mobile network. IEEE J. Biomed. Health. Inf. 2023. early Access. [Google Scholar] [CrossRef] [PubMed]
Zhuansun, C.; Li, P.; Liu, Y.; Tian, Z. Generative AI-Assisted Mobile Edge Computation Offloading in Digital Twin-Enabled IIoT. IEEE Internet Things J. 2025, 12, 13248–13258. [Google Scholar] [CrossRef]
He, Y.; Yang, M.; He, Z.; Guizani, M. Resource allocation based on digital twin-enabled federated learning framework in heterogeneous cellular network. IEEE Trans. Veh. Technol. 2022, 72, 1149–1158. [Google Scholar] [CrossRef]
Zhang, Y.; Hu, J.; Min, G. Digital twin-driven intelligent task offloading for collaborative mobile edge computing. IEEE J. Sel. Areas Commun. 2023, 41, 3034–3045. [Google Scholar] [CrossRef]
Wang, B.; Sun, Y.; Jung, H.; Nguyen, L.D.; Vo, N.S.; Duong, T.Q. Digital twin-enabled computation offloading in UAV-assisted MEC emergency networks. IEEE Wirel. Commun. Lett. 2023, 12, 1588–1592. [Google Scholar] [CrossRef]
Qu, Q.; Xu, R.; Sun, H.; Chen, Y.; Sarkar, S.; Ray, I. A Digital Healthcare Service Architecture for Seniors Safety Monitoring in Metaverse. In Proceedings of the 2023 IEEE International Conference on Metaverse Computing, Networking and Applications (MetaCom), Tokyo, Japan, 26–28 June 2023. [Google Scholar]
Lin, L.; Chen, W.; He, Q.; Xiong, J.; Lin, J.; Lin, L. VECO: A Digital Twin-Empowered Framework for Efficient Vehicular Edge Caching and Computation Offloading. IEEE Trans. Intell. Transp. Syst. 2025, 1588–1592. [Google Scholar] [CrossRef]
Li, Y.; Huang, L.; Yu, Q.; Ning, Q. Optimization of Synchronization Frequencies and Offloading Strategies in MEC-Assisted Digital Twin Networks. IEEE Internet Things J. 2025. early Access. [Google Scholar] [CrossRef]
Hou, W.; Wen, H.; Song, H.; Lei, W.; Zhang, W. Multiagent deep reinforcement learning for task offloading and resource allocation in cybertwin-based networks. IEEE Internet Things J. 2021, 8, 16256–16268. [Google Scholar] [CrossRef]
Suzuki, A.; Kobayashi, M. Multi-Agent Deep Reinforcement Learning for Cooperative Offloading in Cloud-Edge Computing. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022. [Google Scholar]
Peng, H.; Shen, X. Multi-agent reinforcement learning based resource management in MEC-and UAV-assisted vehicular networks. IEEE J. Sel. Areas Commun. 2020, 39, 131–141. [Google Scholar] [CrossRef]
Du, J.; Kong, Z.; Sun, A.; Kang, J.; Niyato, D.; Chu, X.; Yu, F.R. MADDPG-based joint service placement and task offloading in MEC empowered air–ground integrated networks. IEEE Internet Things J. 2023, 11, 10600–10615. [Google Scholar] [CrossRef]
Xue, J.; Wang, L.; Yu, Q.; Mao, P. Multi-Agent Deep Reinforcement Learning-based Partial Offloading and Resource Allocation in Vehicular Edge Computing Networks. Comput. Commun. 2025, 234, 108081. [Google Scholar] [CrossRef]
Zhang, X.; Wang, C.; Zhu, Y.; Cao, J.; Liu, T. Multi-Agent Deep Reinforcement Learning with Trajectory Prediction for Task Migration-Assisted Computation Offloading. IEEE Trans. Mob. Comput. 2025, 24, 5839–5856. [Google Scholar] [CrossRef]
Yao, S.; Wang, M.; Ren, J.; Xia, T.; Wang, W.; Xu, K.; Xu, M.; Zhang, H. Multi-Agent Reinforcement Learning for Task Offloading in Crowd-Edge Computing. IEEE Trans. Mob. Comput. 2025. early access. [Google Scholar] [CrossRef]
Huang, J.; Zhou, F.; Feng, L.; Li, W.; Zhao, M.; Yan, X.; Xi, Y.; Wu, J. Digital Twin Assisted DAG Task Scheduling via Evolutionary Selection MARL in Large-Scale Mobile Edge Network. In Proceedings of the 2023 IEEE International Conference on Communications Workshops (ICC Workshops), Rome, Italy, 28 May 2023. [Google Scholar]
Zhou, F.; Feng, L.; Kadoch, M.; Yu, P.; Li, W.; Wang, Z. Multiagent RL aided task offloading and resource management in Wi-Fi 6 and 5G coexisting industrial wireless environment. IEEE Trans. Ind. Inf. 2021, 18, 2923–2933. [Google Scholar] [CrossRef]
Liu, T.; Wang, K.; Sha, L.; Chang, B.; Sui, Z. Table-to-Text Generation by Structure-Aware Seq2seq Learning. In Proceedings of the AAAI Conference on Artificial Intelligence 2018, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Mao, H.; Zhang, Z.; Xiao, Z.; Gong, Z.; Ni, Y. Learning multi-agent communication with double attentional deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2020, 34, 1–34. [Google Scholar] [CrossRef]
Bono, G.; Dibangoye, J.S.; Simonin, O.; Matignon, L.; Pereyron, F. Solving multi-agent routing problems using deep attention mechanisms. IEEE Trans. Intell. Transp. Syst. 2020, 22, 7804–7813. [Google Scholar] [CrossRef]
Wu, L.; Qu, J.; Li, S.; Zhang, C.; Du, J.; Sun, X.; Zhou, J. Attention-Augmented MADDPG in NOMA-Based Vehicular Mobile Edge Computational Offloading. IEEE Internet Things J. 2024, 11, 27000–27014. [Google Scholar] [CrossRef]
Tanveer, J.; Lee, S.W.; Rahmani, A.M.; Aurangzeb, K.; Alam, M.; Zare, G.; Alamdari, P.M.; Hosseinzadeh, M. PGA-DRL: Progressive graph attention-based deep reinforcement learning for recommender systems. Inf. Fusion 2025, 121, 103167. [Google Scholar] [CrossRef]
Pagliardini, M.; Paliotta, D.; Jaggi, M.; Fleuret, F. Fast Attention Over Long Sequences with Dynamic Sparse Flash Attention. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]

Figure 1. Layered digital twin-based IoT system model.

Figure 2. FA-MADDPG algorithm visualized process.

Figure 3. Convergence curve of FA-MADDPG 0–2500 episodes; the converged reward is around 260 and needs around 1200 episodes to converge.

Figure 4. Results comparison of training 0–35 min; FA-MADDPG achieves the best reward and high time efficiency.

Figure 5. (a) Latency of 0.6–1.1 GHz local CPU. (b) Energy consumption of 0.6–1.1 GHz local CPU. Latency and energy consumption decrease as local CPU frequency increases. FA-MADDPG receives excellent latency and energy consumption for each local CPU frequency.

Figure 6. (a) Latency of 80–180 KB task. (b) Energy consumption of 80–180 KB task. Latency and energy consumption increase as task size increases. FA-MADDPG receives great latency and energy consumption for each task size.

Figure 7. (a) Latency of 3–8 base stations. (b) Energy consumption of 3–8 base stations. Latency and energy consumption decrease as the number of base stations increases. FA-MADDPG receives great latency and energy consumption for each number of base station.

Figure 8. (a) Latency of 40–140 MHz bandwidth. (b) Energy consumption of 40–140 MHz bandwidth. Latency and energy consumption decrease as the number of bandwidths increases. FA-MADDPG receives excellent latency and energy consumption for each bandwidth.

Table 1. Task weight for IoT system.

Task Weight	Traffic Throughput
0	Background data or unimportant task
1	Notification data or lightweight task
2	Management or normal task
3	Network control or important task
4	Emergency task

Table 2. Experiment settings.

Parameters	Value
Task size	120 KB
MEC servers computing frequency	3 GHz
User device computing frequency	0.8 GHz
Channel bandwidth	80 MHz
Maximum transmission power	0.4 W
Noise power	0.001 W
Channel gain	0.01 W
Base station	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, Y.; Yuan, X.; Wang, S.; Chen, L.; Zhang, Z.; Wang, T. Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy Gradient for Mobile Edge Computing in Digital Twin-Powered Internet of Things. Mathematics 2025, 13, 2164. https://doi.org/10.3390/math13132164

AMA Style

Gao Y, Yuan X, Wang S, Chen L, Zhang Z, Wang T. Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy Gradient for Mobile Edge Computing in Digital Twin-Powered Internet of Things. Mathematics. 2025; 13(13):2164. https://doi.org/10.3390/math13132164

Chicago/Turabian Style

Gao, Yuzhe, Xiaoming Yuan, Songyu Wang, Lixin Chen, Zheng Zhang, and Tianran Wang. 2025. "Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy Gradient for Mobile Edge Computing in Digital Twin-Powered Internet of Things" Mathematics 13, no. 13: 2164. https://doi.org/10.3390/math13132164

APA Style

Gao, Y., Yuan, X., Wang, S., Chen, L., Zhang, Z., & Wang, T. (2025). Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy Gradient for Mobile Edge Computing in Digital Twin-Powered Internet of Things. Mathematics, 13(13), 2164. https://doi.org/10.3390/math13132164

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy Gradient for Mobile Edge Computing in Digital Twin-Powered Internet of Things

Abstract

1. Introduction

2. Related Work

2.1. DT-Powered MEC

2.2. MADRL for Computation Offloading Decisions

2.3. Flash-Attention-Advanced DRL

3. System Model

3.1. Digital Twin-Based IoT Model

3.1.1. Physical Entity Layer

3.1.2. Virtual Twin Layer

3.1.3. Application Layer

3.2. Communication Model

3.3. Computation Model

3.3.1. Local Computation Model

3.3.2. Offloading Computation Model

3.3.3. System Computation Model

4. Problem Formulation and Proposed FA-MADDPG Solution

4.1. Problem Formulation

4.2. MDP Problem Construction

4.2.1. State Space

4.2.2. Action Space

4.2.3. Reward Function

4.3. FA-MADDPG Algorithm Model

4.3.1. Attention-Based MADDPG Algorithm

4.3.2. Flash Attention Mechanism

4.3.3. Joint Optimization Algorithm Framework

4.3.4. Complexity Analysis

5. Simulation and Results

5.1. Convergence of Proposed Algorithm

5.2. Time Efficiency Analysis

5.3. Performance Evaluation

5.4. Flash Attention Benefits Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI