Heterogeneous Computing Resources Scheduling Based on Time-Varying Graphs and Multi-Agent Reinforcement Learning

Yuan, Jinshan; Zhang, Xuncai; Gong, Kexin

doi:10.3390/fi18030168

Open AccessArticle

Heterogeneous Computing Resources Scheduling Based on Time-Varying Graphs and Multi-Agent Reinforcement Learning

by

Jinshan Yuan

¹

,

Xuncai Zhang

¹ and

Kexin Gong

^2,*

¹

China United Network Communications Group Co., Ltd. Qinghai Branch, Xining 810008, China

²

School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(3), 168; https://doi.org/10.3390/fi18030168

Submission received: 28 January 2026 / Revised: 11 March 2026 / Accepted: 16 March 2026 / Published: 20 March 2026

(This article belongs to the Special Issue Collaborative Intelligence for Connected Agents)

Download

Browse Figures

Versions Notes

Abstract

The evolution toward 6G Computing Power Networks (CPN) aims to deeply integrate multi-tier computing resources across Cloud, Edge, and end devices. However, the significant heterogeneity of computing resources, characterized by varying hardware architectures such as CPUs, GPUs, and NPUs, coupled with the time-varying network topology caused by terminal mobility, poses severe challenges to realizing efficient integrated scheduling that satisfies Quality of Service (QoS). To address spatiotemporal mismatches between task requirements and hardware architectures, this paper proposes an integrated scheduling method combining Discrete Time-Varying Graph (DTVG) construction with Multi-Agent Reinforcement Learning (MARL). Specifically, we model the dynamic interaction between mobile tasks and heterogeneous nodes as a DTVG to capture spatiotemporal evolution and employ a QMIX-based algorithm to enable collaborative decision-making among distributed agents. Simulation results demonstrate that the proposed approach effectively solves the joint optimization problem of heterogeneous resource matching and dynamic path planning, significantly outperforming traditional baselines in terms of resource utilization and average latency. This study confirms that incorporating graph-theoretic modeling with reinforcement learning offers a robust solution for the complex coupling of communication and computation in dynamic 6G networks.

Keywords:

computing power network; multi-agent systems; heterogeneous computing; time-varying graph

1. Introduction

With the rapid evolution of 5G and the prospective 6G networks, the paradigm of network architecture is shifting from traditional data transmission pipes to the deep convergence of computing and networking, known as the Computing Power Networks (CPN) [1]. This paradigm shift is driven by the vision of ubiquitous intelligence, where connectivity and computation are seamlessly integrated [2]. In critical scenarios such as Vehicle-to-Everything (V2X) and Industrial Internet of Things (IIoT), the exponential growth of data requires ultra-low latency and high-reliability processing [3,4]. To address these demands, the multi-tier computing architecture, comprising Cloud, Edge, and End devices, has emerged as a critical infrastructure. By deploying computing resources at network Edge infrastructures, such as base stations and roadside units, computation tasks can be offloaded from resource-constrained mobile terminals to Edge nodes, thereby significantly reducing service latency and backhaul bandwidth consumption [5,6]. The typical system scenario of such a heterogeneous computing power network in a dynamic V2X environment is illustrated in Figure 1.

However, the realization of efficient integrated scheduling in CPN is hindered by two fundamental challenges. First, computing resources in 6G networks exhibit inherent heterogeneity at multiple levels. This heterogeneity manifests not only in the hierarchical distribution across Cloud, Edge, and end tiers, but also in the diversity of hardware architectures, including CPUs, GPUs, and NPUs [7]. Critically, different task types demonstrate distinct affinity for specific computing resources. For instance, AI inference and Deep Neural Network (DNN) partitioning necessitate GPU acceleration [8], whereas logic control operations primarily leverage general-purpose CPU capabilities. However, existing works often treat computing resources as generic CPU cycles, failing to account for the specific acceleration capabilities (e.g., GPU/NPU) required by AI tasks. This leads to inefficient resource utilization even when load balancing is achieved. Consequently, failure to account for such heterogeneous attributes during scheduling results in resource mismatches and severe load imbalance, wherein high-performance nodes remain underutilized while low-capacity nodes become overloaded, thereby substantially degrading Quality of Service (QoS). Second, the high mobility of terminals introduces severe spatiotemporal dynamics to the communication links, resulting in a Time-Varying Graph (TVG) topology [9]. As users move, the physical connection between a terminal and an Edge node is unstable, potentially causing frequent handovers and link interruptions [10]. A computing node that is optimal at the current timeslot may become unreachable in the next due to limited communication coverage. Consequently, the scheduling problem transforms into a complex joint optimization of communication resources (bandwidth, channel quality) and heterogeneous computing resources under dynamic topological constraints.

Traditional static optimization methods and meta-heuristic algorithms, represented by Ant Colony Optimization (ACO) [11], Particle Swarm Optimization (PSO) [12], and Genetic Algorithms (GA) [13], struggle to adapt to these real-time variations. These iterative approaches often suffer from slow convergence and high computational complexity, failing to guarantee continuity of service and optimal resource matching in highly dynamic environments.

To tackle these challenges, this paper proposes an integrated scheduling method for heterogeneous computing power based on Discrete Time-Varying Graphs (DTVGs) and Multi-Agent Reinforcement Learning (MARL). We model the dynamic interaction between mobile tasks and heterogeneous nodes as a DTVG, capturing the temporal evolution of network topology and resource states. Furthermore, to address the dimensionality curse and coordination issues in large-scale distributed networks, we employ the QMIX algorithm [14], a value-based MARL framework. By decomposing the global joint value function into local agent utility functions, QMIX enables decentralized execution with centralized training, effectively learning cooperative strategies among heterogeneous nodes.

The main contributions of this paper are summarized as follows:

We construct a heterogeneous Discrete Time-Varying Graph model. Unlike traditional static graphs, this model integrates the spatiotemporal dynamics of communication links with the specific attributes of heterogeneous hardware (e.g., CPU/GPU capacities), transforming the continuous dynamic scheduling problem into a path optimization problem on a discrete graph.
We propose a spatiotemporal feature extraction mechanism based on Long Short-Term Memory (LSTM). By capturing the historical load trends and topological changes from the DTVG, this mechanism provides accurate state representation for the decision-making agent, enhancing the perception of environmental dynamics.
We design a QMIX-based collaborative scheduling algorithm. We formulate the resource allocation problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). A novel reward function is designed to balance task migration costs, heterogeneous resource matching, and QoS satisfaction. Extensive simulations demonstrate that our method outperforms traditional baselines in terms of resource utilization and average delay.

The remainder of this paper is organized as follows. In Section 2, we review the related works concerning heterogeneous Edge computing and intelligent scheduling. Section 3 introduces the system model and mathematically formulates the integrated scheduling problem based on time-varying graphs. The technical details of the proposed MARL-based scheduling framework are presented in Section 4. Section 5 details the experimental settings and analyzes the simulation results, including ablation studies and comparative evaluations against baselines. Finally, the conclusion is drawn in Section 6.

2. Related Work

2.1. Computing Power Network and Heterogeneous Resource Scheduling

The convergence of computing and networking has prompted extensive research into CPN and Mobile Edge Computing (MEC) resource scheduling [15,16]. Early studies primarily focused on offloading decisions in Cloud–Edge collaboration scenarios, utilizing methods involving convex optimization [17] and game theory [18] to minimize latency or energy consumption. For instance, Wang et al. proposed a Lagrangian dual decomposition method for joint task offloading in ultra-dense networks [19]. Furthermore, meta-heuristic algorithms, such as Particle Swarm Optimization (PSO) [12] and Ant Colony Optimization (ACO) [11], have been widely adopted to tackle the non-convex nature of resource allocation problems. These iterative approaches often suffer from high computational complexity and slow convergence rates, making them challenging to deploy in real-time, highly dynamic V2X environments where topology changes rapidly. This limitation has been further evidenced in VEC scenarios, where MARL-based cooperative offloading strategies have shown superior adaptability over iterative optimization under dynamic link conditions [20].

However, a significant limitation of these works is the assumption of node homogeneity, where computing resources are treated as generic CPU cycles. In the context of 6G and AI-native networks, computing power is inherently heterogeneous. This heterogeneity encompasses diverse hardware architectures, including Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Field-Programmable Gate Arrays (FPGAs). Recent works have begun to address this challenge. Zhou et al. explored resource allocation in heterogeneous Edge systems but focused mainly on capacity differences rather than architectural diversity [21]. Song et al. investigated NOMA-based heterogeneous MEC [22], yet the fine-grained matching between specific task requirements, ranging from AI inference to logic control, and hardware attributes remains underexplored. More recently, Gao et al. proposed a blockchain-assisted heterogeneous resource configuration framework for CPN, jointly optimizing bandwidth, computation, and storage resources to balance latency and energy consumption [23]. Despite these efforts, a concurrent study on CPN further reveals that tasks composed of sequential subtasks exhibit strong preferences for specific hardware at each processing stage, and that existing schemes fail to account for such fine-grained architectural heterogeneity [24].

2.2. Time-Varying Graph Models and Temporal Feature Extraction

Graph theory provides a fundamental framework for modeling network topologies. To capture the high dynamics of modern networks, the concept of Time-Varying Graphs (TVGs) was formalized by Casteigts et al. [9] and has been widely applied in satellite networks [25,26] and Vehicular Ad-hoc Networks (VANETs) [27]. For instance, Han et al. proposed a weighted time-space evolution graph to construct a time-varying topology model for Low Earth Orbit (LEO) satellite constellation networks, enabling adaptive routing under dynamic link conditions [28].

Despite these advancements, standard TVG models in communication networks predominantly focus on the topological evolution of connectivity. They often lack a mechanism to capture the long-term historical trends of node loads and link qualities. To address this, sequence learning techniques, particularly LSTM networks, have been introduced to analyze temporal dependencies in dynamic networks. Yu et al. demonstrated the effectiveness of integrating graph convolution with LSTM for spatio-temporal traffic forecasting [29]. LSTM has demonstrated superior performance in predicting network traffic and user mobility patterns by mitigating the vanishing gradient problem inherent in traditional Recurrent Neural Networks (RNNs) [30,31]. However, integrating LSTM-based temporal feature extraction directly with graph-structural attributes for CPN scheduling remains an open research direction.

2.3. Reinforcement Learning with Partial Observability

Deep Reinforcement Learning (DRL) has emerged as a powerful tool for solving complex, non-convex scheduling problems [32]. Approaches based on Deep Q-Networks (DQN) [33] and Deep Deterministic Policy Gradient (DDPG) [34] have shown success in single-agent scenarios. Nevertheless, centralized DRL struggles with the curse of dimensionality in large-scale networks.

To enable distributed decision-making, Multi-Agent Reinforcement Learning has gained significant attention. While Independent Q-Learning (IQL) offers a straightforward approach, it suffers from non-stationarity in multi-agent environments [35]. Value Decomposition Networks (VDN) [36] and QMIX [14] addressed this by decomposing the global value function to ensure efficient cooperation. A critical challenge in dynamic scheduling is that agents often operate under partial observability, where a single time-slot snapshot is insufficient to capture the environment’s state. To overcome this, Deep Recurrent Q-Networks (DRQN) integrate LSTM units into DRL agents to maintain an internal memory of historical states [37]. Although QMIX has been applied to scenarios such as traffic light control [38], combining QMIX with LSTM to handle the dual challenges of heterogeneous resource coupling and temporal topology variations is still nascent [39]. This motivates our proposal to utilize LSTM for extracting spatiotemporal features from Time-Varying Graphs to enhance the decision-making capability of MARL agents. A recent MATD3-based MARL framework addresses partial offloading and resource allocation in VEC networks [40], yet it neither models architectural heterogeneity of compute nodes nor incorporates temporal state history, limitations that our work explicitly overcomes. Distinct from prior DRL-based approaches in Edge computing [12,21,32] which rely on static network snapshots, our framework explicitly integrates architectural heterogeneity into the DTVG state space, enabling the agent to learn type-aware scheduling policies that standard methods overlook.

In summary, while existing research has explored various facets of resource scheduling and multi-agent reinforcement learning, most models fail to simultaneously address the spatiotemporal topology variations and the multi-tier hardware architecture heterogeneity of 6G-CPNs. This gap motivates the proposed integrated framework, which leverages graph-theoretic modeling to capture dynamic interactions and employs a specialized MARL algorithm for architecture-aware resource matching.

3. System Model and Problem Formulation

In this section, we formalize the heterogeneous computing network architecture and construct the weighted DTVG model. To ground the following mathematical modeling in a realistic context, we consider a 6G-enabled Vehicle-to-Everything (V2X) application scenario. In this environment, high-mobility vehicles act as dynamic task generators that offload compute-intensive AI workloads, such as autonomous driving inference and real-time traffic perception, to the network. The infrastructure is organized into a multi-tier computing power network (CPN) comprising Edge-based Roadside Units (RSUs) and centralized Cloud servers, each equipped with heterogeneous hardware (CPUs, GPUs, and NPUs). The objective is to efficiently schedule these spatiotemporally varying tasks across the available resources. Subsequently, we formulate the integrated scheduling problem as a joint optimization of communication and heterogeneous computation resources.

3.1. Heterogeneous Computing Network Architecture

We consider a multi-tier computing network consisting of three layers: the Cloud Layer, the Edge Layer, and the User Layer (UEs).

3.1.1. Heterogeneous Node Model

Let

S = {s_{1}, s_{2}, \dots, s_{N}}

denote the set of available computing nodes, encompassing Cloud servers and Edge nodes (BSs/RSUs). To capture the heterogeneity of computing resources, we define an attribute vector

{Attr}_{k}

for each node

s_{k} \in S

:

{Attr}_{k} = [C_{k}^{cpu}, C_{k}^{gpu}, C_{k}^{mem}, B_{k}^{\max}, L_{k}],

(1)

where

C_{k}^{cpu}

and

C_{k}^{gpu}

represent the computing capacities (in FLOPS) of the CPU and GPU units, respectively;

C_{k}^{mem}

denotes the available memory size;

B_{k}^{\max}

denotes the maximum network bandwidth capacity of the node; and

L_{k} \in {Cloud, Edge}

indicates the network hierarchy level, which determines the baseline network latency.

3.1.2. Task Diversity Model

Let

M = {m_{1}, m_{2}, \dots, m_{M}}

denote the set of computation tasks. Each task

m_{i}

is a persistent service flow requiring continuous resource allocation and may migrate across nodes due to UE mobility. Each task is characterized by:

{Req}_{i} = [D_{i}^{data}, W_{i}^{workload}, T y p e_{i}, T_{i}^{\max}],

(2)

where

D_{i}^{data}

is the input data size (bits),

W_{i}^{workload}

is the computational workload (FLOPs), and

T_{i}^{\max}

is the maximum tolerable latency. Crucially,

T y p e_{i}

represents the workload profile, ranging from compute-intensive AI inference and bandwidth-heavy video transcoding to lightweight status updates. To quantify the computational affinity between task profiles and hardware architectures, we introduce a processing efficiency factor

α_{i, k} \in (0, 1]

. For instance, for a deep learning task (

T y p e_{i} = AI

), the efficiency on a GPU node

s_{k}

is high (

α_{i, k} \approx 1

), whereas on a CPU-only node

s_{k^{'}}

, the efficiency is low (

α_{i, k^{'}} ≪ 1

).

3.2. Weighted Discrete Time-Varying Graph

Due to node mobility and load dynamics, the network topology evolves over time. It is important to note that the proposed Discrete Time-Varying Graph sequence G = G1, G2, …, GT, as illustrated in Figure 2, in which the operational horizon is discretized into T time slots, does not represent a physical network topology. Rather, it constitutes a logical state-action abstraction designed to formulate the scheduling problem as a Markov Decision Process for the reinforcement learning environment.

At any time slot t, the logical graph is defined as

G_{t} = (V_{t}, E_{t})

. The node set

V_{t} = M \cup S

unifies the task workload demands (Task nodes

M

) with the available system resources (Computing nodes

S

) within a single graph abstraction. To jointly capture spatial resource mapping and temporal state transitions, the Edge set

E_{t}

is composed of two functionally distinct Edge types: Offloading Edges and Migration Edges. This bipartite-like structure enables the RL algorithm to simultaneously reason about assignment feasibility and migration overhead from a unified graph representation.

3.2.1. Offloading Edge ( $E_{T, S}$ )

An offloading Edge between task node

m_{i}

and computing node

s_{k}

exists when two conditions are simultaneously satisfied: the resource matching constraints are met, and node

s_{k}

lies within the physical communication range of the requesting user. This Edge type encodes the spatial action space available to each scheduling agent, ensuring the logical graph strictly adheres to realistic network reachability. The weight of this Edge,

W_{i, k}^{edge}

, represents the total service latency, incorporating both transmission and heterogeneous processing latency:

W_{i, k}^{edge} = T_{i, k}^{trans} + T_{i, k}^{comp} = \frac{D_{i}^{data}}{B_{i, k} (t)} + \frac{W_{i}^{workload}}{α_{i, k} \cdot C_{k}},

(3)

where

T_{i, k}^{trans}

denotes the transmission latency required to offload task i to computing node k, while

T_{i, k}^{comp}

represents the heterogeneous processing (computation) latency of task i executed on node k. The superscripts ’trans’ and ’comp’ distinguish the communication and computation components of the total service delay, respectively.

B_{i, k} (t)

is the dynamic real-time wireless bandwidth constrained by the physical channel state and the node’s communication capacity. This offloading Edge weight dynamically couples the communication quality with the computational capability, providing the basis for the joint optimization of networking and computing. The term

α_{i, k} \cdot C_{k}

represents the effective computing power of node

s_{k}

for task

m_{i}

. A higher

α_{i, k}

implies a better match between the task and the hardware, resulting in lower processing latency.

3.2.2. Migration Edge ( $E_{S, S}$ )

In contrast to offloading Edges, which model instantaneous demand-to-resource mappings, migration Edges capture the temporal continuity of task placement across consecutive time slots. Specifically, a migration Edge connects computing node

s_{k}

at time slot t to node

s_{k^{'}}

at time slot

t + 1

, and its weight quantifies the overhead incurred by transferring the task state between nodes. The weight represents the migration overhead:

W_{k, k^{'}}^{m i g} = \frac{D_{i}^{m i g}}{B_{k, k^{'}}^{backhaul}},

(4)

where

D_{i}^{m i g}

denotes the size of the task state data, and

B_{k, k^{'}}^{backhaul}

represents the backhaul bandwidth between nodes. It is worth noting that the bandwidth capacity varies significantly across hierarchy levels, differentiating between horizontal Edge-to-Edge connections and vertical Edge-to-Cloud links.

3.3. Communication and Computation Energy Model

The energy consumption consists of transmission energy and computation energy. The transmission energy for task

m_{i}

to node

s_{k}

is

E_{i, k}^{trans} = P_{t x} \cdot T_{i, k}^{trans}

, where

P_{t x}

is the user’s transmission power. The computation energy is modeled based on the heterogeneous hardware architecture. We introduce an architecture-specific energy coefficient

κ_{k}

:

E_{i, k}^{comp} = κ_{k} \cdot {(C_{k})}^{2} \cdot W_{i}^{workload},

(5)

where

κ_{k}

varies for CPU and GPU architectures. Thus, the total energy cost is

E_{i, k}^{total} = E_{i, k}^{trans} + E_{i, k}^{comp}

.

3.4. Problem Formulation

Let

x_{i, k}^{t} \in {0, 1}

be the binary decision variable indicating whether task

m_{i}

is offloaded to node

s_{k}

at time t. Let

y_{i, k, k^{'}}^{t} \in {0, 1}

indicate if task

m_{i}

migrates from

s_{k}

to

s_{k^{'}}

. Our objective is to maximize the long-term system utility

U

, defined as the weighted sum of task completion rewards, heterogeneous resource matching degree, and migration penalties.

The optimization problem is formulated as:

\begin{matrix} max_{x, y} U = & \sum_{t = 1}^{T} \sum_{i = 1}^{M} (R_{i, t}^{S L A} + λ_{1} α_{i, k} - λ_{2} I (k \neq k^{'}) W_{k, k^{'}}^{m i g}) \end{matrix}

(6a)

\begin{matrix} s . t . & \sum_{k = 1}^{N} x_{i, k}^{t} = 1, \forall i, t, \end{matrix}

(6b)

\begin{matrix} \sum_{i = 1}^{M} x_{i, k}^{t} \cdot D_{i}^{data} \leq C_{k}^{mem}, \forall k, t, \end{matrix}

(6c)

\begin{matrix} \sum_{i = 1}^{M} x_{i, k}^{t} \cdot B_{i, k} (t) \leq B_{k}^{\max}, \forall k, t, \end{matrix}

(6d)

where

t \in {1, \dots, T}

denotes the discrete time slot index;

i \in {1, \dots, M}

and

k \in {1, \dots, N}

are the indices of tasks and computing nodes, respectively;

λ_{1}

and

λ_{2}

are positive weighting coefficients for architectural affinity and migration penalties;

I (\cdot)

is the indicator function that equals 1 if the condition

k \neq k^{'}

holds and 0 otherwise;

W_{k, k^{'}}^{m i g}

is the migration cost; and

D_{i}^{data}

represents the input data size.

R_{i, t}^{S L A}

is the reward for satisfying the QoS requirements (latency deadline

T_{i}^{\max}

). The term

λ_{1} α_{i, k}

incentivizes the allocation of tasks to architecturally compatible nodes, such as mapping AI workloads to GPU-accelerated units, while the final term serves to penalize excessive service migrations between node k at time t and node

k^{'}

at time

t + 1

. Notably, maximizing

α_{i, k}

implicitly reduces energy-per-FLOP by steering tasks toward compatible hardware, while

λ_{2}

controls the trade-off between task completion reward and the overhead of service migration. Constraint (6b) ensures each task is served by exactly one node. Constraints (6c) and (6d) act as strict physical bounds to ensure that the allocated tasks do not exceed the memory and the communication bandwidth capacities (

B_{k}^{\max}

) of the node, ensuring the joint feasibility of computation and communication resource allocation.

4. LSTM-Enhanced QMIX Framework for Integrated Scheduling

In this section, we elaborate on the proposed integrated scheduling framework. Building upon the system model defined in Section 3, the framework aims to solve the optimization problem in (6a) by learning an optimal policy. The proposed framework consists of two core components: a spatiotemporal feature extraction module based on LSTM and a multi-agent collaborative decision-making module based on QMIX.

4.1. Spatiotemporal Heterogeneous Feature Extraction

Effective scheduling relies on the accurate perception of the DTVG state defined in Section 3.2. However, raw observation snapshots are insufficient to capture the temporal correlations of node loads caused by task dynamics and user mobility. To address this, we design a feature extraction module based on LSTM networks. This module functions as a trajectory encoder, mapping the historical time-series data of heterogeneous nodes into a compact latent representation.

4.1.1. Heterogeneous State Representation

For each Edge node

s_{k} \in S

, we construct a multi-dimensional state vector

e_{t}^{k}

at time slot t, encapsulating both resource utilization and network traffic status. To ensure numerical stability and accelerate convergence, raw measurements are normalized using Z-score standardization. The feature vector is defined as:

e_{t}^{k} = [u_{t}^{cpu}, u_{t}^{gpu}, u_{t}^{mem}, B_{t}^{in}, B_{t}^{out}, Δ L_{t}],

(7)

where

u_{t}^{cpu}

and

u_{t}^{gpu}

denote the utilization rates relative to the capacities

C_{k}^{cpu}

and

C_{k}^{gpu}

, respectively;

u_{t}^{mem}

represents memory usage;

B_{t}^{in / out}

indicates real-time inbound/outbound bandwidth; and

Δ L_{t}

represents the rate of load change, providing first-order trend information.

4.1.2. LSTM-Based Temporal Encoding

The standard RNNs suffer from gradient vanishing problems when processing long sequences. To mitigate this, we employ LSTM units equipped with gating mechanisms to selectively memorize or forget information. The LSTM processes the input sequence

{e_{t - H}^{k}, \dots, e_{t}^{k}}

over a history window H. At each time step, the internal cell state

c_{t}^{k}

and hidden state

h_{t}^{k}

are updated as follows:

\begin{matrix} f_{t} & = σ (W_{f} \cdot [h_{t - 1}^{k}, e_{t}^{k}] + b_{f}), \end{matrix}

(8a)

\begin{matrix} i_{t} & = σ (W_{i} \cdot [h_{t - 1}^{k}, e_{t}^{k}] + b_{i}), \end{matrix}

(8b)

\begin{matrix} o_{t} & = σ (W_{o} \cdot [h_{t - 1}^{k}, e_{t}^{k}] + b_{o}), \end{matrix}

(8c)

\begin{matrix} c_{t}^{k} & = f_{t} ⊙ c_{t - 1}^{k} + i_{t} ⊙ tanh (W_{c} \cdot [h_{t - 1}^{k}, e_{t}^{k}] + b_{c}), \end{matrix}

(8d)

\begin{matrix} h_{t}^{k} & = o_{t} ⊙ tanh (c_{t}^{k}), \end{matrix}

(8e)

where

σ

denotes the sigmoid activation function, and ⊙ represents the element-wise product. Here, the forget gate

f_{t}

determines which historical load information is discarded, while the input gate

i_{t}

controls the integration of current monitoring data. The output gate

o_{t}

regulates the information flow to the next layer. Through this mechanism, the final hidden state

h_{t}^{k}

effectively encodes both micro-fluctuations, such as sudden traffic bursts, and macro-periodic patterns of the heterogeneous resources. This spatiotemporal embedding

h_{t}^{k}

is then fed into the RL agent to resolve the partial observability of the network environment.

4.2. Dec-POMDP Formulation

To address the distributed and dynamic nature of the scheduling problem, we reformulate the optimization objective as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), formally characterized by the tuple

〈 N, S^{g l o b a l}, A, P, R, Ω, O, γ 〉

. Specifically,

N

denotes the set of agents corresponding to the computation tasks,

P

captures the stochastic spatiotemporal evolution of the DTVG,

Ω

is the observation function mapping the global state

s_{t} \in S_{global}

to local observations in

O

, and

γ \in (0, 1)

is the discount factor. To enhance scalability and ensure feasibility, we incorporate parameter sharing across all homogeneous task agents and apply constraint-aware action masking based on constraints (6b)–(6d) to dynamically prune the action space

A

. The key elements are defined as follows.

4.2.1. Global State ( $S_{global}$ )

The global state

s_{t} \in S_{global}

aggregates the holistic information of the CPN environment at time slot t. It encompasses the global topological structure of the DTVG

G_{t}

and the real-time resource attribute matrix of all computing nodes. Within the CTDE paradigm,

s_{t}

is exclusively utilized by the mixing network during the centralized training phase to capture global interactions.

4.2.2. Local Observation ( $O$ )

Instead of relying on a monolithic centralized scheduler, we employ a decentralized Multi-Agent Reinforcement Learning paradigm. Since a computation task is inherently a passive workload generated by a user device, it cannot independently make scheduling decisions. To bridge this gap, we instantiate a dedicated Task-specific Scheduling Agent

i \in N

for each persistent service flow

m_{i}

, which acts as an active decision-making entity on its behalf.

This agent operates as an active software process residing in the Edge control plane. By adopting a one-to-one mapping between a logical software agent and a specific physical task flow, our system effectively decomposes the globally complex resource allocation problem into distributed local decisions. This decentralized design inherently avoids the action-space explosion that afflicts centralized schedulers when managing large numbers of concurrent tasks in dynamic 6G environments. Functionally, the agent is responsible for collecting environmental state observations regarding its assigned task, choosing an offloading action according to its local policy, and calculating the reward based on the task’s execution performance. Considering the partial observability constraints, the local observation

o_{t}^{i} \in O

for agent i integrates the heterogeneity-aware context defined in Section 3.1:

(1): Task Profile: The intrinsic requirement vector ${Req}_{i}$ , specifically highlighting the workload type $T y p e_{i}$ and the latency deadline $T_{i}^{\max}$ .
(2): Node Context: The heterogeneous attribute vector ${Attr}_{k}$ of candidate nodes, augmented by their spatiotemporal load embeddings $h_{t}^{k}$ extracted via the LSTM module.
(3): Link Conditions: The dynamic weights of reachable Offloading Edges ( $W_{i, k}^{edge}$ ) and potential Migration Edges ( $W_{k, k^{'}}^{m i g}$ ) within the current DTVG snapshot.

4.2.3. Reward Function Design ( $R$ )

To ensure the learned policy converges towards the optimal system utility

U

defined in Section 3.4, we construct a composite reward function

r_{t}

. This function explicitly correlates with the optimization objective terms:

r_{t} = λ_{1} \cdot r_{match} + r_{qos} - λ_{2} \cdot r_{m i g} .

(9)

The components are mathematically defined to enforce specific scheduling behaviors:

(1): $r_{match}$ (Architectural Affinity): This term corresponds to the processing efficiency factor $α_{i, k}$ . If task $m_{i}$ is assigned to node $s_{k}$ , we set $r_{match} = α_{i, k}$ . This mechanism explicitly incentivizes the alignment of workload types with hardware architectures, exemplified by mapping AI inference tasks to GPU-accelerated nodes (where $α \approx 1$ ), as opposed to mismatched resources such as CPU-based nodes (where $α ≪ 1$ ).
(2): $r_{qos}$ (SLA Adherence): This term corresponds to the $R_{i, t}^{S L A}$ component in the system objective (Equation (6a)) and quantifies the satisfaction of QoS requirements, formulated as $r_{qos} = T_{i}^{\max} / (T_{i, k}^{trans} + T_{i, k}^{comp})$ . A higher value signifies a more substantial safety margin for meeting the completion deadline.
(3): $r_{m i g}$ (Migration Penalty): This term reflects the migration overhead $W_{k, k^{'}}^{m i g}$ , designed to penalize excessive service handovers, thereby promoting service continuity.

To prevent gradient explosion and ensure numerical stability during training, all reward components are first standardized using Z-Score normalization based on historical statistics collected from the DTVG, and then clipped to the range

[- 1, 1]

.

4.3. QMIX-Based Collaborative Scheduling Algorithm

To achieve global optimal scheduling in a decentralized setting, we define the learning process based on the QMIX algorithm. The discrete action space for each agent i, denoted as

A_{i}

, corresponds to the set of available computing nodes

S

. Selecting an action

u_{t}^{i} = s_{k} \in S

is equivalent to setting the decision variable

x_{i, k}^{t} = 1

in the optimization problem (6a). Following the Centralized Training with Decentralized Execution (CTDE) paradigm, QMIX effectively addresses the multi-agent credit assignment problem in large-scale heterogeneous networks.

As illustrated in Figure 3, the overall architecture comprises two tightly coupled components: the Agent Network and the Mixing Network. To address the partial observability inherent in distributed scheduling, each agent network adopts a Deep Recurrent Q-Network (DRQN) architecture. Specifically, the network for agent i comprises an input Multi-Layer Perceptron (MLP), a Gated Recurrent Unit (GRU), and an output MLP. The GRU component leverages hidden states

h_{t - 1}^{i}

to memorize the history of observations. Consequently, the agent computes its local utility value

Q_{i} (τ_{t}^{i}, u_{t}^{i})

based on the current local observation

o_{t}^{i}

and the last action

u_{t - 1}^{i}

, where

τ

denotes the action-observation history. During the execution phase, agents select actions greedily based solely on their local

Q_{i}

values, eliminating the need for real-time global communication.

For ensuring the feasibility of scheduling decisions, we integrate a constraint-aware action selection mechanism. Before the agent selects an action

u_{t}^{i}

, a validity check is performed based on constraints (6b)–(6d), which strictly enforce the single-node assignment rule as well as memory and bandwidth capacities. Nodes that violate these hard constraints are removed from the candidate set for the current time slot, effectively pruning the action space and guaranteeing system stability. During the execution phase, agents select actions greedily from the valid candidate set based on their local

Q_{i}

values, eliminating the need for real-time global communication.

Regarding communication overhead, the proposed framework adopts the Centralized Training with Decentralized Execution (CTDE) paradigm to achieve high scalability. Global state information is required only during the offline training phase, where the Mixing Network aggregates agent utilities for joint optimization. During online execution, each agent operates independently based on its local observation history, with no inter-agent state exchange or explicit negotiation required. Since cooperative scheduling strategies are encoded directly into the trained network weights, the communication overhead during online deployment is effectively eliminated. In addition to execution efficiency, action masking plays a vital role in enforcing the hard physical constraints defined in Equations (6c) and (6d). By filtering out illegal actions, such as the assignment of tasks to nodes with exhausted memory or bandwidth capacity, the masking mechanism prevents agents from receiving sparse or invalid reward signals. This approach significantly reduces training variance and ensures that all learned scheduling policies are strictly compliant with the operational requirements of dynamic 6G networks.

To bridge local utilities with the global objective, QMIX introduces a Mixing Network to approximate the global joint action-value function

Q_{t o t} (s_{t}, u_{t})

. The mixing network is a feedforward neural network whose weights are dynamically generated by Hypernetworks taking the global state

s_{t}

as input. To ensure consistency between the global argmax and individual argmax operations, QMIX enforces a monotonicity constraint:

\frac{\partial Q_{t o t}}{\partial Q_{i}} \geq 0, \forall i \in N,

(10)

where

N

denotes the set of task agents;

Q_{t o t}

is the global joint action-value function;

Q_{i}

is the individual utility value for agent i; and

\frac{\partial Q_{t o t}}{\partial Q_{i}}

characterizes the sensitivity of the global value function to each agent’s local utility.

This constraint is satisfied by applying absolute activation functions to the outputs of the hypernetworks, ensuring non-negative weights. The framework is trained end-to-end by minimizing the following Temporal Difference (TD) loss:

L (θ) = \sum_{b = 1}^{B} \sum_{t = 1}^{T} {(y_{t o t} - Q_{t o t} (s_{t}, u_{t}; θ))}^{2},

(11)

where

θ

represents the parameters of the network;

y_{t o t} = r_{t} + γ {max}_{u^{'}} Q_{t o t} (s_{t + 1}, u^{'}; θ^{-})

denotes the target value;

r_{t}

is the composite reward at time t;

0 < γ < 1

is the discount factor;

u^{'}

denotes the joint action in the next state over which the maximum is taken; and

θ^{-}

denotes the parameters of the target network. By optimizing

Q_{t o t}

, the learned policy implicitly pursues the long-term system utility

U

defined in Equation (6a), thereby guaranteeing load balancing and QoS satisfaction under heterogeneous resource constraints.

5. Simulation and Analysis

5.1. Experimental Setup

To comprehensively evaluate the proposed heterogeneous computing scheduling method, the simulation environment is strictly parameterized to reflect the 6G-enabled V2X application scenario described in Section 3, incorporating dynamic vehicle mobility and the tiered RSU-Cloud architecture. We developed a high-fidelity co-simulation platform constructed by integrating SUMO (Simulation of Urban MObility) [41] with a custom Python-based Edge computing framework (Python version number 3.11). SUMO is employed to generate realistic vehicle trajectories and traffic flow constraints, simulating the spatiotemporal movement of mobile users within the predefined V2X network topology. On top of this physical layer, the Edge computing environment and the MARL training process are implemented using PyTorch (version 2.6.0+cu124) [42]. This setup allows us to capture the complex coupling between vehicle mobility, wireless communication coverage, and computational workload distribution consistent with our theoretical system model.

Reflecting the hierarchical architecture defined in our system model, the simulation configures a heterogeneous network comprising distinct types of computing nodes to emulate the diversity of Edge and Cloud resources. Specifically, we define Type C1/C2 nodes as IO-intensive Edge nodes, modeled after roadside units (RSUs) described in the research report. These nodes are equipped with low-power CPUs and limited storage, characterized by low access latency, making them ideal for handling delay-sensitive status reporting tasks. In contrast, Type C3 nodes represent compute-intensive regional Cloud centers equipped with high-performance GPUs and large memory capacities. These nodes are designed to handle heavy workloads such as AI perception and inference tasks. Correspondingly, the task stream generated in the simulation is a mix of lightweight status reporting tasks and heavy AI inference tasks, enabling us to verify the algorithm’s ability to perform type-aware scheduling. To ensure statistical reliability and reproducibility, all simulation results reported in this paper are derived from 5 independent experimental runs using different random seeds. The reported metrics represent the mean performance, and the observed low variability across these runs confirms the stability and robustness of our proposed framework.

The training of the proposed QMIX-based model follows the Centralized Training with Decentralized Execution (CTDE) [14] paradigm. The key hyperparameters were determined through empirical tuning during the theoretical research phase. The GRU hidden dimension is set to 64 to capture historical context, while the mixing network uses a hidden dimension of 32. The exploration rate

ϵ

anneals linearly from 1.0 to 0.05 to balance exploration and exploitation. Detailed parameter settings, including the learning rate of

5 \times 10^{- 4}

and a replay buffer size of 5000 episodes, are summarized in Table 1. The selection of these specific hyperparameters, such as the hidden layer dimensions and the learning rate, was guided by extensive empirical tuning and sensitivity analysis. These configurations are consistent with established multi-agent reinforcement learning benchmarks [14,43] to ensure a robust balance between exploration efficiency, convergence stability, and training speed within the dynamic 6G network environment.

5.2. Performance Evaluation

We first analyze the convergence behavior of the proposed MARL algorithm to validate its stability in dynamic environments. Figure 4 illustrates the evolution of the average travel time over training episodes. In the initial phase (approximately 0–20 episodes), the curve exhibits significant fluctuations as the agents explore the state-action space and the mixing network adjusts its weights. As the training progresses, the multi-agent system gradually learns the implicit collaboration policy. It can be observed that the travel time decreases significantly and stabilizes around 200 s after approximately 80 episodes. The convergence to a low travel time indicates that the algorithm successfully finds a stable scheduling policy that balances local objectives with global system efficiency.

A critical objective of this work is to achieve efficient integration of heterogeneous computing power. Effective resource matching relies heavily on accurate perception of node states. Figure 5 depicts the comparison between predicted and actual CPU utilization. Specifically, Figure 5a illustrates the performance on the training set, where the LSTM module successfully captures the long-term periodic fluctuations and noise characteristics of the workload. Figure 5b demonstrates the generalization capability on the test set, showing high tracking accuracy with low Mean Squared Error (MSE) even on unseen data. This precise workload prediction enables the scheduler to distinguish between idle and overloaded nodes effectively. Consequently, the proposed algorithm demonstrates intelligent type-matching behavior in our experiments, dispatching approximately 85% of compute-intensive AI inference tasks to GPU-enabled C3 nodes while routing 90% of status reporting tasks to CPU-based C1 nodes. This differentiation maximizes the utility of heterogeneous hardware and avoids resource mismatch problems common in random or static scheduling strategies.

Finally, to quantify the superiority of our approach, we compare it against two widely adopted heuristic baselines: Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO), as well as a state-of-the-art Multi-Agent Deep Reinforcement Learning baseline: Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [43]. The quantitative comparison results under varying task loads are presented in Table 2.

The proposed QMIX-based method outperforms the baselines across all key metrics. Specifically, it reduces the average latency by approximately 22% compared to ACO. This improvement is attributed to the Time-Varying Graph (TVG) modeling, which allows the system to anticipate topology changes and migrate tasks proactively before link failures occur. Furthermore, our method significantly lowers the migration count compared to PSO. By incorporating global migration costs into the reward function, the agents learn to suppress excessive and oscillatory task transfers, ensuring that migrations occur only when the expected performance gain outweighs the overhead, thereby enhancing service stability. The proposed method also achieves the highest resource utilization rate of 84.3%, confirming that the collaborative mechanism of MARL effectively balances the load across heterogeneous C1 and C3 nodes and prevents localized congestion.Furthermore, compared to the modern DRL baseline (MADDPG), our QMIX-based approach still achieves approximately 8.2% lower average latency and higher resource utilization. We attribute this distinct advantage to the algorithmic architecture. In the specific “bottleneck” scenarios of Edge computing, tasks frequently compete for highly constrained heterogeneous resources. QMIX addresses this by employing a mixing network that enforces a monotonicity constraint, effectively factorizing the global joint value function to ensure optimal implicit coordination among cooperative agents. In contrast, MADDPG’s independent actor-critic framework often struggles with sub-optimal local convergence and high variance when navigating the highly coupled, discontinuous action spaces inherent in DTVG-based resource matching.

It is worth noting that the SUMO-based simulation environment stochastically generates vehicle trajectories and task arrival patterns, naturally encompassing a wide spectrum of traffic conditions ranging from low-density off-peak periods to high-density peak-hour scenarios. Consequently, the Mean ± Std. Dev. values reported in Table 2, aggregated across 5 independent runs, reflect the performance distribution across these inherently varying load conditions. The consistently low standard deviations of our proposed method across all three metrics serve as strong evidence that the learned scheduling policy maintains stable superiority under both light and heavy load regimes, confirming the robustness and generalizability of our framework in dynamic 6G-CPN environments.

6. Conclusions

In this paper, we proposed an integrated scheduling framework for heterogeneous computing power based on Discrete Time-Varying Graphs and QMIX-based Multi-Agent Reinforcement Learning. The DTVG model effectively captures the spatiotemporal evolution of communication topologies, providing a robust foundation for anticipating link availability under high user mobility. Furthermore, the QMIX algorithm resolves the coordination challenges among heterogeneous agents by decomposing global utility, enabling implicit cooperation and precise resource type-matching. Simulation results demonstrate that our approach significantly outperforms traditional heuristic baselines in terms of service latency, migration stability, and resource utilization.

Author Contributions

Conceptualization, J.Y. and X.Z.; methodology, J.Y.; software, J.Y.; validation, J.Y. and X.Z.; formal analysis, J.Y.; investigation, J.Y. and X.Z.; resources, X.Z.; data curation, J.Y.; writing—original draft preparation, K.G.; writing—review and editing, K.G.; visualization, K.G.; supervision, X.Z.; project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Major Science and Technology Project of Qinghai Province (Grant No. 2024-GX-A3).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Authors Jinshan Yuan and Zhang Xuncai were employed by China United Network Communications Group Co., Ltd. Qinghai Branch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Tang, Q.; Xie, N.; Kunz, T.; Luan, T.H.; Liu, Z. Computing Power Network: The Architecture of Convergence of Computing and Networking. IEEE Netw. 2021, 35, 166–173. [Google Scholar]
David, K.; Berndt, H. 6G Vision and Requirements: Is There Any Need for Beyond 5G? IEEE Veh. Technol. Mag. 2018, 13, 72–80. [Google Scholar] [CrossRef]
Gyawali, S.; Xu, S.; Qian, Y.; Hu, R.Q. Challenges and Solutions for Cellular Based V2X Communications. IEEE Commun. Surv. Tutorials 2021, 23, 222–255. [Google Scholar] [CrossRef]
Sisinni, E.; Saifullah, A.; Han, S.; Jennehag, U.; Gidlund, M. Industrial Internet of Things: Challenges, Opportunities, and Directions. IEEE Trans. Ind. Informatics 2018, 14, 4724–4734. [Google Scholar] [CrossRef]
Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A Survey on Mobile Edge Computing: The Communication Perspective. IEEE Commun. Surv. Tutorials 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Fang, F.; Feng, Y.; Ding, Z.; Zhang, H.; Chen, X.; Cheung, G. Joint Task Offloading and Resource Optimization in NOMA-Based Heterogeneous Mobile Edge Computing. IEEE Trans. Wirel. Commun. 2021, 20, 6926–6941. [Google Scholar]
Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet Things J. 2020, 7, 7457–7469. [Google Scholar] [CrossRef]
Casteigts, A.; Flocchini, P.; Quattrociocchi, W.; Santoro, N. Time-Varying Graphs and Dynamic Networks. Int. J. Parallel Emergent Distrib. Syst. 2012, 27, 387–408. [Google Scholar] [CrossRef]
Arshad, R.; ElSawy, H.; Sorour, S.; Al-Naffouri, T.Y.; Alouini, M.-S. Handover Management in 5G and Beyond: A Topology Aware Skipping Approach. IEEE Access 2016, 4, 9073–9081. [Google Scholar] [CrossRef]
Hussein, M.K.; Mousa, M.H. Efficient Task Offloading for IoT-Based Applications in Fog Computing Using Ant Colony Optimization. IEEE Access 2020, 8, 37191–37201. [Google Scholar] [CrossRef]
Luo, Q.; Li, C.; Luan, T.H.; Shi, W. Minimizing the Delay and Cost of Computation Offloading for Vehicular Edge Computing. IEEE Trans. Serv. Comput. 2022, 15, 2897–2909. [Google Scholar] [CrossRef]
Al-Turjman, F. Hybrid Genetic Algorithm for IOMT-Cloud Task Scheduling. Wirel. Commun. Mob. Comput. 2022, 2022, 6604286. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; PMLR: Stockholm, Sweden, 2018; Volume 80, pp. 4295–4304. [Google Scholar]
Tang, S.; Yu, Y.; Wang, H.; Wang, G.; Chen, W.; Xu, Z.; Guo, S.; Gao, W. A Survey on Scheduling Techniques in Computing and Network Convergence. IEEE Commun. Surv. Tutorials 2024, 26, 160–195. [Google Scholar] [CrossRef]
Mach, P.; Becvar, Z. Mobile Edge Computing: A Survey on Architecture and Computation Offloading. IEEE Commun. Surv. Tutorials 2017, 19, 1628–1656. [Google Scholar] [CrossRef]
Bi, S.; Zhang, Y.J. Computation Rate Maximization for Wireless Powered Mobile-Edge Computing with Binary Computation Offloading. IEEE Trans. Wirel. Commun. 2018, 17, 4177–4190. [Google Scholar] [CrossRef]
Chen, X.; Jiao, L.; Li, W.; Fu, X. Efficient Multi-User Computation Offloading for Mobile-Edge Cloud Computing. IEEE/ACM Trans. Netw. 2016, 24, 2795–2808. [Google Scholar] [CrossRef]
Wang, Y.; Sheng, M.; Wang, X.; Wang, L.; Li, J. Mobile-Edge Computing: Partial Computation Offloading Using Dynamic Voltage Scaling. IEEE Trans. Commun. 2016, 64, 4268–4282. [Google Scholar] [CrossRef]
Cui, Y.; Zhang, D.; Li, H.; Qiang, H.; Zhao, H. Cooperative Task Offloading Strategy for Vehicular Edge Computing Based on Multi-Agent Deep Reinforcement Learning. Future Gener. Comput. Syst. 2026, 174, 107950. [Google Scholar] [CrossRef]
Zhou, Y.; Yu, F.R.; Chen, J.; He, B. Joint Resource Allocation for Ultra-Reliable and Low-Latency Radio Access Networks with Edge Computing. IEEE Trans. Wirel. Commun. 2022, 21, 444–460. [Google Scholar] [CrossRef]
Song, Z.; Liu, Y.; Sun, X. Joint Task Offloading and Resource Allocation for NOMA-Enabled Multi-Access Mobile Edge Computing. IEEE Trans. Commun. 2021, 69, 1548–1564. [Google Scholar] [CrossRef]
Gao, Q.; Liu, C.; Wang, L.; Liu, Y.; Xu, Y. Blockchain-Based Heterogeneous Resource Configuration Scheme in Computing Power Network. Sci. Rep. 2025, 15, 21247. [Google Scholar] [CrossRef] [PubMed]
Zhong, A.; Wu, D.; Yang, B.; Wang, R. Heterogeneous Resource Allocation with Latency Guarantee for Computing Power Network. Digit. Commun. Netw. 2025, 12, 25–37. [Google Scholar] [CrossRef]
Liu, Y.; Mao, Y.; Liu, Z.; Ye, F.; Yang, Y. Joint Task Offloading and Resource Allocation in Heterogeneous Edge Environments. IEEE Trans. Mob. Comput. 2024, 23, 7318–7334. [Google Scholar] [CrossRef]
Gounder, V.V.; Prakash, R.; Abu-Amara, H. Routing in LEO-Based Satellite Networks. In Proceedings of the 1999 IEEE Emerging Technologies Symposium on Wireless Communications and Systems, Richardson, TX, USA, 12–13 April 1999; pp. 22.1–22.6. [Google Scholar]
Shi, K.; Zhang, X.; Zhang, S.; Li, H. Time-Expanded Graph Based Energy-Efficient Delay-Bounded Multicast Over Satellite Networks. IEEE Trans. Veh. Technol. 2020, 69, 10380–10384. [Google Scholar] [CrossRef]
Han, Z.; Xu, C.; Zhao, G.; Wang, S.; Cheng, K.; Yu, S. Time-Varying Topology Model for Dynamic Routing in LEO Satellite Constellation Networks. IEEE Trans. Veh. Technol. 2023, 72, 3440–3454. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Fu, R.; Zhang, Z.; Li, L. Using LSTM and GRU Neural Network Methods for Traffic Flow Prediction. In Proceedings of the 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), Wuhan, China, 11–13 November 2016; pp. 324–328. [Google Scholar]
Mao, H.; Alizadeh, M.; Menache, I.; Kandula, S. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets); ACM: New York, NY, USA, 2016; pp. 50–56. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G. Human-level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y. Continuous Control with Deep Reinforcement Learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Tan, M. Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. In Proceedings of the 10th International Conference on Machine Learning (ICML), Amherst, MA, USA, 27–29 June 1993; pp. 330–337. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M. Value-Decomposition Networks for Cooperative Multi-Agent Learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar]
Hausknecht, M.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the AAAI Fall Symposium Series, Arlington, VA, USA, 12–14 November 2015. [Google Scholar]
Wei, H.; Zheng, G.; Gayah, V.; Li, Z. CoLight: Learning Network-level Cooperation for Traffic Signal Control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), Beijing, China, 3–7 November 2019; pp. 1913–1922. [Google Scholar]
Nguyen, T.T.; Nguyen, N.D.; Nahavandi, S. Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions, and Applications. IEEE Trans. Cybern. 2020, 50, 3826–3839. [Google Scholar] [CrossRef]
Xue, J.; Wang, L.; Yu, Q.; Mao, P. Multi-Agent Deep Reinforcement Learning-Based Partial Offloading and Resource Allocation in Vehicular Edge Computing Networks. Comput. Commun. 2025, 234, 108081. [Google Scholar] [CrossRef]
Krajzewicz, D.; Erdmann, J.; Behrisch, M.; Bieker, L. Recent Development and Applications of SUMO - Simulation of Urban MObility. Int. J. Adv. Syst. Meas. 2012, 5, 128–138. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6379–6390. [Google Scholar]

Figure 1. System scenario of a heterogeneous computing power network under dynamic V2X environments. The figure illustrates the spatiotemporal dynamics as vehicles move from time slot t to

t + 1

. Key challenges are highlighted: resource mismatch (denoted by the red cross symbol ‘×’), where AI tasks necessitating GPU acceleration are inefficiently matched with CPU-based RSUs following a handover, and topology evolution driven by high mobility, which together necessitate an intelligent integrated scheduling approach.

Figure 1. System scenario of a heterogeneous computing power network under dynamic V2X environments. The figure illustrates the spatiotemporal dynamics as vehicles move from time slot t to

t + 1

. Key challenges are highlighted: resource mismatch (denoted by the red cross symbol ‘×’), where AI tasks necessitating GPU acceleration are inefficiently matched with CPU-based RSUs following a handover, and topology evolution driven by high mobility, which together necessitate an intelligent integrated scheduling approach.

Figure 2. Construction of the Heterogeneous Discrete Time-Varying Graph. The network state is sliced into discrete time slots. At each slot, the task node

T_{i}

establishes feasible connections (Offloading Edges) with available heterogeneous computing nodes, where Edge weights are determined by the processing efficiency factor

α_{i, k}

. The Migration Edge captures the task state transition across time slots, enabling the joint optimization of resource matching and migration costs.

Figure 2. Construction of the Heterogeneous Discrete Time-Varying Graph. The network state is sliced into discrete time slots. At each slot, the task node

T_{i}

establishes feasible connections (Offloading Edges) with available heterogeneous computing nodes, where Edge weights are determined by the processing efficiency factor

α_{i, k}

. The Migration Edge captures the task state transition across time slots, enabling the joint optimization of resource matching and migration costs.

Figure 3. The detailed architecture of the proposed QMIX-based collaborative scheduling algorithm. (Bottom) Each agent n adopts a DRQN structure, taking the local observation

o_{t}^{n}

, last action

u_{t - 1}^{n}

, and the spatiotemporal embedding

h_{t}^{(n)}

(extracted by LSTM) as inputs. (Top) The Mixing Network aggregates local utilities into a global

Q_{t o t}

. Its weights are dynamically generated by Hypernetworks (depicted as orange blocks) taking the global heterogeneous state

s_{t}

as input, with absolute activation operations (

| \cdot |

) applied to ensure the non-negativity constraint.

Figure 3. The detailed architecture of the proposed QMIX-based collaborative scheduling algorithm. (Bottom) Each agent n adopts a DRQN structure, taking the local observation

o_{t}^{n}

, last action

u_{t - 1}^{n}

, and the spatiotemporal embedding

h_{t}^{(n)}

(extracted by LSTM) as inputs. (Top) The Mixing Network aggregates local utilities into a global

Q_{t o t}

. Its weights are dynamically generated by Hypernetworks (depicted as orange blocks) taking the global heterogeneous state

s_{t}

as input, with absolute activation operations (

| \cdot |

) applied to ensure the non-negativity constraint.

Figure 4. Convergence curve of the training process, showing the stability of the proposed algorithm.

Figure 5. The CPU utilization prediction performance of the LSTM module. (a) The model effectively learns the periodic load fluctuations from the training data. (b) The model maintains high prediction accuracy on the test set, ensuring accurate state perception for heterogeneous scheduling. The vertical axis represents the CPU utilization ratio, ranging from 0 to 1.

Table 1. Hyperparameter Settings for QMIX Training.

Parameter	Value
GRU Hidden Dimension	64
Mixing Network Hidden Dimension	32
Exploration Rate ( $ϵ$ )	Annealed 1.0 → 0.05
Discount Factor ( $γ$ )	0.99
Replay Buffer Size	5000 Episodes
Batch Size	32
Target Update Frequency	200 Episodes
Learning Rate	$5 \times 10^{- 4}$

Table 2. Performance Comparison under Different Task Loads (Mean ± Std. Dev.).

Method	Avg. Latency (ms)	Migration Count	Resource Util. (%)
PSO	45.2 ± 2.4	124 ± 6	68.5 ± 2.1
ACO	41.8 ± 1.9	98 ± 4	72.1 ± 1.7
MADDPG	35.4 ± 1.1	83 ± 3	79.8 ± 1.2
Proposed (QMIX)	32.5 ± 0.7	76 ± 2	84.3 ± 0.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, J.; Zhang, X.; Gong, K. Heterogeneous Computing Resources Scheduling Based on Time-Varying Graphs and Multi-Agent Reinforcement Learning. Future Internet 2026, 18, 168. https://doi.org/10.3390/fi18030168

AMA Style

Yuan J, Zhang X, Gong K. Heterogeneous Computing Resources Scheduling Based on Time-Varying Graphs and Multi-Agent Reinforcement Learning. Future Internet. 2026; 18(3):168. https://doi.org/10.3390/fi18030168

Chicago/Turabian Style

Yuan, Jinshan, Xuncai Zhang, and Kexin Gong. 2026. "Heterogeneous Computing Resources Scheduling Based on Time-Varying Graphs and Multi-Agent Reinforcement Learning" Future Internet 18, no. 3: 168. https://doi.org/10.3390/fi18030168

APA Style

Yuan, J., Zhang, X., & Gong, K. (2026). Heterogeneous Computing Resources Scheduling Based on Time-Varying Graphs and Multi-Agent Reinforcement Learning. Future Internet, 18(3), 168. https://doi.org/10.3390/fi18030168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Heterogeneous Computing Resources Scheduling Based on Time-Varying Graphs and Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Computing Power Network and Heterogeneous Resource Scheduling

2.2. Time-Varying Graph Models and Temporal Feature Extraction

2.3. Reinforcement Learning with Partial Observability

3. System Model and Problem Formulation

3.1. Heterogeneous Computing Network Architecture

3.1.1. Heterogeneous Node Model

3.1.2. Task Diversity Model

3.2. Weighted Discrete Time-Varying Graph

3.2.1. Offloading Edge ( E T , S )

3.2.2. Migration Edge ( E S , S )

3.3. Communication and Computation Energy Model

3.4. Problem Formulation

4. LSTM-Enhanced QMIX Framework for Integrated Scheduling

4.1. Spatiotemporal Heterogeneous Feature Extraction

4.1.1. Heterogeneous State Representation

4.1.2. LSTM-Based Temporal Encoding

4.2. Dec-POMDP Formulation

4.2.1. Global State ( S global )

4.2.2. Local Observation ( O )

4.2.3. Reward Function Design ( R )

4.3. QMIX-Based Collaborative Scheduling Algorithm

5. Simulation and Analysis

5.1. Experimental Setup

5.2. Performance Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2.1. Offloading Edge ( $E_{T, S}$ )

3.2.2. Migration Edge ( $E_{S, S}$ )

4.2.1. Global State ( $S_{global}$ )

4.2.2. Local Observation ( $O$ )

4.2.3. Reward Function Design ( $R$ )