Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO

Gao, Xu; Dong, Hang; Zhang, Lianji; Wang, Yibo; Yang, Xianliang; Li, Zhenyu

doi:10.3390/app15168928

Open AccessArticle

Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO

by

Xu Gao

^1,2

,

Hang Dong

^1,2,*,

Lianji Zhang

^1,2,

Yibo Wang

^1,2,

Xianliang Yang

^1,2 and

Zhenyu Li

^1,2

¹

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

²

National Supercomputing Center in Zhengzhou, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8928; https://doi.org/10.3390/app15168928

Submission received: 9 July 2025 / Revised: 7 August 2025 / Accepted: 8 August 2025 / Published: 13 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

In HPC systems, job scheduling plays a critical role in determining resource allocation and task execution order. With the continuous expansion of computing scale and increasing system complexity, modern HPC scheduling faces two major challenges: a massive decision space consisting of tens of thousands of computing nodes and a huge job queue, as well as complex temporal dependencies between jobs and dynamically changing resource states.Traditional heuristic algorithms and basic reinforcement learning methods often struggle to effectively address these challenges in dynamic HPC environments. This study proposes a novel scheduling framework that combines GTrXL with PPO, achieving significant performance improvements through multiple technical innovations. The framework leverages the sequence modeling capabilities of the Transformer architecture and selectively filters relevant historical scheduling information through a dual-gate mechanism, improving long sequence modeling efficiency compared to standard Transformers. The proposed SECT module further enhances resource awareness through dynamic feature recalibration, achieving improved system utilization compared to similar attention mechanisms. Experimental results on multiple datasets (ANL-Intrepid, Alibaba, SDSC-SP2) demonstrate that the proposed components achieve significant performance improvements over baseline PPO implementations. Comprehensive evaluations on synthetic workloads and real HPC trace data show improvements in resource utilization and waiting time, particularly under high-load conditions, while maintaining good robustness across various cluster configurations.

Keywords:

HPC; job scheduling; Transformer-XL; SE; PPO; scheduling efficiency

1. Introduction

With the acceleration of digital transformation and the widespread adoption of computationally intensive applications in fields such as scientific research, industrial engineering, and business analytics [1], scheduling problems in HPC environments have become increasingly complex and face higher performance requirements [2]. In modern HPC systems, coordinating thousands of nodes and managing large-scale task submissions across diverse objectives and resource constraints presents substantial challenges [3,4]. However, traditional scheduling methods typically rely on static heuristic algorithms and fixed assumptions, making them ill suited to the inherent dynamism and uncertainty of real-world HPC workloads [5,6].

Increasing workload heterogeneity and time complexity exacerbate job scheduling difficulty, particularly due to temporal relationships between jobs and resources. [7,8]. These relationships include: (i) execution sequence constraints, where the output of one job must be used by another job, such as variant detection in genomics requiring prior sequence alignment results; (ii) runtime trigger conditions, where a task can only be scheduled after other tasks are completed or specific conditions are met, such as in multi-stage engineering simulations [9]; and (iii) resource contention, where concurrent requests for scarce resources (e.g., GPUs or bandwidth) may lead to latency spikes or idle time. Research indicates that unmanaged competition can reduce system throughput by up to 40% [10].

This challenge is further exacerbated under diverse workloads. Computing-intensive tasks require CPU/GPU throughput, while data-intensive tasks require high I/O bandwidth [11,12]. Modern architectures employ hierarchical storage and dedicated interconnect technologies, further complicating the scheduling process [13].

Traditional scheduling algorithms have obvious shortcomings: FCFS often leads to low resource utilization [14], while SJF may cause long job starvation [15]. Additionally, rule-based and priority-based schemes typically exhibit limited adaptability and high system maintenance complexity [16].

To address these challenges, we propose GTrXL-SPPO, a unified framework that combines Gated GTrXL with PPO. GTrXL captures long-term sequence patterns via multi-head attention mechanisms [17], while GRU-inspired gating mechanisms enhance the retention of essential scheduling signals [18,19]. Additionally, the SE module [20] further enhances the model’s ability to focus on critical scheduling information by adaptively reweighting input features.

In our model, we employ PPO for policy learning [21], leveraging its truncated objective function to ensure stable training in volatile HPC environments. PPO demonstrates superior sampling efficiency and convergence in high-dimensional control tasks [22], and we further incorporate adaptive learning rates to enhance responsiveness to workload fluctuations.

Evaluations on synthetic and real-world datasets (including ANL-Intrepid, Alibaba, and SDSC-SP2) show that our model consistently outperforms traditional methods and standard PPO in terms of throughput, latency, and resource utilization. Additionally, GTrXL-SPPO demonstrates strong generalization capabilities across different system architectures and workload types.

The key contributions of this paper include:

We design a dual-channel policy network based on GTrXL to capture long-term job sequences.
We propose a lightweight SE layer for dynamic resource reweighting in scheduling-sensitive environments.
We developed a reinforcement learning framework based on PPO, incorporating exponential clipping and memory replay to achieve better convergence.
We validated our model on large-scale HPC task sequences, demonstrating performance superior to existing baselines.

The remainder of this paper is organized as follows: Section 2 describes the background and related work in the field of HPC job scheduling. Section 3 introduces our proposed GTrXL-SPPO framework, including architectural design and training methods. Section 4 details the experimental setup and evaluation metrics. Section 5 discusses the experimental results and comparative analysis. Finally, Section 7 summarizes the paper and outlines potential future work.

2. Related Work

2.1. Traditional High-Performance Computing Task Scheduling

Traditional high-performance computing (HPC) job scheduling methods, such as rule-based policies and heuristic algorithms, have long been used to coordinate resources and balance workloads.These methods typically rely on fixed strategies manually designed for specific environments. However, the increasing heterogeneity of modern HPC cluster architectures and the growing diversity of workloads pose substantial challenges to these static strategies. In particular, they often struggle to manage execution time dependencies and balance critical performance objectives such as energy efficiency, latency, and throughput [23,24,25]. Table 1 summarizes the inherent limitations of representative baseline methods, highlighting the necessity of adaptive and learning-based scheduling methods.

In addition to rule-based techniques, previous studies have explored various methods, including feedback control theory, queuing models, and evolutionary optimization algorithms (e.g., genetic algorithm-based schedulers). While these methods are effective in constrained or static environments, they typically struggle to demonstrate the required flexibility and scalability under the real-time constraints of dynamic, large-scale HPC systems.

Case Study Analysis

In earlier versions, Google’s Borg system primarily used a first-come, first-served (FCFS) policy. However, under fluctuating workloads, this policy proved insufficient to ensure the reliable restart of service tasks and batch tasks.

Similarly, traditional batch scheduling frameworks, such as Moab/Maui, LoadLeveler, LSF, PBS, and Sun Grid Engine (SGE), typically rely on space-sharing mechanisms and a two-level queue architecture. These systems sort job queues based on submission time and allocate tasks to idle nodes using static resource matching strategies. However, such static heuristic algorithms often result in poor system performance, exacerbated resource fragmentation, and reduced predictability of performance metrics [34].

For example, PBS Pro often experiences task processing delays when faced with a surge of high-priority tasks, demonstrating the rigid nature of traditional heuristic algorithms in real-time decision-making. Among these traditional schedulers, SLURM [27] is one of the most widely deployed systems in modern HPC environments, including deployments at LLNL, NERSC, and TACC. It adopts a modular, plugin-driven design, containing components such as priority, scheduler, and task submission, and primarily allocates computing nodes through a queue-based space sharing mechanism [35].

Although SLURM offers scalability and integration flexibility, it inherently relies on static heuristic logic. Under dynamic workloads, especially when rigid filling strategies conflict with actual runtime task patterns [3], this often leads to low resource utilization, task starvation, and task startup delays. These challenges highlight the urgent need for adaptive scheduling strategies.

To address this issue, deep learning-based policy learning methods (DRL) such as Approximate PPO and the GTrXL-SPPO proposed in this paper have emerged as promising alternatives. These methods achieve policy adaptation through continuous interaction with the environment and demonstrate superior performance in managing the long-term scheduling complexity of diverse and volatile workloads [36].

2.2. Deep Reinforcement Learning

Unlike continuous control tasks (such as robot control), high-performance computing (HPC) job scheduling inherently involves discrete decisions: assigning specific jobs to specific computing nodes or deferring scheduling decisions. Therefore, our GTrXL-SPPO framework adopts a discrete action space, where each action corresponds to a specific job-to-node assignment or waiting decision. The size of the action space varies with the number of available nodes plus one waiting action, as shown in our implementation on an 800-node cluster (Section 4.4).

As shown in Table 2, PPO has become a widely adopted baseline due to its robust performance and stable training characteristics. By constraining policy updates with a clipped surrogate objective, PPO mitigates the instability issues common in other policy gradient methods.

DRL-based schedulers can eliminate the need for manually designed heuristic methods by continuously interacting with the environment to learn optimal decision strategies [3,41]. Algorithms such as PPO, A2C, and DQL have demonstrated significant effectiveness in reducing task delays and improving resource utilization [42,43] and have been extended to multi-objective optimization scenarios. A recent review paper [44] points out that modeling long-term temporal relationships and rewards can further improve DRL scheduling performance.

Scheduling problems are typically modeled as multi-state multi-objective reinforcement learning (MDP) problems, where agents observe system states, select scheduling actions, receive feedback (rewards), and update their policies accordingly [45,46]. The environment state includes task attributes and system resource metrics, while the reward function is designed to incentivize desirable behaviors, such as maximizing throughput, minimizing delay, and balancing load.

2.3. Application of Transformer Mechanisms in Scheduling

The self-attention mechanism of the Transformer architecture is widely used in scheduling tasks because it can effectively model long-term sequential dependencies [47]. Unlike RNN-based models, Transformers support parallel processing of sequence elements, significantly accelerating training speed [48]. Additionally, their global attention structure can model complex cross-task temporal relationships and provide better interpretability through attention heatmaps.

In the field of HPC scheduling, Transformer-based models offer promising solutions for adapting to dynamic and heterogeneous workloads. Their flexibility allows for encoding diverse task features and system states, while their scalability supports efficient deployment in large-scale environments. Therefore, combining Transformer mechanisms with DRL strategies is emerging as a promising direction for the next generation of intelligent schedulers.

3. Methodology

3.1. Problem Statement

In HPC task scheduling problems, the scheduler manages a dynamic task queue, where each task has specific resource requirements and execution time. The primary objectives of resource scheduling are to optimize resource allocation, minimize task waiting time, improve resource utilization, and maximize system throughput. In this study, GTrXL-SPPO models the scheduler as an intelligent agent that determines the optimal allocation of idle computing nodes to tasks in the waiting queue, as shown in Figure 1. The intelligent agent retrieves job characteristics and resource node states from the environment and concatenates this information into a unified state vector. Within the Markov Decision Process (MDP) framework, the state space and action space are systematically modeled to support intelligent scheduling decisions. Through continuous interaction with the environment, the intelligent agent learns the optimal scheduling strategy.

Environment Module (upper part): Represents a high-performance computing (HPC) multi-resource scheduling system, consisting of four core components: (1) Job Resource Information Management Module, responsible for maintaining dynamic job queues; (2) Window-based Resource Allocation Mechanism, used for resource scheduling; (3) GTrXL History Buffer, storing past scheduling decisions and system states; (4) Multi-step Feedback Mechanism, providing real-time system state updates. Agent Module (Lower Part): Implements an actor–critic architecture, where the actor network processes state information through the GTrXL layer (handling long-term dependencies) and the SE layer (enhancing features) to output scheduling strategies. The critic network evaluates state values to optimize strategies. Reward feedback evaluates scheduling effectiveness (resource utilization, waiting time), state feedback updates system conditions, and historical feedback enables experience-based learning through buffers. Arrows indicate the direction of data flow between components.

3.1.1. MDP Formulation

We model the HPC scheduling problem as an MDP: At each step, the agent observes the system state

s_{t}

and selects an action

a_{t}

to assign tasks to idle nodes.

To handle resource constraints and priority requirements in HPC environments, we extend the basic MDP to a constrained Markov Decision Process (CMDP), which can be formally defined as a tuple

〈 S, A, P, R, C, γ 〉

, where:

$S$ is the state space, representing all possible system configurations
$A$ is the action space, representing all possible job-to-node assignments
$P : S \times A \times S \to [0, 1]$ is the transition probability function
$R : S \times A \times S \to R$ is the reward function
$C : S \times A \times S \to R^{m}$ is a vector of m constraint functions
$γ \in [0, 1)$ is the discount factor

The optimization objective is to find a policy

π^{*}

that maximizes the expected cumulative discounted reward while satisfying the constraints:

\begin{matrix} π^{*} = arg max_{π} E_{τ \sim π} [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t}, s_{t + 1})] \end{matrix}

(1)

subject to:

\begin{matrix} E_{τ \sim π} [\sum_{t = 0}^{\infty} γ^{t} C_{i} (s_{t}, a_{t}, s_{t + 1})] \leq b_{i}, \forall i \in {1, 2, \dots, m} \end{matrix}

(2)

where

τ = (s_{0}, a_{0}, s_{1}, a_{1}, \dots)

is a trajectory, and

b_{i}

are threshold constants for each constraint.

3.1.2. State and Action Space

The system state

s_{t}

contains comprehensive job and system information with a hierarchical structure. At the high level, we organize state information into functional modules:

s_{t} = [J_{t}, W_{t}, R_{t}, h_{t}]

where

J_{t}

represents the job feature vector including the number of nodes requested by job i (reqProc), the runtime requested by job i (reqTime), and the waiting time

w_{i}

(currentTime − submitTime).

W_{t}

denotes the waiting time vector across all queued jobs,

R_{t}

represents the current resource state vector indicating node availability and utilization, and

h_{t}

contains the historical information buffer that tracks past scheduling decisions and system performance.

At decision step t, the implementation-level state representation

ϕ_{t}

reflects the real-time status of system resources and pending tasks. It contains key information such as CPU, GPU, and memory usage, as well as task-specific attributes including resource requests, estimated execution time, and waiting time. Formally, the state space can be modeled as a combination of task windows and node pools. For each waiting task i,

{req}_{i}

denotes the number of requested resource cores,

w_{i}

denotes the waiting time, defined as

w_{i} = t_{current} - t_{arrival}^{(i)}

,

p_{i}

denotes the assigned priority, and

e_{i}

denotes the estimated execution time. For each compute node k,

n_{k}

denotes the node’s available state, and

r_{k}

denotes the remaining execution time of the current task on node k. The detailed state representation at time step t can be represented as:

ϕ_{t} = ⋃_{i = 1}^{n} [{req}_{i}, w_{i}, p_{i}, e_{i}] ⨁ ⋃_{k = 1}^{N} [n_{k}, r_{k}]

(3)

where n denotes the number of tasks to be processed, and N denotes the total number of resource nodes.

The action space is implemented as a discrete space whose size is equal to the window size, formally defined as

A = {0, 1, 2, \dots, window_size - 1}

. Each action

a_{t} \in A

represents an index in the current waiting queue. The agent selects this index to determine the task to be scheduled with the highest priority. The action mechanism is implemented through queue reordering: when action

a_{t} = i

is selected, the task with index i in the waiting queue is moved to the front of the queue, thereby obtaining the highest scheduling priority. In practical applications, when the waiting queue size is smaller than the window size, the effective action space is dynamically restricted to

A_{eff} = {0, 1, 2, \dots, min (window_size, | wait_queue |) - 1}

. This discrete action modeling solves the task selection problem while maintaining computational feasibility, with the size of the action space varying with the observation window size.

This hierarchical design enables the agent to comprehensively observe the characteristics of the task queue and the real-time status of the resource pool, providing sufficient information for making scheduling decisions within the CMDP framework.

3.1.3. Decision Variables and Constraints

The scheduling decision at each time step can be represented by the following decision variables:

$x_{j, n, t} \in {0, 1}$ : Binary variable indicating whether job j is assigned to node n at time step t
$y_{j, t} \in {0, 1}$ : Binary variable indicating whether job j starts execution at time step t

These decision variables are subject to the following constraints:

Resource Capacity Constraint: Each node can be assigned to at most one job at any time

$\begin{matrix} \sum_{j \in J} x_{j, n, t} \leq 1, \forall n \in N, \forall t \end{matrix}$

(4)
Job Completeness Constraint: A job must receive all its requested resources or none at all

$\begin{matrix} \sum_{n \in N} x_{j, n, t} = {req}_{j} \cdot y_{j, t}, \forall j \in J, \forall t \end{matrix}$

(5)
Non-preemption Constraint: Once a job starts execution, it must run to completion

$\begin{matrix} y_{j, t} = 1 \Rightarrow \forall t^{'} \in [t, t + d_{j}), \sum_{n \in N} x_{j, n, t^{'}} = {req}_{j} \end{matrix}$

(6)
Priority Constraint: Higher priority jobs with long waiting times should be scheduled before lower priority jobs

$\begin{matrix} (p_{j} > p_{j^{'}}) \land (w_{j} > w_{threshold}) \Rightarrow schedule_priority (j) > schedule_priority (j^{'}) \end{matrix}$

(7)

In practice, the resource capacity and job completeness constraints are enforced through action space design and action masking, while the non-preemption and priority constraints are encouraged through reward shaping.

3.2. Neural Architecture Design

At the upper decision layer, the agent evaluates jobs in the waiting queue and decides whether to execute or defer their scheduling, ensuring fair resource allocation through priority adjustment and reservation strategies. At the lower decision layer, the agent maximizes resource utilization by filling idle nodes with smaller jobs. This hierarchical decision structure enhances the flexibility and robustness of the scheduler, ensuring fairness for large jobs while improving overall resource efficiency under dynamic load conditions. The overall neural network architecture supporting this scheduling strategy is illustrated in Figure 2, detailing input feature processing, SE-based feature enhancement, GTrXL sequence modeling, and policy/value output branches. The system-level scheduling process (environment, job queue, reservation mechanism, agent decisions) was previously depicted in Figure 1. Here, Figure 2 focuses specifically on the internal design of the neural network responsible for decision-making.

3.3. Multi-Scale Feature Enhancement

To address the challenge of dynamically allocating weights for heterogeneous tasks and system features in HPC environments, especially in scenarios with burst workloads and resource competition, we enhance traditional reinforcement learning by introducing an adaptive feature recalibration mechanism. Specifically, the SE module [20] combines average pooling with a bottleneck multi-layer perceptron (channel → channel/4 → channel) and a Sigmoid activation function to capture global temporal relationships between feature channels, effectively suppressing irrelevant features while amplifying key features. The computational overhead is low; for example, adding SE to ResNet-50 increases parameters by only 10% but significantly improves accuracy [20].

In our architecture, the SE block is placed before the GTrXL layer in both the actor network and critic network, enabling temporal modeling to operate on the enhanced features.

To extend adaptability beyond static representations, we propose a novel SECT layer that fuses global statistical information and scheduling-aware dynamics by combining average and max aggregation Algorithm 1.

\begin{matrix} z_{fusion} & = \frac{1}{2} (mean (x) + \max (x)) \end{matrix}

(8)

\begin{matrix} w_{base} & = σ (W_{2} \cdot Gelu (W_{1} \cdot z_{fusion})) \end{matrix}

(9)

\begin{matrix} w_{sched} & = σ (W_{4} \cdot Dropout (Gelu (W_{3} \cdot mean (x)))) \end{matrix}

(10)

\begin{matrix} w_{final} & = \frac{w_{base} + w_{sched}}{2}, x_{out} = x \otimes w_{final} \end{matrix}

(11)

This design adapts channel weights to workload volatility, enhancing robustness without significant overhead [20].

Algorithm 1 SECT Layer

Require:: Input tensor $X \in R^{B \times L \times C}$ , Channel dimension C, Reduction ratio $r = 4$
Ensure:: Output tensor $Y \in R^{B \times L \times C}$
1:: function SECT( $X$ )
2:: $B, L, C \leftarrow shape (X)$
3:: $s \leftarrow \frac{1}{L} \sum_{i = 1}^{L} X_{:, i, :}$ ▹ Global Average Pooling
4:: $z_{1} \leftarrow Linear (s, C / r)$ ▹ First fully connected layer
5:: $z_{2} \leftarrow ReLU (z_{1})$
6:: $z_{3} \leftarrow Linear (z_{2}, C)$ ▹ Restore channel dimension
7:: $a \leftarrow Sigmoid (z_{3})$
8:: $Y \leftarrow X ⊙ a . unsqueeze (1)$
9:: return $Y$
10:: end function

Mathematical Variable Definitions:

B: Batch size representing the number of parallel scheduling instances
L: Sequence length corresponding to the number of jobs/nodes in the current scheduling window
C: Channel dimension representing the feature dimension of each job/node embedding
r: Reduction ratio controlling the bottleneck dimension ( $C / r$ ) in the excitation pathway
$s \in R^{B \times C}$ : Global statistics vector obtained through temporal average pooling across the sequence dimension
$z_{1} \in R^{B \times C / r}$ : Compressed feature representation after dimensionality reduction
$z_{2} \in R^{B \times C / r}$ : Activated features after ReLU non-linearity
$z_{3} \in R^{B \times C}$ : Restored feature dimension after the second linear transformation
$a \in R^{B \times C}$ : Channel-wise attention weights normalized to $[0, 1]$ range via sigmoid activation
$a . unsqueeze (1)$ : Expands attention weights to shape $R^{B \times 1 \times C}$ for broadcasting across sequence dimension
⊙: Element-wise multiplication (Hadamard product) applying channel attention to input features

The SECT mechanism learns to emphasize important scheduling-relevant channels while suppressing less informative features through this squeeze-and-excitation operation.

3.4. Dynamic Memory Mechanism

In high-performance computing (HPC) scheduling, task queues with heterogeneous arrival times and different resource requirements can lead to complex timing patterns. This O(n²) complexity stems from the self-attention mechanism that computes pairwise interactions between all task positions [17], and the fixed context window constraint leads to performance degradation when exceeding a certain threshold [49].

3.4.1. GTrXL: Enhanced Memory Architecture

To address these empirically validated limitations, we propose GTrXL (Gated Transformer-XL), an enhanced Transformer architecture specifically designed for long-term temporal modeling in HPC scheduling scenarios. GTrXL builds upon the segment-level recursive mechanism [49] to model dependencies beyond fixed-length context without compromising temporal consistency. GTrXL supports a relaxed Markov approximation where

π (a_{t} | s_{t})

is determined by the most recent sequence of states

[s_{t - k}, \dots, s_{t}]

. Unlike standard Markov models that only consider immediate states, our method preserves contextual information over extended time spans [50,51]. To maintain temporal consistency between scheduled segments, GTrXL uses relative position encoding instead of absolute position encoding. This enables generalization to sequences longer than training sequences—critical for handling variable-length task queues in HPC workloads. A GRU-based gating system dynamically regulates information flow within each Transformer block, selectively preserving relevant sequential information while filtering outdated inputs as Algorithm 2.

Algorithm 2 Gated Transformer-XL Core Module in GTrXL-SPPO

Require:: State input $X$ , Temporal memory $M$ , Attention mask mask
Ensure:: Output feature $Y$ , Updated memory $M_{new}$
1:: function GTrXLBlock( $X, M, mask$ )
2:: $X^{'} \leftarrow LayerNorm (X)$ ▹ Input
3:: $M^{'} \leftarrow LayerNorm (M)$ ▹ Memory
4:: $A \leftarrow MultiHeadAttention (X^{'}, M^{'}, M^{'}, mask)$
5:: $r \leftarrow σ (W_{r} A + U_{r} X)$ ▹ Reset gate computation
6:: $z \leftarrow σ (W_{z} A + U_{z} X - b_{g})$ ▹ Update gate
7:: $\tilde{h} \leftarrow tanh (W_{g} A + U_{g} (r ⊙ X))$
8:: $H_{1} \leftarrow (1 - z) ⊙ X + z ⊙ \tilde{h}$ ▹ 1st GRU output
9:: $F \leftarrow FeedForward (LayerNorm (H_{1}))$
10:: $r_{2} \leftarrow σ (W_{r} F + U_{r} H_{1})$ ▹ Reset gate (2nd GRU)
11:: $z_{2} \leftarrow σ (W_{z} F + U_{z} H_{1} - b_{g})$ ▹ Update gate (2nd GRU)
12:: ${\tilde{h}}_{2} \leftarrow tanh (W_{g} F + U_{g} (r_{2} ⊙ H_{1}))$
13:: $Y \leftarrow (1 - z_{2}) ⊙ H_{1} + z_{2} ⊙ {\tilde{h}}_{2}$
14:: return $Y$
15:: end function

3.4.2. Temporal Dependency Modeling

GTrXL models the temporal execution order by combining multi-head self-attention with relative position encoding. Each attention head computes queries, keys, and values:

\begin{matrix} \{\begin{matrix} Q = X W_{Q} \in R^{L \times d} \\ K = X W_{K} \in R^{L \times d} \\ V = X W_{V} \in R^{L \times d} \end{matrix} \end{matrix}

(12)

Relative position encodings capture execution order relationships:

\begin{matrix} R_{i, j} = \frac{1}{\sqrt{d}} cos ((i - j) \cdot ω) \end{matrix}

(13)

\begin{matrix} {AttnScore}_{i, j} = \frac{Q_{i} K_{j}^{T} - Q_{i} R_{i - j}^{T}}{\sqrt{d_{h}}} \end{matrix}

(14)

This design allows the agent to recognize execution-order correlations critical for effective scheduling decisions.

The GTrXL architecture employs a two-layer gating strategy within each Transformer block to adaptively control temporal context updates. The high-level operations are:

\begin{matrix} A = MHA (Q, K, V), h = {GRU}_{1} (Q, A) \end{matrix}

(15)

\begin{matrix} F = FFN (h), O = {GRU}_{2} (h, F) \end{matrix}

(16)

This gated structure enables selective temporal state updates, allowing rapid integration of critical scheduling signals (e.g., high-priority job arrivals) while maintaining stability under dynamic workloads. The complete memory flow and compression pipeline are illustrated in Figure 3.

Attention Weight Matrices:

$X \in R^{L \times d}$ : Input sequence tensor where L is sequence length and d is model dimension
$W_{Q} \in R^{d \times d}$ : Query projection matrix mapping input to query space
$W_{K} \in R^{d \times d}$ : Key projection matrix mapping input to key space
$W_{V} \in R^{d \times d}$ : Value projection matrix mapping input to value space
$Q, K, V \in R^{L \times d}$ : Query, key, and value matrices for attention computation

These projection matrices are learned parameters that transform the input scheduling context into query, key, and value representations for multi-head attention.

Where

ω \in R

is the frequency parameter that controls the rate of positional variation (typically initialized as

ω = 1 / 10,000

), and

d_{h} = d / H

represents the dimension per attention head, with H being the total number of attention heads and d being the model dimension. The relative position encoding

R_{i, j}

captures the temporal distance between positions i and j in the scheduling sequence, enabling the model to understand execution order dependencies.

3.4.3. Implementation Details and Training Integration

The GRU-based gating mechanism within each GTrXL block follows standard gated recurrent unit formulations. For each gating layer, the operations are defined as:

\begin{matrix} r_{t} & = σ (W_{r} K_{t} + U_{r} Q_{t}) \end{matrix}

(17)

\begin{matrix} z_{t} & = σ (W_{z} K_{t} + U_{z} Q_{t} - b_{g}) \end{matrix}

(18)

\begin{matrix} {\tilde{h}}_{t} & = tanh (W_{g} K_{t} + U_{g} (r_{t} ⊙ Q_{t})) \end{matrix}

(19)

\begin{matrix} h_{t} & = (1 - z_{t}) ⊙ Q_{t} + z_{t} ⊙ {\tilde{h}}_{t} \end{matrix}

(20)

GRU Gating Parameters:

$r_{t} \in R^{d}$ : Reset gate vector controlling how much past information to forget
$z_{t} \in R^{d}$ : Update gate vector determining the balance between past and new information
${\tilde{h}}_{t} \in R^{d}$ : Candidate hidden state representing new information
$h_{t} \in R^{d}$ : Final output combining past and candidate states
$Q_{t}, K_{t} \in R^{d}$ : Input vectors encoding scheduling context (runtime estimates, resource status)
$W_{r}, W_{z}, W_{g} \in R^{d \times d}$ : Input-to-hidden weight matrices for reset, update, and candidate gates
$U_{r}, U_{z}, U_{g} \in R^{d \times d}$ : Hidden-to-hidden recurrent weight matrices
$b_{g} \in R^{d}$ : Gate bias vector (typically initialized to 0.1 for stable training)
$σ (\cdot)$ : Sigmoid activation function mapping values to $(0, 1)$ range
tanh $(\cdot)$ : Hyperbolic tangent activation function mapping values to $(- 1, 1)$ range
⊙: Element-wise multiplication (Hadamard product)

The weight matrices are randomly initialized using Xavier/Glorot initialization and learned during training through backpropagation.

When critical job characteristics (e.g., high-priority arrivals) emerge, the gates adaptively adjust, facilitating fast integration of new signals (Algorithm 3) and improving responsiveness under dynamic workloads.

Algorithm 3 GTrXL-SPPO collaborative processing framework

Require:: Input state $X \in R^{B \times L \times 1 \times 2}$ , Memory $M$ , Mask mask
Ensure:: Actor output $π (a | s)$ , Critic output $V (s)$
1:: function CollaborativeProcessing( $X, M, mask$ )
2:: if $X . \dim () = = 4$ then
3:: $X \leftarrow X . squeeze (2)$
4:: end if
5:: $X_{flat} \leftarrow X . view (- 1, 2)$
6:: $X_{fused} \leftarrow ReLU ({Linear}_{2 \to 1} (X_{flat}))$
7:: $X_{seq} \leftarrow X_{fused} . view (1, - 1)$
8:: $H_{1} \leftarrow ReLU (Linear (X_{seq}))$
9:: $Y \leftarrow SEC T (H_{1})$ ▹ Apply Algorithm 1
10:: $H_{2} \leftarrow ReLU (Linear (Y))$
11:: if $H_{2} . \dim () = = 1$ then
12:: $H_{input} \leftarrow H_{2} . unsqueeze (0)$
13:: else
14:: $H_{input} \leftarrow H_{2}$
15:: end if
16:: $H_{final} \leftarrow GTrXLBlock (H_{input}, M, mask)$ ▹ Apply Algorithm 2
17:: $H_{output} \leftarrow H_{final} . squeeze (0)$
18:: ${logits}_{actor} \leftarrow {Linear}_{action} (H_{output})$
19:: $π (a | s) \leftarrow Softmax ({logits}_{actor})$
20:: $V (s) \leftarrow {Linear}_{value} (H_{output})$
21:: return $π (a | s), V (s)$
22:: end function

GTrXL-SPPO Collaborative Integration Analysis:

The collaborative framework synergistically combines SECT (Algorithm 1) and GTrXL (Algorithm 2) to maximize their complementary advantages in HPC scheduling:

Processing Pipeline:

Input Standardization: Consolidates dual-channel HPC state information through 2→1 linear transformation
Feature Enhancement: Applies SECT channel attention to amplify scheduling-relevant features
Temporal Modeling: Leverages GTrXL gated Transformer blocks for long-range dependency capture
Shared Dual Output: Both actor and critic networks utilize identical enhanced representations

GTrXL-SPPO collaboration enhances feature quality through SECT preprocessing, improves temporal modeling through GTrXL processing of refined features, achieves computational efficiency through a shared pipeline architecture, and enables memory-aware processing that preserves critical scheduling history across sessions.

3.5. Training Strategy

3.5.1. Algorithm Design

GTrXL-SPPO employs the PPO algorithm to update scheduling policies via gradient ascent, combining policy- and value-based reinforcement learning for stable convergence [21]. The enhanced GTrXL serves as the temporal backbone to capture sequential scheduling dynamics in HPC environments.

The clipped surrogate objective is:

\begin{matrix} L^{clip} (θ) & = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})] \end{matrix}

(21)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

,

ϵ = 0.2

, and

{\hat{A}}_{t}

is computed via Generalized Advantage Estimation (GAE):

\begin{matrix} {\hat{A}}_{t} & = \sum_{l = 0}^{T - t} {(γ \cdot λ)}^{l} \cdot δ_{t + l} \end{matrix}

(22)

\begin{matrix} δ_{t} = r_{t} + γ \cdot V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t}) \end{matrix}

(23)

To address the long-tail reward distribution induced by dynamic job arrivals and evolving system states, we adopt an exponentially sensitive clipping strategy and introduce a time-decaying discount factor

γ_{t} = 1 - \frac{t}{T}

, reducing the weight of late-stage decisions and stabilizing learning.

3.5.2. Constraint Handling in Training

To incorporate constraints into the reinforcement learning framework, we adopt a dual approach:

Hard constraints: Resource capacity and task completion constraints are implemented through action space design and action masking:

$\begin{matrix} A_{t} = {a \in A ∣ a satisfies constraints 1 and 2} \end{matrix}$

(24)
Soft constraints: Non-preemption and priority constraints are encouraged through reward shaping:

$\begin{matrix} R_{t}^{constraint} = - μ_{1} \cdot preemption_violations - μ_{2} \cdot priority_violations \end{matrix}$

(25)

This hybrid approach ensures that all scheduling decisions are feasible while guiding the strategy toward developing solutions for more complex constraints. The constrained optimization problem is thus transformed into an unconstrained problem with a modified reward function:

\begin{matrix} R_{t}^{modified} = R_{t} + R_{t}^{constraint} \end{matrix}

(26)

Implementation Scope Clarification: While our CMDP framework supports constraint handling as described above, the experimental validation in this work focuses on core scheduling efficiency to establish baseline performance. Full constraint integration (deadlines, energy, thermal limits) represents ongoing development work and will be addressed in future extensions of this framework.

3.5.3. Reward Function Design

We formulate the reward function as a multi-objective optimization problem. The total reward

R_{total}

in GTrXL-SPPO integrates four scheduling objectives and a queue pressure regularization term:

\begin{matrix} R_{total} & = \sum_{k = 1}^{4} ω_{k} \cdot r_{k} + α_{reg} \cdot r_{pressure} \end{matrix}

(27)

Each component reward is mathematically defined as follows:

System Utilization Reward:

$\begin{matrix} r_{1} = σ (10 \cdot (u_{t} - 0.5)) \end{matrix}$

(28)

where $u_{t} = \frac{{occupied_nodes}_{t}}{total_nodes}$ is the system utilization at time t, and $σ$ is the sigmoid function.
Waiting Time Penalty:

$\begin{matrix} r_{2} = - Huber (w_{t} - w_{target}) \end{matrix}$

(29)

where $w_{t}$ is the average waiting time of jobs in the queue at time t, $w_{target}$ is the target waiting time, and Huber is the Huber loss function that reduces sensitivity to outliers.
Throughput Reward:

$\begin{matrix} r_{3} = α \cdot {jobs_completed}_{t} \end{matrix}$

(30)

where ${jobs_completed}_{t}$ is the number of jobs completed at time t, and $α$ is a scaling factor.
Priority Reward:

$\begin{matrix} r_{4} = β \cdot \sum_{j \in {scheduled}_{t}} p_{j} \end{matrix}$

(31)

where ${scheduled}_{t}$ is the set of jobs scheduled at time t, $p_{j}$ is the priority of job j, and $β$ is a scaling factor.

The queue pressure regularization term penalizes the accumulation of jobs in the waiting queue:

\begin{matrix} r_{pressure} = - log (1 + {queue_length}_{t}) \end{matrix}

(32)

where

α_{reg} = 0.5

is the regularization coefficient. The adaptive temperature parameter T controls the exploration–exploitation balance in the reward weighting mechanism:

\begin{matrix} ω_{k}^{(T)} = \frac{exp (ω_{k} / T)}{\sum_{j = 1}^{4} exp (ω_{j} / T)} \end{matrix}

(33)

This adaptive design enables the agent to switch its focus between efficiency and fairness under different load conditions. As shown in our parameter sensitivity analysis (Section 5.5), this dynamic temperature adjustment is crucial for balancing multiple scheduling objectives.

When

T = 1.0

(default), the model achieves optimal performance across all metrics; during periods of low utilization, lower values (

T = 0.5

) prioritize resource efficiency; and during periods of high queue pressure, higher values (

T = 2.0

) promote fairness through more balanced target weight allocation. In long-tail scheduling environments [52,53], our reward scheme accelerates convergence and improves the resource-delay Pareto frontier compared to standard PPO training.

3.6. Training Acceleration Strategies

To accelerate training convergence and improve efficiency, the GTrXL-SPPO framework adopts a comprehensive optimization strategy that combines distributed parallel training, memory replay technology, and hybrid optimization methods.

(1) Distributed Training: Parallel processing significantly accelerates the training process compared to a single GPU setup [54]. We use PyTorch to implement DDP training on multiple GPUs. The training dataset is divided into multiple shards, each assigned to a GPU device for independent gradient computation:

\begin{matrix} D_{k} = {(s_{t}^{i}, a_{t}^{i}, r_{t}^{i}) ∣ i mod N = k}, k = 0, 1, \dots, N - 1 \end{matrix}

(34)

where N denotes the number of available GPUs, and

D_{k}

denotes the data subset processed by GPU k. After local gradient updates, the AllReduce operation synchronizes gradients across all devices.

(2) Memory replay mechanism: To better capture long-term temporal patterns in scheduling dynamics, we introduce a memory replay mechanism [19,49]. Specifically, historical hidden states and intermediate features across time steps are stored in a recursive memory matrix. Combined with the GRU gating structure, this design dynamically updates memory content during training, enhancing the agent’s ability to handle time-extended task sequences.

(3) Hybrid Optimization Strategy: In the policy optimization phase, we integrate the Adam optimizer with learning rate warm-up and cosine annealing strategies to ensure a smoother training process and more stable model convergence. Additionally, to reduce communication overhead in multi-GPU environments, we apply 1-bit Adam compression, where gradients are compressed before synchronization:

\begin{matrix} {\tilde{g}}_{t} = sign (g_{t}) ⊙ \sqrt{| g_{t} |}, g_{t} = \nabla_{θ} L (θ) \end{matrix}

(35)

This strategy significantly reduces synchronization communication while maintaining model convergence performance [55].

3.7. Scalability and Deployment Considerations

Facing practical deployment issues, we analyze the ability of GTrXL-SPPO to address three key challenges: node scaling, heterogeneity, and architecture migration.

3.7.1. Node Scaling Analysis

Our sliding window mechanism (Figure 4) ensures that the memory complexity is

O (W)

rather than

O (N)

, where W is the fixed window size (200) and N is the total number of nodes. Combined with the attention aggregation mechanism in GTrXL, this design supports scalability without triggering linear memory growth.

3.7.2. Heterogeneous Adaptation

The SECT module (Section 3.3) supports heterogeneous environments through dynamic feature recalibration. Our experiments on ANL-Intrepid (grid architecture), Alibaba (general-purpose cluster), SDSC-SP2 (homogeneous), and PIK-IPLEX (heterogeneous) show that this module can achieve effective adaptation under different cluster configurations without architecture-specific modifications (Section 5).

3.7.3. Architecture Migration

The modular design supports migration through the following mechanisms:

Configuration-specific model loading: The framework adapts to clusters of different scales by loading pre-trained models for different system configurations.
Distributed training support: Multi-GPU training capabilities are implemented through PyTorch’s distributed training framework, supporting scalable training on different hardware architectures.
General scheduling pattern learning: The GTrXL backbone network learns a general scheduling pattern that can be transferred between different computing architectures without architecture-specific retraining.

Our multi-dataset validation demonstrates cross-architecture generalization capabilities, indicating that the framework can successfully adapt to different computing scales and resource configurations without fundamental algorithmic changes.

3.8. Demonstrative Iteration of the Proposed Algorithm

To further clarify the practical application of the proposed GTrXL-SPPO framework, we present a step-by-step example of a single scheduling iteration, highlighting the use of the SECT and GTrXL modules.

Example Setup: Suppose the job queue and resource state are as follows (in Table 3 and Table 4):

Step 1: State Vector Construction

The state vector

s_{t}

is constructed as:

s_{t} = {[1, 10, 2, 30], [2, 5, 1, 20], [1, 15, 3, 40]} \oplus {[1, 0], [0, 10]}

Step 2: Feature Enhancement via SECT Layer

The input tensor

X

is processed by the SECT layer (Algorithm 1). This adaptively recalibrates channel-wise features.

Step 3: Temporal Modeling via GTrXL Block

The enhanced features are passed through the GTrXL block (Algorithm 2), which models temporal dependencies and gating.

Step 4: Action Selection

The policy network outputs action probabilities, e.g.,

π (a | s_{t}) = [0.2, 0.5, 0.3]

The agent selects job B (index 2) for scheduling.

Step 5: Reward Calculation

After scheduling, the reward is computed as:

R_{total} = \sum_{k = 1}^{4} ω_{k} \cdot r_{k} + α_{reg} \cdot r_{pressure}

where

r_{1}

is system utilization,

r_{2}

is the waiting penalty, etc.

Step 6: Policy Update

The PPO loss is calculated and the policy is updated accordingly.

This example demonstrates how the proposed GTrXL-SPPO framework processes input features, models temporal dependencies, selects actions, and updates the policy in a single scheduling iteration.

Our approach significantly improves scheduling efficiency by optimizing both resource utilization and job waiting times through the novel GTrXL-SPPO framework.

4. Experiments Setup

4.1. Experimental Configuration and Reproducibility

To ensure full reproducibility of the experiments, we provide key experimental settings and evaluation methods.

4.1.1. Dataset Configuration and Partitioning

For each dataset (ANL-Intrepid, Alibaba, SDSC-SP2, PIK-IPLEX), we adopt a time division: training period (first 80% of task submissions) and testing period (last 20% of task submissions), with no time overlap to ensure the authenticity of the evaluation. Data preprocessing includes z-score normalization and normalization to the range [0,1].

4.1.2. Model and Training Configuration

GTrXL-SPPO employs a 12-layer Transformer architecture (12 attention heads, 768-dimensional hidden layer) combined with the SECT attention module (reduction ratio 16) and the PPO reinforcement learning algorithm. Key training parameters include: Actor learning rate

1 \times 10^{- 4}

, Critic learning rate

1 \times 10^{- 3}

, batch size 128, window size 200, PPO clipping ratio 0.2, GAE

λ

0.95, and total training rounds 5000. All experiments are implemented on an NVIDIA V100 GPU cluster (4 × 32 GB) based on PyTorch 1.12.0, with 30 independent random seeds (42–71) to ensure statistical reliability. Complete parameter details are provided in Appendix A.

4.1.3. Evaluation Protocol

Performance evaluation was conducted every 100 training rounds, with results recorded after a 500-round warm-up period. All results are reported as the mean, standard deviation, and 95% confidence interval of 30 independent runs, with statistical significance verified via paired t-tests (p < 0.001).

4.2. Baseline Algorithms

We evaluate GTrXL-SPPO against two categories of baselines:

Rule-based algorithms: FCFS, SJF, and EASY, which follow fixed heuristics but lack adaptability to dynamic workloads.
Reinforcement learning methods: Among RL approaches, PPO is selected due to its proven performance in HPC scheduling [3].

4.3. Workload Trajectories

We evaluate GTrXL-SPPO using four representative real-world workload trajectories: Alibaba [56], ANL-Intrepid [57], SDSC-SP2 [58], and PIK-IPLEX [59]. These datasets span multiple application domains, including cloud services, academic research, and scientific computing, and exhibit significant differences in job characteristics and resource requirements (see Table 5). The job arrival distributions and resource request distributions for each dataset are shown in Figure 5 and Figure 6. This diversity provides a foundation for evaluating the generalization capability of our method. Detailed experimental results are presented in Section 5.

4.4. GTrXL-SPPO Training

We implement GTrXL-SPPO in CQSim, a widely used HPC scheduling simulator. Real workload traces (Section 4.2) are fed sequentially, and the agent schedules jobs using a sliding window mechanism (see Figure 4). All code is implemented in PyTorch. Table 6 summarizes the parameter configurations under different workloads. The model receives state inputs constructed by sliding windows (e.g., [2,200] for the Alibaba workload) and outputs discrete probabilities corresponding to node allocation decisions (e.g., 801 candidate actions representing 800 node assignments plus one wait decision). The Transformer backbone (12 layers, 576 dimensions, 12 heads) accounts for nearly half of the total parameters, with the model scaling from 136M to 160M parameters across workloads.

To further illustrate training effectiveness, Figure 7 presents the progression of total reward across all datasets. GTrXL-SPPO consistently achieves higher final reward values and faster early-stage gains compared to baselines such as FIFO, SJF, and vanilla RL. Notably, the reward curve converges more smoothly under diverse workload dynamics, confirming the model’s robustness and training stability across heterogeneous environments.

Training objectives include utilization, wait time, and throughput. We use a linear learning rate decay (initially

5 \times 10^{- 5}

), curriculum-based load escalation, and sample preservation to enhance generalization. As shown in Figure 7, GTrXL-SPPO consistently surpasses baselines under all workload scenarios, demonstrating faster convergence and achieving significantly higher total rewards, while static algorithms (FIFO and SJF) exhibit fixed reward levels (represented by horizontal lines). Evaluation covers both real and synthetic traces (see Section 5.1).

4.5. Evaluation Metrics

To comprehensively assess the effectiveness of the proposed scheduling strategy, we employ a set of user-level and system-level evaluation metrics, as summarized below:

System Utilization: The proportion of time during which system resources are actively utilized.
Throughput: The number of jobs completed per unit time.
Average Wait Time: The mean time a job spends waiting from submission to execution.
Job Completion Time: The total time required from job submission to completion.

Among these, System Utilization and Throughput are system-level metrics, while Average Wait Time and Job Completion Time are user-level metrics. Together, they provide a multi-dimensional evaluation framework, enabling a comprehensive analysis of scheduling strategies in terms of system efficiency, job responsiveness, and resource optimization. This facilitates the identification and selection of strategies best suited to specific HPC workload scenarios.

5. Case Study

Recent studies have adopted a sampling-real-synthetic data paradigm to construct training workloads for reinforcement learning-based HPC schedulers [3,60]. Building on this foundation, we conduct our experiments using CQGym, a widely adopted simulation platform for job scheduling research [3,61].

5.1. System Performance Evaluation

5.1.1. System Utilization and Waiting Time

As shown in Figure 8a, system utilization frequently approaches 100%, reflecting the model’s effectiveness in maximizing resource usage under varying load conditions. Additionally, Figure 8b demonstrates that our approach significantly reduces job scheduling waiting times during both training and testing phases, further highlighting the efficiency and responsiveness of the proposed method.

5.1.2. Overall Performance Evaluation

Figure 7 shows the total reward achieved by GTrXL-SPPO and baseline methods across different workloads, demonstrating the superior learning capability and convergence speed of our approach.

5.1.3. Comparative Analysis with PPO

As shown in Figure 9 and Figure 10, GTrXL-SPPO demonstrates superior performance across all evaluation metrics. The results on simulated workloads validate the algorithm’s effectiveness, while real-world workloads confirm its excellent generalization capability: GTrXL-SPPO achieves higher resource utilization, larger cumulative rewards, and lower task waiting times, demonstrating robust scheduling performance under diverse workload patterns.

5.2. Ablation Study and Component Analysis

To systematically evaluate the contributions of each component in our proposed GTrXL-SPPO framework, we conducted comprehensive ablation experiments on multiple datasets (ANL-Intrepid, SDSC-SP2, and Alibaba).

5.2.1. Experimental Design

We designed five model variants to evaluate different components:

GTrXL-SPPO (Full): The complete model with all components.
GTrXL-PPO (No SE): Removes the squeeze-and-excitation modules to assess their contribution to feature enhancement.
SPPO (No GTrXL): Replaces the Gated Transformer-XL with a standard feedforward network to assess the impact of temporal modeling.
TrXL-SPPO (No Gating): Removes the gating mechanism from the Transformer-XL to assess the importance of selective memory updates.
GTrXL-SPPO (Fixed Reward): Uses fixed weights in the reward function instead of dynamically temperature-controlled softmax weights.

Each variant is trained under identical conditions with the same hyperparameters, with the only difference being the specific component under evaluation. We assess performance using multiple metrics: system utilization, average task waiting time, throughput (tasks/hour), and resource fragmentation.

5.2.2. Results and Analysis

As shown in Table 7, each component contributes significantly to the overall performance of GTrXL-SPPO:

Impact of GTrXL: Removing the GTrXL component results in the most significant performance degradation, with system utilization decreasing by 14.4%, 15.4%, and 17.4% on the three datasets, respectively. This highlights the critical importance of temporal modeling in capturing long-term job dependencies.

Contribution of SE Module: The SE module improves system utilization by 4.4%, 4.4%, and 5.4% across datasets. By dynamically recalibrating feature importance, it enables better adaptation to changing resource conditions.

Role of Gating Mechanism: The gating mechanism improves system utilization by 3.5%, 3.9%, and 5.0%, respectively, enabling selective memory updates and noise filtering.

Dynamic Reward Weighting: Temperature-controlled dynamic weights improve system utilization by 3.2%, 3.6%, and 4.5% compared to fixed weights, allowing adaptive optimization priority adjustment.

5.3. Statistical Stability Analysis

To address the inherent randomness of reinforcement learning and ensure result reliability, we conducted GTrXL-SPPO experiments using 10 different random seeds (42–51) under identical conditions. Each run processed 3000 tasks using the same Alibaba Cluster workload with a 400-node configuration (Figure 11 and Table 8).

The statistical analysis demonstrates exceptional stability with low coefficients of variation: system utilization (CV = 2.5%), average waiting time (CV = 5.1%), and throughput (CV = 3.6%). Paired t-tests show statistically significant improvements (p < 0.001) with effect sizes exceeding Cohen’s d = 0.8, confirming substantial practical significance.

5.4. Runtime Performance Evaluation

To validate practical deployability in HPC environments, we conducted comprehensive runtime performance evaluation focusing on decision latency and computational overhead using PyTorch implementation on NVIDIA GPU hardware.

GTrXL-SPPO exhibits 9.0 ms mean decision latency with low variance (

σ

= 0.2 ms), indicating stable performance. Through optimization strategies including 16-bit precision, gradient checkpointing, and attention pruning, we achieved 24% latency reduction and 25% memory decrease while maintaining algorithmic benefits (Table 9).

5.5. Parameter Sensitivity Analysis

We conducted comprehensive sensitivity analysis on two critical hyperparameters: pressure regularization coefficient

α_{reg}

and temperature parameter T in dynamic reward weighting.

5.5.1. Pressure Regularization Coefficient ( $α_{reg}$ )

The optimal value

α_{reg} = 0.5

provides the best balance between queue management and scheduling efficiency (Table 10), achieving peak system utilization (87.5%) and minimum waiting time (2458 s).

5.5.2. Temperature Parameter (T)

T = 1.0 provides optimal balance between objective differentiation and integration, achieving the best overall performance across all metrics (Table 11).

5.6. Comparative Analysis of Attention Mechanisms

We evaluated our SECT module against two mainstream alternatives: CBAM [62] and ECA-Net [63].

SECT consistently outperforms alternatives with 2.3–3.6% higher utilization and superior computational efficiency (Table 12), validating its specialized design for HPC scheduling tasks.

5.7. Comparison with Optimization-Based Methods

We evaluated GTrXL-SPPO against Mixed-Integer Programming (MIP) and Constraint Programming (CP) solvers on two scales (Table 13):

GTrXL-SPPO maintains competitive performance at small scales (within 3% of optimal) and significantly outperforms optimization methods at large scales where traditional solvers encounter timeout issues.

5.8. Computational Complexity Analysis

We conducted comprehensive complexity analysis combining theoretical bounds with empirical measurements across multiple operational scales.

Despite quadratic complexity, GTrXL-SPPO demonstrates superior per-task efficiency at large scales (7.4 ms vs. PPO’s 15.6 ms), validating its practical scalability for production HPC environments (Table 14).

5.9. Practical Impact Analysis

To quantify the economic significance of performance improvements, we analyzed the practical impact across different HPC scales based on our measured 25-s wait time reduction and 5.4% utilization improvement (Table 15).

The analysis reveals significant scale amplification effects, where modest per-task improvements accumulate to substantial annual benefits: 57,031 h of time savings equivalent to 35.7 full-time research teams, with total economic impact of USD 11.6 million annually across different HPC scales.

6. Results and Discussion

In this work, we have successfully achieved our objective of developing a more efficient and robust scheduling framework for HPC environments. The proposed GTrXL-SPPO framework effectively addresses the challenges of high-dimensional decision spaces and complex temporal dependencies in dynamic scheduling environments, as demonstrated by our comprehensive experiments across multiple datasets.

The experimental results confirm that the two core components of GTrXL-SPPO—the GTrXL backbone and the SE module—provide complementary benefits across heterogeneous HPC workloads. These advantages are evident in both scientific (ANL) and industrial (Alibaba) scenarios, demonstrating the framework’s applicability to diverse real-world environments. The GTrXL component enhances long-horizon scheduling by effectively capturing temporal execution signals, while the SE module adaptively reweights features to improve responsiveness under resource variability. Together, these modules improve throughput, reduce turnaround time, and enhance resource utilization, as validated across datasets in Section 5.2. Overall, the results demonstrate the robustness and generalizability of our design.

7. Conclusions and Future Work

7.1. Results and Discussion

The proposed GTrXL-SPPO framework successfully addresses the challenges posed by high-dimensional decision spaces and complex temporal dependencies in dynamic high-performance computing scheduling environments. Our comprehensive experiments on multiple datasets demonstrate that the two core components—the GTrXL backbone network and the SECT module—offer certain advantages under heterogeneous workloads.

The GTrXL component enhances long-time-domain scheduling performance by capturing temporal execution signals, while the SECT module improves responsiveness under resource fluctuations. These two innovations collectively achieve quantifiable improvements in throughput, turnaround time, and resource utilization, as validated in Section 5.2.

The framework demonstrates robust performance across diverse workload characteristics, ranging from scientific simulations to cloud computing tasks.

7.2. Limitations and Future Work

Although GTrXL-SPPO achieves significant improvements, we acknowledge the following limitations, which provide directions for future research:

Current Limitations

Manual configuration requirements: Our reward mechanism relies on manually tuned weights and parameters, which poses portability challenges across different HPC platforms.

Limited system-specific constraints: The framework does not directly support production environment requirements, such as task deadlines, energy budgets, or thermal constraints.

Fairness considerations: During sustained high loads, non-preemptive designs may prioritize shorter tasks, causing long-running or resource-intensive tasks to starve.

7.3. Future Research Directions

Automatic configuration: Develop meta-learning methods to automatically adjust reward weights and quickly adapt to new environments through transfer learning.

Constraint-aware framework: Integrate multi-objective optimization to handle concurrent constraints and integrate real-time monitoring systems.

Fairness mechanisms: Refer to the research by Duque et al. [64] to explore fairness reward design and adaptive intervention strategies that consider aging factors to achieve resource reallocation.

These research directions aim to enhance the framework’s practical deployment capabilities while retaining its attention-based optimization advantages.

7.4. Broader Impact

GTrXL-SPPO represents a significant advancement in applying machine learning to infrastructure optimization. The framework’s ability to handle complex temporal dependencies goes beyond high-performance computing scheduling and can be scaled to other resource allocation domains. With the proposed improvements, we expect to provide more robust and fair scheduling solutions for modern computing infrastructure.

Author Contributions

Conceptualization, X.G. and H.D.; Methodology, X.G. and H.D.; Software, H.D.; Validation, Y.W. and X.Y.; Formal analysis, X.G.; Investigation, H.D. and L.Z.; Resources, X.G.; Data curation, H.D.; Writing—original draft, H.D.; Writing—review and editing, X.G., L.Z. and Z.L.; Visualization, Y.W.; Supervision, X.G.; Project administration, X.G.; Funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Program of China (Grant No. 2023YFB3002300, Project: Cross-domain Application Collaborative Operation Technology), administered by the Ministry of Science and Technology of China. The project is led by the Computer Network Information Center, Chinese Academy of Sciences, in collaboration with Zhengzhou University and the National Supercomputing Center in Zhengzhou.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the National Supercomputing Center in Zhengzhou for their support and collaboration.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HPC	High-Performance Computing
PPO	Proximal Policy Optimization
FCFS	First-Come First-Served
SJF	Shortest Job First
SE	Squeeze-and-Excitation
SLURM	Simple Linux Utility for Resource Management
GTrXL	Gated Transformer-XL
DRL	Deep Reinforcement Learning
CBAM	Convolutional Block Attention Module
ECA	Efficient Channel Attention
MIP	Mixed-Integer Programming
CP	Constraint Programming
MDP	Markov Decision Process
MDPs	Markov Decision Processes
DDP	Distributed Data Parallel

Appendix A. Detailed Experimental Configuration Parameters

To ensure complete reproducibility and consistency with the main text, this appendix provides comprehensive experimental configuration parameters for the GTrXL-SPPO framework. All parameter settings are based on the experimental setup described in the main paper.

Appendix A.1. Model Architecture Configuration

Table A1. Complete GTrXL-SPPO architecture parameter configuration.

Component	Configuration Parameters
GTrXL Transformer Core Architecture
Number of Layers	12
Number of Attention Heads	12
Hidden Dimension	768
Feed-Forward Network Dimension	3072 (4 × hidden_dim)
Relative Position Encoding	$R_{i, j} = \frac{1}{\sqrt{d}} cos ((i - j) \cdot ω)$ , $ω$ = 1/10,000
Layer Normalization	Pre-normalization, $ϵ_{norm} = 1 \times 10^{- 12}$
Dropout Rate	0.1
Memory Length	512
SECT Attention Module
Reduction Ratio	16
Squeeze Activation Function	ReLU
Excitation Activation Function	Sigmoid
Global Pooling Method	Adaptive Average Pooling
Dual-path Fusion	$z_{fusion} = \frac{1}{2} (mean (x) + \max (x))$
Gating Mechanism (GRU-style)
Gate Type	Dual-layer GRU gating (reset + update gates)
Reset Gate Initialization	Xavier Uniform Initialization
Update Gate Initialization	Xavier Uniform Initialization
Bias Initialization	Zero Initialization
Gate Bias $b_{g}$	0.1
Actor-Critic Network Structure
Actor Network Layers	[768, 512, 256, action_dimension]
Critic Network Layers	[768, 512, 256, 1]
Hidden Layer Activation	ReLU
Output Layer Activation	Tanh (Actor), Linear (Critic)

Appendix A.2. Training Hyperparameter Configuration

Table A2. Complete training hyperparameter settings.

Parameter Category	Specific Configuration
Training Process Parameters
Total Training Episodes	5000 episodes (per dataset)
Batch Size	128 sequences
Sequence Length	Dynamically adjusted based on sliding window size
Experience Buffer Size	50,000 transitions
Training Frequency	Update every 4 episodes
Evaluation Frequency	Evaluate every 100 training episodes
Warm-up Period	500 training episodes
Random Seed Range	42–71 (30 independent runs)
PPO Reinforcement Learning Parameters
Learning Rate (Actor)	$1 \times 10^{- 4}$
Learning Rate (Critic)	$1 \times 10^{- 3}$
Initial Learning Rate Decay	$5 \times 10^{- 5}$ (linear decay)
Learning Rate Scheduling	Cosine annealing with warm restarts
Clipping Ratio $ϵ$	0.2
Value Function Coefficient	0.5
Entropy Coefficient	0.01
GAE $λ$	0.95
Discount Factor $γ$	0.99
Time-decaying Discount	$γ_{t} = 1 - \frac{t}{T}$
PPO Epochs per Update	4
Gradient Clipping Max Norm	0.5
Optimizer Configuration
Optimizer Type	Adam
Beta1	0.9
Beta2	0.999
Weight Decay	$1 \times 10^{- 5}$
Epsilon	$1 \times 10^{- 8}$
1-bit Adam Compression	${\tilde{g}}_{t} = sign (g_{t}) ⊙ \sqrt{\| g_{t} \|}$
Reward Function Parameters
Pressure Regularization Coefficient $α_{reg}$	0.5
Temperature Parameter T	1.0
Adaptive Weight Calculation	$ω_{k}^{(T)} = \frac{exp (ω_{k} / T)}{\sum_{j = 1}^{4} exp (ω_{j} / T)}$
Huber Loss Threshold	Standard setting

Appendix A.3. Computational Environment Configuration

Table A3. Computational environment and software version specifications.

Component Category	Specific Specifications
Hardware Configuration
GPU Cluster	NVIDIA V100 (4 × 32 GB VRAM)
CPU Processor	Intel Xeon Series
System Memory	128 GB DDR4
Storage System	High-speed parallel storage
Network Interconnect	High-speed InfiniBand
Software Environment
Operating System	Ubuntu 20.04.6 LTS
Python Version	3.9.7
PyTorch Version	1.12.0
CUDA Version	11.6
cuDNN Version	8.4.1
NumPy Version	1.21.5
OpenAI Gym Version	0.21.0
Simulation Platform	CQSim (CQGym)
Distributed Training Configuration
Multi-GPU Training Framework	PyTorch Distributed Data Parallel (DDP)
Number of GPUs	4 NVIDIA V100
Data Sharding Strategy	$D_{k} = {(s_{t}^{i}, a_{t}^{i}, r_{t}^{i}) ∣ i mod N = k}$
Gradient Synchronization	AllReduce operation

Appendix A.4. Reproducibility Checklist

To ensure complete reproducibility, we provide the following checklist:

Code Implementation: Implemented in PyTorch 1.12.0 using CQGym simulation platform
Dataset Access: ANL-Intrepid, Alibaba, SDSC-SP2, PIK-IPLEX public datasets
Randomness Control: 30 independent seeds (42–71), deterministic CUDA operations
Hardware Environment: NVIDIA V100 GPU cluster specifications clearly defined
Hyperparameter Fixed: All training hyperparameters explicitly specified
Evaluation Protocol: Statistical testing methods and confidence interval calculation standardized
Performance Benchmarks: Runtime performance and memory usage clearly recorded

Note: All parameter configurations in this appendix are fully consistent with the main paper text, ensuring reproducibility of experimental results and consistency of parameter settings. Any specific implementation details or parameter adjustments are based on the experimental setup explicitly described in the main paper text.

References

Ben-Nun, T.; Hoefler, T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 2019, 52, 65. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Fan, Y.; Li, B.; Favorite, D.; Singh, N.; Childers, T.; Rich, P.; Allcock, W.; Papka, M.E.; Lan, Z. DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 4903–4917. [Google Scholar] [CrossRef]
Hu, H.; Li, Z.; Hu, H.; Chen, J.; Ge, J.; Li, C.; Chang, V. Multi-objective scheduling for scientific workflow in multicloud environment. J. Netw. Comput. Appl. 2018, 114, 108–122. [Google Scholar] [CrossRef]
Marcus, R.; Negi, P.; Mao, H.; Zhang, C.; Alizadeh, M.; Kraska, T.; Papaemmanouil, O.; Tatbul, N. Neo: A learned query optimizer. Proc. VLDB Endow. 2019, 12, 1705–1718. [Google Scholar] [CrossRef]
Narayanan, D.; Santhanam, K.; Kazhamiaka, F.; Phanishayee, A.; Zaharia, M. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), New York, NY, USA, 4–6 November 2020; pp. 481–498. [Google Scholar]
Cheng, M.; Li, J.; Nazarian, S. DRL-cloud: Deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju, Republic of Korea, 22–25 January 2018; pp. 129–134. [Google Scholar] [CrossRef]
Barve, M.; Sinha, S.; Hardikar, R.P.; Gunturu, A.; Mallik, W. Workload Characterization in HPC Environment for Auto-scaling of Resources—Preliminary Study. In Proceedings of the 2022 IEEE 19th India Council International Conference (INDICON), Kochi, India, 24–26 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
Jiang, Z.; Luo, C.; Gao, W.; Wang, L.; Zhan, J. HPC AI500 V3.0: A scalable HPC AI benchmarking framework. Benchcouncil Trans. Benchmarks Stand. Eval. 2022, 2, 100083. [Google Scholar] [CrossRef]
Aceituno, J.M.; Guasque, A.; Balbastre, P.; Simó, J.; Crespo, A. Hardware resources contention-aware scheduling of hard real-time multiprocessor systems. J. Syst. Archit. 2021, 118, 102223. [Google Scholar] [CrossRef]
Amaral, M.; Polo, J.; Carrera, D.; Seelam, S.; Steinder, M. Topology-aware GPU scheduling for learning workloads in cloud environments. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, USA, 12–17 November 2017. SC ’17. [Google Scholar] [CrossRef]
Huang, J.; Xiao, C.; Wu, W. RLSK: A Job Scheduler for Federated Kubernetes Clusters based on Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Cloud Engineering (IC2E), Sydney, NSW, Australia, 21–24 April 2020; pp. 116–123. [Google Scholar] [CrossRef]
Gómez, C.; Martınez, F.; Armejach, A.; Moretó, M.; Mantovani, F.; Casas, M. Design Space Exploration of Next-Generation HPC Machines. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 20–24 May 2019; pp. 54–65. [Google Scholar] [CrossRef]
Singh, R.M.; Awasthi, L.K.; Sikka, G. Towards Metaheuristic Scheduling Techniques in Cloud and Fog: An Extensive Taxonomic Review. ACM Comput. Surv. 2022, 55, 1–43. [Google Scholar] [CrossRef]
Hwang, E.; Kim, J.S.; Choi, Y.R. Achieving Fairness-Aware Two-Level Scheduling for Heterogeneous Distributed Systems. IEEE Trans. Serv. Comput. 2021, 14, 639–653. [Google Scholar] [CrossRef]
Tirmazi, M.; Barker, A.; Deng, N.; Haque, M.E.; Wilkes, J. Borg: The next generation. In Proceedings of the EuroSys ’20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, 27–30 April 2020. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NIPS’17. pp. 6000–6010. [Google Scholar] [CrossRef]
Ermis, B.; Zappella, G.; Wistuba, M.; Rawal, A.; Archambeau, C. Continual Learning with Transformers for Image Classification. arXiv 2022, arXiv:2206.14085. [Google Scholar] [CrossRef]
Parisotto, E.; Song, H.F.; Rae, J.W.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.M.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S.; et al. Stabilizing Transformers for Reinforcement Learning. arXiv 2019, arXiv:1910.06764. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Wang, Y.; He, H.; Wen, C.; Tan, X. Truly Proximal Policy Optimization. arXiv 2020, arXiv:1903.07940. [Google Scholar] [CrossRef]
Fan, Y.; Lan, Z.; Rich, P.; Allcock, W.; Papka, M.E. Hybrid Workload Scheduling on HPC Systems. In Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, 30 May–3 June 2022; pp. 470–480. [Google Scholar] [CrossRef]
Frachtenberg, E.; Petrini, F.; Fernandez, J.; Pakin, S. STORM: Scalable Resource Management for Large-Scale Parallel Computers. IEEE Trans. Comput. 2006, 55, 1572–1587. [Google Scholar] [CrossRef]
Lin, Z.; Li, C.; Tian, L.; Zhang, B. A scheduling algorithm based on reinforcement learning for heterogeneous environments. Appl. Soft Comput. 2022, 130, 109707. [Google Scholar] [CrossRef]
Lifka, D. The ANL/IBM SP Scheduling System; Argonne National Lab. (ANL): Argonne, IL, USA, 1995. [CrossRef]
Jette, M.; Yoo, A.; Grondona, M. SLURM: Simple Linux Utility for Resource Management; Lawrence Livermore National Lab. (LLNL): Livermore, CA, USA, 2003.
Schwiegelshohn, U.; Yahyapour, R. Fairness in parallel job scheduling. J. Sched. 2000, 3, 297–320. [Google Scholar] [CrossRef]
Tang, W.; Lan, Z.; Desai, N.; Buettner, D. Fault-aware, utility-based job scheduling on Blue, Gene/P systems. In Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops, New Orleans, LA, USA, 31 August–4 September 2009; pp. 1–10. [Google Scholar]
Jackson, D.; Snell, Q.; Clement, M. Core Algorithms of the Maui Scheduler. In Job Scheduling Strategies for Parallel Processing; Feitelson, D.G., Rudolph, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 87–102. [Google Scholar]
Mu’alem, A.; Feitelson, D. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 2001, 12, 529–543. [Google Scholar] [CrossRef]
Mao, H.; Alizadeh, M.; Menache, I.; Kandula, S. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, Atlanta, GA, USA, 9–10 November 2016; pp. 50–56. [Google Scholar] [CrossRef]
Mao, H.; Schwarzkopf, M.; Venkatakrishnan, S.B.; Meng, Z.; Alizadeh, M. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication, Beijing, China, 19–23 August 2019; pp. 270–288. [Google Scholar] [CrossRef]
Wong, A.K.; Goscinski, A.M. Evaluating the EASY-backfill job scheduling of static workloads on clusters. In Proceedings of the 2007 IEEE International Conference on Cluster Computing, Austin, TX, USA, 17–20 September 2007; pp. 64–73. [Google Scholar] [CrossRef]
Chadha, M.; John, J.; Gerndt, M. Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling. arXiv 2021, arXiv:2009.08289. [Google Scholar] [CrossRef]
Fan, Y.; Lan, Z. DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling. Softw. Impacts 2021, 8, 100077. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; ICLR: San Juan, Puerto Rico, 2016. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
Espeholt, L.; Marinier, R.; Stanczyk, P.; Wang, K.; Michalski, M. SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference. arXiv 2020, arXiv:1910.06591. [Google Scholar] [CrossRef]
Lei, Y.; Deng, Q.; Liao, M.; Gao, S. Deep reinforcement learning for dynamic distributed job shop scheduling problem with transfers. Expert Syst. Appl. 2024, 251, 123970. [Google Scholar] [CrossRef]
Li, M.; Wang, Z.; Li, K.; Liao, X.; Hone, K.; Liu, X. Task Allocation on Layered Multiagent Systems: When Evolutionary Many-Objective Optimization Meets Deep Q-Learning. IEEE Trans. Evol. Comput. 2021, 25, 842–855. [Google Scholar] [CrossRef]
Nedic, A.; Ozdaglar, A. Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Trans. Autom. Control 2009, 54, 48–61. [Google Scholar] [CrossRef]
Gu, Y.; Liu, Z.; Dai, S.; Liu, C.; Wang, Y.; Wang, S. Deep Reinforcement Learning for Job Scheduling and Resource Management in Cloud Computing: An Algorithm-Level Review. arXiv 2025, arXiv:2501.01007. [Google Scholar] [CrossRef]
Liu, H.; Zhou, X.; Gao, K.; Ju, Y. An integrated optimization method to task scheduling and VM placement for green datacenters. Simul. Model. Pract. Theory 2024, 135, 102962. [Google Scholar] [CrossRef]
He, H.; Gu, Y.; Liu, Q.; Wu, H.; Cheng, L. Job Scheduling in Hybrid Clouds with Privacy Constraints: A Deep Reinforcement Learning Approach. Concurr. Comput. Pract. Exp. 2025, 37, 1–17. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; XU, C.; Wang, Y. Transformer in Transformer. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 15908–15919. [Google Scholar]
Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache Spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar] [CrossRef]
Deng, K.; He, Z.; Zhang, H.; Lin, H.; Wang, D. TSNet-SAC: Leveraging Transformers for Efficient Task Scheduling. arXiv 2023, arXiv:2307.07445. [Google Scholar] [CrossRef]
Lee, J.; Kee, S.; Janakiram, M.; Runger, G. Attention-based Reinforcement Learning for Combinatorial Optimization: Application to Job Shop Scheduling Problem. arXiv 2024, arXiv:2401.16580. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
El-Mahdy, M.; Sakr, N.; Carrasco, R. Improving Mixed-Criticality Scheduling with Reinforcement Learning. arXiv 2025, arXiv:2504.03994. [Google Scholar] [CrossRef]
Li, S.; Zhao, Y.; Varma, R.; Salpekar, O.; Noordhuis, P.; Li, T.; Paszke, A.; Smith, J.; Vaughan, B.; Damania, P.; et al. PyTorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow. 2020, 13, 3005–3018. [Google Scholar] [CrossRef]
Tang, H.; Gan, S.; Awan, A.A.; Rajbhandari, S.; Li, C.; Lian, X.; Liu, J.; Zhang, C.; He, Y. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
Peng, Y.; Bao, Y.; Chen, Y.; Wu, C.; Guo, C. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017. [Google Scholar]
Argonne National Laboratory. ANL-Intrepid Workload Archive. 2010. Available online: https://www.cs.huji.ac.il/labs/parallel/workload/l_anl_int/ (accessed on 1 August 2024).
Feitelson, D.G.; Tsafrir, D.; Krakov, D. Experience with using the Parallel Workloads Archive. J. Parallel Distrib. Comput. 2014, 74, 2967–2982. [Google Scholar] [CrossRef]
Maurer, T.; Avanzi, F.; Oroza, C.A.; Glaser, S.D.; Conklin, M.; Bales, R.C. Optimizing spatial distribution of watershed-scale hydrologic models using Gaussian Mixture Models. Environ. Model. Softw. 2021, 142, 105076. [Google Scholar] [CrossRef]
Li, B.; Fan, Y.; Dearing, M.; Lan, Z.; Rich, P.; Allcock, W. MRSch: Multi-Resource Scheduling for HPC. In Proceedings of the 2022 IEEE International Conference on Cluster Computing (CLUSTER), Heidelberg, Germany, 5–8 September 2022; pp. 47–57. [Google Scholar] [CrossRef]
Li, J.; Liu, J.; Fan, Y.; Lan, Z. CQGym: A Reinforcement Learning Environment for Cluster Scheduling. arXiv 2021, arXiv:2111.00000. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Duque, R.; Arbelaez, A.; Díaz, J.F. Online over time processing of combinatorial problems. Constraints 2018, 23, 310–334. [Google Scholar] [CrossRef]

Figure 1. GTrXL-SPPO architecture and information flow.

Figure 2. GTrXL-SPPO model architecture. The network processes state representations through a channel fusion module and SE layers, with GTrXL blocks for temporal feature updates. The actor and critic branches share this backbone, diverging at the output layers for action logits and value estimations.

Figure 3. Architecture of the dynamic memory mechanism module. This system manages the flow of scheduling-related temporal information and adjusts update operations to reflect changes in environment dynamics.

Figure 4. Sliding window mechanism architecture.

Figure 5. Job resource demand distributions across workloads.

Figure 6. Job arrival and processing time distributions across workloads.

Figure 7. Total reward achieved on different workloads.

Figure 8. Performance metrics of GTrXL-SPPO. (a) System utilization trends showing resource usage efficiency over time on train data and test data. (b) Job waiting time distribution comparing training and testing phases across different scheduling approaches.

Figure 9. Performance comparison of GTrXL-SPPO and PPO at different stages on the simulated dataset.

Figure 10. Testing phase performance comparison of GTrXL-SPPO vs. PPO across real-world workload datasets. Note: This evaluation focuses specifically on testing phase performance to assess model generalization on unseen data.

Figure 11. Statistical stability analysis of GTrXL-SPPO across 10 independent runs with different random seeds. The red dashed line indicates the mean, while the shaded region represents the 95% confidence interval.

Table 1. Comparative analysis of cluster scheduling methods across key features.

Characteristics	Rule [26,27]	Priority [28,29]	Backfilling [30,31]	RL [32,33]	GTrXL-SPPO
Long-term Effects	×	×	×	✔	✔
Auto Policy Tuning	×	×	×	✔	✔
Dynamic Resource Priority	×	×	✔	×	✔
Temporal Relation Handling	×	×	×	×	✔

Table 2. Comparison of mainstream DRL algorithms for HPC scheduling.

Features	DQL [37]	PG [38]	PPO [21]	SAC [39]	A3C [40]
Sample Efficiency	×	✔	✔	✔	×
Discrete Action Support	✔	✔	✔	×	✔
Training Stability	✔	×	✔	✔	×
Tuning Ease	✔	×	×	✔	×
Parallel Support	×	×	✔	×	✔
Compute Overhead	Low	Medium	Medium	High	Medium

Table 3. Example job queue.

Job	Req. Nodes	Req. Time	Priority	Waiting Time	Estimated Exec Time
A	1	30	2	10	30
B	2	20	1	5	20
C	1	40	3	15	40

Table 4. Resource state.

Node	Availability	Remaining Time
1	Free	0
2	Busy	10

Table 5. Comparison of experimental dataset features.

Dataset	Job Size (Tasks)	Max Nodes	Max Procs	Sys Arch	Load Type
Alibaba	80.5 k	300	300	Gen Clust	Online Svc Intense
ANL-Intrepid	68.9 k	40,960	163,840	Mesh	Sci Comp Intense
SDSC-SP2	73.4 k	128	128	Homo Clust	Batch Proc Intense
PIK-IPLEX	74.3 k	320	2560	Hetero Clust	Mixed Load

1. Abbreviations: Sys Arch: System Architecture; Load Type: Load Characteristics; Gen Clust: General Cluster; Online Svc Intense: Online Service Intensive; Sci Comp Intens; Scientific Computing Intensive; Homo Clust: Homogeneous Cluster; Batch Proc Intense: Batch Processing Intensive; Hetero Clust: Heterogeneous Cluster; 2. Computational scale: Data are derived from the configuration files of each dataset, formatted as “Number of Nodes/Number of Processors.”; 3. ANL-Intrepid uses a mesh architecture with 4 cores per node, suitable for large-scale scientific computing; 4. PIK-IPLEX supports heterogeneous resource scheduling with an average of 8 cores per node; 5. The Alibaba dataset includes configurations with both 300 and 200 nodes in different versions.

Table 6. Parameter comparison of GTrXL-SPPO under multiple workloads.

Dataset	Input	Output	Actor Output Layer Parameters	Transformer Parameters	Other Parameters	Total Trainable Parameters
Alibabav2017	[2,200]	801	461,376	67,723,776	68,341,480	136,526,632
ANL-Intrepid	[2,150]	40,961	23,593,536	67,723,776	68,381,640	159,698,952
SDSC-SP2	[2,50]	129	74,304	67,723,776	68,340,808	136,138,888
PIK-IPLEX	[2,75]	321	184,896	67,723,776	68,341,000	136,249,672

Dynamic output dimension adaptation to node scale (e.g., from 100 nodes to 101).

Table 7. Ablation study results across three datasets.

Model Variant	ANL-Intrepid	SDSC-SP2	Alibaba	Average Improvement
	Util. (%)	Util. (%)	Util. (%)	over Baseline (%)
GTrXL-SPPO (Full)	87.5	85.6	83.2	-
GTrXL-PPO (No SE)	83.1	81.2	77.8	−4.7
SPPO (No GTrXL)	73.1	70.2	65.8	−15.7
TrXL-SPPO (No Gating)	84.0	81.7	78.2	−4.2
GTrXL-SPPO (Fixed Reward)	84.3	82.0	78.7	−3.8

Table 8. Statistical performance metrics of GTrXL-SPPO Across 10 independent runs.

Metric	Mean	Standard Deviation	95% Confidence Interval
System Utilization (%)	84.6	2.1	[82.9, 86.3]
Average Wait Time (s)	2850	145	[2741, 2959]
Throughput (tasks/hour)	135.2	4.8	[132.1, 138.3]
Cumulative Reward	0.648	0.028	[0.632, 0.664]

Table 9. Runtime performance comparison for HPC scheduling algorithms.

Algorithm	Decision Latency (ms)	Memory Usage (MB)	Throughput (Decisions/s)	Scale Limit (Nodes)
PPO	6.2 ± 0.8	578	161	8000
GTrXL-SPPO	9.0 ± 0.2	1049	111	5000
GTrXL-SPPO (Optimized)	6.8 ± 0.3	785	147	8000

Table 10. Effect of

α_{reg}

on system performance (ANL-Intrepid dataset).

Table 10. Effect of

α_{reg}

on system performance (ANL-Intrepid dataset).

$α_{reg}$ Value	System Utilization (%)	Avg. Wait Time (s)	Throughput (Jobs/h)	Queue Length (Avg.)
0.0	85.1	2532	139.7	42.3
0.1	86.2	2495	141.1	35.8
0.5	87.5	2458	142.5	32.1
1.0	86.6	2483	141.8	30.5
2.0	85.0	2545	139.7	29.7

Table 11. Effect of temperature parameter T on system performance (ANL-Intrepid dataset).

T Value	System Utilization (%)	Avg. Wait Time (s)	Throughput (Jobs/h)	Reward Balance Index
0.1	83.1	2705	135.4	0.28
0.5	86.2	2495	141.1	0.62
1.0	87.5	2458	142.5	0.81
2.0	85.8	2520	140.4	0.93
5.0	82.3	2728	134.0	0.97

Table 12. Performance comparison of different attention mechanisms.

Attention Mechanism	ANL Util. (%)	Alibaba Util. (%)	Avg. Improvement	Training Time (h)	Efficiency Rank
SECT (Ours)	87.5	83.2	-	5.2/7.3	1
CBAM	84.9	79.6	-3.2%	6.8/9.1	3
ECA-Net	85.2	80.3	-2.8%	5.5/7.8	2

Table 13. Performance comparison with optimization methods.

Method	Utilization (%)	Wait Time (s)	Makespan (h)	Optimality Gap (%)	Status
Small-scale (100 jobs, 128 nodes)
MIP (Gurobi)	89.8	2398	18.1	0.0	Optimal
CP (OR-Tools)	89.7	2464	18.5	0.8	Near-optimal
GTrXL-SPPO	87.5	2458	18.7	2.0	Complete
Large-scale (3000 jobs, 600 nodes)
MIP (Gurobi)	74.9	3596	28.2	19.5	Timeout
CP (OR-Tools)	83.7	3262	26.3	10.1	Timeout
GTrXL-SPPO	85.8	2508	19.2	4.0	Complete

Table 14. Multi-scale computational complexity comparison.

Method	Time Complexity	Small Scale (100)	Medium Scale (1000)	Large Scale (5000)
		Time per Task (ms)	Time per Task (ms)	Time per Task (ms)
FCFS	$O (n)$	24.5	3.6	1.5
PPO	$O (n \times d)$	26.7	17.7	15.6
GTrXL-SPPO	$O (n^{2} \times d)$	71.3	18.8	7.4

Table 15. Annual impact of performance improvements on different HPC scales.

HPC Environment	Time Saved (Hours/Year)	Cost Savings (USD)	ROI Multiple
Small University (100 nodes)	1267	$72,857	0.73×
Medium Research (400 nodes)	5069	$485,714	1.21×
Large Enterprise (1000 nodes)	12,674	$1,457,141	1.46×
National Supercomputer (5000 nodes)	38,021	$9,612,883	1.92×
Total Impact	57,031	$11,628,595	1.33×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Dong, H.; Zhang, L.; Wang, Y.; Yang, X.; Li, Z. Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO. Appl. Sci. 2025, 15, 8928. https://doi.org/10.3390/app15168928

AMA Style

Gao X, Dong H, Zhang L, Wang Y, Yang X, Li Z. Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO. Applied Sciences. 2025; 15(16):8928. https://doi.org/10.3390/app15168928

Chicago/Turabian Style

Gao, Xu, Hang Dong, Lianji Zhang, Yibo Wang, Xianliang Yang, and Zhenyu Li. 2025. "Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO" Applied Sciences 15, no. 16: 8928. https://doi.org/10.3390/app15168928

APA Style

Gao, X., Dong, H., Zhang, L., Wang, Y., Yang, X., & Li, Z. (2025). Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO. Applied Sciences, 15(16), 8928. https://doi.org/10.3390/app15168928

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO

Abstract

1. Introduction

2. Related Work

2.1. Traditional High-Performance Computing Task Scheduling

Case Study Analysis

2.2. Deep Reinforcement Learning

2.3. Application of Transformer Mechanisms in Scheduling

3. Methodology

3.1. Problem Statement

3.1.1. MDP Formulation

3.1.2. State and Action Space

3.1.3. Decision Variables and Constraints

3.2. Neural Architecture Design

3.3. Multi-Scale Feature Enhancement

3.4. Dynamic Memory Mechanism

3.4.1. GTrXL: Enhanced Memory Architecture

3.4.2. Temporal Dependency Modeling

3.4.3. Implementation Details and Training Integration

3.5. Training Strategy

3.5.1. Algorithm Design

3.5.2. Constraint Handling in Training

3.5.3. Reward Function Design

3.6. Training Acceleration Strategies

3.7. Scalability and Deployment Considerations

3.7.1. Node Scaling Analysis

3.7.2. Heterogeneous Adaptation

3.7.3. Architecture Migration

3.8. Demonstrative Iteration of the Proposed Algorithm

4. Experiments Setup

4.1. Experimental Configuration and Reproducibility

4.1.1. Dataset Configuration and Partitioning

4.1.2. Model and Training Configuration

4.1.3. Evaluation Protocol

4.2. Baseline Algorithms

4.3. Workload Trajectories

4.4. GTrXL-SPPO Training

4.5. Evaluation Metrics

5. Case Study

5.1. System Performance Evaluation

5.1.1. System Utilization and Waiting Time

5.1.2. Overall Performance Evaluation

5.1.3. Comparative Analysis with PPO

5.2. Ablation Study and Component Analysis

5.2.1. Experimental Design

5.2.2. Results and Analysis

5.3. Statistical Stability Analysis

5.4. Runtime Performance Evaluation

5.5. Parameter Sensitivity Analysis

5.5.1. Pressure Regularization Coefficient ( α reg )

5.5.2. Temperature Parameter (T)

5.6. Comparative Analysis of Attention Mechanisms

5.7. Comparison with Optimization-Based Methods

5.8. Computational Complexity Analysis

5.9. Practical Impact Analysis

6. Results and Discussion

7. Conclusions and Future Work

7.1. Results and Discussion

7.2. Limitations and Future Work

Current Limitations

7.3. Future Research Directions

7.4. Broader Impact

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Detailed Experimental Configuration Parameters

Appendix A.1. Model Architecture Configuration

Appendix A.2. Training Hyperparameter Configuration

Appendix A.3. Computational Environment Configuration

Appendix A.4. Reproducibility Checklist

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

5.5.1. Pressure Regularization Coefficient ( $α_{reg}$ )