Next Article in Journal
Increased Functional Mobility in Healthy Elderly Individuals After Six Months of Adapted Taekwondo Practice
Previous Article in Journal
Transformer-Based Traffic Flow Prediction Considering Spatio-Temporal Correlations of Bridge Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO

1
School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
2
National Supercomputing Center in Zhengzhou, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(16), 8928; https://doi.org/10.3390/app15168928
Submission received: 9 July 2025 / Revised: 7 August 2025 / Accepted: 8 August 2025 / Published: 13 August 2025

Abstract

In HPC systems, job scheduling plays a critical role in determining resource allocation and task execution order. With the continuous expansion of computing scale and increasing system complexity, modern HPC scheduling faces two major challenges: a massive decision space consisting of tens of thousands of computing nodes and a huge job queue, as well as complex temporal dependencies between jobs and dynamically changing resource states.Traditional heuristic algorithms and basic reinforcement learning methods often struggle to effectively address these challenges in dynamic HPC environments. This study proposes a novel scheduling framework that combines GTrXL with PPO, achieving significant performance improvements through multiple technical innovations. The framework leverages the sequence modeling capabilities of the Transformer architecture and selectively filters relevant historical scheduling information through a dual-gate mechanism, improving long sequence modeling efficiency compared to standard Transformers. The proposed SECT module further enhances resource awareness through dynamic feature recalibration, achieving improved system utilization compared to similar attention mechanisms. Experimental results on multiple datasets (ANL-Intrepid, Alibaba, SDSC-SP2) demonstrate that the proposed components achieve significant performance improvements over baseline PPO implementations. Comprehensive evaluations on synthetic workloads and real HPC trace data show improvements in resource utilization and waiting time, particularly under high-load conditions, while maintaining good robustness across various cluster configurations.

1. Introduction

With the acceleration of digital transformation and the widespread adoption of computationally intensive applications in fields such as scientific research, industrial engineering, and business analytics [1], scheduling problems in HPC environments have become increasingly complex and face higher performance requirements [2]. In modern HPC systems, coordinating thousands of nodes and managing large-scale task submissions across diverse objectives and resource constraints presents substantial challenges [3,4]. However, traditional scheduling methods typically rely on static heuristic algorithms and fixed assumptions, making them ill suited to the inherent dynamism and uncertainty of real-world HPC workloads [5,6].
Increasing workload heterogeneity and time complexity exacerbate job scheduling difficulty, particularly due to temporal relationships between jobs and resources. [7,8]. These relationships include: (i) execution sequence constraints, where the output of one job must be used by another job, such as variant detection in genomics requiring prior sequence alignment results; (ii) runtime trigger conditions, where a task can only be scheduled after other tasks are completed or specific conditions are met, such as in multi-stage engineering simulations [9]; and (iii) resource contention, where concurrent requests for scarce resources (e.g., GPUs or bandwidth) may lead to latency spikes or idle time. Research indicates that unmanaged competition can reduce system throughput by up to 40% [10].
This challenge is further exacerbated under diverse workloads. Computing-intensive tasks require CPU/GPU throughput, while data-intensive tasks require high I/O bandwidth [11,12]. Modern architectures employ hierarchical storage and dedicated interconnect technologies, further complicating the scheduling process [13].
Traditional scheduling algorithms have obvious shortcomings: FCFS often leads to low resource utilization [14], while SJF may cause long job starvation [15]. Additionally, rule-based and priority-based schemes typically exhibit limited adaptability and high system maintenance complexity [16].
To address these challenges, we propose GTrXL-SPPO, a unified framework that combines Gated GTrXL with PPO. GTrXL captures long-term sequence patterns via multi-head attention mechanisms [17], while GRU-inspired gating mechanisms enhance the retention of essential scheduling signals [18,19]. Additionally, the SE module [20] further enhances the model’s ability to focus on critical scheduling information by adaptively reweighting input features.
In our model, we employ PPO for policy learning [21], leveraging its truncated objective function to ensure stable training in volatile HPC environments. PPO demonstrates superior sampling efficiency and convergence in high-dimensional control tasks [22], and we further incorporate adaptive learning rates to enhance responsiveness to workload fluctuations.
Evaluations on synthetic and real-world datasets (including ANL-Intrepid, Alibaba, and SDSC-SP2) show that our model consistently outperforms traditional methods and standard PPO in terms of throughput, latency, and resource utilization. Additionally, GTrXL-SPPO demonstrates strong generalization capabilities across different system architectures and workload types.
The key contributions of this paper include:
  • We design a dual-channel policy network based on GTrXL to capture long-term job sequences.
  • We propose a lightweight SE layer for dynamic resource reweighting in scheduling-sensitive environments.
  • We developed a reinforcement learning framework based on PPO, incorporating exponential clipping and memory replay to achieve better convergence.
  • We validated our model on large-scale HPC task sequences, demonstrating performance superior to existing baselines.
The remainder of this paper is organized as follows: Section 2 describes the background and related work in the field of HPC job scheduling. Section 3 introduces our proposed GTrXL-SPPO framework, including architectural design and training methods. Section 4 details the experimental setup and evaluation metrics. Section 5 discusses the experimental results and comparative analysis. Finally, Section 7 summarizes the paper and outlines potential future work.

2. Related Work

2.1. Traditional High-Performance Computing Task Scheduling

Traditional high-performance computing (HPC) job scheduling methods, such as rule-based policies and heuristic algorithms, have long been used to coordinate resources and balance workloads.These methods typically rely on fixed strategies manually designed for specific environments. However, the increasing heterogeneity of modern HPC cluster architectures and the growing diversity of workloads pose substantial challenges to these static strategies. In particular, they often struggle to manage execution time dependencies and balance critical performance objectives such as energy efficiency, latency, and throughput [23,24,25]. Table 1 summarizes the inherent limitations of representative baseline methods, highlighting the necessity of adaptive and learning-based scheduling methods.
In addition to rule-based techniques, previous studies have explored various methods, including feedback control theory, queuing models, and evolutionary optimization algorithms (e.g., genetic algorithm-based schedulers). While these methods are effective in constrained or static environments, they typically struggle to demonstrate the required flexibility and scalability under the real-time constraints of dynamic, large-scale HPC systems.

Case Study Analysis

In earlier versions, Google’s Borg system primarily used a first-come, first-served (FCFS) policy. However, under fluctuating workloads, this policy proved insufficient to ensure the reliable restart of service tasks and batch tasks.
Similarly, traditional batch scheduling frameworks, such as Moab/Maui, LoadLeveler, LSF, PBS, and Sun Grid Engine (SGE), typically rely on space-sharing mechanisms and a two-level queue architecture. These systems sort job queues based on submission time and allocate tasks to idle nodes using static resource matching strategies. However, such static heuristic algorithms often result in poor system performance, exacerbated resource fragmentation, and reduced predictability of performance metrics [34].
For example, PBS Pro often experiences task processing delays when faced with a surge of high-priority tasks, demonstrating the rigid nature of traditional heuristic algorithms in real-time decision-making. Among these traditional schedulers, SLURM [27] is one of the most widely deployed systems in modern HPC environments, including deployments at LLNL, NERSC, and TACC. It adopts a modular, plugin-driven design, containing components such as priority, scheduler, and task submission, and primarily allocates computing nodes through a queue-based space sharing mechanism [35].
Although SLURM offers scalability and integration flexibility, it inherently relies on static heuristic logic. Under dynamic workloads, especially when rigid filling strategies conflict with actual runtime task patterns [3], this often leads to low resource utilization, task starvation, and task startup delays. These challenges highlight the urgent need for adaptive scheduling strategies.
To address this issue, deep learning-based policy learning methods (DRL) such as Approximate PPO and the GTrXL-SPPO proposed in this paper have emerged as promising alternatives. These methods achieve policy adaptation through continuous interaction with the environment and demonstrate superior performance in managing the long-term scheduling complexity of diverse and volatile workloads [36].

2.2. Deep Reinforcement Learning

Unlike continuous control tasks (such as robot control), high-performance computing (HPC) job scheduling inherently involves discrete decisions: assigning specific jobs to specific computing nodes or deferring scheduling decisions. Therefore, our GTrXL-SPPO framework adopts a discrete action space, where each action corresponds to a specific job-to-node assignment or waiting decision. The size of the action space varies with the number of available nodes plus one waiting action, as shown in our implementation on an 800-node cluster (Section 4.4).
As shown in Table 2, PPO has become a widely adopted baseline due to its robust performance and stable training characteristics. By constraining policy updates with a clipped surrogate objective, PPO mitigates the instability issues common in other policy gradient methods.
DRL-based schedulers can eliminate the need for manually designed heuristic methods by continuously interacting with the environment to learn optimal decision strategies [3,41]. Algorithms such as PPO, A2C, and DQL have demonstrated significant effectiveness in reducing task delays and improving resource utilization [42,43] and have been extended to multi-objective optimization scenarios. A recent review paper [44] points out that modeling long-term temporal relationships and rewards can further improve DRL scheduling performance.
Scheduling problems are typically modeled as multi-state multi-objective reinforcement learning (MDP) problems, where agents observe system states, select scheduling actions, receive feedback (rewards), and update their policies accordingly [45,46]. The environment state includes task attributes and system resource metrics, while the reward function is designed to incentivize desirable behaviors, such as maximizing throughput, minimizing delay, and balancing load.

2.3. Application of Transformer Mechanisms in Scheduling

The self-attention mechanism of the Transformer architecture is widely used in scheduling tasks because it can effectively model long-term sequential dependencies [47]. Unlike RNN-based models, Transformers support parallel processing of sequence elements, significantly accelerating training speed [48]. Additionally, their global attention structure can model complex cross-task temporal relationships and provide better interpretability through attention heatmaps.
In the field of HPC scheduling, Transformer-based models offer promising solutions for adapting to dynamic and heterogeneous workloads. Their flexibility allows for encoding diverse task features and system states, while their scalability supports efficient deployment in large-scale environments. Therefore, combining Transformer mechanisms with DRL strategies is emerging as a promising direction for the next generation of intelligent schedulers.

3. Methodology

3.1. Problem Statement

In HPC task scheduling problems, the scheduler manages a dynamic task queue, where each task has specific resource requirements and execution time. The primary objectives of resource scheduling are to optimize resource allocation, minimize task waiting time, improve resource utilization, and maximize system throughput. In this study, GTrXL-SPPO models the scheduler as an intelligent agent that determines the optimal allocation of idle computing nodes to tasks in the waiting queue, as shown in Figure 1. The intelligent agent retrieves job characteristics and resource node states from the environment and concatenates this information into a unified state vector. Within the Markov Decision Process (MDP) framework, the state space and action space are systematically modeled to support intelligent scheduling decisions. Through continuous interaction with the environment, the intelligent agent learns the optimal scheduling strategy.
Environment Module (upper part): Represents a high-performance computing (HPC) multi-resource scheduling system, consisting of four core components: (1) Job Resource Information Management Module, responsible for maintaining dynamic job queues; (2) Window-based Resource Allocation Mechanism, used for resource scheduling; (3) GTrXL History Buffer, storing past scheduling decisions and system states; (4) Multi-step Feedback Mechanism, providing real-time system state updates. Agent Module (Lower Part): Implements an actor–critic architecture, where the actor network processes state information through the GTrXL layer (handling long-term dependencies) and the SE layer (enhancing features) to output scheduling strategies. The critic network evaluates state values to optimize strategies. Reward feedback evaluates scheduling effectiveness (resource utilization, waiting time), state feedback updates system conditions, and historical feedback enables experience-based learning through buffers. Arrows indicate the direction of data flow between components.

3.1.1. MDP Formulation

We model the HPC scheduling problem as an MDP: At each step, the agent observes the system state s t and selects an action a t to assign tasks to idle nodes.
To handle resource constraints and priority requirements in HPC environments, we extend the basic MDP to a constrained Markov Decision Process (CMDP), which can be formally defined as a tuple S , A , P , R , C , γ , where:
  • S is the state space, representing all possible system configurations
  • A is the action space, representing all possible job-to-node assignments
  • P : S × A × S [ 0 , 1 ] is the transition probability function
  • R : S × A × S R is the reward function
  • C : S × A × S R m is a vector of m constraint functions
  • γ [ 0 , 1 ) is the discount factor
The optimization objective is to find a policy π that maximizes the expected cumulative discounted reward while satisfying the constraints:
π = arg max π E τ π t = 0 γ t R ( s t , a t , s t + 1 )
subject to:
E τ π t = 0 γ t C i ( s t , a t , s t + 1 ) b i , i { 1 , 2 , , m }
where τ = ( s 0 , a 0 , s 1 , a 1 , ) is a trajectory, and b i are threshold constants for each constraint.

3.1.2. State and Action Space

The system state s t contains comprehensive job and system information with a hierarchical structure. At the high level, we organize state information into functional modules: s t = [ J t , W t , R t , h t ] where J t represents the job feature vector including the number of nodes requested by job i (reqProc), the runtime requested by job i (reqTime), and the waiting time w i (currentTime − submitTime). W t denotes the waiting time vector across all queued jobs, R t represents the current resource state vector indicating node availability and utilization, and h t contains the historical information buffer that tracks past scheduling decisions and system performance.
At decision step t, the implementation-level state representation ϕ t reflects the real-time status of system resources and pending tasks. It contains key information such as CPU, GPU, and memory usage, as well as task-specific attributes including resource requests, estimated execution time, and waiting time. Formally, the state space can be modeled as a combination of task windows and node pools. For each waiting task i, req i denotes the number of requested resource cores, w i denotes the waiting time, defined as w i = t current t arrival ( i ) , p i denotes the assigned priority, and e i denotes the estimated execution time. For each compute node k, n k denotes the node’s available state, and r k denotes the remaining execution time of the current task on node k. The detailed state representation at time step t can be represented as:
ϕ t = i = 1 n [ req i , w i , p i , e i ] k = 1 N [ n k , r k ]
where n denotes the number of tasks to be processed, and N denotes the total number of resource nodes.
The action space is implemented as a discrete space whose size is equal to the window size, formally defined as A = { 0 , 1 , 2 , , window_size 1 } . Each action a t A represents an index in the current waiting queue. The agent selects this index to determine the task to be scheduled with the highest priority. The action mechanism is implemented through queue reordering: when action a t = i is selected, the task with index i in the waiting queue is moved to the front of the queue, thereby obtaining the highest scheduling priority. In practical applications, when the waiting queue size is smaller than the window size, the effective action space is dynamically restricted to A eff = { 0 , 1 , 2 , , min ( window_size , | wait_queue | ) 1 } . This discrete action modeling solves the task selection problem while maintaining computational feasibility, with the size of the action space varying with the observation window size.
This hierarchical design enables the agent to comprehensively observe the characteristics of the task queue and the real-time status of the resource pool, providing sufficient information for making scheduling decisions within the CMDP framework.

3.1.3. Decision Variables and Constraints

The scheduling decision at each time step can be represented by the following decision variables:
  • x j , n , t { 0 , 1 } : Binary variable indicating whether job j is assigned to node n at time step t
  • y j , t { 0 , 1 } : Binary variable indicating whether job j starts execution at time step t
These decision variables are subject to the following constraints:
  • Resource Capacity Constraint: Each node can be assigned to at most one job at any time
    j J x j , n , t 1 , n N , t
  • Job Completeness Constraint: A job must receive all its requested resources or none at all
    n N x j , n , t = req j · y j , t , j J , t
  • Non-preemption Constraint: Once a job starts execution, it must run to completion
    y j , t = 1 t [ t , t + d j ) , n N x j , n , t = req j
  • Priority Constraint: Higher priority jobs with long waiting times should be scheduled before lower priority jobs
    ( p j > p j ) ( w j > w threshold ) schedule_priority ( j ) > schedule_priority ( j )
In practice, the resource capacity and job completeness constraints are enforced through action space design and action masking, while the non-preemption and priority constraints are encouraged through reward shaping.

3.2. Neural Architecture Design

At the upper decision layer, the agent evaluates jobs in the waiting queue and decides whether to execute or defer their scheduling, ensuring fair resource allocation through priority adjustment and reservation strategies. At the lower decision layer, the agent maximizes resource utilization by filling idle nodes with smaller jobs. This hierarchical decision structure enhances the flexibility and robustness of the scheduler, ensuring fairness for large jobs while improving overall resource efficiency under dynamic load conditions. The overall neural network architecture supporting this scheduling strategy is illustrated in Figure 2, detailing input feature processing, SE-based feature enhancement, GTrXL sequence modeling, and policy/value output branches. The system-level scheduling process (environment, job queue, reservation mechanism, agent decisions) was previously depicted in Figure 1. Here, Figure 2 focuses specifically on the internal design of the neural network responsible for decision-making.

3.3. Multi-Scale Feature Enhancement

To address the challenge of dynamically allocating weights for heterogeneous tasks and system features in HPC environments, especially in scenarios with burst workloads and resource competition, we enhance traditional reinforcement learning by introducing an adaptive feature recalibration mechanism. Specifically, the SE module [20] combines average pooling with a bottleneck multi-layer perceptron (channel → channel/4 → channel) and a Sigmoid activation function to capture global temporal relationships between feature channels, effectively suppressing irrelevant features while amplifying key features. The computational overhead is low; for example, adding SE to ResNet-50 increases parameters by only 10% but significantly improves accuracy [20].
In our architecture, the SE block is placed before the GTrXL layer in both the actor network and critic network, enabling temporal modeling to operate on the enhanced features.
To extend adaptability beyond static representations, we propose a novel SECT layer that fuses global statistical information and scheduling-aware dynamics by combining average and max aggregation Algorithm 1.
z fusion = 1 2 ( mean ( x ) + max ( x ) )
w base = σ ( W 2 · Gelu ( W 1 · z fusion ) )
w sched = σ ( W 4 · Dropout ( Gelu ( W 3 · mean ( x ) ) ) )
w final = w base + w sched 2 , x out = x w final
This design adapts channel weights to workload volatility, enhancing robustness without significant overhead [20].
Algorithm 1 SECT Layer
Require: 
Input tensor X R B × L × C , Channel dimension C, Reduction ratio r = 4
Ensure: 
Output tensor Y R B × L × C
  1:
function SECT( X )
  2:
     B , L , C shape ( X )
  3:
     s 1 L i = 1 L X : , i , :                    ▹ Global Average Pooling
  4:
     z 1 Linear ( s , C / r )                ▹ First fully connected layer
  5:
     z 2 ReLU ( z 1 )
  6:
     z 3 Linear ( z 2 , C )                ▹ Restore channel dimension
  7:
     a Sigmoid ( z 3 )
  8:
     Y X a . unsqueeze ( 1 )
  9:
    return  Y
10:
end function
Mathematical Variable Definitions:
  • B: Batch size representing the number of parallel scheduling instances
  • L: Sequence length corresponding to the number of jobs/nodes in the current scheduling window
  • C: Channel dimension representing the feature dimension of each job/node embedding
  • r: Reduction ratio controlling the bottleneck dimension ( C / r ) in the excitation pathway
  • s R B × C : Global statistics vector obtained through temporal average pooling across the sequence dimension
  • z 1 R B × C / r : Compressed feature representation after dimensionality reduction
  • z 2 R B × C / r : Activated features after ReLU non-linearity
  • z 3 R B × C : Restored feature dimension after the second linear transformation
  • a R B × C : Channel-wise attention weights normalized to [ 0 , 1 ] range via sigmoid activation
  • a . unsqueeze ( 1 ) : Expands attention weights to shape R B × 1 × C for broadcasting across sequence dimension
  • ⊙: Element-wise multiplication (Hadamard product) applying channel attention to input features
The SECT mechanism learns to emphasize important scheduling-relevant channels while suppressing less informative features through this squeeze-and-excitation operation.

3.4. Dynamic Memory Mechanism

In high-performance computing (HPC) scheduling, task queues with heterogeneous arrival times and different resource requirements can lead to complex timing patterns. This O(n2) complexity stems from the self-attention mechanism that computes pairwise interactions between all task positions [17], and the fixed context window constraint leads to performance degradation when exceeding a certain threshold [49].

3.4.1. GTrXL: Enhanced Memory Architecture

To address these empirically validated limitations, we propose GTrXL (Gated Transformer-XL), an enhanced Transformer architecture specifically designed for long-term temporal modeling in HPC scheduling scenarios. GTrXL builds upon the segment-level recursive mechanism [49] to model dependencies beyond fixed-length context without compromising temporal consistency. GTrXL supports a relaxed Markov approximation where π ( a t | s t ) is determined by the most recent sequence of states [ s t k , , s t ] . Unlike standard Markov models that only consider immediate states, our method preserves contextual information over extended time spans [50,51]. To maintain temporal consistency between scheduled segments, GTrXL uses relative position encoding instead of absolute position encoding. This enables generalization to sequences longer than training sequences—critical for handling variable-length task queues in HPC workloads. A GRU-based gating system dynamically regulates information flow within each Transformer block, selectively preserving relevant sequential information while filtering outdated inputs as Algorithm 2.
Algorithm 2 Gated Transformer-XL Core Module in GTrXL-SPPO
Require: 
State input X , Temporal memory M , Attention mask mask
Ensure: 
Output feature Y , Updated memory M new
  1:
function GTrXLBlock( X , M , mask )
  2:
     X LayerNorm ( X )                             ▹ Input
  3:
     M LayerNorm ( M )                         ▹ Memory
  4:
     A MultiHeadAttention ( X , M , M , mask )
  5:
     r σ ( W r A + U r X )                   ▹ Reset gate computation
  6:
     z σ ( W z A + U z X b g )                        ▹ Update gate
  7:
     h ˜ tanh ( W g A + U g ( r X ) )
  8:
     H 1 ( 1 z ) X + z h ˜                      ▹ 1st GRU output
  9:
     F FeedForward ( LayerNorm ( H 1 ) )
10:
     r 2 σ ( W r F + U r H 1 )                   ▹ Reset gate (2nd GRU)
11:
     z 2 σ ( W z F + U z H 1 b g )                ▹ Update gate (2nd GRU)
12:
     h ˜ 2 tanh ( W g F + U g ( r 2 H 1 ) )
13:
     Y ( 1 z 2 ) H 1 + z 2 h ˜ 2
14:
    return  Y
15:
end function

3.4.2. Temporal Dependency Modeling

GTrXL models the temporal execution order by combining multi-head self-attention with relative position encoding. Each attention head computes queries, keys, and values:
Q = X W Q R L × d K = X W K R L × d V = X W V R L × d
Relative position encodings capture execution order relationships:
R i , j = 1 d cos ( ( i j ) · ω )
AttnScore i , j = Q i K j T Q i R i j T d h
This design allows the agent to recognize execution-order correlations critical for effective scheduling decisions.
The GTrXL architecture employs a two-layer gating strategy within each Transformer block to adaptively control temporal context updates. The high-level operations are:
A = MHA ( Q , K , V ) , h = GRU 1 ( Q , A )
             F = FFN ( h ) , O = GRU 2 ( h , F )
This gated structure enables selective temporal state updates, allowing rapid integration of critical scheduling signals (e.g., high-priority job arrivals) while maintaining stability under dynamic workloads. The complete memory flow and compression pipeline are illustrated in Figure 3.
Attention Weight Matrices:
  • X R L × d : Input sequence tensor where L is sequence length and d is model dimension
  • W Q R d × d : Query projection matrix mapping input to query space
  • W K R d × d : Key projection matrix mapping input to key space
  • W V R d × d : Value projection matrix mapping input to value space
  • Q , K , V R L × d : Query, key, and value matrices for attention computation
These projection matrices are learned parameters that transform the input scheduling context into query, key, and value representations for multi-head attention.
Where ω R is the frequency parameter that controls the rate of positional variation (typically initialized as ω = 1 / 10,000 ), and d h = d / H represents the dimension per attention head, with H being the total number of attention heads and d being the model dimension. The relative position encoding R i , j captures the temporal distance between positions i and j in the scheduling sequence, enabling the model to understand execution order dependencies.

3.4.3. Implementation Details and Training Integration

The GRU-based gating mechanism within each GTrXL block follows standard gated recurrent unit formulations. For each gating layer, the operations are defined as:
r t = σ ( W r K t + U r Q t )
z t = σ ( W z K t + U z Q t b g )
h ˜ t = tanh ( W g K t + U g ( r t Q t ) )
h t = ( 1 z t ) Q t + z t h ˜ t
GRU Gating Parameters:
  • r t R d : Reset gate vector controlling how much past information to forget
  • z t R d : Update gate vector determining the balance between past and new information
  • h ˜ t R d : Candidate hidden state representing new information
  • h t R d : Final output combining past and candidate states
  • Q t , K t R d : Input vectors encoding scheduling context (runtime estimates, resource status)
  • W r , W z , W g R d × d : Input-to-hidden weight matrices for reset, update, and candidate gates
  • U r , U z , U g R d × d : Hidden-to-hidden recurrent weight matrices
  • b g R d : Gate bias vector (typically initialized to 0.1 for stable training)
  • σ ( · ) : Sigmoid activation function mapping values to ( 0 , 1 ) range
  • tanh ( · ) : Hyperbolic tangent activation function mapping values to ( 1 , 1 ) range
  • ⊙: Element-wise multiplication (Hadamard product)
The weight matrices are randomly initialized using Xavier/Glorot initialization and learned during training through backpropagation.
When critical job characteristics (e.g., high-priority arrivals) emerge, the gates adaptively adjust, facilitating fast integration of new signals (Algorithm 3) and improving responsiveness under dynamic workloads.
Algorithm 3 GTrXL-SPPO collaborative processing framework
Require: 
Input state X R B × L × 1 × 2 , Memory M , Mask mask
Ensure: 
Actor output π ( a | s ) , Critic output V ( s )
  1:
function CollaborativeProcessing( X , M , mask )
  2:
    if  X . dim ( ) = = 4  then
  3:
         X X . squeeze ( 2 )
  4:
    end if
  5:
     X flat X . view ( 1 , 2 )
  6:
     X fused ReLU ( Linear 2 1 ( X flat ) )
  7:
     X seq X fused . view ( 1 , 1 )
  8:
     H 1 ReLU ( Linear ( X seq ) )
  9:
     Y SEC T ( H 1 )                 ▹ Apply Algorithm 1
10:
     H 2 ReLU ( Linear ( Y ) )
11:
    if  H 2 . dim ( ) = = 1  then
12:
         H input H 2 . unsqueeze ( 0 )
13:
    else
14:
         H input H 2
15:
    end if
16:
     H final GTrXLBlock ( H input , M , mask )         ▹ Apply Algorithm 2
17:
     H output H final . squeeze ( 0 )
18:
     logits actor Linear action ( H output )
19:
     π ( a | s ) Softmax ( logits actor )
20:
     V ( s ) Linear value ( H output )
21:
    return  π ( a | s ) , V ( s )
22:
end function
GTrXL-SPPO Collaborative Integration Analysis:
The collaborative framework synergistically combines SECT (Algorithm 1) and GTrXL (Algorithm 2) to maximize their complementary advantages in HPC scheduling:
Processing Pipeline:
  • Input Standardization: Consolidates dual-channel HPC state information through 2→1 linear transformation
  • Feature Enhancement: Applies SECT channel attention to amplify scheduling-relevant features
  • Temporal Modeling: Leverages GTrXL gated Transformer blocks for long-range dependency capture
  • Shared Dual Output: Both actor and critic networks utilize identical enhanced representations
GTrXL-SPPO collaboration enhances feature quality through SECT preprocessing, improves temporal modeling through GTrXL processing of refined features, achieves computational efficiency through a shared pipeline architecture, and enables memory-aware processing that preserves critical scheduling history across sessions.

3.5. Training Strategy

3.5.1. Algorithm Design

GTrXL-SPPO employs the PPO algorithm to update scheduling policies via gradient ascent, combining policy- and value-based reinforcement learning for stable convergence [21]. The enhanced GTrXL serves as the temporal backbone to capture sequential scheduling dynamics in HPC environments.
The clipped surrogate objective is:
L clip ( θ ) = E t min r t ( θ ) A ^ t , clip r t ( θ ) , 1 ϵ , 1 + ϵ A ^ t
where r t ( θ ) = π θ ( a t | s t ) π θ old ( a t | s t ) , ϵ = 0.2 , and A ^ t is computed via Generalized Advantage Estimation (GAE):
A ^ t = l = 0 T t ( γ · λ ) l · δ t + l
δ t = r t + γ · V ϕ ( s t + 1 ) V ϕ ( s t )
To address the long-tail reward distribution induced by dynamic job arrivals and evolving system states, we adopt an exponentially sensitive clipping strategy and introduce a time-decaying discount factor γ t = 1 t T , reducing the weight of late-stage decisions and stabilizing learning.

3.5.2. Constraint Handling in Training

To incorporate constraints into the reinforcement learning framework, we adopt a dual approach:
  • Hard constraints: Resource capacity and task completion constraints are implemented through action space design and action masking:
    A t = { a A a satisfies constraints 1 and 2 }
  • Soft constraints: Non-preemption and priority constraints are encouraged through reward shaping:
    R t constraint = μ 1 · preemption_violations μ 2 · priority_violations
This hybrid approach ensures that all scheduling decisions are feasible while guiding the strategy toward developing solutions for more complex constraints. The constrained optimization problem is thus transformed into an unconstrained problem with a modified reward function:
R t modified = R t + R t constraint
Implementation Scope Clarification: While our CMDP framework supports constraint handling as described above, the experimental validation in this work focuses on core scheduling efficiency to establish baseline performance. Full constraint integration (deadlines, energy, thermal limits) represents ongoing development work and will be addressed in future extensions of this framework.

3.5.3. Reward Function Design

We formulate the reward function as a multi-objective optimization problem. The total reward R total in GTrXL-SPPO integrates four scheduling objectives and a queue pressure regularization term:
R total = k = 1 4 ω k · r k + α reg · r pressure
Each component reward is mathematically defined as follows:
  • System Utilization Reward:
    r 1 = σ 10 · ( u t 0.5 )
    where u t = occupied_nodes t total_nodes is the system utilization at time t, and σ is the sigmoid function.
  • Waiting Time Penalty:
    r 2 = Huber w t w target
    where w t is the average waiting time of jobs in the queue at time t, w target is the target waiting time, and Huber is the Huber loss function that reduces sensitivity to outliers.
  • Throughput Reward:
    r 3 = α · jobs_completed t
    where jobs_completed t is the number of jobs completed at time t, and α is a scaling factor.
  • Priority Reward:
    r 4 = β · j scheduled t p j
    where scheduled t is the set of jobs scheduled at time t, p j is the priority of job j, and β is a scaling factor.
The queue pressure regularization term penalizes the accumulation of jobs in the waiting queue:
r pressure = log ( 1 + queue_length t )
where α reg = 0.5 is the regularization coefficient. The adaptive temperature parameter T controls the exploration–exploitation balance in the reward weighting mechanism:
ω k ( T ) = exp ( ω k / T ) j = 1 4 exp ( ω j / T )
This adaptive design enables the agent to switch its focus between efficiency and fairness under different load conditions. As shown in our parameter sensitivity analysis (Section 5.5), this dynamic temperature adjustment is crucial for balancing multiple scheduling objectives.
When T = 1.0 (default), the model achieves optimal performance across all metrics; during periods of low utilization, lower values ( T = 0.5 ) prioritize resource efficiency; and during periods of high queue pressure, higher values ( T = 2.0 ) promote fairness through more balanced target weight allocation. In long-tail scheduling environments [52,53], our reward scheme accelerates convergence and improves the resource-delay Pareto frontier compared to standard PPO training.

3.6. Training Acceleration Strategies

To accelerate training convergence and improve efficiency, the GTrXL-SPPO framework adopts a comprehensive optimization strategy that combines distributed parallel training, memory replay technology, and hybrid optimization methods.
(1) Distributed Training: Parallel processing significantly accelerates the training process compared to a single GPU setup [54]. We use PyTorch to implement DDP training on multiple GPUs. The training dataset is divided into multiple shards, each assigned to a GPU device for independent gradient computation:
D k = { ( s t i , a t i , r t i ) i mod N = k } , k = 0 , 1 , , N 1
where N denotes the number of available GPUs, and D k denotes the data subset processed by GPU k. After local gradient updates, the AllReduce operation synchronizes gradients across all devices.
(2) Memory replay mechanism: To better capture long-term temporal patterns in scheduling dynamics, we introduce a memory replay mechanism [19,49]. Specifically, historical hidden states and intermediate features across time steps are stored in a recursive memory matrix. Combined with the GRU gating structure, this design dynamically updates memory content during training, enhancing the agent’s ability to handle time-extended task sequences.
(3) Hybrid Optimization Strategy: In the policy optimization phase, we integrate the Adam optimizer with learning rate warm-up and cosine annealing strategies to ensure a smoother training process and more stable model convergence. Additionally, to reduce communication overhead in multi-GPU environments, we apply 1-bit Adam compression, where gradients are compressed before synchronization:
g ˜ t = sign ( g t ) | g t | , g t = θ L ( θ )
This strategy significantly reduces synchronization communication while maintaining model convergence performance [55].

3.7. Scalability and Deployment Considerations

Facing practical deployment issues, we analyze the ability of GTrXL-SPPO to address three key challenges: node scaling, heterogeneity, and architecture migration.

3.7.1. Node Scaling Analysis

Our sliding window mechanism (Figure 4) ensures that the memory complexity is O ( W ) rather than O ( N ) , where W is the fixed window size (200) and N is the total number of nodes. Combined with the attention aggregation mechanism in GTrXL, this design supports scalability without triggering linear memory growth.

3.7.2. Heterogeneous Adaptation

The SECT module (Section 3.3) supports heterogeneous environments through dynamic feature recalibration. Our experiments on ANL-Intrepid (grid architecture), Alibaba (general-purpose cluster), SDSC-SP2 (homogeneous), and PIK-IPLEX (heterogeneous) show that this module can achieve effective adaptation under different cluster configurations without architecture-specific modifications (Section 5).

3.7.3. Architecture Migration

The modular design supports migration through the following mechanisms:
  • Configuration-specific model loading: The framework adapts to clusters of different scales by loading pre-trained models for different system configurations.
  • Distributed training support: Multi-GPU training capabilities are implemented through PyTorch’s distributed training framework, supporting scalable training on different hardware architectures.
  • General scheduling pattern learning: The GTrXL backbone network learns a general scheduling pattern that can be transferred between different computing architectures without architecture-specific retraining.
Our multi-dataset validation demonstrates cross-architecture generalization capabilities, indicating that the framework can successfully adapt to different computing scales and resource configurations without fundamental algorithmic changes.

3.8. Demonstrative Iteration of the Proposed Algorithm

To further clarify the practical application of the proposed GTrXL-SPPO framework, we present a step-by-step example of a single scheduling iteration, highlighting the use of the SECT and GTrXL modules.
Example Setup: Suppose the job queue and resource state are as follows (in Table 3 and Table 4):
Step 1: State Vector Construction
The state vector s t is constructed as:
s t = { [ 1 , 10 , 2 , 30 ] , [ 2 , 5 , 1 , 20 ] , [ 1 , 15 , 3 , 40 ] } { [ 1 , 0 ] , [ 0 , 10 ] }
Step 2: Feature Enhancement via SECT Layer
The input tensor X is processed by the SECT layer (Algorithm 1). This adaptively recalibrates channel-wise features.
Step 3: Temporal Modeling via GTrXL Block
The enhanced features are passed through the GTrXL block (Algorithm 2), which models temporal dependencies and gating.
Step 4: Action Selection
The policy network outputs action probabilities, e.g.,
π ( a | s t ) = [ 0.2 , 0.5 , 0.3 ]
The agent selects job B (index 2) for scheduling.
Step 5: Reward Calculation
After scheduling, the reward is computed as:
R total = k = 1 4 ω k · r k + α reg · r pressure
where r 1 is system utilization, r 2 is the waiting penalty, etc.
Step 6: Policy Update
The PPO loss is calculated and the policy is updated accordingly.
This example demonstrates how the proposed GTrXL-SPPO framework processes input features, models temporal dependencies, selects actions, and updates the policy in a single scheduling iteration.
Our approach significantly improves scheduling efficiency by optimizing both resource utilization and job waiting times through the novel GTrXL-SPPO framework.

4. Experiments Setup

4.1. Experimental Configuration and Reproducibility

To ensure full reproducibility of the experiments, we provide key experimental settings and evaluation methods.

4.1.1. Dataset Configuration and Partitioning

For each dataset (ANL-Intrepid, Alibaba, SDSC-SP2, PIK-IPLEX), we adopt a time division: training period (first 80% of task submissions) and testing period (last 20% of task submissions), with no time overlap to ensure the authenticity of the evaluation. Data preprocessing includes z-score normalization and normalization to the range [0,1].

4.1.2. Model and Training Configuration

GTrXL-SPPO employs a 12-layer Transformer architecture (12 attention heads, 768-dimensional hidden layer) combined with the SECT attention module (reduction ratio 16) and the PPO reinforcement learning algorithm. Key training parameters include: Actor learning rate 1 × 10 4 , Critic learning rate 1 × 10 3 , batch size 128, window size 200, PPO clipping ratio 0.2, GAE λ 0.95, and total training rounds 5000. All experiments are implemented on an NVIDIA V100 GPU cluster (4 × 32 GB) based on PyTorch 1.12.0, with 30 independent random seeds (42–71) to ensure statistical reliability. Complete parameter details are provided in Appendix A.

4.1.3. Evaluation Protocol

Performance evaluation was conducted every 100 training rounds, with results recorded after a 500-round warm-up period. All results are reported as the mean, standard deviation, and 95% confidence interval of 30 independent runs, with statistical significance verified via paired t-tests (p < 0.001).

4.2. Baseline Algorithms

We evaluate GTrXL-SPPO against two categories of baselines:
  • Rule-based algorithms: FCFS, SJF, and EASY, which follow fixed heuristics but lack adaptability to dynamic workloads.
  • Reinforcement learning methods: Among RL approaches, PPO is selected due to its proven performance in HPC scheduling [3].

4.3. Workload Trajectories

We evaluate GTrXL-SPPO using four representative real-world workload trajectories: Alibaba [56], ANL-Intrepid [57], SDSC-SP2 [58], and PIK-IPLEX [59]. These datasets span multiple application domains, including cloud services, academic research, and scientific computing, and exhibit significant differences in job characteristics and resource requirements (see Table 5). The job arrival distributions and resource request distributions for each dataset are shown in Figure 5 and Figure 6. This diversity provides a foundation for evaluating the generalization capability of our method. Detailed experimental results are presented in Section 5.

4.4. GTrXL-SPPO Training

We implement GTrXL-SPPO in CQSim, a widely used HPC scheduling simulator. Real workload traces (Section 4.2) are fed sequentially, and the agent schedules jobs using a sliding window mechanism (see Figure 4). All code is implemented in PyTorch. Table 6 summarizes the parameter configurations under different workloads. The model receives state inputs constructed by sliding windows (e.g., [2,200] for the Alibaba workload) and outputs discrete probabilities corresponding to node allocation decisions (e.g., 801 candidate actions representing 800 node assignments plus one wait decision). The Transformer backbone (12 layers, 576 dimensions, 12 heads) accounts for nearly half of the total parameters, with the model scaling from 136M to 160M parameters across workloads.
To further illustrate training effectiveness, Figure 7 presents the progression of total reward across all datasets. GTrXL-SPPO consistently achieves higher final reward values and faster early-stage gains compared to baselines such as FIFO, SJF, and vanilla RL. Notably, the reward curve converges more smoothly under diverse workload dynamics, confirming the model’s robustness and training stability across heterogeneous environments.
Training objectives include utilization, wait time, and throughput. We use a linear learning rate decay (initially 5 × 10 5 ), curriculum-based load escalation, and sample preservation to enhance generalization. As shown in Figure 7, GTrXL-SPPO consistently surpasses baselines under all workload scenarios, demonstrating faster convergence and achieving significantly higher total rewards, while static algorithms (FIFO and SJF) exhibit fixed reward levels (represented by horizontal lines). Evaluation covers both real and synthetic traces (see Section 5.1).

4.5. Evaluation Metrics

To comprehensively assess the effectiveness of the proposed scheduling strategy, we employ a set of user-level and system-level evaluation metrics, as summarized below:
  • System Utilization: The proportion of time during which system resources are actively utilized.
  • Throughput: The number of jobs completed per unit time.
  • Average Wait Time: The mean time a job spends waiting from submission to execution.
  • Job Completion Time: The total time required from job submission to completion.
Among these, System Utilization and Throughput are system-level metrics, while Average Wait Time and Job Completion Time are user-level metrics. Together, they provide a multi-dimensional evaluation framework, enabling a comprehensive analysis of scheduling strategies in terms of system efficiency, job responsiveness, and resource optimization. This facilitates the identification and selection of strategies best suited to specific HPC workload scenarios.

5. Case Study

Recent studies have adopted a sampling-real-synthetic data paradigm to construct training workloads for reinforcement learning-based HPC schedulers [3,60]. Building on this foundation, we conduct our experiments using CQGym, a widely adopted simulation platform for job scheduling research [3,61].

5.1. System Performance Evaluation

5.1.1. System Utilization and Waiting Time

As shown in Figure 8a, system utilization frequently approaches 100%, reflecting the model’s effectiveness in maximizing resource usage under varying load conditions. Additionally, Figure 8b demonstrates that our approach significantly reduces job scheduling waiting times during both training and testing phases, further highlighting the efficiency and responsiveness of the proposed method.

5.1.2. Overall Performance Evaluation

Figure 7 shows the total reward achieved by GTrXL-SPPO and baseline methods across different workloads, demonstrating the superior learning capability and convergence speed of our approach.

5.1.3. Comparative Analysis with PPO

As shown in Figure 9 and Figure 10, GTrXL-SPPO demonstrates superior performance across all evaluation metrics. The results on simulated workloads validate the algorithm’s effectiveness, while real-world workloads confirm its excellent generalization capability: GTrXL-SPPO achieves higher resource utilization, larger cumulative rewards, and lower task waiting times, demonstrating robust scheduling performance under diverse workload patterns.

5.2. Ablation Study and Component Analysis

To systematically evaluate the contributions of each component in our proposed GTrXL-SPPO framework, we conducted comprehensive ablation experiments on multiple datasets (ANL-Intrepid, SDSC-SP2, and Alibaba).

5.2.1. Experimental Design

We designed five model variants to evaluate different components:
  • GTrXL-SPPO (Full): The complete model with all components.
  • GTrXL-PPO (No SE): Removes the squeeze-and-excitation modules to assess their contribution to feature enhancement.
  • SPPO (No GTrXL): Replaces the Gated Transformer-XL with a standard feedforward network to assess the impact of temporal modeling.
  • TrXL-SPPO (No Gating): Removes the gating mechanism from the Transformer-XL to assess the importance of selective memory updates.
  • GTrXL-SPPO (Fixed Reward): Uses fixed weights in the reward function instead of dynamically temperature-controlled softmax weights.
Each variant is trained under identical conditions with the same hyperparameters, with the only difference being the specific component under evaluation. We assess performance using multiple metrics: system utilization, average task waiting time, throughput (tasks/hour), and resource fragmentation.

5.2.2. Results and Analysis

As shown in Table 7, each component contributes significantly to the overall performance of GTrXL-SPPO:
Impact of GTrXL: Removing the GTrXL component results in the most significant performance degradation, with system utilization decreasing by 14.4%, 15.4%, and 17.4% on the three datasets, respectively. This highlights the critical importance of temporal modeling in capturing long-term job dependencies.
Contribution of SE Module: The SE module improves system utilization by 4.4%, 4.4%, and 5.4% across datasets. By dynamically recalibrating feature importance, it enables better adaptation to changing resource conditions.
Role of Gating Mechanism: The gating mechanism improves system utilization by 3.5%, 3.9%, and 5.0%, respectively, enabling selective memory updates and noise filtering.
Dynamic Reward Weighting: Temperature-controlled dynamic weights improve system utilization by 3.2%, 3.6%, and 4.5% compared to fixed weights, allowing adaptive optimization priority adjustment.

5.3. Statistical Stability Analysis

To address the inherent randomness of reinforcement learning and ensure result reliability, we conducted GTrXL-SPPO experiments using 10 different random seeds (42–51) under identical conditions. Each run processed 3000 tasks using the same Alibaba Cluster workload with a 400-node configuration (Figure 11 and Table 8).
The statistical analysis demonstrates exceptional stability with low coefficients of variation: system utilization (CV = 2.5%), average waiting time (CV = 5.1%), and throughput (CV = 3.6%). Paired t-tests show statistically significant improvements (p < 0.001) with effect sizes exceeding Cohen’s d = 0.8, confirming substantial practical significance.

5.4. Runtime Performance Evaluation

To validate practical deployability in HPC environments, we conducted comprehensive runtime performance evaluation focusing on decision latency and computational overhead using PyTorch implementation on NVIDIA GPU hardware.
GTrXL-SPPO exhibits 9.0 ms mean decision latency with low variance ( σ = 0.2 ms), indicating stable performance. Through optimization strategies including 16-bit precision, gradient checkpointing, and attention pruning, we achieved 24% latency reduction and 25% memory decrease while maintaining algorithmic benefits (Table 9).

5.5. Parameter Sensitivity Analysis

We conducted comprehensive sensitivity analysis on two critical hyperparameters: pressure regularization coefficient α reg and temperature parameter T in dynamic reward weighting.

5.5.1. Pressure Regularization Coefficient ( α reg )

The optimal value α reg = 0.5 provides the best balance between queue management and scheduling efficiency (Table 10), achieving peak system utilization (87.5%) and minimum waiting time (2458 s).

5.5.2. Temperature Parameter (T)

T = 1.0 provides optimal balance between objective differentiation and integration, achieving the best overall performance across all metrics (Table 11).

5.6. Comparative Analysis of Attention Mechanisms

We evaluated our SECT module against two mainstream alternatives: CBAM [62] and ECA-Net [63].
SECT consistently outperforms alternatives with 2.3–3.6% higher utilization and superior computational efficiency (Table 12), validating its specialized design for HPC scheduling tasks.

5.7. Comparison with Optimization-Based Methods

We evaluated GTrXL-SPPO against Mixed-Integer Programming (MIP) and Constraint Programming (CP) solvers on two scales (Table 13):
GTrXL-SPPO maintains competitive performance at small scales (within 3% of optimal) and significantly outperforms optimization methods at large scales where traditional solvers encounter timeout issues.

5.8. Computational Complexity Analysis

We conducted comprehensive complexity analysis combining theoretical bounds with empirical measurements across multiple operational scales.
Despite quadratic complexity, GTrXL-SPPO demonstrates superior per-task efficiency at large scales (7.4 ms vs. PPO’s 15.6 ms), validating its practical scalability for production HPC environments (Table 14).

5.9. Practical Impact Analysis

To quantify the economic significance of performance improvements, we analyzed the practical impact across different HPC scales based on our measured 25-s wait time reduction and 5.4% utilization improvement (Table 15).
The analysis reveals significant scale amplification effects, where modest per-task improvements accumulate to substantial annual benefits: 57,031 h of time savings equivalent to 35.7 full-time research teams, with total economic impact of USD 11.6 million annually across different HPC scales.

6. Results and Discussion

In this work, we have successfully achieved our objective of developing a more efficient and robust scheduling framework for HPC environments. The proposed GTrXL-SPPO framework effectively addresses the challenges of high-dimensional decision spaces and complex temporal dependencies in dynamic scheduling environments, as demonstrated by our comprehensive experiments across multiple datasets.
The experimental results confirm that the two core components of GTrXL-SPPO—the GTrXL backbone and the SE module—provide complementary benefits across heterogeneous HPC workloads. These advantages are evident in both scientific (ANL) and industrial (Alibaba) scenarios, demonstrating the framework’s applicability to diverse real-world environments. The GTrXL component enhances long-horizon scheduling by effectively capturing temporal execution signals, while the SE module adaptively reweights features to improve responsiveness under resource variability. Together, these modules improve throughput, reduce turnaround time, and enhance resource utilization, as validated across datasets in Section 5.2. Overall, the results demonstrate the robustness and generalizability of our design.

7. Conclusions and Future Work

7.1. Results and Discussion

The proposed GTrXL-SPPO framework successfully addresses the challenges posed by high-dimensional decision spaces and complex temporal dependencies in dynamic high-performance computing scheduling environments. Our comprehensive experiments on multiple datasets demonstrate that the two core components—the GTrXL backbone network and the SECT module—offer certain advantages under heterogeneous workloads.
The GTrXL component enhances long-time-domain scheduling performance by capturing temporal execution signals, while the SECT module improves responsiveness under resource fluctuations. These two innovations collectively achieve quantifiable improvements in throughput, turnaround time, and resource utilization, as validated in Section 5.2.
The framework demonstrates robust performance across diverse workload characteristics, ranging from scientific simulations to cloud computing tasks.

7.2. Limitations and Future Work

Although GTrXL-SPPO achieves significant improvements, we acknowledge the following limitations, which provide directions for future research:

Current Limitations

Manual configuration requirements: Our reward mechanism relies on manually tuned weights and parameters, which poses portability challenges across different HPC platforms.
Limited system-specific constraints: The framework does not directly support production environment requirements, such as task deadlines, energy budgets, or thermal constraints.
Fairness considerations: During sustained high loads, non-preemptive designs may prioritize shorter tasks, causing long-running or resource-intensive tasks to starve.

7.3. Future Research Directions

Automatic configuration: Develop meta-learning methods to automatically adjust reward weights and quickly adapt to new environments through transfer learning.
Constraint-aware framework: Integrate multi-objective optimization to handle concurrent constraints and integrate real-time monitoring systems.
Fairness mechanisms: Refer to the research by Duque et al. [64] to explore fairness reward design and adaptive intervention strategies that consider aging factors to achieve resource reallocation.
These research directions aim to enhance the framework’s practical deployment capabilities while retaining its attention-based optimization advantages.

7.4. Broader Impact

GTrXL-SPPO represents a significant advancement in applying machine learning to infrastructure optimization. The framework’s ability to handle complex temporal dependencies goes beyond high-performance computing scheduling and can be scaled to other resource allocation domains. With the proposed improvements, we expect to provide more robust and fair scheduling solutions for modern computing infrastructure.

Author Contributions

Conceptualization, X.G. and H.D.; Methodology, X.G. and H.D.; Software, H.D.; Validation, Y.W. and X.Y.; Formal analysis, X.G.; Investigation, H.D. and L.Z.; Resources, X.G.; Data curation, H.D.; Writing—original draft, H.D.; Writing—review and editing, X.G., L.Z. and Z.L.; Visualization, Y.W.; Supervision, X.G.; Project administration, X.G.; Funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Program of China (Grant No. 2023YFB3002300, Project: Cross-domain Application Collaborative Operation Technology), administered by the Ministry of Science and Technology of China. The project is led by the Computer Network Information Center, Chinese Academy of Sciences, in collaboration with Zhengzhou University and the National Supercomputing Center in Zhengzhou.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the National Supercomputing Center in Zhengzhou for their support and collaboration.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HPCHigh-Performance Computing
PPOProximal Policy Optimization
FCFSFirst-Come First-Served
SJFShortest Job First
SESqueeze-and-Excitation
SLURMSimple Linux Utility for Resource Management
GTrXLGated Transformer-XL
DRLDeep Reinforcement Learning
CBAMConvolutional Block Attention Module
ECAEfficient Channel Attention
MIPMixed-Integer Programming
CPConstraint Programming
MDPMarkov Decision Process
MDPsMarkov Decision Processes
DDPDistributed Data Parallel

Appendix A. Detailed Experimental Configuration Parameters

To ensure complete reproducibility and consistency with the main text, this appendix provides comprehensive experimental configuration parameters for the GTrXL-SPPO framework. All parameter settings are based on the experimental setup described in the main paper.

Appendix A.1. Model Architecture Configuration

Table A1. Complete GTrXL-SPPO architecture parameter configuration.
Table A1. Complete GTrXL-SPPO architecture parameter configuration.
ComponentConfiguration Parameters
GTrXL Transformer Core Architecture
Number of Layers12
Number of Attention Heads12
Hidden Dimension768
Feed-Forward Network Dimension3072 (4 × hidden_dim)
Relative Position Encoding R i , j = 1 d cos ( ( i j ) · ω ) , ω = 1/10,000
Layer NormalizationPre-normalization, ϵ norm = 1 × 10 12
Dropout Rate0.1
Memory Length512
SECT Attention Module
Reduction Ratio16
Squeeze Activation FunctionReLU
Excitation Activation FunctionSigmoid
Global Pooling MethodAdaptive Average Pooling
Dual-path Fusion z fusion = 1 2 ( mean ( x ) + max ( x ) )
Gating Mechanism (GRU-style)
Gate TypeDual-layer GRU gating (reset + update gates)
Reset Gate InitializationXavier Uniform Initialization
Update Gate InitializationXavier Uniform Initialization
Bias InitializationZero Initialization
Gate Bias b g 0.1
Actor-Critic Network Structure
Actor Network Layers[768, 512, 256, action_dimension]
Critic Network Layers[768, 512, 256, 1]
Hidden Layer ActivationReLU
Output Layer ActivationTanh (Actor), Linear (Critic)

Appendix A.2. Training Hyperparameter Configuration

Table A2. Complete training hyperparameter settings.
Table A2. Complete training hyperparameter settings.
Parameter CategorySpecific Configuration
Training Process Parameters
Total Training Episodes5000 episodes (per dataset)
Batch Size128 sequences
Sequence LengthDynamically adjusted based on sliding window size
Experience Buffer Size50,000 transitions
Training FrequencyUpdate every 4 episodes
Evaluation FrequencyEvaluate every 100 training episodes
Warm-up Period500 training episodes
Random Seed Range42–71 (30 independent runs)
PPO Reinforcement Learning Parameters
Learning Rate (Actor) 1 × 10 4
Learning Rate (Critic) 1 × 10 3
Initial Learning Rate Decay 5 × 10 5 (linear decay)
Learning Rate SchedulingCosine annealing with warm restarts
Clipping Ratio ϵ 0.2
Value Function Coefficient0.5
Entropy Coefficient0.01
GAE λ 0.95
Discount Factor γ 0.99
Time-decaying Discount γ t = 1 t T
PPO Epochs per Update4
Gradient Clipping Max Norm0.5
Optimizer Configuration
Optimizer TypeAdam
Beta10.9
Beta20.999
Weight Decay 1 × 10 5
Epsilon 1 × 10 8
1-bit Adam Compression g ˜ t = sign ( g t ) | g t |
Reward Function Parameters
Pressure Regularization Coefficient α reg 0.5
Temperature Parameter T1.0
Adaptive Weight Calculation ω k ( T ) = exp ( ω k / T ) j = 1 4 exp ( ω j / T )
Huber Loss ThresholdStandard setting

Appendix A.3. Computational Environment Configuration

Table A3. Computational environment and software version specifications.
Table A3. Computational environment and software version specifications.
Component CategorySpecific Specifications
Hardware Configuration
GPU ClusterNVIDIA V100 (4 × 32 GB VRAM)
CPU ProcessorIntel Xeon Series
System Memory128 GB DDR4
Storage SystemHigh-speed parallel storage
Network InterconnectHigh-speed InfiniBand
Software Environment
Operating SystemUbuntu 20.04.6 LTS
Python Version3.9.7
PyTorch Version1.12.0
CUDA Version11.6
cuDNN Version8.4.1
NumPy Version1.21.5
OpenAI Gym Version0.21.0
Simulation PlatformCQSim (CQGym)
Distributed Training Configuration
Multi-GPU Training FrameworkPyTorch Distributed Data Parallel (DDP)
Number of GPUs4 NVIDIA V100
Data Sharding Strategy D k = { ( s t i , a t i , r t i ) i mod N = k }
Gradient SynchronizationAllReduce operation

Appendix A.4. Reproducibility Checklist

To ensure complete reproducibility, we provide the following checklist:
  • Code Implementation: Implemented in PyTorch 1.12.0 using CQGym simulation platform
  • Dataset Access: ANL-Intrepid, Alibaba, SDSC-SP2, PIK-IPLEX public datasets
  • Randomness Control: 30 independent seeds (42–71), deterministic CUDA operations
  • Hardware Environment: NVIDIA V100 GPU cluster specifications clearly defined
  • Hyperparameter Fixed: All training hyperparameters explicitly specified
  • Evaluation Protocol: Statistical testing methods and confidence interval calculation standardized
  • Performance Benchmarks: Runtime performance and memory usage clearly recorded
Note: All parameter configurations in this appendix are fully consistent with the main paper text, ensuring reproducibility of experimental results and consistency of parameter settings. Any specific implementation details or parameter adjustments are based on the experimental setup explicitly described in the main paper text.

References

  1. Ben-Nun, T.; Hoefler, T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 2019, 52, 65. [Google Scholar] [CrossRef]
  2. Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
  3. Fan, Y.; Li, B.; Favorite, D.; Singh, N.; Childers, T.; Rich, P.; Allcock, W.; Papka, M.E.; Lan, Z. DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 4903–4917. [Google Scholar] [CrossRef]
  4. Hu, H.; Li, Z.; Hu, H.; Chen, J.; Ge, J.; Li, C.; Chang, V. Multi-objective scheduling for scientific workflow in multicloud environment. J. Netw. Comput. Appl. 2018, 114, 108–122. [Google Scholar] [CrossRef]
  5. Marcus, R.; Negi, P.; Mao, H.; Zhang, C.; Alizadeh, M.; Kraska, T.; Papaemmanouil, O.; Tatbul, N. Neo: A learned query optimizer. Proc. VLDB Endow. 2019, 12, 1705–1718. [Google Scholar] [CrossRef]
  6. Narayanan, D.; Santhanam, K.; Kazhamiaka, F.; Phanishayee, A.; Zaharia, M. Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), New York, NY, USA, 4–6 November 2020; pp. 481–498. [Google Scholar]
  7. Cheng, M.; Li, J.; Nazarian, S. DRL-cloud: Deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju, Republic of Korea, 22–25 January 2018; pp. 129–134. [Google Scholar] [CrossRef]
  8. Barve, M.; Sinha, S.; Hardikar, R.P.; Gunturu, A.; Mallik, W. Workload Characterization in HPC Environment for Auto-scaling of Resources—Preliminary Study. In Proceedings of the 2022 IEEE 19th India Council International Conference (INDICON), Kochi, India, 24–26 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
  9. Jiang, Z.; Luo, C.; Gao, W.; Wang, L.; Zhan, J. HPC AI500 V3.0: A scalable HPC AI benchmarking framework. Benchcouncil Trans. Benchmarks Stand. Eval. 2022, 2, 100083. [Google Scholar] [CrossRef]
  10. Aceituno, J.M.; Guasque, A.; Balbastre, P.; Simó, J.; Crespo, A. Hardware resources contention-aware scheduling of hard real-time multiprocessor systems. J. Syst. Archit. 2021, 118, 102223. [Google Scholar] [CrossRef]
  11. Amaral, M.; Polo, J.; Carrera, D.; Seelam, S.; Steinder, M. Topology-aware GPU scheduling for learning workloads in cloud environments. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, NY, USA, 12–17 November 2017. SC ’17. [Google Scholar] [CrossRef]
  12. Huang, J.; Xiao, C.; Wu, W. RLSK: A Job Scheduler for Federated Kubernetes Clusters based on Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Cloud Engineering (IC2E), Sydney, NSW, Australia, 21–24 April 2020; pp. 116–123. [Google Scholar] [CrossRef]
  13. Gómez, C.; Martınez, F.; Armejach, A.; Moretó, M.; Mantovani, F.; Casas, M. Design Space Exploration of Next-Generation HPC Machines. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, 20–24 May 2019; pp. 54–65. [Google Scholar] [CrossRef]
  14. Singh, R.M.; Awasthi, L.K.; Sikka, G. Towards Metaheuristic Scheduling Techniques in Cloud and Fog: An Extensive Taxonomic Review. ACM Comput. Surv. 2022, 55, 1–43. [Google Scholar] [CrossRef]
  15. Hwang, E.; Kim, J.S.; Choi, Y.R. Achieving Fairness-Aware Two-Level Scheduling for Heterogeneous Distributed Systems. IEEE Trans. Serv. Comput. 2021, 14, 639–653. [Google Scholar] [CrossRef]
  16. Tirmazi, M.; Barker, A.; Deng, N.; Haque, M.E.; Wilkes, J. Borg: The next generation. In Proceedings of the EuroSys ’20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, 27–30 April 2020. [Google Scholar]
  17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NIPS’17. pp. 6000–6010. [Google Scholar] [CrossRef]
  18. Ermis, B.; Zappella, G.; Wistuba, M.; Rawal, A.; Archambeau, C. Continual Learning with Transformers for Image Classification. arXiv 2022, arXiv:2206.14085. [Google Scholar] [CrossRef]
  19. Parisotto, E.; Song, H.F.; Rae, J.W.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.M.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S.; et al. Stabilizing Transformers for Reinforcement Learning. arXiv 2019, arXiv:1910.06764. [Google Scholar] [CrossRef]
  20. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  21. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
  22. Wang, Y.; He, H.; Wen, C.; Tan, X. Truly Proximal Policy Optimization. arXiv 2020, arXiv:1903.07940. [Google Scholar] [CrossRef]
  23. Fan, Y.; Lan, Z.; Rich, P.; Allcock, W.; Papka, M.E. Hybrid Workload Scheduling on HPC Systems. In Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, 30 May–3 June 2022; pp. 470–480. [Google Scholar] [CrossRef]
  24. Frachtenberg, E.; Petrini, F.; Fernandez, J.; Pakin, S. STORM: Scalable Resource Management for Large-Scale Parallel Computers. IEEE Trans. Comput. 2006, 55, 1572–1587. [Google Scholar] [CrossRef]
  25. Lin, Z.; Li, C.; Tian, L.; Zhang, B. A scheduling algorithm based on reinforcement learning for heterogeneous environments. Appl. Soft Comput. 2022, 130, 109707. [Google Scholar] [CrossRef]
  26. Lifka, D. The ANL/IBM SP Scheduling System; Argonne National Lab. (ANL): Argonne, IL, USA, 1995. [CrossRef]
  27. Jette, M.; Yoo, A.; Grondona, M. SLURM: Simple Linux Utility for Resource Management; Lawrence Livermore National Lab. (LLNL): Livermore, CA, USA, 2003.
  28. Schwiegelshohn, U.; Yahyapour, R. Fairness in parallel job scheduling. J. Sched. 2000, 3, 297–320. [Google Scholar] [CrossRef]
  29. Tang, W.; Lan, Z.; Desai, N.; Buettner, D. Fault-aware, utility-based job scheduling on Blue, Gene/P systems. In Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops, New Orleans, LA, USA, 31 August–4 September 2009; pp. 1–10. [Google Scholar]
  30. Jackson, D.; Snell, Q.; Clement, M. Core Algorithms of the Maui Scheduler. In Job Scheduling Strategies for Parallel Processing; Feitelson, D.G., Rudolph, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 87–102. [Google Scholar]
  31. Mu’alem, A.; Feitelson, D. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 2001, 12, 529–543. [Google Scholar] [CrossRef]
  32. Mao, H.; Alizadeh, M.; Menache, I.; Kandula, S. Resource Management with Deep Reinforcement Learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, Atlanta, GA, USA, 9–10 November 2016; pp. 50–56. [Google Scholar] [CrossRef]
  33. Mao, H.; Schwarzkopf, M.; Venkatakrishnan, S.B.; Meng, Z.; Alizadeh, M. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication, Beijing, China, 19–23 August 2019; pp. 270–288. [Google Scholar] [CrossRef]
  34. Wong, A.K.; Goscinski, A.M. Evaluating the EASY-backfill job scheduling of static workloads on clusters. In Proceedings of the 2007 IEEE International Conference on Cluster Computing, Austin, TX, USA, 17–20 September 2007; pp. 64–73. [Google Scholar] [CrossRef]
  35. Chadha, M.; John, J.; Gerndt, M. Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling. arXiv 2021, arXiv:2009.08289. [Google Scholar] [CrossRef]
  36. Fan, Y.; Lan, Z. DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling. Softw. Impacts 2021, 8, 100077. [Google Scholar] [CrossRef]
  37. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  38. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; ICLR: San Juan, Puerto Rico, 2016. [Google Scholar]
  39. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
  40. Espeholt, L.; Marinier, R.; Stanczyk, P.; Wang, K.; Michalski, M. SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference. arXiv 2020, arXiv:1910.06591. [Google Scholar] [CrossRef]
  41. Lei, Y.; Deng, Q.; Liao, M.; Gao, S. Deep reinforcement learning for dynamic distributed job shop scheduling problem with transfers. Expert Syst. Appl. 2024, 251, 123970. [Google Scholar] [CrossRef]
  42. Li, M.; Wang, Z.; Li, K.; Liao, X.; Hone, K.; Liu, X. Task Allocation on Layered Multiagent Systems: When Evolutionary Many-Objective Optimization Meets Deep Q-Learning. IEEE Trans. Evol. Comput. 2021, 25, 842–855. [Google Scholar] [CrossRef]
  43. Nedic, A.; Ozdaglar, A. Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Trans. Autom. Control 2009, 54, 48–61. [Google Scholar] [CrossRef]
  44. Gu, Y.; Liu, Z.; Dai, S.; Liu, C.; Wang, Y.; Wang, S. Deep Reinforcement Learning for Job Scheduling and Resource Management in Cloud Computing: An Algorithm-Level Review. arXiv 2025, arXiv:2501.01007. [Google Scholar] [CrossRef]
  45. Liu, H.; Zhou, X.; Gao, K.; Ju, Y. An integrated optimization method to task scheduling and VM placement for green datacenters. Simul. Model. Pract. Theory 2024, 135, 102962. [Google Scholar] [CrossRef]
  46. He, H.; Gu, Y.; Liu, Q.; Wu, H.; Cheng, L. Job Scheduling in Hybrid Clouds with Privacy Constraints: A Deep Reinforcement Learning Approach. Concurr. Comput. Pract. Exp. 2025, 37, 1–17. [Google Scholar] [CrossRef]
  47. Han, K.; Xiao, A.; Wu, E.; Guo, J.; XU, C.; Wang, Y. Transformer in Transformer. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 15908–15919. [Google Scholar]
  48. Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache Spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
  49. Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2978–2988. [Google Scholar] [CrossRef]
  50. Deng, K.; He, Z.; Zhang, H.; Lin, H.; Wang, D. TSNet-SAC: Leveraging Transformers for Efficient Task Scheduling. arXiv 2023, arXiv:2307.07445. [Google Scholar] [CrossRef]
  51. Lee, J.; Kee, S.; Janakiram, M.; Runger, G. Attention-based Reinforcement Learning for Combinatorial Optimization: Application to Job Shop Scheduling Problem. arXiv 2024, arXiv:2401.16580. [Google Scholar] [CrossRef]
  52. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
  53. El-Mahdy, M.; Sakr, N.; Carrasco, R. Improving Mixed-Criticality Scheduling with Reinforcement Learning. arXiv 2025, arXiv:2504.03994. [Google Scholar] [CrossRef]
  54. Li, S.; Zhao, Y.; Varma, R.; Salpekar, O.; Noordhuis, P.; Li, T.; Paszke, A.; Smith, J.; Vaughan, B.; Damania, P.; et al. PyTorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow. 2020, 13, 3005–3018. [Google Scholar] [CrossRef]
  55. Tang, H.; Gan, S.; Awan, A.A.; Rajbhandari, S.; Li, C.; Lian, X.; Liu, J.; Zhang, C.; He, Y. 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
  56. Peng, Y.; Bao, Y.; Chen, Y.; Wu, C.; Guo, C. A Generic Communication Scheduler for Distributed DNN Training Acceleration. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017. [Google Scholar]
  57. Argonne National Laboratory. ANL-Intrepid Workload Archive. 2010. Available online: https://www.cs.huji.ac.il/labs/parallel/workload/l_anl_int/ (accessed on 1 August 2024).
  58. Feitelson, D.G.; Tsafrir, D.; Krakov, D. Experience with using the Parallel Workloads Archive. J. Parallel Distrib. Comput. 2014, 74, 2967–2982. [Google Scholar] [CrossRef]
  59. Maurer, T.; Avanzi, F.; Oroza, C.A.; Glaser, S.D.; Conklin, M.; Bales, R.C. Optimizing spatial distribution of watershed-scale hydrologic models using Gaussian Mixture Models. Environ. Model. Softw. 2021, 142, 105076. [Google Scholar] [CrossRef]
  60. Li, B.; Fan, Y.; Dearing, M.; Lan, Z.; Rich, P.; Allcock, W. MRSch: Multi-Resource Scheduling for HPC. In Proceedings of the 2022 IEEE International Conference on Cluster Computing (CLUSTER), Heidelberg, Germany, 5–8 September 2022; pp. 47–57. [Google Scholar] [CrossRef]
  61. Li, J.; Liu, J.; Fan, Y.; Lan, Z. CQGym: A Reinforcement Learning Environment for Cluster Scheduling. arXiv 2021, arXiv:2111.00000. [Google Scholar]
  62. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  63. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
  64. Duque, R.; Arbelaez, A.; Díaz, J.F. Online over time processing of combinatorial problems. Constraints 2018, 23, 310–334. [Google Scholar] [CrossRef]
Figure 1. GTrXL-SPPO architecture and information flow.
Figure 1. GTrXL-SPPO architecture and information flow.
Applsci 15 08928 g001
Figure 2. GTrXL-SPPO model architecture. The network processes state representations through a channel fusion module and SE layers, with GTrXL blocks for temporal feature updates. The actor and critic branches share this backbone, diverging at the output layers for action logits and value estimations.
Figure 2. GTrXL-SPPO model architecture. The network processes state representations through a channel fusion module and SE layers, with GTrXL blocks for temporal feature updates. The actor and critic branches share this backbone, diverging at the output layers for action logits and value estimations.
Applsci 15 08928 g002
Figure 3. Architecture of the dynamic memory mechanism module. This system manages the flow of scheduling-related temporal information and adjusts update operations to reflect changes in environment dynamics.
Figure 3. Architecture of the dynamic memory mechanism module. This system manages the flow of scheduling-related temporal information and adjusts update operations to reflect changes in environment dynamics.
Applsci 15 08928 g003
Figure 4. Sliding window mechanism architecture.
Figure 4. Sliding window mechanism architecture.
Applsci 15 08928 g004
Figure 5. Job resource demand distributions across workloads.
Figure 5. Job resource demand distributions across workloads.
Applsci 15 08928 g005
Figure 6. Job arrival and processing time distributions across workloads.
Figure 6. Job arrival and processing time distributions across workloads.
Applsci 15 08928 g006
Figure 7. Total reward achieved on different workloads.
Figure 7. Total reward achieved on different workloads.
Applsci 15 08928 g007
Figure 8. Performance metrics of GTrXL-SPPO. (a) System utilization trends showing resource usage efficiency over time on train data and test data. (b) Job waiting time distribution comparing training and testing phases across different scheduling approaches.
Figure 8. Performance metrics of GTrXL-SPPO. (a) System utilization trends showing resource usage efficiency over time on train data and test data. (b) Job waiting time distribution comparing training and testing phases across different scheduling approaches.
Applsci 15 08928 g008
Figure 9. Performance comparison of GTrXL-SPPO and PPO at different stages on the simulated dataset.
Figure 9. Performance comparison of GTrXL-SPPO and PPO at different stages on the simulated dataset.
Applsci 15 08928 g009
Figure 10. Testing phase performance comparison of GTrXL-SPPO vs. PPO across real-world workload datasets. Note: This evaluation focuses specifically on testing phase performance to assess model generalization on unseen data.
Figure 10. Testing phase performance comparison of GTrXL-SPPO vs. PPO across real-world workload datasets. Note: This evaluation focuses specifically on testing phase performance to assess model generalization on unseen data.
Applsci 15 08928 g010
Figure 11. Statistical stability analysis of GTrXL-SPPO across 10 independent runs with different random seeds. The red dashed line indicates the mean, while the shaded region represents the 95% confidence interval.
Figure 11. Statistical stability analysis of GTrXL-SPPO across 10 independent runs with different random seeds. The red dashed line indicates the mean, while the shaded region represents the 95% confidence interval.
Applsci 15 08928 g011
Table 1. Comparative analysis of cluster scheduling methods across key features.
Table 1. Comparative analysis of cluster scheduling methods across key features.
CharacteristicsRule [26,27]Priority [28,29]Backfilling [30,31]RL [32,33]GTrXL-SPPO
Long-term Effects×××
Auto Policy Tuning×××
Dynamic Resource Priority×××
Temporal Relation Handling××××
Table 2. Comparison of mainstream DRL algorithms for HPC scheduling.
Table 2. Comparison of mainstream DRL algorithms for HPC scheduling.
FeaturesDQL
[37]
PG
[38]
PPO
[21]
SAC
[39]
A3C
[40]
Sample Efficiency××
Discrete Action Support×
Training Stability××
Tuning Ease×××
Parallel Support×××
Compute OverheadLowMediumMediumHighMedium
Table 3. Example job queue.
Table 3. Example job queue.
JobReq. NodesReq. TimePriorityWaiting TimeEstimated Exec Time
A13021030
B2201520
C14031540
Table 4. Resource state.
Table 4. Resource state.
NodeAvailabilityRemaining Time
1Free0
2Busy10
Table 5. Comparison of experimental dataset features.
Table 5. Comparison of experimental dataset features.
DatasetJob Size
(Tasks)
Max NodesMax ProcsSys ArchLoad Type
Alibaba80.5 k300300Gen ClustOnline Svc Intense
ANL-Intrepid68.9 k40,960163,840MeshSci Comp Intense
SDSC-SP273.4 k128128Homo ClustBatch Proc Intense
PIK-IPLEX74.3 k3202560Hetero ClustMixed Load
1. Abbreviations: Sys Arch: System Architecture; Load Type: Load Characteristics; Gen Clust: General Cluster; Online Svc Intense: Online Service Intensive; Sci Comp Intens; Scientific Computing Intensive; Homo Clust: Homogeneous Cluster; Batch Proc Intense: Batch Processing Intensive; Hetero Clust: Heterogeneous Cluster; 2. Computational scale: Data are derived from the configuration files of each dataset, formatted as “Number of Nodes/Number of Processors.”; 3. ANL-Intrepid uses a mesh architecture with 4 cores per node, suitable for large-scale scientific computing; 4. PIK-IPLEX supports heterogeneous resource scheduling with an average of 8 cores per node; 5. The Alibaba dataset includes configurations with both 300 and 200 nodes in different versions.
Table 6. Parameter comparison of GTrXL-SPPO under multiple workloads.
Table 6. Parameter comparison of GTrXL-SPPO under multiple workloads.
DatasetInputOutputActor Output Layer ParametersTransformer ParametersOther ParametersTotal Trainable Parameters
Alibabav2017[2,200]801461,37667,723,77668,341,480136,526,632
ANL-Intrepid[2,150]40,96123,593,53667,723,77668,381,640159,698,952
SDSC-SP2[2,50]12974,30467,723,77668,340,808136,138,888
PIK-IPLEX[2,75]321184,89667,723,77668,341,000136,249,672
Dynamic output dimension adaptation to node scale (e.g., from 100 nodes to 101).
Table 7. Ablation study results across three datasets.
Table 7. Ablation study results across three datasets.
Model VariantANL-IntrepidSDSC-SP2AlibabaAverage Improvement
Util. (%)Util. (%)Util. (%)over Baseline (%)
GTrXL-SPPO (Full)87.585.683.2-
GTrXL-PPO (No SE)83.181.277.8−4.7
SPPO (No GTrXL)73.170.265.8−15.7
TrXL-SPPO (No Gating)84.081.778.2−4.2
GTrXL-SPPO (Fixed Reward)84.382.078.7−3.8
Table 8. Statistical performance metrics of GTrXL-SPPO Across 10 independent runs.
Table 8. Statistical performance metrics of GTrXL-SPPO Across 10 independent runs.
MetricMeanStandard Deviation95% Confidence Interval
System Utilization (%)84.62.1[82.9, 86.3]
Average Wait Time (s)2850145[2741, 2959]
Throughput (tasks/hour)135.24.8[132.1, 138.3]
Cumulative Reward0.6480.028[0.632, 0.664]
Table 9. Runtime performance comparison for HPC scheduling algorithms.
Table 9. Runtime performance comparison for HPC scheduling algorithms.
AlgorithmDecision Latency (ms)Memory Usage (MB)Throughput (Decisions/s)Scale Limit (Nodes)
PPO6.2 ± 0.85781618000
GTrXL-SPPO9.0 ± 0.210491115000
GTrXL-SPPO (Optimized)6.8 ± 0.37851478000
Table 10. Effect of α reg on system performance (ANL-Intrepid dataset).
Table 10. Effect of α reg on system performance (ANL-Intrepid dataset).
α reg ValueSystem Utilization (%)Avg. Wait Time (s)Throughput (Jobs/h)Queue Length (Avg.)
0.085.12532139.742.3
0.186.22495141.135.8
0.587.52458142.532.1
1.086.62483141.830.5
2.085.02545139.729.7
Table 11. Effect of temperature parameter T on system performance (ANL-Intrepid dataset).
Table 11. Effect of temperature parameter T on system performance (ANL-Intrepid dataset).
T ValueSystem Utilization (%)Avg. Wait Time (s)Throughput (Jobs/h)Reward Balance Index
0.183.12705135.40.28
0.586.22495141.10.62
1.087.52458142.50.81
2.085.82520140.40.93
5.082.32728134.00.97
Table 12. Performance comparison of different attention mechanisms.
Table 12. Performance comparison of different attention mechanisms.
Attention MechanismANL Util. (%)Alibaba Util. (%)Avg. ImprovementTraining Time (h)Efficiency Rank
SECT (Ours)87.583.2-5.2/7.31
CBAM84.979.6-3.2%6.8/9.13
ECA-Net85.280.3-2.8%5.5/7.82
Table 13. Performance comparison with optimization methods.
Table 13. Performance comparison with optimization methods.
MethodUtilization (%)Wait Time (s)Makespan (h)Optimality Gap (%)Status
Small-scale (100 jobs, 128 nodes)
MIP (Gurobi)89.8239818.10.0Optimal
CP (OR-Tools)89.7246418.50.8Near-optimal
GTrXL-SPPO87.5245818.72.0Complete
Large-scale (3000 jobs, 600 nodes)
MIP (Gurobi)74.9359628.219.5Timeout
CP (OR-Tools)83.7326226.310.1Timeout
GTrXL-SPPO85.8250819.24.0Complete
Table 14. Multi-scale computational complexity comparison.
Table 14. Multi-scale computational complexity comparison.
MethodTime ComplexitySmall Scale (100)Medium Scale (1000)Large Scale (5000)
Time per Task (ms)Time per Task (ms)Time per Task (ms)
FCFS O ( n ) 24.53.61.5
PPO O ( n × d ) 26.717.715.6
GTrXL-SPPO O ( n 2 × d ) 71.318.87.4
Table 15. Annual impact of performance improvements on different HPC scales.
Table 15. Annual impact of performance improvements on different HPC scales.
HPC EnvironmentTime Saved (Hours/Year)Cost Savings (USD)ROI Multiple
Small University (100 nodes)1267$72,8570.73×
Medium Research (400 nodes)5069$485,7141.21×
Large Enterprise (1000 nodes)12,674$1,457,1411.46×
National Supercomputer (5000 nodes)38,021$9,612,8831.92×
Total Impact57,031$11,628,5951.33×
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, X.; Dong, H.; Zhang, L.; Wang, Y.; Yang, X.; Li, Z. Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO. Appl. Sci. 2025, 15, 8928. https://doi.org/10.3390/app15168928

AMA Style

Gao X, Dong H, Zhang L, Wang Y, Yang X, Li Z. Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO. Applied Sciences. 2025; 15(16):8928. https://doi.org/10.3390/app15168928

Chicago/Turabian Style

Gao, Xu, Hang Dong, Lianji Zhang, Yibo Wang, Xianliang Yang, and Zhenyu Li. 2025. "Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO" Applied Sciences 15, no. 16: 8928. https://doi.org/10.3390/app15168928

APA Style

Gao, X., Dong, H., Zhang, L., Wang, Y., Yang, X., & Li, Z. (2025). Self-Attention Mechanisms in HPC Job Scheduling: A Novel Framework Combining Gated Transformers and Enhanced PPO. Applied Sciences, 15(16), 8928. https://doi.org/10.3390/app15168928

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop