1. Introduction
High-Performance Computing Cluster (HPCC) consists of numerous computing nodes, networking components, and storage systems, forming a distributed computing environment with large-scale data processing capabilities. With the continued expansion of HPC applications in fields such as climate modeling, genome-wide association studies, and large-language model training [
1], the efficient scheduling of cluster resources has become a critical bottleneck that limits computational performance.
Early research on HPC task prediction primarily relied on statistical models and shallow machine learning methods, aiming to capture the temporal patterns of resource demands from historical data. For instance, AutoRegressive Integrated Moving Average (ARIMA) models have been employed for time-series fitting, where differencing is used to remove non-stationarity, and lag operators are applied to extract linear dependencies. However, these models suffer from high mean squared error when predicting nonlinear workload fluctuations, such as sudden spikes in resource demand caused by bursty computing tasks, rendering them insufficient for practical scheduling needs. In recent years, breakthroughs in deep-learning-based time-series forecasting have offered new perspectives for HPC task prediction [
2]. Models based on Gated Recurrent Units (GRU) and Temporal Convolutional Networks (TCN) have gained prominence, yet they primarily model intra-sequence temporal dependencies and lack the capability to explicitly capture inter-task relationships, such as data dependencies and resource contention across tasks.
Despite these advancements, several key challenges remain in the current HPC task prediction landscape:
Modeling of multi-scale temporal features: HPC workloads typically exhibit both long-term steady-state patterns and short-term abrupt changes [
3]. Single-model architectures often struggle to simultaneously capture long-term trends and short-term variations, resulting in inefficient disentanglement of temporal features across different time scales.
Insufficient modeling of multidimensional resource correlations: Most existing methods focus on univariate predictions, whereas practical scheduling requires the joint optimization of multiple resource metrics such as CPU, memory, and storage [
4]. The interdependencies among these dimensions are often underexplored.
Limited capacity for dynamic feature selection: Traditional methods lack adaptive sensitivity to critical time periods, such as task submission peaks or node failure alarms, which significantly limits system robustness [
5].
To address the aforementioned challenges, this paper proposes a predictive model that integrates multimodal temporal features with a hierarchical attention mechanism. By jointly incorporating Informer, Long Short-Term Memory (LSTM), and Graph Neural Networks (GNN), the model enables multidimensional and dynamic modeling of task loads in High-Performance Computing (HPC) clusters.
Informer, LSTM, and GNN have each demonstrated notable success across various temporal and structural modeling tasks. Informer has been applied in long-sequence forecasting tasks such as traffic flow and energy demand due to its ability to handle extended temporal dependencies efficiently. LSTM has been widely adopted in real-time system modeling and anomaly detection because of its sensitivity to short-term variations. GNNs, by contrast, have been utilized in domains involving topological structures—such as network traffic analysis and industrial IoT systems—where relational dependencies significantly impact system behavior. These successful precedents support the transferability of the proposed hybrid architecture to HPC environments, which exhibit both temporal complexity and structural interdependence.
The main contributions are as follows:
In order to solve the coupling problem of long-period regularity and short-time dynamics, a three-mode parallel encoder is constructed: Informer backbone network: adopts ProbSparse Attention to filter key time points by entropy value to realize efficient modeling of long-period resource fluctuations (e.g., periodic experimental tasks). Its Attention Distilling module can compress the length of the sequence to 1/4 of the original length to reduce computational complexity. LSTM dynamic modeling branch: capture hour/minute-level bursty task features (e.g., GPU-intensive AI training tasks) by input gate and forget gate mechanisms, and design a sliding time window (Window Size = 24) to extract local temporal gradient features. GNN Topology Awareness Module: construct a cluster node graph, define inter-node communication latency (<5 ms for strong connection) and task dependency (DAG task flow), use Graph Attention Network (GAT) to aggregate resource competition states of neighboring nodes, and explicitly model the temporal and spatial propagation effects of task-node mapping.
- 2.
Hierarchical Attention Enhancement Mechanisms
To enhance the synergy of multidimensional resource prediction, a dual attention interaction layer is constructed: Cross-scale multi-head self-attention: 8 parallel sets of attention heads are deployed at the Informer and LSTM outputs, focusing on different time-resolved features (e.g., heads 1–4 deal with hourly CPU/memory fluctuations, and heads 5–8 deal with minute-level I/O throughput), and cross-scale feature fusion is realized through a learnable weight matrix. Goal-directed cross-attention: designing multitasking attentional adapters in the decoding phase to reconstruct Query-Key-Value in terms of resource dimensions:
Query: multidimensional prediction target embedding vector (CPU_usage, Mem_alloc, Storage, Time_cost).
Key-Value: joint time-space characterization of historical coding states.
The gradient conflict problem in multitask prediction is mitigated by calculating the cosine similarity between the target vector and the historical state and dynamically assigning feature weights.
Multitask learning (MTL) has emerged as a powerful paradigm for jointly modeling related predictive tasks, improving generalization by leveraging shared information across outputs. In the context of HPC resource forecasting, however, MTL approaches remain underutilized. The majority of prior work focuses on the prediction of a single resource dimension—such as CPU usage or job duration—without accounting for the interdependence among different types of resources (e.g., memory, I/O, and storage). Such a limitation hinders the ability to comprehensively anticipate system workload and performance bottlenecks. The proposed model addresses this gap by employing a multitask learning framework that jointly predicts four key resource indicators.
2. Methods
2.1. Informer
Transformer architectures have demonstrated significant success in a wide range of sequential modeling tasks, particularly due to their self-attention mechanisms and ability to model long-range dependencies. Despite their effectiveness in domains such as natural language processing and time-series forecasting, the use of Transformer-based models for HPC workload prediction has not been extensively investigated. Existing approaches in this area predominantly rely on traditional recurrent neural networks (e.g., LSTM, GRU) or classical statistical methods, which often struggle with long-term dependencies and multivariate interactions. This study introduces the Informer architecture as a long-sequence encoder within the context of HPC resource prediction. By leveraging its efficient attention mechanism and scalability, the proposed model enhances the ability to capture long-term temporal patterns across multiple resource metrics.
The Informer model is a Transformer model designed for Long-Sequence Time-Series Forecasting (LSTF). Although the traditional Transformer architecture shows strong capability in processing sequence data, when it is directly applied to long-sequence time-series prediction, the self-attention mechanism needs to compute the correlation between all positions, which leads to high computational complexity, a drastic increase in resource consumption, and low efficiency of model training and inference, and at the same time, its step-by-step decoding method is prone to cause error accumulation, which affects the consistency of the long-range prediction [
6]. To address these problems, Informer introduces the ProbSparse self-attention mechanism, by calculating the probability distribution of the attention score of each key vector to the query vector, the ProbSparse self-attention mechanism can adaptively select the key vector that is the most critical to the current query vector to participate in the calculation, which significantly reduces the computation amount, and significantly improves the efficiency of the long-run sequence processing [
7]; in the forward propagation process, the ProbSparse self-attention mechanism also supports the parallel computation of the outputs of multiple time steps, which further improves the inference efficiency and realizes the fast prediction of long-sequence data. In addition, Informer introduces the self-attention distillation technique, adds an auxiliary loss function in the encoding process, gradually compresses the sequence length and retains the key features, and reduces the memory occupation.
The Informer architecture diagram is as
Figure 1, which is divided into two parts: encoder and decoder. The encoder encodes the input long-time-series data, gradually extracts the long-term dependent features in the input sequence, and compresses the length of the sequence through a convolutional downsampling operation to reduce the computational complexity and memory consumption. The decoder directly generates the complete predicted sequence through the starting markers and the features extracted by the encoder, avoiding error accumulation and significantly improving the efficiency of long-series prediction.
2.2. LSTM
Recurrent Neural Network (RNN) is able to effectively retain historical information through its unique recurrent connection mechanism, thus directly capturing temporal dependencies in sequence data. Its core strength lies in processing long-sequence data, and it is especially good at memorizing short-term dependencies. However, RNNs are prone to suffering from the problem of disappearing or exploding gradients when dealing with long-sequence data due to their inherent structural characteristics, thus making it difficult to capture long-term dependencies. To address this challenge, the improved architecture, Long Short-Term Memory Network (LSTM), has emerged. By introducing a gating mechanism, LSTM is able to effectively regulate the information flow, which significantly improves the ability to model complex sequences, especially in dealing with long-term dependencies [
8]. As an improved architecture of a recurrent neural network, LSTM effectively solves the gradient vanishing problem of traditional RNN through a gating system. Its core mechanism consists of a cell state and triple-gating structure: Forget gate uses a Sigmoid function to quantify the retention ratio of historical information to achieve dynamic screening of redundant features [
9]; Input Gate accurately regulates the writing intensity of new features through the co-computation of gating signals and candidate memories; The output gate determines the dimensional weights of the information output based on the current state and gating coefficients. Among them, the cellular state realizes memory transmission across time steps through a linear cyclic pathway, which fundamentally circumvents the gradient decay caused by nonlinear activation. The triple gating, on the other hand, constructs a dynamic balance between information storage and forgetting through a parameterized regulatory mechanism [
10]. This synergy between cell states and the gating system enables LSTM to stably capture long-range dependencies over a range of hundreds of time steps, significantly improving the robustness of temporal modeling.
The architecture of LSTM is shown in
Figure 2. The computational flow of the LSTM step
t can be decomposed into the following steps:
Oblivion Gate
Combining the input of the current moment
and the previous moment hidden state
, the forgetting coefficients are generated by Sigmoid function:
where
is the weight matrix,
is the bias term, and
denotes the vector splicing of the hidden state of the previous moment with the input of the current moment [
11].
Input gates and candidate memory states
Input gates update the activation values by calculating information with independent weights:
Candidate Memory Cell Status
indicates the potential information that can be added to the memory cell under the current input, calculated as:
where
is the hyperbolic tangent activation function [
12].
Memory unit status update
Combining the calculation results of the forgetting gate and the input gate to realize the dynamic update of the memory. Realizing the dynamic update of the memory cell state
:
Output gates and hidden state generation
Based on the input and memory cell states at the current moment, the output state at the current moment is determined, i.e., the hidden state
. The value of the output gate is
The output states are calculated as follows:
2.3. GNN
Graph Neural Network (GNN), as a deep-learning framework for processing graph-structured data, generates low-dimensional representation vectors by fusing node features and topological relations through an iterative mechanism of message passing, aggregation, and updating. Its core process consists of three stages: nodes generate information vectors based on neighborhood features and edge attributes, construct local structure representations through aggregation functions (mean value, weighted summation, or attention mechanism), and ultimately update node states by combining their own features [
13]. Among them, the Graph Attention Network (GAT) breaks through the static weight limitation of traditional graph convolution by dynamically quantifying the influence weights of neighboring nodes on the target node through a learnable attention function. The method maps the target node and neighbor features into a competitive intensity scalar, weighted aggregated neighborhood resource states after Softmax normalization, and generates dynamic representations through a nonlinear transformation. The attention mechanism can adaptively focus on key nodes with intense resource grabbing, effectively suppress noise interference, and provide a data-driven solution for modeling the implicit dependencies of heterogeneous graphs.
Graph Neural Networks (GNNs) have been widely adopted in various domains such as traffic forecasting, molecular interaction modeling, and recommender systems due to their capacity to capture complex structural dependencies. However, their application within High-Performance Computing (HPC) environments remains relatively limited. This is largely attributed to the fact that HPC workloads are typically represented as tabular or temporal data, making it nontrivial to construct meaningful graph structures. Nonetheless, HPC tasks often exhibit implicit interdependencies—such as shared resource usage, task co-location, or communication patterns—which can be naturally represented using graph-based formulations. Recent efforts have begun to explore the utility of GNNs in modeling such relationships, particularly in job scheduling and resource allocation. In this work, GNN components are employed to capture structural relationships among tasks and computational resources, complementing temporal modeling techniques and enabling a more holistic understanding of HPC workload dynamics.
In this study, cluster nodes are modeled as vertices in a graph structure, whose feature vectors are composed of real-time resource metrics (e.g., CPU/memory utilization, task queue lengths), and edges are constructed on the basis of physical connectivity or resource dependencies, and edge attributes are incorporated into competitive parameters, such as communication latency, bandwidth occupancy, etc. GAT computes the unnormalized competitive coefficients through the shared-attention function and generates the dynamic weights via Softmax to realize the weighted aggregation of neighbor resource states. The updated node representations can support multidimensional applications: task migration decisions based on competition hotspot identification, timing detection of abnormal load fluctuations, and resource reallocation strategies based on attention weights. The method captures competitive patterns adaptively through end-to-end learning, and its weight interpretability provides a theoretical basis for the visual analysis of cluster resource competition relationships, which significantly outperforms traditional rule-driven static scheduling models.
2.4. Attention
2.4.1. Multi-Head Self-Attention
The multi-head self-attention mechanism overcomes the limitations of the traditional single-head self-attention mechanism by extracting features from different subspaces and modeling global dependencies through multiple attention heads in parallel [
14]. In order to effectively capture features from different time scales and enhance the prediction performance of the model, this study deploys eight sets of parallel multi-head self-attention mechanisms at the output of Informer and LSTM for enhanced global context modeling of sequence features. The following are the specific steps:
Query-Key-Value mapping: features of each time step are mapped into eight different subspaces, each corresponding to an attention head. Long-period resource fluctuation features captured by Informer are mapped into the first four attention heads, which are used to deal with situations such as hourly CPU and memory fluctuations, and short-period dynamic features extracted by LSTM are mapped into the last four attention heads (heads 5–8) for handling situations such as I/O throughput at the minute level.
Attention weight generation: in each subspace, the dot product similarity between Query and Key is calculated and then normalized by the Softmax function to generate the attention weight distribution;
Context-weighted fusion: In each subspace, the value is weighted and summed according to the attention weights to form the feature representation of each subspace, and the feature representations of these subspaces are spliced and fused to generate the final contextual features.
The specific parameters are listed in
Table 1:
2.4.2. Cross-Attention
Cross-attention is a dynamic feature alignment mechanism that realizes goal-oriented feature selection and fusion by interacting with features from different modalities or different sequences and calculating cross-modality/cross-sequence attention weights. The core idea is to take the target features as Query and the context features as Key-Value, and filter the key information related to the target through similarity calculation [
15]. In this study, the cross-attention mechanism is deployed between the LSTM coding layer and the fully connected prediction layer to realize the goal-directed feature selection for the multidimensional resource prediction task. The specific implementation is as follows:
Query-Key-Value mapping: each prediction target (CPU_usage, Mem_alloc, Storage, Time_cost) is mapped as an independent embedding vector, which is obtained by learning through the embedding layer, characterizing the requirements of different resource prediction tasks. For example, the Query vector for CPU_usage may carry features related to computationally intensive tasks, while the Query vector for Mem_alloc may focus on memory allocation patterns.
Attention weight generation: For each Query vector of the prediction target, the cosine similarity between it and the Key vector of each time step in the history encoding state is calculated. The higher the similarity, the stronger the relevance of the historical feature to the current prediction target. The cosine similarity scores are normalized by a Softmax function to generate an attention weight distribution that allows the model to focus on the importance of different time steps and features.
Context-weighted fusion: Based on the generated attentional weights, the values in the historical encoding state are weighted and summed to generate a contextual feature representation of this prediction target. The contextual features of different prediction targets are fused and linearly transformed by the learnable weight matrix to generate the final contextual features.
The specific parameters are listed in
Table 2:
3. Methodology
3.1. Model Architecture
In this study, we propose a hierarchical timing prediction model that integrates Informer, LSTM, and GRU, combining multi-head self-attention and cross-attention mechanisms, for the multidimensional task prediction needs of high-performance computing clusters, and realizes the collaborative modeling of long-periodic regularity and short-time dynamics. Specifically, a hybrid architecture of Informer and LSTM is designed to address the coupling of long-range dependency and complex timing: Informer captures the long-range regularity of task load (e.g., weekly granularity resource fluctuation) through the probabilistic sparse self-attention mechanism, and its distillation operation effectively reduces the complexity of long-sequence computation; the LSTM branch models the short-term dynamics (e.g., sudden task submission), while the GNN module explicitly models topological dependencies among cluster nodes (e.g., task-node mapping relationships), and captures the spatiotemporal propagation characteristics of resource competition by aggregating neighboring node states through graph convolution. For the multidimensional resource correlation modeling problem, a multi-head self-attention mechanism is introduced after the Informer layer and LSTM layer, which achieves cross-scale feature fusion by focusing on features at different time scales (e.g., hourly CPU fluctuations versus minute-level memory variations) through parallel multiple sets of attention heads; a cross-attention mechanism is added between the LSTM output and the fully connected layer to capture the multidimensional resource target (CPU/memory/storage/time) as query vectors, and dynamically calculate their attention weights with respect to historical encoded features to realize goal-oriented feature selection.
The overall structure of the model is shown in
Figure 3, and the core design is divided into the following modules:
- (1)
Informer backbone network: modeling of long-period resource fluctuations
Oriented towards cyclical task patterns in days/weeks (e.g., periodic simulation experiments, batch data processing tasks), using the Informer backbone network, specifically:
Sparse Attention Distillation: inserting the self-attention distillation module between the encoder stacking layers, the sequence length is gradually compressed using one-dimensional convolution with maximum pooling operation with a compression ratio of 1:4, which compresses the 7-day history sequence from to time steps.
Periodic pattern capture: stack 3 layers of Informer encoders with hidden dimension per layer, output long-period feature tensor: where .
- (2)
LSTM dynamic modeling branch: bursty task feature extraction
Constructing bidirectional LSTM dynamic coding branches for hour/minute-level bursty tasks (e.g., GPU-intensive AI training task preemption, ad-hoc data analysis requests): Gated gradient sensing: local temporal gradient features are reinforced by input gates, and low-frequency noise interference is suppressed by forgetting gates. Sliding window timing modeling: design a sliding time window with Window Size = 24, and compute the first-order difference feature within the window to capture the instantaneous changes in the task queue.
- (3)
GNN Topology Awareness Module: Modeling Cluster Resource Competition
Construct the cluster node relationship graph
, three example node relationship graphs are shown in
Figure 4.
The specific structure is as follows:
Vertices:
is used to represent a physical computing node in the cluster, with static attributes including a number of CPU cores, memory capacity, local storage space, etc., and dynamic attributes including real-time CPU utilization, memory occupancy, etc. The above attributes are normalized and spliced into a feature vector
as the initial node embedding of the GNN, where
. Edges: The design of edges is divided into two categories, one is based on the physical network connection if the communication delay between two nodes
and
is
, then establish an undirected edge
with an edge weight of
, which is used to quantify the efficiency of the communication; the other is based on the logical dependence of tasks, if the output of the task
on a node
is the input to the task
on another node
(which follows the flow of the tasks of the directed acyclic graph), then establish the directed edge
with an edge weight of
Dynamic graph structure update mechanism: The physical connection graph is updated every 30 min with network latency data to ensure the topology is consistent with the actual cluster state, and the task-dependency graph is dynamically constructed based on the real-time task scheduler DAG description file, and the edge connections are updated instantly when the task is submitted.
Graph Attention Network Aggregation: Using a multi-head attention mechanism, each node generates topology-aware features by aggregating the resource states (e.g., CPU contention, memory bandwidth pressure) of its neighboring nodes. Specifically, the physical connectivity edge focuses on hardware resource competition among nodes, while the task-dependent edge models task cascading effects.
- (4)
Multimodal feature alignment:
Cross-modal alignment of long-periodic features, dynamic time-series features, and topological features, replicated expansion of topological features along the time axis, and maximum pooling of node dimensions for long-periodic features and dynamic time-series features.
After processing by Informer, LSTM, GNN, and the two attention mechanisms, the model obtains a rich representation of the features. At the end of the model, the fully connected layer is used to map these high-dimensional features to a single output. Through the processing of the fully connected layer, the most predictive features are extracted from the complex feature space of the model and output the final predicted values, such as the task execution time or the resource requirements of the task.
3.2. Model Training and Evaluation
3.2.1. Data Pre-Processing
The dataset used in this study was collected from a self-built HPC cluster over a continuous 12-month period from January to December 2023, with a sampling interval of 5 min. It contains rich multivariate time-series data reflecting the dynamic behavior of real-world workloads. The collected features include CPU utilization, memory usage, storage I/O throughput, network traffic, task execution time, task queue length, node temperature, and system-level failure or restart events. The workload is diverse, comprising compute-intensive applications (approximately 40%), data-intensive analytics (30%), machine learning training jobs (20%), and mixed-stage pipeline tasks (10%). This dataset captures complex temporal patterns and variability in resource consumption, making it well-suited for evaluating multitask, multi-metric prediction models under realistic HPC operational conditions. A self-built heterogeneous computing cluster is used to monitor the dataset to ensure the effectiveness and generalization performance of model training through systematic data cleansing, time-sensitive segmentation, and dynamic training monitoring. In the data cleaning stage, we first delete the duplicate records in all fields (e.g., redundant logs due to collection interruptions) and standardize the data format. Outliers are detected using the box plot method, which calculates the lower quartile Q1 (i.e., the 25th percentile position after sorting the data from smallest to largest), the upper quartile Q3 (i.e., the 75th percentile position after sorting the data from smallest to largest), and the interquartile spacing IQR to detect outliers [
16]. IQR, which is the difference between Q3 and Q1, reveals the middle 50 percentile of the data range, while the values that are out of this range are considered outliers, i.e.,:
For the outliers in the data, this study adopts the source tracing and resampling interpolation strategy, which ensures the data quality by locating the outlier data sources and replacing them with re-collected valid observations. In the data partitioning stage, the dataset is divided into a training set, validation set, and test set according to the ratio of 7:2:1, and the time sequence continuity is always maintained. This model uses an Informer–LSTM hybrid architecture as the timing encoder. In the forward propagation process, the input sequence first passes through the Informer module, in which the probabilistic sparse self-attention layer captures the long-period dependent features through the entropy filtering mechanism, while the feature distillation layer reduces the computational complexity through hierarchical feature filtering. The LSTM branch running in parallel then extracts short-time dynamic features through gated memory units. The multi-head self-attention fusion layer adaptively interacts with the cross-scale temporal features output from the dual branches to realize the collaborative modeling of hourly CPU trends and millisecond memory fluctuations.
The fused feature representations are input to the cross-attention decision layer, and the historical features are dynamically weighted using the target resource vector as the query vector, and finally, the predicted values are output through the fully connected layer. The model uses the Adam optimizer for parameter update, the normalized sliding window sequence is input for training, and backpropagation is performed by minimizing the Mean Absolute Error (MAE) loss function. The average loss value is recorded every 100 training cycles to monitor the convergence trend, and the early stop mechanism is triggered when the loss of the validation set decreases below 1% for 10 consecutive cycles and automatically rolls back to the optimal weights. After the training is completed, the optimal weights are loaded to evaluate the prediction of the test set, the MAE index is calculated, and the error distribution characteristics are analyzed by the prediction-true value scatter plot. Comparative experiments show that the model exhibits significant robustness in heterogeneous computing environments, capturing weekly granularity of load periodicity while maintaining sensitivity to transient resource contention.
3.2.2. Loss Function Design
This study adopts a multitask joint learning framework, which needs to synchronously optimize the four objectives of task execution time, CPU utilization, memory occupation, and storage requirements. A hierarchical weighted loss function is designed for the differences in numerical characteristics and error sensitivity of each task.
The loss function is a measure of the difference between the predicted and actual values of the model. For the problem of resource prediction in high-performance computing clusters, we use the Mean Absolute Error (MAE) as the base loss function [
17], which is calculated as follows:
where
is the true value,
is the predicted value of the model, and m denotes the sample size. Please note that the reason for choosing MAE instead of Mean Squared Error (MSE) is that MAE is less sensitive to outliers than MSE and is more suitable for monitoring data with noise.
To avoid optimizing one task during multitask learning that leads to performance degradation in other tasks, an uncertainty weighting strategy is used:
A trainable noise parameter is introduced for each task, and the weights are automatically adjusted by gradient descent to avoid the task imbalance problem. is the loss of the ith task. Dynamic weight adjustment enables the model to focus on the performance of each task during the optimization process and improves the overall generalization ability.
In our proposed hybrid model combining Informer, LSTM, and Graph Neural Network (GNN) modules, each sub-module captures complementary characteristics from different perspectives: Informer for long-term dependencies, LSTM for short-term dynamics, and GNN for inter-task structural relationships. However, without an effective semantic alignment mechanism, information fusion may suffer from redundancy, conflicts, or degradation. To address this, we propose a novel Multi-Objective Multimodal Attention Consistency Loss (MOMA-Loss), which enforces consistency across attention representations from different modalities, thereby improving semantic coherence and fusion quality.
Let the attention outputs from the three modules be:
Informer attention matrix: ,
LSTM attention matrix: ,
GNN-based graph attention matrix: , where .
MOMA-Loss consists of the following three components:
- (1)
Temporal Alignment Loss
This term enforces alignment between long- and short-term temporal attention representations:
- (2)
Cross-Modal Alignment Loss
The GNN attention
is projected into the temporal space using a learnable projection
, enabling alignment with sequence-based modalities:
- (3)
Entropy Regularization
To prevent attention collapse and encourage diverse attention distributions, we introduce an entropy regularizer:
- (4)
MOMA-Loss Summary
The final multimodal attention consistency loss is:
where
,
, and
are weighting coefficients. In our experiments, we empirically set
,
, and
.
- (5)
Final Objective Function
Integrating MOMA-Loss with multitask prediction losses, we define the overall objective as:
where
denotes the learnable uncertainty parameter for task
i, dynamically adjusting its contribution to the total loss.
is the Mean Absolute Error loss for task
i, selected for its robustness to noise and outliers in HPC monitoring data.
This unified loss not only ensures accurate and balanced multitask predictions, but also promotes deep semantic collaboration among heterogeneous modalities, improving the overall robustness and generalization of the prediction model.
3.2.3. Optimizer
In order to minimize the loss function, we employ the Adam (Adaptive Moment Estimation) optimizer. The core idea of the Adam algorithm is to introduce first-order moments and second-order moments on top of the gradient to adaptively estimate and adjust the learning rate of each parameter. The calculation method is as follows:
For each parameter
of the model, at each iteration, its gradient is
, and the update of the first-order moment
is
where
is the decay rate of the first-order moments, with larger values indicating that the algorithm relies more on historical gradients.
The second-order moments
are updated as
where
is the decay rate of the second-order moments, which helps to balance the range of variation of the gradient and prevents the parameter from being updated too much or too little.
Since the first-order and second-order moments are biased towards zero in the initial stage, it is necessary to bias-correct these two moments to bring them closer to their true values in the following manner:
where
and
are estimates of the first-order and second-order moments after bias correction.
and
, respectively, represent the exponential decay effects of
and
during the t-th iteration.
Finally, based on the corrected moment estimation, the updated parameter
is as follows:
where
is the learning rate, and
is a very small constant used to prevent the denominator from being zero. In this study, the initial learning rate is set to
, the momentum parameters are set to
,
, and the numerical stability constant
.
Since Adam’s adaptive learning rate has partially mitigated the learning rate sensitivity problem, we use a combination of linear warmup and step-down as the learning rate scheduling strategy. In the first 10 epochs, the learning rate is gradually increased to alleviate the initial gradient direction instability:
where the warmup period
, and the initial learning rate
. The learning rate is updated every 40 epochs and is calculated according to the following formula:
where the step cycle
.
Introducing the Adam optimizer into the above proposed model helps accelerate the training process and effectively avoids problems such as gradient vanishing and explosion, especially when training deep recurrent neural networks, which enhances the stability and efficiency of training, and helps the hybrid model proposed in this study to better accomplish the task of task prediction in HPC clusters.
3.2.4. Ablation Experiments
- (1)
Module Contribution Ablation Study
To validate the synergistic effect of components in the Informer–LSTM–GNN architecture, this study designs hierarchical ablation experiments to quantify the independent/joint contributions of the temporal module (Informer–LSTM) and the spatial module (GNN). The experiments are based on the HPC cluster monitoring dataset, and the evaluation metrics include MAE, epoch convergence step, and node-level prediction consistency coefficient (NCC).
The baseline model is a complete Informer–LSTM–GNN architecture incorporating a multilayer attention mechanism with the following ablation group configuration:
Abl-1: Remove the GNN module and keep only the Informer–LSTM timing encoder
Abl-2: Disable Informer and use LSTM-GNN combination
Abl-3: Disable LSTM and use the Informer-GNN combination
As an example, the results of the ablation experiment to predict the execution time of the task are listed in
Table 3:
The ablation experiment data show that in the Abl-1 configuration, the removal of the GNN module leads to a 14.3% decrease in the node-level prediction consistency coefficient (NCC), which verifies that the GNN is able to effectively capture the resource competition among nodes through neighborhood messaging. The Abl-2 configuration leads to a 19.4% increase in the error in weekly granularity load prediction due to the removal of the Informer module, which confirms that the Probabilistic Sparse Attention filtering mechanism of Informer can achieve sparse characterization of long-period regularity by focusing on key time points. The result confirms the entropy filtering mechanism of the Probabilistic Sparse Attention of Informer, which can achieve sparse characterization of long-period regularity by focusing on key time points. For the short burst traffic prediction task, the minute-level MAE of Abl-3 increases by 11.0%, highlighting the irreplaceable nature of the LSTM forgetting gate and input gate synergy mechanism in transient dynamic modeling.
- (2)
Loss Function Composition Ablation Study
To validate the effectiveness of the proposed Multi-Objective Multimodal Attention Consistency Loss (MOMA-Loss) in improving multitask prediction performance, we conducted an ablation study under controlled settings.
The ablation study was performed using the same dataset and model backbone, with variations only in the loss function configuration. Three comparative setups were evaluated:
Abl-1: This version uses Mean Absolute Error (MAE) as the loss for all tasks, with task-level uncertainty-based dynamic weighting to mitigate imbalance.
Abl-2: In addition to Abl-1, this version includes an entropy regularization term on the attention distribution to encourage informative and diverse attention patterns, thus avoiding attention collapse.
Baseline: Builds upon Abl-2 by incorporating both temporal alignment and modality alignment terms from the proposed MOMA-Loss, enforcing consistency across attention patterns of different modalities.
This research evaluates the model performance on four prediction targets: task runtime, CPU utilization, memory usage, and storage demand. The evaluation metrics include:
MAE (Mean Absolute Error): Measures the average absolute difference between predicted and true values. R2 (Coefficient of Determination): Indicates the proportion of variance explained by the model.
The prediction performance under various loss function configurations is listed in
Table 4:
These results confirm that aligning attention distributions across temporal steps and modalities reduces semantic drift and leads to more consistent predictions.
3.2.5. Model Robustness Validation
To evaluate the robustness of our proposed model under practical HPC scenarios involving sensor failures and missing monitoring data, we designed the following testing procedure. For each test sample, 10%, 20%, and 30% of the input feature dimensions were randomly masked to simulate varying levels of sensor failure. The masked values were replaced with either zeros or the mean values of the corresponding features from the training set to represent different missing data strategies. The model was evaluated on these modified test sets without any parameter updates. To mitigate the influence of randomness, the experiments were repeated five times for each missing rate, and the average results were reported. We employed metrics such as Mean Absolute Error (MAE) to quantitatively assess the impact of missing data on prediction accuracy. The results are listed in
Table 5:
The results show a gradual increase in prediction error with higher missing data ratios, but the overall degradation remained limited, demonstrating strong fault tolerance.
4. Results
To verify the effectiveness of the Hybrid Forecasting Model (HFM) proposed in this paper, Model-based RL, Meta-Continual Learning, Transformer-XL, and BERT (Bidirectional Encoder Representations from Transformers) are chosen as the benchmark comparison in this study.
Meta-Continual Learning is introduced as a comparison method to evaluate its performance in predicting task-level parameters such as runtime and CPU demand under non-stationary workload distributions in HPC clusters. By combining meta-learning and continual adaptation, this method offers potential advantages in handling long-term shifts in task patterns and heterogeneous job characteristics.
By incorporating an environment dynamics model into the reinforcement learning loop, Model-based RL enhances the prediction accuracy of future task states while significantly reducing training overhead [
18]. In the context of HPC task scheduling, it enables the agent to simulate resource–task interactions ahead of time, leading to more informed and cost-efficient scheduling decisions. This approach is used to validate the gains of incorporating model priors in improving task prediction reliability and scheduling robustness.
By introducing Segment-Level Recurrence and Relative Position Encoding, Transformer-XL significantly improves the ability of traditional Transformers to model long series, especially in capturing long-term dependencies in time series [
19]. Relationships in time series. Its performance can provide a direct reference for evaluating the advantages of this model in time-series feature extraction and long-term dependency modeling.
BERT, as a pre-trained model based on a Transformer encoder, achieves deep semantic association mining through a two-way self-attention mechanism and Masked Language Modeling (MLM). BERT is chosen to validate the performance improvement of the Cross-Attention mechanism proposed in this paper in cross-dimensional feature association modeling, especially in multivariate collaborative prediction tasks.
The experimental environment is configured as a local computing platform based on PyCharm 2022.3.4, the programming language is Python 3.10, and the deep-learning framework is based on PyTorch 1.13.1. In order to quantify the performance of the model, the following evaluation strategy is adopted in this study: the prediction accuracy is taken as the core metric, and the prediction errors are compared with those in the four key metrics of the execution time of the task, the CPU utilization, the memory consumption, and the storage demand. Record the GPU memory usage in the training phase and the single inference time in the inference phase to measure the lightweightness and deployment feasibility of the model.
In this study, the performance comparison of the three algorithms in terms of Mean Absolute Error (MAE) and Prediction Accuracy Compliance Rate (PACR) is visualized by a Grouped Bar Chart. The MAE is calculated using Equation (
8), and the Prediction Accuracy Compliance Rate (PACR) is calculated as follows:
where
and
denote the predicted and true values of the
i-th sample, respectively,
is the allowable error threshold, and
is the indicator function.
Based on the analysis of historical data, this study sets the following prediction error tolerance thresholds, as shown in
Table 6:
The average absolute error of prediction and prediction accuracy attainment for the four metrics of task execution time, CPU utilization, memory requirement, and storage requirement under the five algorithms are shown in
Figure 5:
The performance advantage of the hybrid Informer–LSTM–GNN model proposed in this paper in task prediction stems from the adaptive design of its multimodal synergistic architecture to complex HPC scenarios. The Informer module utilizes an efficient attention mechanism to capture long-term cyclical patterns of task execution (e.g., daily/weekly job peaks), while the LSTM branch focuses on portraying the effects of resource contention or load short-term fluctuations triggered by resource contention or sudden changes in load. The two complement each other to avoid the short-term and long-term characteristics of traditional models so that the prediction can reflect the global trend and adapt to local perturbations. The GNN module explicitly expresses the topological dependency of the task execution environment by modeling the resource competition between nodes.
The GPU memory usage of the five models in the training phase is recorded along with the single inference time in the inference phase, as shown in the
Table 7:
The hybrid model proposed in this study, which integrates Informer, LSTM, Graph Neural Networks (GNN), and hierarchical attention mechanisms, is designed to comprehensively capture the multidimensional temporal features and resource dependencies inherent in high-performance computing cluster tasks. By combining diverse deep-learning modules, the model achieves precise predictions of critical parameters such as job runtime, CPU demand, memory, and storage usage. While it demonstrates relatively high GPU memory consumption during training and longer inference latency compared to other methods, these are natural consequences of its architectural complexity and enhanced representational capacity. Compared with alternative models, this hybrid approach more effectively captures structural relationships among tasks and multi-scale temporal dependencies, leading to significant improvements in prediction accuracy. In practical applications, the moderate computational overhead is considered a reasonable trade-off for enhanced predictive performance, particularly in scenarios with stringent accuracy requirements.
5. Conclusions
With the increasing scale of High-Performance Computing (HPC) clusters and the growing complexity of task types, resource scheduling and management are facing unprecedented challenges. Particularly in multidimensional resource prediction, traditional methods suffer from insufficient prediction accuracy, limited time-series dependent modeling capability, and insufficient cross-dimensional feature association. Aiming at the above challenges, this paper proposes a combined prediction model integrating Informer, LSTM, and Graph Neural Network (GNN), and designs a hierarchical attention framework based on multi-head self-attention and cross-attention mechanisms, which effectively improves the overall performance of multidimensional resource prediction.
The model structurally combines the advantages of different models: Informer efficiently captures long-range dependency information, LSTM excels in modeling short-term temporal features, and GNN expresses the graph-structural relationship between tasks and resources; the multi-head self-attention mechanism enhances the global perception of temporal features, and the cross-attention mechanism realizes the deep semantic association between different resource dimensions. The experimental results show that the prediction accuracy of the combined model proposed in this paper reaches 89.9%, 87.9%, 86.3%, and 84.3% on the four metrics of task execution time, CPU utilization, memory occupancy, and storage demand, respectively, which is significantly better than that of existing methods such as TCN, LSTM, and Transformer.
In summary, the research in this paper not only provides a new solution idea for the multidimensional resource prediction problem in HPC clusters but also verifies the feasibility and effectiveness of integrating multimodal temporal modeling with attention mechanisms in deep-learning architectures. Future work will further explore the lightweight model structure to adapt to the edge computing environment and combine it with intelligent decision-making mechanisms, such as reinforcement learning, to realize a more efficient and intelligent HPC resource scheduling and management system.