Next Article in Journal
Giant Mpemba Effect via Weak Interactions in Open Quantum Systems
Previous Article in Journal
From Quantum Time to Manifestly Covariant QFT: On the Need for a Quantum-Action-Based Quantization
Previous Article in Special Issue
Information-Theoretic Dual Adaptive Control Revisited: Multivariable Extension with Applications to Fault-Tolerant Control
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Progressive Spatiotemporal Graph Modeling for Spacecraft Anomaly Detection

1
School of Advanced Manufacturing and Robotics, Peking University, Beijing 100871, China
2
Institute of Remote Sensing Satellite, China Academy of Space Technology, Beijing 100094, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2026, 28(4), 426; https://doi.org/10.3390/e28040426
Submission received: 11 February 2026 / Revised: 20 March 2026 / Accepted: 7 April 2026 / Published: 10 April 2026

Abstract

The growing number of on-orbit spacecraft and the increasing volume of telemetry data have made intelligent anomaly detection in multi-channel telemetry essential for mission operations. Current spacecraft anomaly detection methods primarily rely on statistical models or time-series deep learning approaches, which often fail to explicitly model spatiotemporal dependencies across multiple telemetry channels. This shortcoming limits their ability to capture the dynamically evolving and intricately coupled relationships between variables. To overcome this limitation, a Progressive Spatiotemporal Graph (PSTG) model is proposed for anomaly detection in multi-channel spacecraft telemetry. PSTG employs a multi-scale patch embedding module to extract hierarchical semantic features from multi-channel time series, effectively reducing the dimensionality of the spatiotemporal graph. It constructs a sparse adjacency matrix using a multi-head attention mechanism that integrates intra-channel temporal dynamics, inter-channel spatial correlations, and cross-channel spatiotemporal interactions. An improved multi-head graph attention network then captures pairwise dependencies among nodes within the adjacency matrix. As a result, PSTG encodes rich spatiotemporal representations derived from intricate variable interactions, enabling accurate, real-time prediction of multi-channel telemetry. Furthermore, a dynamic thresholding mechanism is incorporated into PSTG to perform online anomaly detection based on prediction residuals. Extensive experiments on real-world spacecraft telemetry data collected over 84 months show that PSTG outperforms eleven state-of-the-art benchmark methods in almost all cases across multiple evaluation metrics. Finally, visualizations of the learned adjacency and attention matrices are presented to interpret the spatiotemporal modeling process, providing operators with actionable insights into the detected anomalies and facilitating root cause analysis.

1. Introduction

Satellites are sophisticated systems composed of multiple components, each serving distinct functions. Due to the extreme operational environments, such as rapid thermal cycling and intense electromagnetic radiation, it is difficult to prevent operational anomalies and failures, which pose significant risks to in-orbit satellite reliability and safety [1]. To mitigate these risks, satellite operators typically monitor key time-series telemetry data continuously, aiming to detect anomalies early and prevent critical system failures that could disrupt mission operations [2]. However, modern satellites generate vast amounts of telemetry data, including parameters such as temperature, voltage, and current [3]. Manually inspecting such high-volume, multivariate data for anomalies is highly labor-intensive, requiring operators to track hundreds of interrelated parameters across various subsystems. Although feature dimensionality reduction techniques can partially alleviate this burden, prior research indicates their limited effectiveness when applied to large-scale, high-dimensional sequential data [4]. In response, this study directly focuses on modeling multidimensional telemetry time series to preserve cross-parameter correlations and enhance anomaly detection accuracy. Given the complex interdependencies among telemetry variables, addressing anomaly detection within a multivariate time series framework is essential for reliably identifying significant deviations [4].
Telemetry data exhibits high dimensionality and dynamic interaction patterns over time. These interactions manifest as temporal, spatial, and spatiotemporal correlations across multiple channels. Temporal correlation refers to the dependence of current values on historical observations within a single channel, driven by periodic behaviors and underlying system dynamics. Spatial correlation arises from physical and functional dependencies among different subsystems, where the state of one telemetry channel influences others at the same moment. Furthermore, due to causal relationships (e.g., an increase in motor speed leading to delayed rises in current and temperature), telemetry data is also affected by past information from other channels, reflecting broader spatiotemporal dependencies. These non-exclusive correlation types may coexist and evolve dynamically throughout a mission. Beyond easily identifiable point anomalies, contextual anomalies, i.e., values that appear normal in isolation but deviate under specific temporal or operational conditions [1,5,6], require a deep understanding of the intricate spatiotemporal structure of multichannel telemetry. Consequently, spacecraft anomaly detection remains heavily reliant on expert analysts, and accurate, intelligent, and automated detection continues to pose a major challenge.
Early efforts in spacecraft anomaly detection were pioneered by NASA and its affiliated research centers. Systems such as the Inductive Monitoring System (IMS) [7], the BEAM/DIAD framework [8], and related approaches [9,10] laid the foundation for rule-based and data-driven health monitoring of shuttle telemetry, employing clustering-based nominal modeling and statistical invariants to identify deviations from normal behavior. With the rapid development of deep learning, substantial progress has been made in detecting anomalies in multivariate time series. Recurrent Neural Networks (RNNs) [11] and Long Short-Term Memory (LSTM) networks [12] excel at capturing temporal dependencies, while Convolutional Neural Networks (CNNs) [13], Variational Autoencoders (VAEs) [14], and Graph Neural Networks (GNNs) [15] are used to model inter-variable relationships. However, most of these models implicitly encode spatiotemporal interactions within global hidden states, failing to explicitly represent the underlying dependency structures. Spatiotemporal Graph Neural Networks (STGNNs) [16] offer a more structured approach by using GNN modules [17] to model spatial correlations and CNN [18], LSTM [19], or Transformer [20] components to capture temporal dynamics. Despite these advances, existing methods often fail to jointly model evolving spatiotemporal dependencies across channels and time steps. Temporal modeling is typically confined to individual channels, and spatial relationships are encoded without temporal context, resulting in poor representations of cross-channel phenomena such as delayed responses or coupled oscillations. This limitation undermines the detection of correlated or system-level anomalies.
To address these challenges, a prediction-driven anomaly detection framework is adopted, which first forecasts future telemetry values and then identifies anomalies based on the prediction residuals. The primary contribution of this work lies in the tailored design of a novel and robust forecasting model referred to as Progressive Spatiotemporal Graph (PSTG).
The main contributions of this paper are summarized as follows:
(a) A novel multi-scale adaptive fusion method is proposed to address the challenge of simultaneously capturing global patterns and local variations in spacecraft telemetry across diverse mission profiles. By modeling both long-term dependencies and short-term fluctuations, the method enables a comprehensive temporal feature representation that surpasses the capabilities of conventional single-scale approaches.
(b) A unified spatiotemporal graph representation, enhanced with an adaptive attention mechanism, is introduced to overcome the limitations of static dependency modeling. This approach dynamically identifies the most relevant node interactions at each time step, enabling the simultaneous learning of heterogeneous spatiotemporal dependencies through a single coherent graph structure, thereby significantly improving the modeling accuracy of complex spacecraft systems.
(c) The effectiveness of the complete PSTG framework is demonstrated through extensive experiments on a real-world spacecraft telemetry dataset spanning 84 months. While the model outperforms eleven state-of-the-art methods in almost all cases across multiple metrics, more importantly, its learned graph structure supports interpretable analysis. This capability is critical for assisting operators in diagnosing the root causes of detected anomalies.
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the architecture and technical details of the PSTG framework. Section 4 evaluates the proposed method on a real-world spacecraft telemetry dataset. Section 5 concludes the paper.

2. Related Work

2.1. Foundational Spacecraft Anomaly Detection Methodologies

Early spacecraft anomaly detection techniques primarily relied on threshold-based rules and expert systems. Threshold methods involve setting fixed upper and lower bounds for each telemetry channel, making them ineffective for detecting contextual or collective anomalies [21]. Expert system-based approaches [22] require manually defined rules and are typically limited to monitoring only a few critical subsystems. These methods struggle to detect previously unseen anomalies or novel fault signatures, resulting in high false-negative rates, particularly in advanced spacecraft used for deep space missions.
With advancements in digital modeling, model-based anomaly detection has gained traction. These methods rely on constructing an accurate digital model of spacecraft systems, comparing simulated outputs with actual telemetry, and identifying discrepancies as potential anomalies. Kolcio et al. [23] developed a nominal model of a spacecraft’s attitude control system using Simulink and SysML, applying the constraint suspension method for fault diagnosis. However, as spacecraft become more autonomous and complex, creating precise digital models becomes increasingly challenging. Difficulties in parameter back-solving and real-time model updating severely limit the practicality and scalability of such approaches.
The growing volume of telemetry data and advances in artificial intelligence have spurred interest in data-driven anomaly detection. These methods leverage raw time series data and machine learning algorithms to distinguish between normal and anomalous states without requiring explicit fault rules. In recent years, classification-, clustering-, and prediction-based machine learning techniques have emerged as prominent tools for spacecraft anomaly detection.
Classification-based methods typically employ supervised learning, which can be either statistical or distance-based. Bernal-Mencia et al. [24] used Kernel Principal Component Analysis (KPCA) for feature extraction, followed by a Multi-Layer Perceptron (MLP) for binary classification of normal and abnormal states. However, supervised methods demand large volumes of labeled training data, which are expensive and time-consuming to obtain, especially given the scarcity of documented anomalies in real missions [25]. As a result, unsupervised approaches such as clustering have gained increasing attention [26]. For instance, Li et al. [27] used normal-state telemetry as a reference baseline, computed distance scores for incoming sub-sequences, and performed anomaly detection based on similarity measures. Nevertheless, classification and clustering methods often lack sensitivity to fine-grained temporal changes, making them prone to missing short-lived anomalies and limiting their real-time applicability.
Prediction-based anomaly detection methods use algorithms, particularly deep learning models, to learn the normal behavior of telemetry data during regular operation. These models reconstruct historical data or predict future values under normal conditions [28], and anomalies are identified by analyzing deviations between predicted and observed values. Commonly used models include Trajectory Optimization [29], Linear Models [30], Relevance Vector Machines (RVMs) [31], Extreme Learning Machines (ELMs) [32], Autoencoders (AEs) [33], Bayesian Neural Networks (BNNs) [34], Temporal Convolutional Networks (TCNs) [35], RNNs [36], LSTMs [6,37], Transformers [14], and Generative Adversarial Networks (GANs) [4]. The LSTM-based Telemanom framework, developed by NASA engineers [6], has become a benchmark in satellite telemetry anomaly detection. Lakey et al. [38] compared various deep learning architectures and found CNNs effective for diagnosing multiple fault types, while LSTMs and RNNs excelled in capturing time-series anomalies. Transformers demonstrated strong performance in detecting subtle, prolonged anomalies but required substantial computational resources.
Critically, while deep learning models like LSTMs and early Transformers have significantly improved detection accuracy, the technical landscape has been further enriched by advanced architectures such as iTransformer [39], PatchTST [40], and efficient mixing structures like TSMixer [41] and WPMixer [42]. Despite these breakthroughs—including cross-dimensional dependency modeling [43], frequency-domain MLPs [44], and selective representation spaces [45]—a fundamental limitation persists: these models predominantly operate as “black boxes”. Although they can flag anomalies with high precision, they rarely provide interpretable explanations regarding why an anomaly occurred or which specific sensor interactions contributed to the detection. Furthermore, as Zeng et al. [46] highlighted, the inherent complexity of Transformer-based architectures does not always translate to superior performance in time-series tasks, potentially obscuring the decision-making process. This lack of transparency poses a significant barrier in safety-critical aerospace applications, where trust, accountability, and rigorous root-cause analysis are paramount. Consequently, the research frontier is shifting from pure accuracy enhancement toward the development of explainable anomaly detection frameworks that provide actionable insights for mission operations.
Beyond spacecraft-specific anomaly studies, recent fault-diagnosis research has also shown that unsupervised and partial domain adaptation strategies can improve robustness under distribution shift and class mismatch [47,48]. This trend is relevant to telemetry anomaly detection, where cross-mission transfer and rare-event imbalance remain practical challenges.

2.2. Spacecraft Anomaly Detection with Explainability

In spacecraft operations, it is not sufficient to merely detect anomalies with high accuracy; ground operators must also understand the reasons behind each detection. Clear explanations enable faster decision-making, reduce mission risk, and support corrective actions. Therefore, Explainable Artificial Intelligence (XAI) [49,50] has become essential in this field, providing transparent justifications for anomaly detection decisions. Hundman et al. [6] employed an LSTM model to predict single-channel telemetry and introduced a dynamic threshold mechanism for anomaly detection. This approach offers partial explainability and has been successfully deployed for the Soil Moisture Active Passive (SMAP) satellite and the Mars Science Laboratory (MSL) rover. However, it treats multichannel telemetry as independent sequences, neglecting inter-sensor dependencies. As a result, it cannot distinguish whether a prediction error stems from genuine system anomalies or modeling inaccuracies.
To address these limitations, Xu et al. [51] integrated attention mechanisms with LSTM to jointly learn inter-parameter correlations and long-term temporal dependencies, using dynamic thresholds for final detection. Yu et al. [52] utilized GNNs to model intrinsic properties of telemetry variables and applied attention mechanisms to capture short-term cross-dimensional interactions, followed by LSTM-based temporal feature extraction. Liu et al. [18] employed dynamic graph attention networks to model complex spatial correlations in multivariate time series and applied optimal thresholds for anomaly detection. Zeng et al. [12] constructed a feature attention-based LSTM and causal network to infer causal relationships among telemetry parameters, updating dynamic thresholds via the k-sigma method.
While these studies incorporate aspects of spatial or temporal modeling, they typically treat these dimensions separately, failing to fully integrate spatiotemporal dynamics. Some works attempt joint modeling: Tian et al. [19] used a graph attention network to capture spatial dependencies and an LSTM for temporal modeling, leveraging spatiotemporal information for anomaly detection. Yu et al. [13] combined the CNN and LSTM to extract spatiotemporal features and used GAN-generated anomaly scores to detect multichannel anomalies. However, since these approaches consider the temporal and spatial features separately, they fail to further analyze the explainable patterns, such as the temporal, spatial, and spatiotemporal correlations [53], in the time series data. Given the complex spatiotemporal dynamics of multi-channel spacecraft telemetry data, anomaly detection methods must account for these spatiotemporal characteristics in a unified manner.

3. Methodology

Engineering-level intuition of PSTG: Before presenting the formal mathematical derivations, this section first summarizes the practical operation logic of PSTG for spacecraft health monitoring. The framework continuously processes incoming telemetry in three tightly coupled steps: it first uses multi-scale patches to capture both short-term high-frequency transients and long-term low-frequency mission drifts; it then performs dynamic graph construction and structure-guided weighting to discover data-adaptive sensor couplings rather than relying on a fixed engineering schematic; finally, it compares predicted normal behavior with real-time observations and converts residual deviations into actionable alarms through a dynamically calibrated statistical threshold.

3.1. Overall Framework

The proposed PSTG framework has a prediction-driven architecture that turns short-horizon forecasts and their deviations into anomaly evidence. An overview of the framework is illustrated in Figure 1, where T denotes the total sequence length, L is the context length, F represents the forecast horizon, and  n L is the progressive composition depth.
PSTG is designed as a channel-agnostic and data-adaptive framework rather than a mission template-specific model. The multi-scale patch operator captures temporal behaviors at multiple horizons, the dynamic graph module learns couplings directly from observed telemetry interactions, and the statistical thresholding module calibrates alarm criteria from recent residual distributions. Because these components do not rely on explicit sensor identities or hand-crafted subsystem schematics, the same architecture is expected to transfer to other missions and subsystem groups with similar multivariate telemetry characteristics. This expectation is currently design-based and qualitative, and dedicated cross-mission validation will be addressed in future work.
The framework’s overall detection process acts on the full multivariate time series X and their consolidated predictions X ^ to produce an anomaly score matrix S :
S = Φ X , X ^ .
The predictions X ^ are generated by the core model, which maps a context window X t L + 1 : t to an F-step forecast X ^ t + 1 : t + F . This mapping is a progressive composition of three operators:
X ^ t + 1 : t + F = T Θ 3 G Θ 2 ( n L ) ( n L ) G Θ 2 ( n L 1 ) ( n L 1 ) G Θ 2 ( 1 ) ( 1 ) P Θ 1 ( X t L + 1 : t ) ,
where Θ = { Θ 1 , { Θ 2 ( l ) } l = 1 n L , Θ 3 } denotes the set of learnable parameters associated with the multi-scale patching P , the stacked graph reasoning G , and the final forecast projection T , respectively.
The core operators of this framework, i.e.,  P , G , and  Φ are defined as follows:
(1)
P : Multi-scale temporal patching.  P partitions raw telemetry data into a hierarchy of temporal patches and aggregates across scales, preserving fine-grained fluctuations and mission-level trends to yield a stable multi-resolution representation.
(2)
G : Progressive Spatiotemporal Graph reasoning.  G constructs a data-adaptive spatiotemporal dependency structure over channels and time, refining the latent representation via structure-guided attention. Executed in a stacked manner with depth n L , it captures cross-channel couplings and long-range effects without assuming a fixed correlation pattern.
(3)
Φ : Statistical anomaly decision.  Φ converts forecast–signal discrepancies into anomaly evidence across channels and time and issues decisions under a data-calibrated criterion. Concrete choices (i.e., robust deviation, temporal stabilization, dynamic thresholding) are specified later.

3.2. Progressive Spatiotemporal Inference for Multi-Channel Telemetry

This section formalizes the core inference engine of PSTG, which progressively refines a latent representation of the input telemetry data through a deep, stacked architecture. The objective is to transform an initial multi-resolution embedding into a forecast-ready state by iteratively applying the spatiotemporal reasoning operator G .
Given the multi-channel telemetry sequence X R C × L , an initial latent representation, Z fused , is first generated by applying the multi-scale patch design operator:
Z fused = P Θ 1 ( X ) .
Z fused R C × n encapsulates hierarchical temporal features and serves as the initial hidden representation for the main reasoning stack, denoted as H [ 0 ] = Z fused .
The progressive inference is defined as a sequence of transformations, in which each layer not only refines the representations produced by the preceding layer but also performs intra-layer spatiotemporal reasoning through the joint evolution of graph structure and attention weights via the operator G . Each layer possesses its own unique set of learnable parameters. The recursive update rule for the hidden states is formulated as:
H [ l ] = G Θ 2 ( l ) ( H [ l 1 ] ) , for l = 1 , , n L .
The final output of the stack, H [ n L ] , is the fully distilled, forecast-ready latent representation. This representation is then consumed by the prediction head: X ^ future = T Θ 3 ( H [ n L ] ) .
While the formulation above defines the transformation for a single context window, the complete predicted sequence, X ^ , is generated by deploying this entire inference pipeline in a sliding-window fashion across the full telemetry sequence, where each window produces a short-horizon forecast X ^ future and only its first τ time steps are retained. These retained segments are then concatenated in temporal order to construct the final, continuous prediction X ^ . This global forecast X ^ is then compared against the ground truth to identify anomalies, as detailed in the anomaly decision section.

3.2.1. Multi-Scale Patch Design

Inspired by its success in natural language processing, patch embedding has recently emerged as a powerful paradigm for capturing local semantic information. Considering the large volume of telemetry data, we use a patch-based approach, dividing the data into smaller segments for analysis. This reduces the sequence length to be processed and lowers the overall computational load. Several existing methods rely on a uniform patch length. This single-scale approach, however, is inherently limited because a fixed patch size is ill suited to capturing both short-term fluctuations and long-term trends simultaneously.
This section provides the formal specification of the multi-scale temporal patching operator, P Θ 1 . Conceptually, this operator is defined as the composition of three foundational transformations: a multi-scale partitioning function Π P ( · ) , a position-aware embedding function Embed Θ emb ( · ) , and a gated attention fusion function Fuse Θ gate ( · ) . The entire operator constitutes a mapping from the raw temporal domain to a structured, multi-resolution latent space, expressed as:
P Θ 1 = Fuse Θ gate Embed Θ emb Π P : R C × L R C × N × D .
The learnable parameters of the operator are thus Θ 1 = Θ emb Θ gate . Next, the concrete instantiation of each constituent transformation is specified.
The initial transformation, Π P ( · ) , discretizes the continuous input X t L + 1 : t R C × L by partitioning it based on a set of K patch lengths P = { p 1 , p 2 , , p K } :
Π P ( X t L + 1 : t ; P ) = X p ( k ) k = 1 K .
To ensure comparability, this function standardizes the output to a sequence of N = L / p main patches for every scale, so that temporal patterns with different characteristic durations can be captured, by applying a sliding window with a calculated stride h k = L p k N 1 . This yields a set of patches X p ( k ) k = 1 K .
The subsequent transformation, Embed Θ emb ( · ) , governed by the learnable embedding parameters Θ emb = { ( W k , b k ) } k = 1 K , endows the partitioned data with semantic structure and temporal order:
Embed Θ emb X p ( k ) k = 1 K ; W k , b k = Z k k = 1 K .
Embed Θ emb ( · ) is itself a combination of two functions. First, a scale-dedicated linear projection maps each patch to a D-dimensional vector. Second, to counteract the information loss from partitioning, a fixed sinusoidal positional prior is infused via summation. The complete embedding is defined as:
z i , c , k = W k x i , c ( k ) + b k + p i ,
where W k is the weight matrix and b k denotes the bias vector, and  p i represents the fixed positional encoding vector for position i. The components of p i are defined by:
( p i ) j = sin ( i / θ 2 j / D ) if j is even , cos ( i / θ ( j 1 ) / D ) if j is odd .
Finally, the operator culminates in Fuse Θ gate , governed by the learnable parameters Θ Gate = { W gate , b gate } , which adaptively aggregates the parallel, multi-scale representations into a single, unified representation. Its mapping is defined as:
Fuse Θ gate Z k k = 1 K = Z fused .
After obtaining the set of embeddings z i , c , 1 , z i , c , 2 , , z i , c , K , Fuse Θ gate is used to fuse them into a single informative representation z i , c fused R D . Specifically, a gated attention mechanism is employed to compute the fused representation as:
z i , c fused = j = 1 K α j · z i , c , j ,
α j = Softmax Linear [ z i , c , 1 z i , c , K ] ,
where the attention weights α j are derived from a softmax function over a linear projection of the concatenated embeddings, allowing the model to learn the relative importance of each z i , c , j in the fusion process.
After fusing the temporal dependencies, we obtain the feature embeddings Z fused , R C × N × D , which are then reshaped into a node feature matrix Z fused R n × D . Here, each of the n = C × N rows represents a unique spatiotemporal node that will be processed by the subsequent graph reasoning module.

3.2.2. Progressive Spatiotemporal Graph Modeling

The core of the proposed framework lies in the PSTG modeling module, which transforms the initial multi-resolution embedding, Z fused , into a forecast-ready state. This is achieved by iteratively applying a spatiotemporal reasoning operator, G , in a stacked architecture. The recursive update rule for the hidden states is given by:
H [ l ] = G Θ 2 ( l ) H [ l 1 ] , for l = 1 , , n L , with H [ 0 ] = Z fused .
This section provides the formal specification for the generic operator G Θ 2 ( l ) at any given layer l. Conceptually, the operator is decomposed into two primary transformations: a dynamic graph construction operator G graph , followed by a structure-guided graph attention operator G attn . The complete operator for a layer is thus expressed as:
G Θ 2 ( l ) = G attn G graph .
The learnable parameters are partitioned accordingly, and  Θ 2 ( l ) = Θ graph ( l ) Θ attn ( l ) . The formal specifications for each constituent operator are provided in the following subsection.
Spatial–Temporal Graph Construction
The spatial–temporal graph construction approach formulates multivariate time series data as a dynamic graph, where the nodes represent spatial entities and the learned edges capture evolving spatial correlations [53]. To uncover the underlying relational structure among input variables without relying on a pre-defined static graph, a dynamic graph learning mechanism is applied to learn a sparse, weighted adjacency matrix directly from node features in an end-to-end manner.
Relying on a single graph structure can be insufficient to capture the complex, diverse nature of inter-variable dependencies. Therefore, the multi-head graph learner mechanism that allows the model to learn H distinct adjacency matrices in parallel from H different representation subspaces is adopted. Each head is specialized to capture a distinct pattern of the graph structure, such as dependencies at different time scales or of various types.
The operator’s mapping is defined as:
G graph ( H [ l 1 ] ; Θ graph ( l ) ) = { A final ( h ) } h = 1 H .
The operator is parameterized by Θ graph ( l ) = { W 1 ( l ) , W 2 ( l ) } , where W 1 ( l ) , W 2 ( l ) R D H × D H . To maintain notational consistency with the preceding section, we continue to denote this input matrix as Z : = H [ l 1 ] R n × D , where Z = Z fused for the initial layer ( l = 1 ) .
To capture the spatial–temporal features of this node feature matrix, we first divided Z into different heads Z ( h ) R n × D H , where h { 1 , 2 , , H } . To maintain parameter efficiency, two linear transformations with weights W 1 , W 2 R D H × D H are learned and shared across all heads. These transformations project the node features of each head into a relational space:
E 1 ( h ) = Z ( h ) W 1 , E 2 ( h ) = Z ( h ) W 2 .
The weighted adjacency matrix for each head is then computed via a dot product, followed by a Rectified Linear Unit (ReLU) activation:
A dense h = ReLU E 1 h E 2 h .
To enforce sparsity and retain only the most important connections, a top-k masking strategy based on a hyperparameter γ is adopted to obtain the final sparse adjacency matrix A mask h for each head.
To convert the raw edge weights into a normalized probability distribution, a softmax function is applied:
A norm h i , j = exp A mask h i , j k = 1 n exp A mask h i , k .
This step transforms the adjacency matrix into a row-stochastic matrix, where each row sums to one.
Finally, to prevent overfitting to the learned graph structure, we apply dropout with rate p dropout directly to this normalized adjacency matrix during the training phase.
The final processed adjacency matrix A final h is then passed to the subsequent graph attention layer for information propagation.
Graph Attention Learning
Once constructed, the dynamic adjacency matrix A final h serves as the foundation for learning node representations. Our attention module uses a dynamic mechanism that computes attention weights based on pairwise interactions between a node and its neighbors, thereby enhancing the model’s expressive capacity. A key modification is introduced to the standard GATv2 architecture for more effective integration of the learned graph structure. Instead of applying a linear transformation to the concatenated query and key vectors, we directly use the learned adjacency matrix to modulate the attention scores. This design not only enhances computational efficiency by eliminating a linear layer but, and more importantly, directly injects the learned relational structure as a strong inductive bias into the attention mechanism.
The structure-guided graph attention operator G attn is formally defined as:
G attn ( H [ l 1 ] , { A final ( h ) } h = 1 H ; Θ attn ( l ) ) = H [ l ] ,
where the learnable parameters for the layer are Θ attn ( l ) = { W Q ( l ) , W K ( l ) , W V ( l ) , W O ( l ) , w A ( l ) } .
Given the feature set Z , we first project it into a combined query Q , key K , and value V representation using three distinct linear layers. The resultant tensor undergoes reshaping and permutation to disentangle the representations for the H distinct attention heads. Specifically, Z is structured to yield H independent sets of Q ( h ) , K ( h ) , V ( h ) R n × D H , for  h = 1 , , H . To incorporate the explicit graph structure as a strong inductive bias, the learned attention scores are modulated by A final h . The attention score from source node j to target node i is computed as:
e i , j ( h ) = ( A final ( h ) ) i j · LeakyReLU w A Q i ( h ) K j ( h ) ,
where w A R 2 D H × 1 is a shared linear projection applied across all heads.
These modulated scores are then normalized across all source nodes in the neighborhood of node i, using the softmax function to obtain the final attention coefficients α i , j ( h ) :
α i , j ( h ) = softmax j e i , j ( h ) = exp e i , j ( h ) k N i exp e i , k ( h ) .
To enhance regularization, dropout is applied to the attention coefficients. Subsequently, the message vector for node i within head h, denoted as M i ( h ) , is computed by aggregating the node features of its neighbors, weighted by the final attention coefficients:
M i ( h ) = j N i α i , j ( h ) V j ( h ) .
The outputs from all H heads are then concatenated and passed through a final linear projection layer, W O R D × D . This step yields the aggregated message representation M R n × D :
M = M ( 1 ) M ( 2 ) M ( H ) W O .
Finally, following the standard Transformer architecture, a residual connection is added to the input features, followed by layer normalization, to produce the layer’s final output, Z out R n × D :
Z out = LayerNorm ( H [ l 1 ] + M ) .

3.2.3. Loss Function and Optimization

To minimize the discrepancy between the multi-channel telemetry prediction and actual data, the learning criterion is formulated as a composite loss function L . This function is designed to capture the signal’s point-wise accuracy as well as its structural and dynamic properties, and is defined as:
L ( X future , X ^ future ) = L MSE + λ 1 L freq + λ 2 L shape = ( X future X ^ future F 2 + λ 1 F ( X future ) F ( X ^ future ) F 2 + λ 2 t X future t X ^ future F 2 ) ,
where · F is the Frobenius norm, F ( · ) represents the Discrete Fourier Transform (DFT) along the temporal axis, and  t denotes the temporal gradient operator.The weight parameters λ 1 and λ 2 are critical for balancing point-wise reconstruction accuracy with spectral and structural properties. In this study, these hyperparameters were determined through a grid search on the validation set, ensuring that the model effectively captures both high-frequency fluctuations and long-term trends.
To optimize the model parameters, the aforementioned loss function is minimized using Stochastic Gradient Descent (SGD)-based methods. Specifically, the Adam optimizer [54] is adopted, which adaptively adjusts learning rates for each parameter and accelerates convergence. To further enhance convergence stability and generalization, the Cosine Annealing (CA) learning rate scheduler [55] is integrated, which gradually reduces the learning rate following a cosine decay schedule over the course of training. This scheduler is applied on top of the Adam optimizer to facilitate smoother model convergence by allowing for large initial learning rates and progressively finer updates as training proceeds.

3.3. Multi-Channel Telemetric Anomaly Detection

Following the prediction stage, which yields the output X ^ future R C × F , the final forecast sequence X ^ is constructed using a recursive strategy where the number of retained points equals the window step size, τ . Specifically, only the first τ points from each prediction window are utilized. Once the complete forecast sequence is assembled, the anomaly detection phase begins. This process is designed to identify and score anomalous deviations by comparing the model’s predictions against the ground-truth telemetry data. The methodology adapts the robust, unsupervised techniques proposed by Kotowski et al. The overall procedure of the PSTG framework is summarized in Algorithm 1.
The application of the operator Φ is centered on the principle of non-parametric dynamic thresholding. Its core mechanism involves the determination of an optimal threshold, ϵ * , which is found by solving the following optimization problem over the raw residual sequence r = | X X ^ | :
ϵ * = a r g m a x ϵ Δ μ r s / μ r s + Δ σ r s / σ r s r a + R seq 2 ,
where Δ μ ( r ) and Δ σ ( r ) respectively denote the decreases in the mean and standard deviation of the raw residuals after excluding values above the threshold ϵ ; r a is the set of anomalous residuals exceeding ϵ ; and R seq represents the set of continuous sequences formed by those residuals.
Given the optimal threshold ϵ * from the optimization step, the operator then assigns a severity score, s, to each detected anomalous sequence. This score quantifies the normalized magnitude of the deviation relative to the data-driven threshold:
s ( i ) = max ( r seq ( i ) ) ϵ * μ ( r ) + σ ( r ) .
To enhance robustness, the operator’s application is extended with two refinements. First, to capture “silent failures” (i.e., anomalies manifesting as abrupt signal drops or inverted deviations), the entire optimization and scoring procedure is independently applied to a reflected residual sequence r ref = 2 μ ( r ) r . Second, to mitigate false alarms, a false-positive pruning strategy is employed, which assesses the percent decrease between the peaks of consecutively ranked anomalous sequences and reclassifies those below a predefined threshold p δ as normal. This approach evaluates the percent decrease, d ( i ) , between the peaks of consecutively ranked anomalous sequences, r max ( i 1 ) and r max ( i ) :
d ( i ) = r max ( i 1 ) r max ( i ) r max ( i 1 ) .
If d ( i ) is found to be less than p δ , then the corresponding sequence and all subsequent, lower-ranked sequences are reclassified as normal.
Algorithm 1 The complete PSTG algorithm
  1:
Input: Training dataset D train = { ( X context ( i ) ,   X future ( i ) ) } ; Telemetry sequence for detection S test ; Setup of model hyperparameters for model training and prediction.
  2:
Output: Anomaly scores S for the sequence S test .
  •     Part 1: Parameter Learning via End-to-End Training
  3:
Initialize parameters Θ = { Θ 1 ,   { Θ 2 ( l ) } l = 1 n L ,   Θ 3 } .
  4:
Define the predictor
X ^ t + 1 : t + F = T Θ 3 G Θ 2 ( n L ) ( n L ) G Θ 2 ( n L 1 ) ( n L 1 ) G Θ 2 ( 1 ) ( 1 ) P Θ 1 ( X t L + 1 : t ) .
  5:
for epoch = 1 to E do
  6:
     for each batch { ( X context ,   X future ) } D train  do
  7:
           X ^ future F Θ ( X context )
  8:
           L L pred ( X ^ future ,   X future ) .
  9:
          Update Θ .
10:
     end for
11:
end for
12:
Θ * Θ .
  •     Part 2: Full Sequence Generation
13:
X ^ [ ] .
14:
for  t = L to length( S test ) - F step τ  do
15:
      X context S test [ t L + 1 : t ] .
16:
      X ^ forecast F Θ * ( X context ) .
17:
      X ^ X ^ | | X ^ forecast [ 1 : τ ] .
18:
end for
  •     Part 3: Anomaly Decision
19:
T p r e d length ( X ^ ) .
20:
X target S test [ L + 1 : L + T p r e d ] .
21:
S Φ ( X target ,   X ^ ) .
22:
return  S .

4. Experimental Results and Analysis

4.1. Description of Datasets

Experiments were conducted using multi-channel telemetry data from real spacecraft, and the proposed algorithm was compared with eleven advanced methodologies to validate its effectiveness. Notably, the adopted European Space Agency (ESA) Anomalies Dataset (ESA-AD) conceals the telemetry channel names. This prevents the algorithm from leveraging domain-specific knowledge, emphasizing the use of a general data-driven approach rather than focusing on the complexity of specific tasks. The experiments were based on the lightweight subset from Mission 1 of the ESA-AD dataset, focusing on telemetry channels 41–46 of subsystem 5. This subset presented significant challenges due to the high number and complexity of anomalies. Given the advanced autonomous capabilities of modern spacecraft, individual telecommands exert less influence on telemetry behavior compared to earlier, non-autonomous systems. Consequently, the dataset used in this work excludes telecommand data. A representative segment of the Mission 1 lightweight subset is illustrated in Figure 2. The Y-axis can be omitted, since the channels have been normalized and vertically offset for visual clarity.
Both the training and test datasets consist of millions of telemetry data, with the final three months of the training set designated as the validation set. The validation and test sets contain only samples that occur after those in the training set to prevent data leakage from future time points. Anomalies are present in all three subsets—training, validation, and testing. This data-partitioning strategy maximizes the use of available data, reflecting a mature phase of the task where sufficient historical data enabled robust model training. Anomaly statistics across the three datasets are summarized in Table 1.

4.2. Experimental Details

The TimeEval [56] framework is employed, as modified by Krzysztof et al. [57], to implement the PSTG algorithm. The algorithm was first trained on the training set, during which contamination levels were calculated, thresholds were established, and standardization parameters were determined. It was then applied to the test set for online anomaly detection, operating without access to future samples from the test sequence.
In typical spacecraft mission operations, high-dimensional telemetry data are downlinked to ground stations for intensive health monitoring. Therefore, the PSTG model is positioned as a ground-based diagnostic tool, where the emphasis is placed on detection accuracy and interpretability rather than the extreme resource frugality required for on-orbit processing. Our model was implemented in PyTorch 2.6.0 under Python 3.9.0 and trained on a single NVIDIA GeForce RTX 4090 GPU. As outlined in Section 3.2.3, the AdamW optimizer was used in conjunction with a CA learning rate scheduler. The hyperparameters adopted in the primary experiment are summarized in Table 2.
For our method and all baseline approaches, the sliding window length L was set to 250 and the prediction window length F was set to 10, following the standard configurations established in Telemanom. From each prediction window, only the first time step ( τ = 1 ) was retained for further processing.
For anomaly detection, the size of the smoothing window is calculated by W s = p s n s B s , where B s is the test batch size, n s denotes a configurable base factor, and p s represents the tuning percentage.
The hyperparameters for the proposed model were carefully tuned and are detailed in Table 3.

4.3. Evaluation Metrics

Modified performance metrics from ESA-AD were adopted to better align conventional anomaly detection measures with practical spacecraft operational requirements. These metrics included the corrected event-wise F0.5-score and modified affiliation-based F0.5-score. Among these, the modified affiliation-based F0.5-score has a comparatively intricate formulation, so a dedicated description is provided in the following subsection. The computation of these metrics excluded the use of Point Adjustment (PA), a preprocessing protocol intended to refine anomaly predictions before evaluation. PA operates under the assumption that if any single point within an anomalous segment is correctly detected, then all points in that segment are treated as correctly identified. This protocol relaxes the detection burden on algorithms, though often leading to inflated performance scores [58]. The application of PA might allow randomly generated anomaly scores to surpass the performance of several recently proposed time series anomaly detection methods [59].
Operationally, Event-wise F 0.5 reflects whether anomaly events can be detected in time while controlling false alarms at the event level, which is critical for alarm triage in spacecraft operations. Affiliation-based F 0.5 evaluates temporal localization quality by measuring how well predicted anomaly intervals align with the corresponding ground-truth intervals in terms of boundary proximity and coverage completeness. Reporting both metrics therefore provides complementary evidence on alarm reliability and diagnosis usability.
Affiliation-based F β -score is a time-domain metric that evaluates how closely and completely predicted anomaly intervals match each ground-truth interval by computing the average temporal distance within each exclusive affiliation zone:
F aff , 0.5 = 1.25 · P aff · R aff 0.25 · P aff + R aff .
Here,
P aff = 1 N ( j J 1 | pred I j | x pred I j F ¯ precision min y gt j | x y | d x + ( N | J | ) · 0.5 ) , R aff = 1 N j = 1 N 1 | gt j | y gt j F ¯ recall min x pred I j | x y | d y ,
where J = { j pred I j } , N is the total number of ground-truth events, I j signifies the affiliation zone of gt j , F ¯ represents the survival function derived from uniform random sampling.

4.4. Baseline Methods

Baseline methods under comparison encompass a diverse set of approaches, ranging from classical techniques developed by NASA engineers [6] to competitive multivariate time-series modeling baselines. To broaden the comparison across modern modeling paradigms, high-performance time-series forecasting backbones commonly adopted in forecasting-based anomaly detection pipelines were incorporated, including the recently proposed deep spatiotemporal graph neural network TimeFilter [53]. Furthermore, several Transformer-based models that have emerged in the past two years were included, as they are widely recognized for achieving State-Of-The-Art (SOTA) performance in time-series forecasting, specifically iTransformer [39], PatchTST [40], and Crossformer [43]. Furthermore, lightweight MLP-Mixer-based alternatives such as DLinear [46] and TSMixer [41] are included to evaluate efficiency–accuracy trade-offs. Finally, the analysis considers methods that enhance deep neural network performance through frequency-domain or time-frequency decomposition techniques, such as FreTS [44], WPMixer [42] to assess robustness under diverse inductive biases. The modern forecasting backbones included in the benchmark are summarized in Table 4.

4.5. Anomaly Detection Results

The anomaly detection results of the baseline methods and PSTG are shown in Table 5. Evidently, no baseline method performed well across all five categories of metrics.
The accurate identification of anomalous events is paramount in spacecraft anomaly detection, with a primary focus on the event-wise F0.5 score, which places greater emphasis on precision to reduce false alarms. As shown in Table 5, PSTG achieves an event-wise F0.5 score of 0.917, significantly outperforming all competing baseline methods. In comparison, forecasting-based backbones such as PatchTST and iTransformer exhibit relatively strong recall but suffer from reduced precision, indicating a tendency to over-detect anomalies when applied to complex multivariate telemetry data. Simpler linear models (DLinear) and certain Transformer variants (Crossformer) fail to capture intricate cross-channel dependencies, resulting in substantially degraded event-wise performance.
Beyond event-level detection, anomaly detection systems must accurately localize the temporal extent of anomalous behaviors across multiple telemetry channels. The affiliation-based metrics explicitly evaluate the temporal alignment between detected anomalies and ground-truth intervals. PSTG attains an affiliation-based F0.5 score of 0.892 and consistently surpasses all baseline methods across affiliation-based precision, recall, and F0.5 metrics. While TimeFilter, as a graph-based baseline, demonstrates competitive temporal localization capability, its reliance on patch-level or static filtration limits its ability to progressively refine spatiotemporal dependencies.
Overall, the superior performance of PSTG across both event-wise and affiliation-based evaluations highlights the effectiveness of its progressive spatiotemporal graph reasoning mechanism for reliable spacecraft anomaly detection.

4.6. Discussion on Imbalanced Anomaly Patterns

The ESA-AD subset used in this work is naturally imbalanced, where normal samples dominate and anomaly events are relatively sparse. Under this setting, the F 0.5 -oriented evaluation is intentionally adopted to emphasize precision and reduce operational false alarms. The dynamic thresholding strategy further mitigates spurious alarms by adapting to recent residual statistics; however, under extremely rare or weak anomaly patterns, recall degradation may still occur. These observations are reported as practical behavior under the current data regime rather than a complete robustness claim.

4.7. Explainability Analysis

As shown in Figure 3, relational patterns evolve progressively across network layers, illustrating how spatial, temporal, and spatiotemporal dependencies are incrementally refined. In the first layer, the adjacency matrix captures coarse spatial dependencies among nodes (Figure 3a), while the attention map displays diffuse and unstructured temporal interactions across time steps (Figure 3b). Following message passing in the second layer, the adjacency matrix becomes sparser and more structured (Figure 3c), indicating that the model selectively retains salient spatial connections while suppressing weaker or noisy ones. Finally, as shown in Figure 3d, the attention map exhibits concentrated regions of high intensity that correspond to correlated node activations across space and time, reflecting the emergence of stable spatiotemporal coupling patterns. This hierarchical refinement demonstrates that the proposed network adaptively enhances meaningful dependencies across spatial, temporal, and spatiotemporal dimensions, thereby improving both interpretability and representation stability.
To provide a rigorous mathematical foundation for the qualitative observations in Figure 3, we further quantify the evolutionary dynamics across the two stacked reasoning layers using Shannon Entropy (H). This metric evaluates the uncertainty and information distribution of the learned interaction components, defined as
H ( i ) = j = 1 N p i j log ( p i j ) ,
where N = 60 represents the total number of spatiotemporal nodes ( N = 6 channels × 10 patches) and p i j denotes the interaction intensity between nodes. A higher H value indicates a more diffuse information integration, while a decreasing trend signifies the emergence of “reasoning determinism”. The quantitative results reveal distinct information-theoretic behaviors for the two interaction components: the entropy of the adjacency matrix exhibits a steady increase from Layer 1 ( H 2.94 ) to Layer 2 ( H 3.43 ), suggesting that as the reasoning depth increases, the model expands its receptive field, transitioning from initial local physical constraints to a more comprehensive, global representation of the spacecraft’s structural dependencies. Conversely, the attention mechanism shows a notable entropy reduction, dropping from H 3.95 in Layer 1 to H 3.54 in Layer 2, indicating that the model is actively “distilling” information by filtering out redundant telemetry noise and concentrating its representational capacity on a sparse subset of critical spatiotemporal interactions. This dual-process evolution—the expansion of structural context and the concentration of dynamic attention—mathematically confirms the PSTG model’s ability to balance holistic system monitoring with precise anomaly localization, ensuring the stability and interpretability of the learned coupling patterns for complex satellite health monitoring tasks.
Across repeated runs, this entropy split (broader structural context with more focused attention) is consistent with the stronger and more stable Event-wise and Affiliation-based F 0.5 distributions reported in Figure 4 and Figure 5. This entropy–performance linkage is presented as an empirical consistency analysis for interpretability support, rather than a causal proof.
Figure 4 and Figure 5 summarize the statistical distributions of the F0.5 scores obtained from 15 independent runs across different baseline methods on the ESA dataset. Specifically, Figure 4 reports the Event-wise F0.5 results, while Figure 5 presents the affiliation-based F0.5 results. For each method, performance distributions are visualized using boxplots with per-run samples overlaid and sorted by the mean value, highlighting both the central tendency and variability of detection performance. This statistical comparison facilitates a direct assessment of robustness and stability of different approaches under repeated trials.

4.8. Ablation Study

To comprehensively evaluate component contributions, we report two complementary ablation settings.
Backbone-level ablation (A/B/C+GCN). We retain the original progressive degradation design: Experiment A replaces multi-scale patches with a single-scale variant, Experiment B further weakens graph construction by using single-head attention and removing top-k sparsification, and Experiment C additionally replaces attention-based aggregation with a GCN. As shown in Figure 6, performance decreases as these components are removed, confirming that multi-scale representation, sparse graph construction, and attention-based aggregation jointly support robust anomaly detection.
G-module-focused ablation (I–IV). We further isolate the proposed attention architecture using four settings: Experiment I (full PSTG with M+A+G), Experiment II (w/o M), Experiment III (w/o A), and Experiment IV (w/o G, replacing the structure-guided attention with standard GATv2). As shown in Figure 7, Experiment I achieves the best overall scores. Removing M causes the largest drop in event-wise F 0.5 (0.921 → 0.723), and weakening A also reduces event-wise F 0.5 (0.921 → 0.830). The comparison between I and IV highlights the value of structure guidance: replacing the proposed module with GATv2 lowers event-wise F 0.5 from 0.921 to 0.849, with recall slightly increasing (0.877 → 0.892) but precision dropping markedly (0.933 → 0.839), indicating over-detection without explicit structural constraints.

5. Conclusions

In this work, the Progressive Spatiotemporal Graph (PSTG) framework was proposed to overcome the difficulty of modeling intricate, evolving dependencies in long-horizon spacecraft telemetry. By combining a multi-scale patch embedding strategy with a structure-guided graph attention mechanism, PSTG explicitly captures both global mission trends and local interaction dynamics across telemetry channels. Extensive experiments on an ESA real-world dataset show that PSTG outperforms eleven state-of-the-art baselines in almost all cases and, more importantly, bridges the gap between algorithmic prediction and operational trust by visualizing learned adjacency and attention matrices for actionable, interpretable alarms. Future work will extend this interpretability towards causal reasoning graphs and investigate lightweight model distillation for potential on-board deployment.

Author Contributions

Conceptualization, Z.C., Z.L. and H.C.; methodology, Z.C. and Z.L.; software, Z.C. and Z.L.; validation, Z.C., Y.C. and Y.W.; formal analysis, Z.C. and Z.L.; investigation, Z.C. and Z.L.; resources, Y.W. and H.C.; data curation, Z.C. and Z.L.; writing—original draft preparation, Z.C.; writing—review and editing, Z.L. and H.C.; visualization, Z.C.; supervision, H.C. and Y.W.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset (ESA-AD) supporting the findings of this study is openly available in Zenodo at https://zenodo.org/records/12528696 accessed on 6 April 2026; see the corresponding ESA-AD benchmark description in [57].

Acknowledgments

The authors would like to thank the European Space Agency (ESA) for providing the “Anomalies in Satellite Telemetry” (ESA-AD) benchmark dataset. Special thanks are extended to the research team led by Krzysztof Kotowski for their contribution in curating and maintaining the dataset on Zenodo.

Conflicts of Interest

The authors Yue Wang was employed by China Academy of Space Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Cuéllar, S.; Santos, M.; Alonso, F.; Fabregas, E.; Farias, G. Explainable anomaly detection in spacecraft telemetry. Eng. Appl. Artif. Intell. 2024, 133, 108083. [Google Scholar] [CrossRef]
  2. Chen, J.; Pi, D.; Wu, Z.; Zhao, X.; Pan, Y.; Zhang, Q. Imbalanced satellite telemetry data anomaly detection model based on Bayesian LSTM. Acta Astronaut. 2021, 180, 232–242. [Google Scholar] [CrossRef]
  3. Manavalan, R. NISAR Real Time Data Processing—A Simple and Futuristic View. In Big Data, Machine Learning, and Applications; Patgiri, R., Bandyopadhyay, S., Borah, M., Thounaojam, D., Eds.; Springer International Publishing: Silchar, India, 2019; pp. 95–101. [Google Scholar] [CrossRef]
  4. Xu, Z.; Cheng, Z.; Tang, Q.; Guo, B. An encoder-decoder generative adversarial network-based anomaly detection approach for satellite telemetry data. Acta Astronaut. 2023, 213, 547–558. [Google Scholar] [CrossRef]
  5. Boniol, P.; Paparizzos, J.; Palpanas, T. New Trends in Time Series Anomaly Detection. In Proceedings of the 26th International Conference on Extending Database Technology (EDBT 2023), Ioannina, Greece, 28–31 March 2023; pp. 847–850. [Google Scholar] [CrossRef]
  6. Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Soderstrom, T. Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 387–395. [Google Scholar] [CrossRef]
  7. Iverson, D.L. Inductive system health monitoring. In Proceedings of the International Conference on Artificial Intelligence, Las Vegas, NV, USA, 21–24 June 2004. [Google Scholar]
  8. Park, H.; Mackey, R.; James, M.; Zak, M.; Kynard, M.; Sebghati, J. Analysis of Space Shuttle main engine data using beacon-based exception analysis for multi-missions. In Proceedings of the IEEE Aerospace Conference; IEEE: Piscataway, NJ, USA, 2002; Volume 6, p. 6. [Google Scholar]
  9. Schwabacher, M.; Oza, N.; Matthews, B. Unsupervised anomaly detection for liquid-fueled rocket propulsion health monitoring. J. Aerosp. Comput. Inf. Commun. 2009, 6, 464–482. [Google Scholar] [CrossRef]
  10. Martin, R.A. Evaluation of Anomaly Detection Capability for Ground-Based Pre-Launch Shuttle Operations. In Aerospace Technologies Advancements; IntechOpen: London, UK, 2010. [Google Scholar]
  11. Napoli, C.; De Magistris, G.; Ciancarelli, C.; Corallo, F.; Russo, F.; Nardi, D. Exploiting Wavelet Recurrent Neural Networks for satellite telemetry data modeling, prediction and control. Expert Syst. Appl. 2022, 206, 117831. [Google Scholar] [CrossRef]
  12. Zeng, Z.; Jin, G.; Xu, C.; Chen, S.; Zeng, Z.; Zhang, L. Satellite Telemetry Data Anomaly Detection Using Causal Network and Feature-Attention-Based LSTM. IEEE Trans. Instrum. Meas. 2022, 71, 3507221. [Google Scholar] [CrossRef]
  13. Yu, J.; Song, Y.; Tang, D.; Han, D.; Dai, J. Telemetry Data-Based Spacecraft Anomaly Detection With Spatial–Temporal Generative Adversarial Networks. IEEE Trans. Instrum. Meas. 2021, 70, 3515209. [Google Scholar] [CrossRef]
  14. Wang, X.; Pi, D.; Zhang, X.; Liu, H.; Guo, C. Variational transformer-based anomaly detection approach for multivariate time series. Measurement 2022, 191, 110791. [Google Scholar] [CrossRef]
  15. Ding, C.; Sun, S.; Zhao, J. MST-GAT: A multimodal spatial–temporal graph attention network for time series anomaly detection. Inf. Fusion 2023, 89, 527–536. [Google Scholar] [CrossRef]
  16. Corradini, F.; Gori, M.; Lucheroni, C.; Piangerelli, M.; Zannotti, M. A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification. arXiv 2024. [Google Scholar] [CrossRef]
  17. Xiong, W.; Wang, P.; Sun, X.; Wang, J. SiET: Spatial information enhanced transformer for multivariate time series anomaly detection. Knowl.-Based Syst. 2024, 296, 111928. [Google Scholar] [CrossRef]
  18. Liu, L.; Tian, L.; Kang, Z.; Wan, T. Spacecraft anomaly detection with attention temporal convolution networks. Neural Comput. Appl. 2023, 35, 9753–9761. [Google Scholar] [CrossRef]
  19. Tian, Z.; Zhuo, M.; Liu, L.; Chen, J.; Zhou, S. Anomaly detection using spatial and temporal information in multivariate time series. Sci. Rep. 2023, 13, 4400. [Google Scholar] [CrossRef]
  20. Wang, C.; Liu, G. From anomaly detection to classification with graph attention and transformer for multivariate time series. Adv. Eng. Inform. 2024, 60, 102357. [Google Scholar] [CrossRef]
  21. Fernández, M.; Yue, Y.; Weber, R. Telemetry Anomaly Detection System Using Machine Learning to Streamline Mission Operations. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 70–75. [Google Scholar] [CrossRef]
  22. Rolincik, M.; Lauriente, M.; Koons, H.; Gorney, D. An expert system for diagnosing environmentally induced spacecraft anomalies. In Proceedings of the AIAA-91-1377, Huntsville, AL, USA, 24–27 September 1991; pp. 36–44. [Google Scholar]
  23. Kolcio, K.; Fesq, L. Model-based off-nominal state isolation and detection system for autonomous fault management. In Proceedings of the 2016 IEEE Aerospace Conference, Big Sky, MT, USA, 5–12 March 2016; pp. 1–13. [Google Scholar] [CrossRef]
  24. Bernal-Mencia, P.; Doerksen, K.; Yap, C. Machine Learning for Early Satellite Anomaly Detection. In Proceedings of the Small Satellite Conference, Logan, UT, USA, 7–12 August 2021. [Google Scholar]
  25. Baireddy, S.; Desai, S.; Foster, R.; Chan, M.; Comer, M.; Delp, E. Spacecraft Time-Series Online Anomaly Detection Using Deep Learning. In Proceedings of the 2023 IEEE Aerospace Conference, Big Sky, MT, USA, 4–11 March 2023; pp. 1–9. [Google Scholar] [CrossRef]
  26. Nalepa, J.; Myller, M.; Andrzejewski, J.; Benecki, P.; Piechaczek, S.; Kostrzewa, D. Evaluating algorithms for anomaly detection in satellite telemetry data. Acta Astronaut. 2022, 198, 689–701. [Google Scholar] [CrossRef]
  27. Li, T.; Comer, M.; Delp, E.; Desai, S.; Foster, R.; Chan, M. A Matching-Based Method for Anomaly Verification in Spacecraft Telemetry. In Proceedings of the 2022 IEEE Aerospace Conference, Big Sky, MT, USA, 5–12 March 2022; pp. 1–8. [Google Scholar] [CrossRef]
  28. Maleki Sadr, M.A.; Zhu, Y.; Hu, P. An Anomaly Detection Method for Satellites Using Monte Carlo Dropout. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 2044–2052. [Google Scholar] [CrossRef]
  29. Caruso, A.; Quarta, A.A.; Mengali, G.; Bassetto, M. Optimal On-Orbit Inspection of Satellite Formation. Remote Sens. 2022, 14, 5192. [Google Scholar] [CrossRef]
  30. Gao, Y.; Chen, G.; Fu, W.; Chen, X.; Ma, L.; Luo, T.; Xue, D. A Real-Time Linear Prediction Algorithm for Detecting Abnormal BDS-2/BDS-3 Satellite Clock Offsets. Remote Sens. 2023, 15, 1831. [Google Scholar] [CrossRef]
  31. Pang, J.; Liu, D.; Peng, Y.; Peng, X. Collective Anomalies Detection for Sensing Series of Spacecraft Telemetry with the Fusion of Probability Prediction and Markov Chain Model. Sensors 2019, 19, 722. [Google Scholar] [CrossRef] [PubMed]
  32. Abdelghafar, S.; Darwish, A.; Hassanien, A.; Yahia, M.; Zaghrout, A. Anomaly detection of satellite telemetry based on optimized extreme learning machine. J. Space Saf. Eng. 2019, 6, 291–298. [Google Scholar] [CrossRef]
  33. Sakurada, M.; Yairi, T. Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, Gold Coast, QLD, Australia, 2 December 2014; pp. 4–11. [Google Scholar] [CrossRef]
  34. Codetta-Raiteri, D.; Portinale, L. Dynamic Bayesian Networks for Fault Detection, Identification, and Recovery in Autonomous Spacecraft. IEEE Trans. Syst. Man Cybern. Syst. 2015, 45, 13–24. [Google Scholar] [CrossRef]
  35. Wang, Y.; Wu, Y.; Yang, Q.; Zhang, J. Anomaly Detection of Spacecraft Telemetry Data Using Temporal Convolution Network. In Proceedings of the 2021 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Glasgow, UK, 17–20 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
  36. Su, Y.; Zhao, Y.; Niu, C.; Liu, R.; Sun, W.; Pei, D. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2828–2837. [Google Scholar] [CrossRef]
  37. Tariq, S.; Lee, S.; Shin, Y.; Lee, M.; Jung, O.; Chung, D.; Woo, S. Detecting Anomalies in Space using Multivariate Convolutional LSTM with Mixtures of Probabilistic PCA. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2123–2133. [Google Scholar] [CrossRef]
  38. Lakey, D.; Schlippe, T. A Comparison of Deep Learning Architectures for Spacecraft Anomaly Detection. In Proceedings of the 2024 IEEE Aerospace Conference, Big Sky, MT, USA, 2–9 March 2024; pp. 1–11. [Google Scholar] [CrossRef]
  39. Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
  40. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  41. Chen, S.A.; Li, C.L.; Yoder, N.; Arik, S.O.; Pfister, T. TSMixer: An All-MLP Architecture for Time Series Forecasting. Trans. Mach. Learn. Res. 2023. [Google Scholar] [CrossRef]
  42. Murad, M.M.N.; Aktukmak, M.; Yilmaz, Y. WPMixer: Efficient Multi-Resolution Mixing for Long-Term Time Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–27 February 2025; pp. 19581–19588. [Google Scholar]
  43. Zhang, Y.; Yan, J. Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  44. Yi, K.; Zhang, Q.; Fan, W.; Wang, S.; Wang, P.; He, H.; Lian, D.; An, N.; Cao, L.; Niu, Z. Frequency-domain MLPs are More Effective Learners in Time Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
  45. Wu, X.; Qiu, X.; Cheng, H.; Li, Z.; Hu, J.; Guo, C.; Yang, B. Enhancing Time Series Forecasting through Selective Representation Spaces: A Patch Perspective. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 9–15 December 2025. [Google Scholar]
  46. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar] [CrossRef]
  47. Jiang, F.; Lin, W.; Wu, Z.; Zhang, S.; Chen, Z.; Li, W. Fault diagnosis of gearbox driven by vibration response mechanism and enhanced unsupervised domain adaptation. Adv. Eng. Inform. 2024, 61, 102460. [Google Scholar] [CrossRef]
  48. Liu, S.; Li, X.; He, J.; Chen, Z.; Dai, L. Partial domain adaptation fault diagnosis method based on deep residual shrinkage network. J. Comput. Des. Eng. 2025, 12, 76–86. [Google Scholar] [CrossRef]
  49. Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  50. Das, A.; Rad, P. Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey. arXiv 2020, arXiv:2006.11371. [Google Scholar] [CrossRef]
  51. Xu, Z.; Cheng, Z.; Guo, B. A Multivariate Anomaly Detector for Satellite Telemetry Data Using Temporal Attention-Based LSTM Autoencoder. IEEE Trans. Instrum. Meas. 2023, 72, 3523913. [Google Scholar] [CrossRef]
  52. Yu, B.; Yu, Y.; Xu, J.; Xiang, G.; Yang, Z. MAG: A Novel Approach for Effective Anomaly Detection in Spacecraft Telemetry Data. IEEE Trans. Ind. Informatics 2024, 20, 3891–3899. [Google Scholar] [CrossRef]
  53. Hu, Y.; Zhang, G.; Liu, P.; Lan, D.; Li, N.; Cheng, D.; Dai, T.; Xia, S.T.; Pan, S. TimeFilter: Patch-Specific Spatial-Temporal Graph Filtration for Time Series Forecasting. arXiv 2025. [Google Scholar] [CrossRef]
  54. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  55. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  56. Wenig, P.; Schmidl, S.; Papenbrock, T. TimeEval: A benchmarking toolkit for time series anomaly detection algorithms. Proc. VLDB Endow. 2022, 15, 3678–3681. [Google Scholar] [CrossRef]
  57. Kotowski, K.; Haskamp, C.; Andrzejewski, J.; Ruszczak, B.; Nalepa, J.; Lakey, D.; Collins, P.; Kolmas, A.; Bartesaghi, M.; Martinez-Heras, J.; et al. European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry. arXiv 2024. [Google Scholar] [CrossRef]
  58. Mejri, N.; Lopez-Fuentes, L.; Roy, K.; Chernakov, P.; Ghorbel, E.; Aouada, D. Unsupervised anomaly detection in time-series: An extensive evaluation and analysis of state-of-the-art methods. Expert Syst. Appl. 2024, 256, 124922. [Google Scholar] [CrossRef]
  59. Xu, H.; Chen, W.; Zhao, N.; Li, Z.; Bu, J.; Li, Z.; Liu, Y.; Zhao, Y.; Pei, D.; Feng, Y.; et al. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. In Proceedings of the 2018 World Wide Web Conference; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2018; pp. 187–196. [Google Scholar] [CrossRef]
Figure 1. Overall framework of PSTG. The data flow proceeds through three key modules: (I) Multi-scale Patch Design: Raw inputs are segmented into multi-scale patches and mapped into high-dimensional embeddings via gated attention fusion. (II) Progressive Spatiotemporal Graph Reasoning: The module employs a sequence of graph attention blocks to repeatedly extract and reason over spatiotemporal dependencies, achieving a progressive representation learning. (III) Statistical Anomaly Decision: Finally, the reconstructed outputs are compared with ground truth to compute anomaly scores using a dynamic pruning strategy.
Figure 1. Overall framework of PSTG. The data flow proceeds through three key modules: (I) Multi-scale Patch Design: Raw inputs are segmented into multi-scale patches and mapped into high-dimensional embeddings via gated attention fusion. (II) Progressive Spatiotemporal Graph Reasoning: The module employs a sequence of graph attention blocks to repeatedly extract and reason over spatiotemporal dependencies, achieving a progressive representation learning. (III) Statistical Anomaly Decision: Finally, the reconstructed outputs are compared with ground truth to compute anomaly scores using a dynamic pruning strategy.
Entropy 28 00426 g001
Figure 2. Partial illustration of the dataset.
Figure 2. Partial illustration of the dataset.
Entropy 28 00426 g002
Figure 3. Visualization of adjacency and graph attention matrices in the two-layer ( n L = 2 ) spatiotemporal graph neural network with multi-head attention ( H = 4 ). (a) Adjacency matrix after first layer. (b) Aggregated graph attention matrix after first layer. (c) Adjacency matrix after second layer. (d) Final graph attention matrix highlighting temporal, spatial, and spatiotemporal correlations.
Figure 3. Visualization of adjacency and graph attention matrices in the two-layer ( n L = 2 ) spatiotemporal graph neural network with multi-head attention ( H = 4 ). (a) Adjacency matrix after first layer. (b) Aggregated graph attention matrix after first layer. (c) Adjacency matrix after second layer. (d) Final graph attention matrix highlighting temporal, spatial, and spatiotemporal correlations.
Entropy 28 00426 g003
Figure 4. Statistical comparison of Event-wise F0.5 scores.
Figure 4. Statistical comparison of Event-wise F0.5 scores.
Entropy 28 00426 g004
Figure 5. Statistical comparison of Affiliation-based F0.5 scores.
Figure 5. Statistical comparison of Affiliation-based F0.5 scores.
Entropy 28 00426 g005
Figure 6. Performance comparison of PSTG vs. experiments A, B, and C.
Figure 6. Performance comparison of PSTG vs. experiments A, B, and C.
Entropy 28 00426 g006
Figure 7. G-module-focused ablation results for four settings (I–IV). Top: event-wise precision, recall, and F 0.5 . Bottom: affiliation-based precision, recall, and F 0.5 .
Figure 7. G-module-focused ablation results for four settings (I–IV). Top: event-wise precision, recall, and F 0.5 . Bottom: affiliation-based precision, recall, and F 0.5 .
Entropy 28 00426 g007
Table 1. The lightweight subset dataset overview.
Table 1. The lightweight subset dataset overview.
Mission 1–the Lightweight
Subset
TrainValidationTest
Data points39,774,0801,479,37040,925,288
Duration (anonymized)81 months3 months84 months
Annotated points [%]1.741.231.81
Annotated events52365
Anomalies22229
Rare nominal events26136
Univariate/Multivariate0/480/31/64
Global/Local39/93/040/25
Point/Subsequence1/472/19/56
Distinct event classes17213
Table 2. Setup of Hyperparameters for model training.
Table 2. Setup of Hyperparameters for model training.
ParametersValues
learning rate 5 × 10 4
weight decay 4 × 10 4
T m a x 70
e t a m i n 0
Table 3. Setup of hyperparameters of the prediction model.
Table 3. Setup of hyperparameters of the prediction model.
ParametersValues
P{25,50,125}
p main 25
D512
H4
n L 2
γ 0.1
p dropout 0.1
p δ 0.21
p s 0.05
n s 30
B s 70
Table 4. Modern time-series backbone baselines for spacecraft telemetry anomaly detection.
Table 4. Modern time-series backbone baselines for spacecraft telemetry anomaly detection.
CategoryMethodKey Insight/Architecture
Modern TS
Backbones
DLinear [46]A lightweight linear decomposition-based forecasting model that separates trend and seasonal components, serving as a strong linear baseline.
iTransformer [39]An inverted Transformer that captures multivariate correlations by embedding the entire time series.
PatchTST [40]A patch-based Transformer that preserves local semantic information and long-term dependencies.
TSMixer/WPMixer [41,42]High-performance MLP-based architectures that alternate between time and feature mixing.
FreTS [44]A frequency-domain MLP structure designed to capture periodic patterns in telemetry.
Crossformer [43]A Transformer variant designed to explicitly model cross-dimension and hierarchical dependencies.
TimeFilter [53]A spatiotemporal filtration approach that serves as a direct graph-based baseline.
Table 5. Performance comparison of PSTG and SOTA baseline methods on the ESA dataset.
Table 5. Performance comparison of PSTG and SOTA baseline methods on the ESA dataset.
MetricTime
Filter
iTrans
Former
PatchTSTCross
Former
DLinearTSMixerFreTSWPMixerPSTG
Event-wisePrecision0.8320.8240.9020.3720.3470.8050.7490.7960.932
Recall0.8460.8770.8620.8150.8460.7690.8310.8460.862
F 0.5 0.8350.8340.8940.4180.3940.7980.7640.8060.917
Affiliation-basedPrecision0.8840.9000.8980.8220.7840.8660.8630.8760.905
Recall0.8140.8580.8370.7680.7050.7750.7970.7820.844
F 0.5 0.8690.8910.8850.8110.7670.8460.8490.8560.892
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Z.; Li, Z.; Cao, Y.; Wang, Y.; Chang, H. Progressive Spatiotemporal Graph Modeling for Spacecraft Anomaly Detection. Entropy 2026, 28, 426. https://doi.org/10.3390/e28040426

AMA Style

Chen Z, Li Z, Cao Y, Wang Y, Chang H. Progressive Spatiotemporal Graph Modeling for Spacecraft Anomaly Detection. Entropy. 2026; 28(4):426. https://doi.org/10.3390/e28040426

Chicago/Turabian Style

Chen, Zihan, Zewen Li, Yuge Cao, Yue Wang, and Hsi Chang. 2026. "Progressive Spatiotemporal Graph Modeling for Spacecraft Anomaly Detection" Entropy 28, no. 4: 426. https://doi.org/10.3390/e28040426

APA Style

Chen, Z., Li, Z., Cao, Y., Wang, Y., & Chang, H. (2026). Progressive Spatiotemporal Graph Modeling for Spacecraft Anomaly Detection. Entropy, 28(4), 426. https://doi.org/10.3390/e28040426

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop