1. Introduction
Satellites are sophisticated systems composed of multiple components, each serving distinct functions. Due to the extreme operational environments, such as rapid thermal cycling and intense electromagnetic radiation, it is difficult to prevent operational anomalies and failures, which pose significant risks to in-orbit satellite reliability and safety [
1]. To mitigate these risks, satellite operators typically monitor key time-series telemetry data continuously, aiming to detect anomalies early and prevent critical system failures that could disrupt mission operations [
2]. However, modern satellites generate vast amounts of telemetry data, including parameters such as temperature, voltage, and current [
3]. Manually inspecting such high-volume, multivariate data for anomalies is highly labor-intensive, requiring operators to track hundreds of interrelated parameters across various subsystems. Although feature dimensionality reduction techniques can partially alleviate this burden, prior research indicates their limited effectiveness when applied to large-scale, high-dimensional sequential data [
4]. In response, this study directly focuses on modeling multidimensional telemetry time series to preserve cross-parameter correlations and enhance anomaly detection accuracy. Given the complex interdependencies among telemetry variables, addressing anomaly detection within a multivariate time series framework is essential for reliably identifying significant deviations [
4].
Telemetry data exhibits high dimensionality and dynamic interaction patterns over time. These interactions manifest as temporal, spatial, and spatiotemporal correlations across multiple channels. Temporal correlation refers to the dependence of current values on historical observations within a single channel, driven by periodic behaviors and underlying system dynamics. Spatial correlation arises from physical and functional dependencies among different subsystems, where the state of one telemetry channel influences others at the same moment. Furthermore, due to causal relationships (e.g., an increase in motor speed leading to delayed rises in current and temperature), telemetry data is also affected by past information from other channels, reflecting broader spatiotemporal dependencies. These non-exclusive correlation types may coexist and evolve dynamically throughout a mission. Beyond easily identifiable point anomalies, contextual anomalies, i.e., values that appear normal in isolation but deviate under specific temporal or operational conditions [
1,
5,
6], require a deep understanding of the intricate spatiotemporal structure of multichannel telemetry. Consequently, spacecraft anomaly detection remains heavily reliant on expert analysts, and accurate, intelligent, and automated detection continues to pose a major challenge.
Early efforts in spacecraft anomaly detection were pioneered by NASA and its affiliated research centers. Systems such as the Inductive Monitoring System (IMS) [
7], the BEAM/DIAD framework [
8], and related approaches [
9,
10] laid the foundation for rule-based and data-driven health monitoring of shuttle telemetry, employing clustering-based nominal modeling and statistical invariants to identify deviations from normal behavior. With the rapid development of deep learning, substantial progress has been made in detecting anomalies in multivariate time series. Recurrent Neural Networks (RNNs) [
11] and Long Short-Term Memory (LSTM) networks [
12] excel at capturing temporal dependencies, while Convolutional Neural Networks (CNNs) [
13], Variational Autoencoders (VAEs) [
14], and Graph Neural Networks (GNNs) [
15] are used to model inter-variable relationships. However, most of these models implicitly encode spatiotemporal interactions within global hidden states, failing to explicitly represent the underlying dependency structures. Spatiotemporal Graph Neural Networks (STGNNs) [
16] offer a more structured approach by using GNN modules [
17] to model spatial correlations and CNN [
18], LSTM [
19], or Transformer [
20] components to capture temporal dynamics. Despite these advances, existing methods often fail to jointly model evolving spatiotemporal dependencies across channels and time steps. Temporal modeling is typically confined to individual channels, and spatial relationships are encoded without temporal context, resulting in poor representations of cross-channel phenomena such as delayed responses or coupled oscillations. This limitation undermines the detection of correlated or system-level anomalies.
To address these challenges, a prediction-driven anomaly detection framework is adopted, which first forecasts future telemetry values and then identifies anomalies based on the prediction residuals. The primary contribution of this work lies in the tailored design of a novel and robust forecasting model referred to as Progressive Spatiotemporal Graph (PSTG).
The main contributions of this paper are summarized as follows:
(a) A novel multi-scale adaptive fusion method is proposed to address the challenge of simultaneously capturing global patterns and local variations in spacecraft telemetry across diverse mission profiles. By modeling both long-term dependencies and short-term fluctuations, the method enables a comprehensive temporal feature representation that surpasses the capabilities of conventional single-scale approaches.
(b) A unified spatiotemporal graph representation, enhanced with an adaptive attention mechanism, is introduced to overcome the limitations of static dependency modeling. This approach dynamically identifies the most relevant node interactions at each time step, enabling the simultaneous learning of heterogeneous spatiotemporal dependencies through a single coherent graph structure, thereby significantly improving the modeling accuracy of complex spacecraft systems.
(c) The effectiveness of the complete PSTG framework is demonstrated through extensive experiments on a real-world spacecraft telemetry dataset spanning 84 months. While the model outperforms eleven state-of-the-art methods in almost all cases across multiple metrics, more importantly, its learned graph structure supports interpretable analysis. This capability is critical for assisting operators in diagnosing the root causes of detected anomalies.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 presents the architecture and technical details of the PSTG framework.
Section 4 evaluates the proposed method on a real-world spacecraft telemetry dataset.
Section 5 concludes the paper.
3. Methodology
Engineering-level intuition of PSTG: Before presenting the formal mathematical derivations, this section first summarizes the practical operation logic of PSTG for spacecraft health monitoring. The framework continuously processes incoming telemetry in three tightly coupled steps: it first uses multi-scale patches to capture both short-term high-frequency transients and long-term low-frequency mission drifts; it then performs dynamic graph construction and structure-guided weighting to discover data-adaptive sensor couplings rather than relying on a fixed engineering schematic; finally, it compares predicted normal behavior with real-time observations and converts residual deviations into actionable alarms through a dynamically calibrated statistical threshold.
3.1. Overall Framework
The proposed PSTG framework has a prediction-driven architecture that turns short-horizon forecasts and their deviations into anomaly evidence. An overview of the framework is illustrated in
Figure 1, where
T denotes the total sequence length,
L is the context length,
F represents the forecast horizon, and
is the progressive composition depth.
PSTG is designed as a channel-agnostic and data-adaptive framework rather than a mission template-specific model. The multi-scale patch operator captures temporal behaviors at multiple horizons, the dynamic graph module learns couplings directly from observed telemetry interactions, and the statistical thresholding module calibrates alarm criteria from recent residual distributions. Because these components do not rely on explicit sensor identities or hand-crafted subsystem schematics, the same architecture is expected to transfer to other missions and subsystem groups with similar multivariate telemetry characteristics. This expectation is currently design-based and qualitative, and dedicated cross-mission validation will be addressed in future work.
The framework’s overall detection process acts on the full multivariate time series
and their consolidated predictions
to produce an anomaly score matrix
:
The predictions
are generated by the core model, which maps a context window
to an
F-step forecast
. This mapping is a progressive composition of three operators:
where
denotes the set of learnable parameters associated with the multi-scale patching
, the stacked graph reasoning
, and the final forecast projection
, respectively.
The core operators of this framework, i.e., , , and are defined as follows:
- (1)
: Multi-scale temporal patching. partitions raw telemetry data into a hierarchy of temporal patches and aggregates across scales, preserving fine-grained fluctuations and mission-level trends to yield a stable multi-resolution representation.
- (2)
: Progressive Spatiotemporal Graph reasoning. constructs a data-adaptive spatiotemporal dependency structure over channels and time, refining the latent representation via structure-guided attention. Executed in a stacked manner with depth , it captures cross-channel couplings and long-range effects without assuming a fixed correlation pattern.
- (3)
: Statistical anomaly decision. converts forecast–signal discrepancies into anomaly evidence across channels and time and issues decisions under a data-calibrated criterion. Concrete choices (i.e., robust deviation, temporal stabilization, dynamic thresholding) are specified later.
3.2. Progressive Spatiotemporal Inference for Multi-Channel Telemetry
This section formalizes the core inference engine of PSTG, which progressively refines a latent representation of the input telemetry data through a deep, stacked architecture. The objective is to transform an initial multi-resolution embedding into a forecast-ready state by iteratively applying the spatiotemporal reasoning operator .
Given the multi-channel telemetry sequence
, an initial latent representation,
, is first generated by applying the multi-scale patch design operator:
encapsulates hierarchical temporal features and serves as the initial hidden representation for the main reasoning stack, denoted as .
The progressive inference is defined as a sequence of transformations, in which each layer not only refines the representations produced by the preceding layer but also performs intra-layer spatiotemporal reasoning through the joint evolution of graph structure and attention weights via the operator
. Each layer possesses its own unique set of learnable parameters. The recursive update rule for the hidden states is formulated as:
The final output of the stack, , is the fully distilled, forecast-ready latent representation. This representation is then consumed by the prediction head:
While the formulation above defines the transformation for a single context window, the complete predicted sequence, , is generated by deploying this entire inference pipeline in a sliding-window fashion across the full telemetry sequence, where each window produces a short-horizon forecast and only its first time steps are retained. These retained segments are then concatenated in temporal order to construct the final, continuous prediction . This global forecast is then compared against the ground truth to identify anomalies, as detailed in the anomaly decision section.
3.2.1. Multi-Scale Patch Design
Inspired by its success in natural language processing, patch embedding has recently emerged as a powerful paradigm for capturing local semantic information. Considering the large volume of telemetry data, we use a patch-based approach, dividing the data into smaller segments for analysis. This reduces the sequence length to be processed and lowers the overall computational load. Several existing methods rely on a uniform patch length. This single-scale approach, however, is inherently limited because a fixed patch size is ill suited to capturing both short-term fluctuations and long-term trends simultaneously.
This section provides the formal specification of the multi-scale temporal patching operator,
. Conceptually, this operator is defined as the composition of three foundational transformations: a multi-scale partitioning function
, a position-aware embedding function
, and a gated attention fusion function
. The entire operator constitutes a mapping from the raw temporal domain to a structured, multi-resolution latent space, expressed as:
The learnable parameters of the operator are thus . Next, the concrete instantiation of each constituent transformation is specified.
The initial transformation,
, discretizes the continuous input
by partitioning it based on a set of
K patch lengths
:
To ensure comparability, this function standardizes the output to a sequence of patches for every scale, so that temporal patterns with different characteristic durations can be captured, by applying a sliding window with a calculated stride . This yields a set of patches .
The subsequent transformation,
, governed by the learnable embedding parameters
, endows the partitioned data with semantic structure and temporal order:
is itself a combination of two functions. First, a scale-dedicated linear projection maps each patch to a
D-dimensional vector. Second, to counteract the information loss from partitioning, a fixed sinusoidal positional prior is infused via summation. The complete embedding is defined as:
where
is the weight matrix and
denotes the bias vector, and
represents the fixed positional encoding vector for position
i. The components of
are defined by:
Finally, the operator culminates in
, governed by the learnable parameters
, which adaptively aggregates the parallel, multi-scale representations into a single, unified representation. Its mapping is defined as:
After obtaining the set of embeddings
,
is used to fuse them into a single informative representation
. Specifically, a gated attention mechanism is employed to compute the fused representation as:
where the attention weights
are derived from a softmax function over a linear projection of the concatenated embeddings, allowing the model to learn the relative importance of each
in the fusion process.
After fusing the temporal dependencies, we obtain the feature embeddings , which are then reshaped into a node feature matrix . Here, each of the rows represents a unique spatiotemporal node that will be processed by the subsequent graph reasoning module.
3.2.2. Progressive Spatiotemporal Graph Modeling
The core of the proposed framework lies in the PSTG modeling module, which transforms the initial multi-resolution embedding,
, into a forecast-ready state. This is achieved by iteratively applying a spatiotemporal reasoning operator,
, in a stacked architecture. The recursive update rule for the hidden states is given by:
This section provides the formal specification for the generic operator
at any given layer
l. Conceptually, the operator is decomposed into two primary transformations: a dynamic graph construction operator
, followed by a structure-guided graph attention operator
. The complete operator for a layer is thus expressed as:
The learnable parameters are partitioned accordingly, and . The formal specifications for each constituent operator are provided in the following subsection.
Spatial–Temporal Graph Construction
The spatial–temporal graph construction approach formulates multivariate time series data as a dynamic graph, where the nodes represent spatial entities and the learned edges capture evolving spatial correlations [
53]. To uncover the underlying relational structure among input variables without relying on a pre-defined static graph, a dynamic graph learning mechanism is applied to learn a sparse, weighted adjacency matrix directly from node features in an end-to-end manner.
Relying on a single graph structure can be insufficient to capture the complex, diverse nature of inter-variable dependencies. Therefore, the multi-head graph learner mechanism that allows the model to learn H distinct adjacency matrices in parallel from H different representation subspaces is adopted. Each head is specialized to capture a distinct pattern of the graph structure, such as dependencies at different time scales or of various types.
The operator’s mapping is defined as:
The operator is parameterized by . To maintain notational consistency with the preceding section, we continue to denote this input matrix as , where for the initial layer
.
To capture the spatial–temporal features of this node feature matrix, we first divided
into different heads
, where
. To maintain parameter efficiency, two linear transformations with weights
are learned and shared across all heads. These transformations project the node features of each head into a relational space:
The weighted adjacency matrix for each head is then computed via a dot product, followed by a Rectified Linear Unit (ReLU) activation:
To enforce sparsity and retain only the most important connections, a top-k masking strategy based on a hyperparameter is adopted to obtain the final sparse adjacency matrix for each head.
To convert the raw edge weights into a normalized probability distribution, a softmax function is applied:
This step transforms the adjacency matrix into a row-stochastic matrix, where each row sums to one.
Finally, to prevent overfitting to the learned graph structure, we apply dropout with rate directly to this normalized adjacency matrix during the training phase.
The final processed adjacency matrix is then passed to the subsequent graph attention layer for information propagation.
Graph Attention Learning
Once constructed, the dynamic adjacency matrix serves as the foundation for learning node representations. Our attention module uses a dynamic mechanism that computes attention weights based on pairwise interactions between a node and its neighbors, thereby enhancing the model’s expressive capacity. A key modification is introduced to the standard GATv2 architecture for more effective integration of the learned graph structure. Instead of applying a linear transformation to the concatenated query and key vectors, we directly use the learned adjacency matrix to modulate the attention scores. This design not only enhances computational efficiency by eliminating a linear layer but, and more importantly, directly injects the learned relational structure as a strong inductive bias into the attention mechanism.
The structure-guided graph attention operator
is formally defined as:
where the learnable parameters for the layer are
.
Given the feature set
, we first project it into a combined query
, key
, and value
representation using three distinct linear layers. The resultant tensor undergoes reshaping and permutation to disentangle the representations for the
H distinct attention heads. Specifically,
is structured to yield
H independent sets of
,
,
, for
. To incorporate the explicit graph structure as a strong inductive bias, the learned attention scores are modulated by
. The attention score from source node
j to target node
i is computed as:
where
is a shared linear projection applied across all heads.
These modulated scores are then normalized across all source nodes in the neighborhood of node
i, using the softmax function to obtain the final attention coefficients
:
To enhance regularization, dropout is applied to the attention coefficients. Subsequently, the message vector for node
i within head
h, denoted as
, is computed by aggregating the node features of its neighbors, weighted by the final attention coefficients:
The outputs from all
H heads are then concatenated and passed through a final linear projection layer,
. This step yields the aggregated message representation
:
Finally, following the standard Transformer architecture, a residual connection is added to the input features, followed by layer normalization, to produce the layer’s final output,
:
3.2.3. Loss Function and Optimization
To minimize the discrepancy between the multi-channel telemetry prediction and actual data, the learning criterion is formulated as a composite loss function
. This function is designed to capture the signal’s point-wise accuracy as well as its structural and dynamic properties, and is defined as:
where
is the Frobenius norm,
represents the Discrete Fourier Transform (DFT) along the temporal axis, and
denotes the temporal gradient operator.The weight parameters
and
are critical for balancing point-wise reconstruction accuracy with spectral and structural properties. In this study, these hyperparameters were determined through a grid search on the validation set, ensuring that the model effectively captures both high-frequency fluctuations and long-term trends.
To optimize the model parameters, the aforementioned loss function is minimized using Stochastic Gradient Descent (SGD)-based methods. Specifically, the Adam optimizer [
54] is adopted, which adaptively adjusts learning rates for each parameter and accelerates convergence. To further enhance convergence stability and generalization, the Cosine Annealing (CA) learning rate scheduler [
55] is integrated, which gradually reduces the learning rate following a cosine decay schedule over the course of training. This scheduler is applied on top of the Adam optimizer to facilitate smoother model convergence by allowing for large initial learning rates and progressively finer updates as training proceeds.
3.3. Multi-Channel Telemetric Anomaly Detection
Following the prediction stage, which yields the output , the final forecast sequence is constructed using a recursive strategy where the number of retained points equals the window step size, . Specifically, only the first points from each prediction window are utilized. Once the complete forecast sequence is assembled, the anomaly detection phase begins. This process is designed to identify and score anomalous deviations by comparing the model’s predictions against the ground-truth telemetry data. The methodology adapts the robust, unsupervised techniques proposed by Kotowski et al. The overall procedure of the PSTG framework is summarized in Algorithm 1.
The application of the operator
is centered on the principle of non-parametric dynamic thresholding. Its core mechanism involves the determination of an optimal threshold,
, which is found by solving the following optimization problem over the raw residual sequence
:
where
and
respectively denote the decreases in the mean and standard deviation of the raw residuals after excluding values above the threshold
;
is the set of anomalous residuals exceeding
; and
represents the set of continuous sequences formed by those residuals.
Given the optimal threshold
from the optimization step, the operator then assigns a severity score,
s, to each detected anomalous sequence. This score quantifies the normalized magnitude of the deviation relative to the data-driven threshold:
To enhance robustness, the operator’s application is extended with two refinements. First, to capture “silent failures” (i.e., anomalies manifesting as abrupt signal drops or inverted deviations), the entire optimization and scoring procedure is independently applied to a reflected residual sequence
. Second, to mitigate false alarms, a false-positive pruning strategy is employed, which assesses the percent decrease between the peaks of consecutively ranked anomalous sequences and reclassifies those below a predefined threshold
as normal. This approach evaluates the percent decrease,
, between the peaks of consecutively ranked anomalous sequences,
and
:
If
is found to be less than
, then the corresponding sequence and all subsequent, lower-ranked sequences are reclassified as normal.
| Algorithm 1 The complete PSTG algorithm |
- 1:
Input: Training dataset ; Telemetry sequence for detection ; Setup of model hyperparameters for model training and prediction. - 2:
Output: Anomaly scores for the sequence .
|
- 3:
Initialize parameters . - 4:
- 5:
for epoch = 1 to E do - 6:
for each batch do - 7:
- 8:
. - 9:
Update . - 10:
end for - 11:
end for - 12:
.
|
- 13:
. - 14:
for to length() - F step do - 15:
. - 16:
. - 17:
. - 18:
end for
|
- 19:
. - 20:
. - 21:
. - 22:
return .
|
4. Experimental Results and Analysis
4.1. Description of Datasets
Experiments were conducted using multi-channel telemetry data from real spacecraft, and the proposed algorithm was compared with eleven advanced methodologies to validate its effectiveness. Notably, the adopted European Space Agency (ESA) Anomalies Dataset (ESA-AD) conceals the telemetry channel names. This prevents the algorithm from leveraging domain-specific knowledge, emphasizing the use of a general data-driven approach rather than focusing on the complexity of specific tasks. The experiments were based on the lightweight subset from Mission 1 of the ESA-AD dataset, focusing on telemetry channels 41–46 of subsystem 5. This subset presented significant challenges due to the high number and complexity of anomalies. Given the advanced autonomous capabilities of modern spacecraft, individual telecommands exert less influence on telemetry behavior compared to earlier, non-autonomous systems. Consequently, the dataset used in this work excludes telecommand data. A representative segment of the Mission 1 lightweight subset is illustrated in
Figure 2. The Y-axis can be omitted, since the channels have been normalized and vertically offset for visual clarity.
Both the training and test datasets consist of millions of telemetry data, with the final three months of the training set designated as the validation set. The validation and test sets contain only samples that occur after those in the training set to prevent data leakage from future time points. Anomalies are present in all three subsets—training, validation, and testing. This data-partitioning strategy maximizes the use of available data, reflecting a mature phase of the task where sufficient historical data enabled robust model training. Anomaly statistics across the three datasets are summarized in
Table 1.
4.2. Experimental Details
The TimeEval [
56] framework is employed, as modified by Krzysztof et al. [
57], to implement the PSTG algorithm. The algorithm was first trained on the training set, during which contamination levels were calculated, thresholds were established, and standardization parameters were determined. It was then applied to the test set for online anomaly detection, operating without access to future samples from the test sequence.
In typical spacecraft mission operations, high-dimensional telemetry data are downlinked to ground stations for intensive health monitoring. Therefore, the PSTG model is positioned as a ground-based diagnostic tool, where the emphasis is placed on detection accuracy and interpretability rather than the extreme resource frugality required for on-orbit processing. Our model was implemented in PyTorch 2.6.0 under Python 3.9.0 and trained on a single NVIDIA GeForce RTX 4090 GPU. As outlined in
Section 3.2.3, the AdamW optimizer was used in conjunction with a CA learning rate scheduler. The hyperparameters adopted in the primary experiment are summarized in
Table 2.
For our method and all baseline approaches, the sliding window length L was set to 250 and the prediction window length F was set to 10, following the standard configurations established in Telemanom. From each prediction window, only the first time step () was retained for further processing.
For anomaly detection, the size of the smoothing window is calculated by , where is the test batch size, denotes a configurable base factor, and represents the tuning percentage.
The hyperparameters for the proposed model were carefully tuned and are detailed in
Table 3.
4.3. Evaluation Metrics
Modified performance metrics from ESA-AD were adopted to better align conventional anomaly detection measures with practical spacecraft operational requirements. These metrics included the corrected event-wise F0.5-score and modified affiliation-based F0.5-score. Among these, the modified affiliation-based F0.5-score has a comparatively intricate formulation, so a dedicated description is provided in the following subsection. The computation of these metrics excluded the use of Point Adjustment (PA), a preprocessing protocol intended to refine anomaly predictions before evaluation. PA operates under the assumption that if any single point within an anomalous segment is correctly detected, then all points in that segment are treated as correctly identified. This protocol relaxes the detection burden on algorithms, though often leading to inflated performance scores [
58]. The application of PA might allow randomly generated anomaly scores to surpass the performance of several recently proposed time series anomaly detection methods [
59].
Operationally, Event-wise reflects whether anomaly events can be detected in time while controlling false alarms at the event level, which is critical for alarm triage in spacecraft operations. Affiliation-based evaluates temporal localization quality by measuring how well predicted anomaly intervals align with the corresponding ground-truth intervals in terms of boundary proximity and coverage completeness. Reporting both metrics therefore provides complementary evidence on alarm reliability and diagnosis usability.
Affiliation-based
-score is a time-domain metric that evaluates how closely and completely predicted anomaly intervals match each ground-truth interval by computing the average temporal distance within each exclusive affiliation zone:
Here,
where
,
N is the total number of ground-truth events,
signifies the affiliation zone of
,
represents the survival function derived from uniform random sampling.
4.4. Baseline Methods
Baseline methods under comparison encompass a diverse set of approaches, ranging from classical techniques developed by NASA engineers [
6] to competitive multivariate time-series modeling baselines. To broaden the comparison across modern modeling paradigms, high-performance time-series forecasting backbones commonly adopted in forecasting-based anomaly detection pipelines were incorporated, including the recently proposed deep spatiotemporal graph neural network TimeFilter [
53]. Furthermore, several Transformer-based models that have emerged in the past two years were included, as they are widely recognized for achieving State-Of-The-Art (SOTA) performance in time-series forecasting, specifically iTransformer [
39], PatchTST [
40], and Crossformer [
43]. Furthermore, lightweight MLP-Mixer-based alternatives such as DLinear [
46] and TSMixer [
41] are included to evaluate efficiency–accuracy trade-offs. Finally, the analysis considers methods that enhance deep neural network performance through frequency-domain or time-frequency decomposition techniques, such as FreTS [
44], WPMixer [
42] to assess robustness under diverse inductive biases. The modern forecasting backbones included in the benchmark are summarized in
Table 4.
4.5. Anomaly Detection Results
The anomaly detection results of the baseline methods and PSTG are shown in
Table 5. Evidently, no baseline method performed well across all five categories of metrics.
The accurate identification of anomalous events is paramount in spacecraft anomaly detection, with a primary focus on the event-wise F0.5 score, which places greater emphasis on precision to reduce false alarms. As shown in
Table 5, PSTG achieves an event-wise F0.5 score of 0.917, significantly outperforming all competing baseline methods. In comparison, forecasting-based backbones such as PatchTST and iTransformer exhibit relatively strong recall but suffer from reduced precision, indicating a tendency to over-detect anomalies when applied to complex multivariate telemetry data. Simpler linear models (DLinear) and certain Transformer variants (Crossformer) fail to capture intricate cross-channel dependencies, resulting in substantially degraded event-wise performance.
Beyond event-level detection, anomaly detection systems must accurately localize the temporal extent of anomalous behaviors across multiple telemetry channels. The affiliation-based metrics explicitly evaluate the temporal alignment between detected anomalies and ground-truth intervals. PSTG attains an affiliation-based F0.5 score of 0.892 and consistently surpasses all baseline methods across affiliation-based precision, recall, and F0.5 metrics. While TimeFilter, as a graph-based baseline, demonstrates competitive temporal localization capability, its reliance on patch-level or static filtration limits its ability to progressively refine spatiotemporal dependencies.
Overall, the superior performance of PSTG across both event-wise and affiliation-based evaluations highlights the effectiveness of its progressive spatiotemporal graph reasoning mechanism for reliable spacecraft anomaly detection.
4.6. Discussion on Imbalanced Anomaly Patterns
The ESA-AD subset used in this work is naturally imbalanced, where normal samples dominate and anomaly events are relatively sparse. Under this setting, the -oriented evaluation is intentionally adopted to emphasize precision and reduce operational false alarms. The dynamic thresholding strategy further mitigates spurious alarms by adapting to recent residual statistics; however, under extremely rare or weak anomaly patterns, recall degradation may still occur. These observations are reported as practical behavior under the current data regime rather than a complete robustness claim.
4.7. Explainability Analysis
As shown in
Figure 3, relational patterns evolve progressively across network layers, illustrating how spatial, temporal, and spatiotemporal dependencies are incrementally refined. In the first layer, the adjacency matrix captures coarse spatial dependencies among nodes (
Figure 3a), while the attention map displays diffuse and unstructured temporal interactions across time steps (
Figure 3b). Following message passing in the second layer, the adjacency matrix becomes sparser and more structured (
Figure 3c), indicating that the model selectively retains salient spatial connections while suppressing weaker or noisy ones. Finally, as shown in
Figure 3d, the attention map exhibits concentrated regions of high intensity that correspond to correlated node activations across space and time, reflecting the emergence of stable spatiotemporal coupling patterns. This hierarchical refinement demonstrates that the proposed network adaptively enhances meaningful dependencies across spatial, temporal, and spatiotemporal dimensions, thereby improving both interpretability and representation stability.
To provide a rigorous mathematical foundation for the qualitative observations in
Figure 3, we further quantify the evolutionary dynamics across the two stacked reasoning layers using Shannon Entropy (
H). This metric evaluates the uncertainty and information distribution of the learned interaction components, defined as
where
represents the total number of spatiotemporal nodes (
channels × 10 patches) and
denotes the interaction intensity between nodes. A higher
H value indicates a more diffuse information integration, while a decreasing trend signifies the emergence of “reasoning determinism”. The quantitative results reveal distinct information-theoretic behaviors for the two interaction components: the entropy of the adjacency matrix exhibits a steady increase from Layer 1 (
) to Layer 2 (
), suggesting that as the reasoning depth increases, the model expands its receptive field, transitioning from initial local physical constraints to a more comprehensive, global representation of the spacecraft’s structural dependencies. Conversely, the attention mechanism shows a notable entropy reduction, dropping from
in Layer 1 to
in Layer 2, indicating that the model is actively “distilling” information by filtering out redundant telemetry noise and concentrating its representational capacity on a sparse subset of critical spatiotemporal interactions. This dual-process evolution—the expansion of structural context and the concentration of dynamic attention—mathematically confirms the PSTG model’s ability to balance holistic system monitoring with precise anomaly localization, ensuring the stability and interpretability of the learned coupling patterns for complex satellite health monitoring tasks.
Across repeated runs, this entropy split (broader structural context with more focused attention) is consistent with the stronger and more stable Event-wise and Affiliation-based
distributions reported in
Figure 4 and
Figure 5. This entropy–performance linkage is presented as an empirical consistency analysis for interpretability support, rather than a causal proof.
Figure 4 and
Figure 5 summarize the statistical distributions of the F0.5 scores obtained from 15 independent runs across different baseline methods on the ESA dataset. Specifically,
Figure 4 reports the Event-wise F0.5 results, while
Figure 5 presents the affiliation-based F0.5 results. For each method, performance distributions are visualized using boxplots with per-run samples overlaid and sorted by the mean value, highlighting both the central tendency and variability of detection performance. This statistical comparison facilitates a direct assessment of robustness and stability of different approaches under repeated trials.
4.8. Ablation Study
To comprehensively evaluate component contributions, we report two complementary ablation settings.
Backbone-level ablation (A/B/C+GCN). We retain the original progressive degradation design: Experiment A replaces multi-scale patches with a single-scale variant, Experiment B further weakens graph construction by using single-head attention and removing top-
k sparsification, and Experiment C additionally replaces attention-based aggregation with a GCN. As shown in
Figure 6, performance decreases as these components are removed, confirming that multi-scale representation, sparse graph construction, and attention-based aggregation jointly support robust anomaly detection.
G-module-focused ablation (I–IV). We further isolate the proposed attention architecture using four settings: Experiment I (full PSTG with M+A+G), Experiment II (w/o M), Experiment III (w/o A), and Experiment IV (w/o G, replacing the structure-guided attention with standard GATv2). As shown in
Figure 7, Experiment I achieves the best overall scores. Removing M causes the largest drop in event-wise
(0.921 → 0.723), and weakening A also reduces event-wise
(0.921 → 0.830). The comparison between I and IV highlights the value of structure guidance: replacing the proposed module with GATv2 lowers event-wise
from 0.921 to 0.849, with recall slightly increasing (0.877 → 0.892) but precision dropping markedly (0.933 → 0.839), indicating over-detection without explicit structural constraints.