1. Introduction
Weather forecasting plays a vital role in agricultural production, disaster prevention, transportation, and many other societal activities. The accuracy of such forecasts is largely determined by the precision of atmospheric state estimation, which provides the initial conditions for numerical prediction systems. However, the increasing heterogeneity and volume of meteorological data collected from surface stations, upper-air soundings, and satellites propose considerable challenges to traditional assimilation-based estimation methods. These systems often struggle to integrate multi-source information with diverse spatial–temporal characteristics under a unified analytical framework [
1,
2].
To overcome these limitations, this study develops an atmospheric state estimation model based on self-supervised graph neural network (GNN). By representing heterogeneous meteorological data as graph structures, the proposed model captures the spatial correlations between observation nodes and atmospheric state variables more effectively than conventional methods. Through self-supervised pretraining and task-specific fine-tuning, the model learns intrinsic relationships among different data sources, thereby improving estimation accuracy while reducing dependency on physical assumptions.
While improving accuracy is crucial, the interpretability of such models is equally significant in atmospheric science, where understanding the influence of different observations is essential for optimizing observation networks and guiding resource allocation [
3,
4]. To enhance transparency, this study further introduces an improved gradient-based interpretability framework, referred to as SA-Grad-CAM++, which integrates the sensitivity analysis and Grad-CAM++ methods to quantify the contribution of each observation node to the overall estimation. This approach not only strengthens the explainability of the model but also provides physically meaningful insights into the spatial structure of meteorological interactions.
Overall, the present work focuses on constructing an interpretable framework for atmospheric state estimation that integrates multi-source meteorological data, advances the modeling capacity of GNNs through self-supervised learning, and enables transparent analysis of observation importance. The experimental results demonstrate that the proposed approach achieves higher estimation accuracy and more reliable interpretability than representative graph-based baseline models, indicating its potential for enhancing both predictive performance and scientific understanding in atmospheric modeling.
2. Related Works
2.1. Graph Neural Networks in Meteorology
In recent years, deep learning techniques have developed rapidly. The emergence of GNN has opened up a new path for processing tasks related to graph-structured data. The concept of GNN can be traced back to Scarselli et al.’s research on graph-structured data processing [
5]. Subsequently, Kipf et al. proposed Graph Convolutional Network (GCN) [
6], which extended the idea of convolution to graph-structured data in non-Euclidean spaces. Petar Veličković et al. proposed Graph Attention Network (GAT) [
7]. They combined the graph neural network with the attention mechanism to assign learnable weights to the neighbor nodes and constructed an adaptive weighted feature aggregation mechanism for the neighbor nodes, thus improving the flexibility of the model.
Since meteorological observation data comes from decentralized monitoring stations, the location distribution of the stations is irregular, which makes the graph an ideal structure for representing meteorological data. Graph neural networks have been widely used in meteorological data analysis, such as sea surface temperature analysis and solar radiation prediction, due to their advantages in handling irregular graph-structured data [
8,
9]. For example, HiSTGNN [
10] represents the relationship between meteorological variables at different weather stations or regions by constructing a hierarchical graph. This model helps to reveal the spatial interactions of meteorological variables for prediction and analysis of atmospheric state variables. GraphCast [
11] converts the grids of NWP systems into hierarchical graphs, which allows for the use of GNN for remote mutual information. This method overcomes the step range limitation present in traditional convolutional neural networks and improves computational efficiency. In addition, some meteorological data may be missing due to the absence or malfunctioning of measurement equipment. To address this problem, Bhandari et al. [
12] used graph neural networks to predict missing atmospheric variables based on other environmental features, enhancing the accuracy of meteorological analysis in data-scarce areas. However, most of the existing models focus on studying the information interactions of nodes in spatial or temporal dimensions and lack in-depth studies on the combination of multiple sources of meteorological data.
2.2. Interpretability Analysis Methods for Graph Neural Networks
Interpretability analysis of neural network models is an important research direction. GNN constructs complex heterogeneous models by integrating the feature information of graph structures and show great potential in areas such as social network analysis and molecular structure research. However, the complexity of such models increases the difficulty of understanding how GNNs work. To overcome this challenge, scholars have proposed a variety of interpretable methods to reveal the internal mechanisms of GNNs, such as gradient-weighted class activation mapping (Grad-CAM) [
13] and hierarchical correlation propagation (LRP) [
14]. Based on the principles and application scenarios of interpretable analysis methods, researchers have categorized interpretable methods into several categories such as gradient-based methods, perturbation-based methods, agent-based methods, and decomposition-based methods [
15]. These interpretable analysis methods not only enhance the credibility of the model but also improve the transparency and security of the model in decision-making applications.
Gradient-based interpretable analysis methods use gradients to represent the importance of inputs. Sensitivity analysis (SA) [
16] is a classical method of using gradients to interpret neural networks. SA directly uses the square of the gradient as the importance score, which is simple and efficient to compute; however, there are limitations in the method, and the sensitivity may not always accurately reflect the importance in some cases. Guided backpropagation (Guided-BP) [
16] improves on SA by setting the negative gradient to zero during the backpropagation process. Guided-BP avoids the interference of the negative gradient but suffers from the same limitations as SA. Grad-CAM [
13] is an interpretable method based on class-based activation mapping [
17]. Grad-CAM uses the output to the gradient of the feature map as weights and computes the weighted sum of the feature maps embedded by the nodes as the importance score. The limitation of this method is that it is only applicable to graph classification models and cannot be used in node classification tasks.
Gradient-based interpretability analysis methods are simple and efficient. However, there are some limitations of this type of method: (1) the gradient may not always accurately reflect the importance, and (2) in the saturated region of the model, the model output may not change significantly with the input.
2.3. Interpretability Analysis Methods in Meteorology
It is important to analyze and study the interpretability of meteorological models. The traditional Forecast Sensitivity to Observations (FSO) method is based on the data assimilation system, which evaluates the importance of observations by calculating the gradient of the forecast error to the observations [
18,
19], but this method is limited by the structure of the system and is usually only applicable to specific systems. In recent years, neural network modeling technology has developed rapidly, and to explain the operation mechanism of the model, scholars have proposed a variety of interpretable methods for neural network models. In the field of meteorology, several studies using interpretable analysis methods to assess the impact of observations have also appeared [
20,
21,
22]. Jeon et al. [
23] extended the application of interpretable methods to the field of atmospheric sciences. CloudNine [
24] used the LRP method to aggregate the impacts according to the type of observation, space, and time, and to perform an analysis of the impacts of the observations at multiple spatial and temporal resolutions. This approach allows for the analysis of the impact of observations on specific regions and time periods and is not dependent on specific weather forecasting and numerical assimilation systems. These interpretable analysis methods can help us understand the decision-making process of the model and assess the contribution of different observations to the model output, but the node-level importance analysis of meteorological data is still in its infancy.
3. Materials and Methods
3.1. Basic Processes
In this paper, we first integrate data from multiple datasets of Ground, High-altitude, Satellite, and State of the atmosphere, process and filter the experimental data, and then construct a meteorological graph, which represents the observed data and atmospheric state data as a graph structure. Subsequently, a network model based on graph neural networks is constructed, and the model is pretrained using a self-supervised approach and then fine-tuned for specific tasks. Finally, the importance of the observed nodes is assessed using a gradient-based interpretability analysis.
Figure 1 illustrates the Basic Processes of the work.
3.2. Overall Architecture
The architecture of the model proposed in this paper mainly consists of projection layer, feature fusion layer, graph neural network layer and residual convolution layer. The overall structure of the model is shown in
Figure 2:
The model first transforms the initial features to a uniform dimension through different projection layers and then fuses the node feature vectors and georeferenced embeddings with each other using a gating function. Subsequently, the model analyzes the spatial correlation between observed nodes, the spatial correlation between atmospheric state nodes, and the correlation information between observed nodes and atmospheric state nodes through three graph neural network layers, respectively. After each set of graph neural network layers, the model maintains the initial node characteristics through a residual convolution layer. After the graph neural network layer is repeated N times with the residual convolution layer, the model output is obtained through the fully connected layer. There are two forms of fully connected layers: in the pre-training phase, the fully connected layer maps the features to the dimensions of the model inputs to obtain the reconstructed features; in the fine-tuning phase, the fully connected layer transforms the features to the dimensions of the model outputs.
3.3. The Projection Layer
The projection layer transforms the initial node features and maps different types of observation data, atmospheric state data and position coordinate data into a unified feature space. Since the types of data variables of observation data from different sources are different, this paper chooses different projection layers for data from different sources. The structure is shown in
Figure 3.
3.4. Feature Fusion Layer
The feature fusion layer combines geographic coordinate encoding with node feature vectors.
Figure 4 illustrates the architecture of the feature fusion module used in the proposed model. Features from two data sources are first projected into a common embedding space. The concatenated representations are then fused by a fully connected (FC) layer followed by ReLU activation and layer normalization. A residual skip connection between the input and output ensures stable gradient propagation and retains original feature information. The fused representation serves as the unified input for subsequent graph-based learning.
3.5. GNN Layer
The graph neural network layer analyzes the complex relationships between nodes through a layered graph neural network. As the graph attention network has the function of capturing the importance of the surrounding nodes, the GAT [
7] is selected as the basic structure of the graph neural network layer in this paper. GAT has a unique advantage in dealing with the importance of nodes, which is more adapted to the research needs of this paper and can provide support for the subsequent interpretable analysis. The structure of the graph neural network layer is shown in
Figure 5, where triangles indicate observation nodes and circles indicate atmospheric state nodes.
The graph neural network layer is divided into three parts: information interaction between observation nodes, information interaction between state nodes, and information interaction between observation nodes and state nodes. The role of the information interaction between observation nodes is to capture the relationship between observation nodes. Similarly, the role of the letter interaction between state nodes is to capture the relationship between state nodes. Finally, the information interaction between observation nodes and state nodes fuses the features of observation nodes and state nodes with each other. The hierarchical graph neural network distinguishes different types of edges in the graph and uses different weights for different types of edges, which can better handle the relationships between data nodes.
3.6. Transformer-Based Decoder Block
To retain the initial node features, alleviate the gradient vanishing problem, speed up the training of the model, and improve the robustness and generalization ability of the model, this paper adds a residual convolution layer after each graph neural network layer. The residual convolution layer splices the output features of the graph neural network layer with the original features in the channel dimension and then uses a linear layer to obtain the fused features. The structure of the residual convolution layer is shown in
Figure 6.
In
Figure 6, the light blue blocks represent the graph-encoded features obtained through message passing in the graph convolutional layer, capturing spatial correlations among nodes. The blue-green blocks denote the fused representations generated by the residual connection that combines the graph features with the original projected features, ensuring both global smoothness and local detail preservation.
3.7. Network Model Training Process
The training process of the model consists of two phases: pre-training and fine-tuning. In pre-training, a generative self-supervised strategy is employed for node feature reconstruction by randomly masking input nodes. This approach is inspired by recent advances in graph self-supervised learning frameworks such as GraphMAE [
25], which enable models to learn intrinsic node correlations without labeled supervision. In the fine-tuning stage, the model is further trained according to the actual task by minimizing the loss value to improve the prediction performance of the model.
3.7.1. Loss Function
In this paper, mean square error (MSE) is used as the loss function. Mean square error is a common indicator used to measure the size of the deviation of the predicted value from the actual value, which is calculated as follows.
where
and
denote the predicted and actual values, respectively, and
is the total number of samples.
The smaller the value of MSE, the closer the prediction result of the model is to the real value, and the higher the accuracy of the model. When the MSE is 0, it indicates that the prediction of the model is completely accurate; on the contrary, the larger the mean square error is, the lower the prediction accuracy of the model is.
3.7.2. Self-Supervised Pre-Training Stage
In the pre-training phase, this paper uses a generative self-supervised learning strategy to train the model to perform node feature reconstruction after randomly masking some of the nodes so that the model learns the correlation between the nodes. The specific flow of pre-training is shown in
Figure 7.
First, a certain proportion of nodes are randomly selected from the model input data for masking, i.e., the feature values of these nodes are set to 0. Then the model is trained to reconstruct the node features, and the MSE loss function is used to measure the gap between the reconstructed features and the actual features, so that the model can adequately learn the complex relationship between the node features by minimizing the loss value. Finally, the model obtained from pre-training is retained.
3.7.3. Fine-Tuning Stage
The fine-tuning phase builds on the model obtained from the pre-training to further train the model for specific tasks such as atmospheric temperature estimation. The flow of the fine-tuning phase is shown in
Figure 8.
The input data is first passed through the atmospheric state estimation model to obtain state estimates. In this stage, the fully connected layer at the end of the atmospheric state estimation model converts the features to the model output dimension to obtain an estimate of the atmospheric state. The mean square error between the estimated and actual values of the atmospheric state is then calculated, and by minimizing the mean square error, the model is able to accurately estimate the atmospheric state.
3.8. Interpretable Method
The SA-Grad-CAM++ method proposed in this paper combines the features of both SA [
16] and Grad-CAM++ [
26]. Grad-CAM++ only considers the importance of each node in the last feature map, and obtains deep information by analyzing the last feature map, but the last feature map may have some deviation from the original nodes, and can only obtain more vague importance information. SA, on the other hand, derives the importance of a node by calculating the gradient of the output node to the input node, and can obtain the importance of each node, accurate to each node, with more fine-grained information. The combination of the two can get a more accurate node importance representation. The structure of the method is shown in
Figure 9.
The traditional Grad-CAM++ method is mainly used in classification tasks. The method first solves the partial derivatives of the class scores to the feature map to obtain the weights, and then calculates the weighted average of each layer of the feature map according to this weight. While the atmospheric state estimation in this paper belongs to the prediction task at the node level, and there is no concept of class score, so this paper changes the class score in the traditional Grad-CAM++ method to the partial derivatives of the feature map to the opposite number of the mean square error of the predicted value and the true value to the partial derivatives of the feature map, and the modified formula for calculating the weights is as Equation (2):
where
represents the contribution of the
k-th feature map to the prediction of variable
c;
is the local gradient weighting coefficient;
denotes the activation of feature map
k at position
, and
filters positive gradient effects. The negative sign ensures that the gradient is taken with respect to the minimization of the mean square error loss.
In addition, in the traditional GradCAM++, after weighted summation of the feature maps, a ReLU function is added to set the importance of the points that negatively affect the results to 0. In order to preserve the order of these negatively affected points, the ReLU function is removed in this paper, and the modified Grad-CAM++ is calculated as Equation (3):
where
denotes the final node-level importance map obtained by the weighted summation of feature maps
with corresponding weights
.
For the sensitivity analysis method, the final squaring operation was removed. Finally, the results of Grad-CAM++ were normalized and fused with the results obtained from the sensitivity analysis, calculated as Equation (4):
where
represents the final fused interpretation map combining sensitivity analysis and Grad-CAM++;
denotes the saliency map obtained through SA, which quantifies the influence of input perturbations on the model output; The operator
indicates the element-wise (Hadamard) product between the two saliency maps, emphasizing the regions that are simultaneously sensitive and highly activated; The additive term
preserves the sensitivity information to avoid complete dependence on Grad-CAM++ activations.
This method combines the advantages of sensitivity analysis and Grad-CAM++ to obtain a more accurate node importance.
4. Results and Analysis
4.1. Datasets
In this study, meteorological data covering central China (100–120° E, 22–42° N) from 1 January 2020 to 31 December 2023 were collected to construct a heterogeneous dataset that integrates real observations and atmospheric state data.
The observational data include three components [
27]: (1) Surface observations (SURF) from the Global Surface Integrated Dataset compiled by the National Meteorological Information Center (NMIC), which provide daily records of temperature, precipitation, pressure, and wind speed; (2) Upper-air observations (UPAR) from the Global Upper-air Composite Dataset by NMIC, containing temperature, humidity, and wind data near the 500 hPa level to represent stable mid-tropospheric conditions; (3) Satellite observations (MOD11C1) from the Terra MODIS Land Surface Temperature/Emissivity Daily Product, offering 0.05° resolution measurements of land surface temperature (LST) for both day and night.
The atmospheric state data are obtained from ERA5 [
28], the Fifth-Generation ECMWF Reanalysis Dataset, which provides gridded variables including temperature, wind speed, and precipitation at a 0.25° spatial resolution.
All datasets were spatially and temporally aligned and normalized to form a unified graph-based representation of the atmospheric system. The details of each dataset and their variables are summarized in
Table 1.
The ERA5 dataset provides reanalysis variables that serve as atmospheric state inputs and prediction targets. The SURF dataset represents surface meteorological observations, offering ground-based temperature, precipitation, and wind data. The UPAR dataset, collected near the 500 hPa level, provides vertical structure information on temperature, humidity, and wind. The MOD11C1 product offers satellite-derived land surface temperature measurements, complementing the sparse surface station coverage in complex terrains. All variables were regridded and standardized before model training.
Figure 10 illustrates the distribution of nodes in each dataset (using 1 March 2020 data as an example). From the figure, it can be seen that the distribution of ground observation nodes and satellite observation nodes is relatively dense, and the distribution of high-altitude observation nodes is relatively sparse. The geographical distribution of ground observation nodes and high altitude observation nodes is not uniform, and the distribution of satellite observation nodes is relatively neat, but some of these nodes may have missing data, which may be due to the thick cloud cover in the area, making it difficult for the on-board sensors to measure the surface data.
4.2. Data Preprocessing
All meteorological datasets described in
Section 4.1 were spatially and temporally aligned to construct a unified heterogeneous graph for model training. The raw dataset is downloaded from platforms such as CDS as a file in NetCDF, TXT or HDF4 format. The preprocessing workflow consisted of six main steps, as illustrated in
Figure 11.
(1) Temporal synchronization: Each dataset was converted to a daily time scale covering the period from 1 January 2020 to 31 December 2023. For datasets with higher temporal resolution (e.g., ERA5 hourly and UPAR twice-daily observations), daily averages were calculated. Missing dates were filled by linear interpolation along the time dimension to ensure temporal continuity.
(2) Spatial harmonization: Observation coordinates were confined to the study area (100–120° E, 22–42° N). ERA5 gridded data were resampled to 0.25° × 0.25° resolution using bilinear interpolation. Point-based observations (SURF and UPAR) were matched to their nearest ERA5 grid cell, while MODIS pixels were averaged within each grid cell. Elevation information was extracted from the ERA5 topography field and assigned to every node to account for orographic effects.
(3) Variable standardization: All variables were converted to consistent units: temperature (°C), pressure (hPa), wind (m/s), and humidity (%). Before model input, each feature dimension was standardized to achieve zero mean and unit variance.
(4) Node and edge construction: After cleaning and merging, every record was represented as a node containing the feature vector composed of relevant meteorological variables and spatial coordinates (latitude, longitude, elevation). Three types of edges were created according to spatial distance: state-to-state (SS), observation-to-observation (OO), and observation-to-state (OS) links. Edges were established when the haversine distance between nodes was within 300 km, enabling local information exchange among neighboring stations or grid points.
(5) Dataset division: To ensure reproducibility and avoid temporal leakage, the data were split chronologically: training set: from 2020 to 2022 (36 months); test set: form 2023 (12 months). The split was not randomized to preserve temporal dependency. Within the training period, 10% of samples were held out for validation and early stopping.
(6) Feature fusion and normalization check: After preprocessing, heterogeneous node features from the four data sources were projected into a unified embedding space of 32 dimensions. Feature distributions were examined to confirm numerical stability before pre-training.
4.3. Evaluation Metrics
In order to be able to adequately assess the performance of the network, three common performance evaluation metrics are used in this paper: root mean square error (RMSE), mean absolute error (MAE) and coefficient of determination (R2).
(1) Root Mean Square Error (RMSE) is the square root of the Mean Square Error (MSE). It is calculated by Equation (5):
where
denotes the predicted value,
denotes the actual value, and
is the number of samples.
Since the RMSE has the same units as the original data, it is more intuitive in size. Its disadvantage is that due to the presence of the squaring operation, the RMSE is more sensitive to large errors and may amplify the results due to a few large errors.
(2) Mean Absolute Error (MAE) is an indicator used to measure the average absolute deviation between the predicted value and the real value. MAE is calculated by first calculating the absolute value of the difference between the predicted value and the real value and then taking the average value to obtain the mean absolute error, as shown in Equation (6).
The smaller the MAE, the smaller the difference between the model’s predictions and the actual values, and the more accurate the model. Compared to RMSE, MAE is less sensitive to outliers and will not be overly affected by a small number of large errors.
(3) The coefficient of determination (
) is a key indicator for assessing how well the regression model fits the data. It is calculated by Equation (7):
where
is the average of the true values.
usually ranges from 0 to 1. The closer
is to 1, the better the model fits the data, while the smaller
is, the worse the model fits the data.
4.4. Experimental Settings
The network model in this paper is constructed based on Pytorch-2.6 deep learning framework. The experimental environment of this paper is a laptop based on 64-bit Windows 11 operating system, CPU is AMD R7-8745H, GPU is GeForce RTX 4060.
The training process of the model is divided into two phases: pre-training and fine-tuning. In the pre-training phase, the model adopts a generative, self-supervised learning strategy for feature reconstruction of the initial node features to learn the complex relationship between the observation data and the atmospheric state data. First, the features of 70% of the input nodes are randomly masked, and then the model is trained to reconstruct the features. In the fine-tuning stage, the model further optimizes the model parameters by minimizing the prediction error to improve the prediction performance of the model. The optimizer used in this paper is Adam, which can adaptively adjust the learning rate, but in order to adjust the learning rate more efficiently, this paper adds a learning rate decay every 50 rounds during the training of the model to change the learning rate to 0.9 times the original. The hyperparameters used in this experiment are shown in
Table 2.
4.5. Evaluation of Atmospheric State Estimation Network Model
4.5.1. Comparison Experiment
In this paper, the proposed model is compared and experimented with baseline models such as GCN [
6], GAT [
7] and GraphSAGE [
29]. The experimental data and preprocessing procedures are the same as in
Section 4.1 and
Section 4.2. All reported results are obtained using the held-out test dataset (year 2023) to ensure fair comparison and avoid data leakage between training and testing. The experimental results are shown in
Table 3.
The experimental results show that the model in this paper outperforms some existing GNN models in all the evaluation metrics. It shows that the atmospheric state estimation model proposed in this paper has better fitting ability.
4.5.2. Ablation Experiment
To evaluate the effectiveness of each part of the atmospheric state estimation model proposed in this paper, this experiment compares the effectiveness of the model without pre-training, removing the projection and feature fusion layers, and removing the residual convolution layer.
(1) Removing the pre-training phase: keep the model unchanged, skip the pre-training phase, and proceed directly to the fine-tuning phase.
(2) Remove projection and feature fusion layers: replace the projection and feature fusion layers with simple feature splicing.
(3) Removing the residual convolution layer: the residual convolution layer was removed and the rest of the model was kept unchanged.
Ten experiments were conducted for each case separately and the average value was taken as the experimental result. The experimental results are shown in
Table 4.
The experimental results show that the accuracy of the model with the removal of the projection and feature fusion layers, the removal of the residual convolutional layer, and no pre-training is weaker than that of the full model, suggesting that the structure of the parts in the model is valid.
4.6. Performance Evaluation of Interpretability Methods
4.6.1. Results
This experiment quantifies the importance of initial data nodes for atmospheric state estimation using gradient-based interpretable analysis methods, including SA [
16], Guided-BP [
16], Grad-CAM++ [
13], and SA-Grad-CAM++ proposed in this paper.
Figure 12 demonstrates the computational results of each method, in which the pink points indicate the atmospheric state nodes, the green points indicate the observation nodes, and the node size indicates the importance of the nodes, with larger nodes indicating that the nodes are more important.
As can be seen from
Figure 12, the results obtained by SA and Guided-BP are more refined, with more obvious differences in the importance of neighboring nodes, while the results obtained by Grad-CAM++ are more homogeneous. This is because SA and Guided-BP calculate the gradient of each input node, while Grad-CAM++ only takes into account the information of the last feature map, which corresponds to the model inputs in a more ambiguous way, and thus the method yields more ambiguous results.
From an atmospheric perspective, the interpretability analysis provides insight into how the model recognizes spatial–temporal relationships among meteorological variables. The high-importance regions identified by the SA-Grad-CAM++ analysis correspond to areas with strong temperature gradients and wind shear, which are physically consistent with regions of enhanced energy exchange and convective activity. This confirms that the model does not rely on random correlations but effectively captures physically meaningful atmospheric patterns. Therefore, the interpretability results bridge the gap between the machine learning model and the underlying meteorological processes, providing confidence in the physical reliability of the predictions.
4.6.2. Interpretable Comparison Experiment
To evaluate the performance of the interpretable analysis methods, a set of comparative experiments are conducted in this paper, in which several representative gradient-based interpretable analysis methods are selected and compared with the methods proposed in this paper. First, the interpretability analysis is performed on the same model and input data using each interpretability analysis method, and then the nodes in the meteorological map are removed sequentially in the order of calculated node importance from high to low and from low to high. The performance of the model is verified once for each 5% node removal, and the curves of the RMSE with the removal of the nodes are plotted as shown in
Figure 13.
Figure 13 compares the robustness of different interpretability-enhanced models under increasing proportions of removed nodes. The baseline model (L1) shows a rapid increase in RMSE as node information becomes sparser, while models incorporating interpretability mechanisms (L2) demonstrate improved resilience.
Among them, the proposed SA-Grad-CAM++ model (
Figure 13d) exhibits the slowest RMSE growth, indicating the highest robustness. This improvement results from combining sensitivity-based global awareness (SA) with activation-based local attribution (Grad-CAM++), enabling the model to maintain more stable feature importance representations even under node loss.
In order to quantify the performance of the interpretability analysis method, the area difference (ADC) between the two curves and the horizontal coordinates was calculated as shown in
Table 5. The comparison shows that the method proposed in this paper outperforms the baseline method, which indicates that the method proposed in this paper is effective and can better analyze the importance of the observed nodes.
4.6.3. Interpretable Ablation Experiment
The interpretable analysis method proposed in this paper makes some modifications to the original method by replacing the bias derivation of the feature maps by the model output in the original method with the bias derivation of the feature maps by the opposite of the mean square error of the predicted value and the real value. In order to verify the validity of this improvement, a set of controlled experiments are conducted in this paper to compare and verify the performance of each basic interpretability analysis method when using the model output or the opposite of the mean square error of the model output and the real value as the object of bias derivation. The results of the experiments are shown in
Table 6.
It can be seen that the SA and Guided-BP methods work better when using the model output as the object of bias solving, while Grad-CAM++ works better when using the opposite of the mean square error between the model output and the actual value as the object of bias solving. Therefore, in this paper, SA, which uses the model output as the object of bias derivation, and Grad-CAM++, which uses the opposite of the mean square error between the model output and the actual value as the object of bias derivation, are selected to be combined.
In order to verify that the above conclusions still hold after combining SA and Grad-CAM++ with the method proposed in this paper, a set of experiments is conducted to evaluate the performance of the method after combining SA and Grad-CAM++ with different bias-seeking objects, and the experimental results are shown in
Table 7.
The experimental results show that the best results are achieved by the combined approach of interpretability methods used in this paper.
In addition, compared with the experimental results of the SA method or the Grad-CAM++ method alone in
Table 5,
Table 6 and
Table 7, the interpretable analysis of the method proposed in this paper is better, which indicates that the combination of the two methods proposed in this paper is effective for the final analysis and can better analyze the importance of the observed nodes.
5. Discussion
The proposed framework achieves higher estimation accuracy and interpretability compared with conventional graph-based models, but these benefits are accompanied by additional computational demands. The use of self-supervised pretraining and hierarchical attention layers increases the number of model operations and training iterations relative to standard GNN. Moreover, the integration of residual convolution blocks and the gradient-based interpretability analysis introduces extra forward and backward passes during training and evaluation. Although this results in longer training and inference times, the additional cost remains moderate and acceptable considering the improvement in predictive performance and transparency. In the context of meteorological modeling, where accuracy and reliability are of paramount importance, such a trade-off between computational cost and model interpretability is justified.
Beyond performance gains, the incorporation of interpretability analysis enhances the physical insight derived from data-driven modeling. The improved SA-Grad-CAM++ approach reveals how different observation nodes contribute to atmospheric state estimation, highlighting regions that correspond to meaningful meteorological processes such as strong temperature gradients or dynamic atmospheric interactions. This capability strengthens scientific understanding by aligning model reasoning with known physical mechanisms, thereby improving trust in the learned representations and supporting data-driven observation network optimization.
Nevertheless, some limitations remain. The current model assumes relatively moderate topographic variation and focuses primarily on regional-scale data. Its performance in highly complex or data-sparse regions may be influenced by spatial aliasing or limited feature diversity. Future studies could incorporate additional geophysical descriptors to enhance spatial representation. Further exploration of temporal graph mechanisms or hybrid physics–data frameworks could also improve generalization. Optimizing the interpretability module to reduce redundant computations, for example through sparse gradient approximation or modular analysis, may further balance interpretability and efficiency.
In summary, the proposed self-supervised and interpretable GNN framework establishes a balance between computational demand, predictive accuracy, and model transparency. While the approach requires slightly higher computational effort due to its integrated learning and interpretability mechanisms, it provides benefits in terms of both estimation reliability and physical interpretability.
6. Conclusions
This study developed an interpretable framework for atmospheric state estimation that integrates multi-source meteorological observations through a self-supervised GNN architecture. By representing heterogeneous data from surface, upper-air, satellite, and reanalysis sources as graph structures, the proposed model effectively captured spatial correlations among observation nodes and atmospheric variables. The self-supervised pretraining strategy enabled the network to learn intrinsic relationships across data sources before task-specific fine-tuning, thereby improving the robustness and accuracy of atmospheric state estimation compared with several representative baseline models.
Beyond predictive performance, the study emphasized the interpretability of the model outputs, which is essential for understanding the contribution of different observations and for optimizing observation networks in meteorological analysis. To achieve this, an enhanced gradient-based interpretability method, SA-Grad-CAM++, was introduced by combining Sensitivity Analysis with Grad-CAM++ to obtain fine-grained node-level importance maps. The interpretability experiments demonstrated that this hybrid approach provides more precise and physically meaningful attributions than existing gradient-based methods, reinforcing confidence in the scientific reliability of the model’s inferences.
Overall, the results show that the integration of self-supervised learning and interpretable GNN modeling offers a promising pathway for accurate and transparent atmospheric state estimation. This unified framework not only advances data-driven modeling of heterogeneous meteorological systems but also strengthens the interpretability of machine learning approaches in weather and climate studies. Future research may extend this work by incorporating additional terrain and land-use attributes, exploring temporal graph mechanisms, and applying the framework to broader regional or global forecasting tasks.
Author Contributions
Conceptualization, F.B. and W.Z.; methodology, S.L. and C.W.; software, Y.L. and C.W.; validation, G.X. and C.W.; formal analysis, G.X. and S.L.; investigation, G.X. and S.L.; resources, W.Z. and F.B.; data curation, Y.L. and C.W.; writing—original draft preparation, G.X., F.B. and W.Z.; writing—review and editing, G.X., F.B. and W.Z.; visualization, C.W., G.X.; supervision, S.L.; project administration, W.Z.; funding acquisition, F.B. and W.Z. All authors have read and agreed to the published version of the manuscript.
Funding
Support by Sichuan Science and Technology Program [2023YFSY0026, 2023YFH0004].
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Waqas, M.; Humphries, U.W.; Chueasa, B.; Wangwongchai, A. Artificial intelligence and numerical weather prediction models: A technical survey. Nat. Hazards Res. 2025, 5, 306–320. [Google Scholar] [CrossRef]
- Cheng, S.; Quilodrán-Casas, C.; Ouala, S.; Farchi, A.; Liu, C.; Tandeo, P.; Fablet, R.; Lucor, D.; Iooss, B.; Brajard, J.; et al. Machine Learning With Data Assimilation and Uncertainty Quantification for Dynamical Systems: A Review. IEEE/CAA J. Autom. Sin. 2023, 10, 1361–1387. [Google Scholar] [CrossRef]
- Yang, R.; Hu, J.; Li, Z.; Mu, J.; Yu, T.; Xia, J.; Li, X.; Dasgupta, A.; Xiong, H. Interpretable machine learning for weather and climate prediction: A review. Atmos. Environ. 2024, 338, 120797. [Google Scholar] [CrossRef]
- Zhang, H.; Liu, Y.; Zhang, C.; Li, N. Machine Learning Methods for Weather Forecasting: A Survey. Atmosphere 2025, 16, 82. [Google Scholar] [CrossRef]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
- Kipf, T. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
- Jeon, H.-J.; Choi, M.-W.; Lee, O.-J. Day-ahead hourly solar irradiance forecasting based on multi-attributed spatio-temporal graph convolutional network. Sensors 2022, 22, 7179. [Google Scholar] [CrossRef]
- Yang, Y.; Dong, J.; Sun, X.; Lima, E.; Mu, Q.; Wang, X. A CFCC-LSTM model for sea surface temperature prediction. IEEE Geosci. Remote Sens. Lett. 2017, 15, 207–211. [Google Scholar] [CrossRef]
- Ma, M.; Xie, P.; Teng, F.; Wang, B.; Ji, S.; Zhang, J.; Li, T. HiSTGNN: Hierarchical spatio-temporal graph neural network for weather forecasting. Inf. Sci. 2023, 648, 119580. [Google Scholar] [CrossRef]
- Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W. Learning skillful medium-range global weather forecasting. Science 2023, 382, 1416–1421. [Google Scholar] [CrossRef]
- Bhandari, H.C.; Pandeya, Y.R.; Jha, K.; Jha, S. Recent advances in electrical engineering: Exploring graph neural networks for weather prediction in data-scarce environments. Environ. Res. Commun. 2024, 6, 105010. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 618–626. [Google Scholar]
- Schwarzenberg, R.; Hübner, M.; Harbecke, D.; Alt, C.; Hennig, L. Layerwise relevance visualization in convolutional text graph classifiers. arXiv 2019, arXiv:1909.10911. [Google Scholar] [CrossRef]
- Yuan, H.; Yu, H.; Gui, S.; Ji, S. Explainability in graph neural networks: A taxonomic survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5782–5799. [Google Scholar] [CrossRef] [PubMed]
- Baldassarre, F.; Azizpour, H. Explainability techniques for graph convolutional networks. arXiv 2019, arXiv:1905.13686. [Google Scholar] [CrossRef]
- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2921–2929. [Google Scholar]
- Kalnay, E.; Ota, Y.; Miyoshi, T.; Liu, J. A simpler formulation of forecast sensitivity to observations: Application to ensemble Kalman filters. Tellus A Dyn. Meteorol. Oceanogr. 2012, 64, 18462. [Google Scholar] [CrossRef]
- Buehner, M.; Du, P.; Bédard, J. A new approach for estimating the observation impact in ensemble–variational data assimilation. Mon. Weather. Rev. 2018, 146, 447–465. [Google Scholar] [CrossRef]
- Pope, P.E.; Kolouri, S.; Rostami, M.; Martin, C.E.; Hoffmann, H. Explainability methods for graph convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 10772–10781. [Google Scholar]
- Ying, Z.; Bourgeois, D.; You, J.; Zitnik, M.; Leskovec, J. Gnnexplainer: Generating explanations for graph neural networks. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
- Vu, M.; Thai, M.T. Pgm-explainer: Probabilistic graphical model explanations for graph neural networks. Adv. Neural Inf. Process. Syst. 2020, 33, 12225–12235. [Google Scholar]
- Jeon, H.-J.; Kang, J.-H.; Kwon, I.-H.; Lee, O. Explainable graph neural networks for observation impact analysis in atmospheric state estimation. arXiv 2024, arXiv:2403.17384. [Google Scholar] [CrossRef]
- Jeon, H.-J.; Kang, J.-H.; Kwon, I.-H.; Lee, O. Cloudnine: Analyzing meteorological observation impact on weather prediction using explainable graph neural networks. arXiv 2024, arXiv:2402.14861. [Google Scholar] [CrossRef]
- Hou, Z.; Liu, X.; Cen, Y.; Dong, Y.; Yang, H.; Wang, C.; Tang, J. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 594–604. [Google Scholar]
- Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]
- Wan, Z.; Hook, S.; Hulley, G. MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 0.05 Deg CMG V061; data set 2021, MOD11C11. 061; NASA EOSDIS Land Processes Distributed Active Archive Center (DAAC): Sioux Falls, SD, USA, 2021. [Google Scholar]
- Hersbach, H.; Comyn-Platt, E.; Bell, B.; Berrisford, P.; Biavati, G.; Horányi, A.; Sabater, J.M.; Nicolas, J.; Peubey, C.; Radu, R. ERA5 Post-Processed Daily-Statistics on Pressure Levels from 1940 to Present; Copernicus Climate Change Service (C3S) Climate Data Store (CDS): Reading, UK, 2023. [Google Scholar]
- Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1025–1035. [Google Scholar]
Figure 1.
Basic Processes.
Figure 1.
Basic Processes.
Figure 2.
Network structure diagram of the atmospheric state estimation model.
Figure 2.
Network structure diagram of the atmospheric state estimation model.
Figure 3.
Structure of the projection layer.
Figure 3.
Structure of the projection layer.
Figure 4.
Structure of the feature fusion layer. Where and indicate element-wise multiplication and residual addition, respectively.
Figure 4.
Structure of the feature fusion layer. Where and indicate element-wise multiplication and residual addition, respectively.
Figure 5.
Structure of the graph neural network layer.
Figure 5.
Structure of the graph neural network layer.
Figure 6.
Structure of the residual convolution layer.
Figure 6.
Structure of the residual convolution layer.
Figure 7.
Pre-training flowchart.
Figure 7.
Pre-training flowchart.
Figure 8.
Fine-tuning flowchart.
Figure 8.
Fine-tuning flowchart.
Figure 9.
Structure of the SA-Grad-CAM++.
Figure 9.
Structure of the SA-Grad-CAM++.
Figure 10.
Spatial distributions of heterogeneous meteorological data sources on 1 March 2020. (a) ERA5 atmospheric state nodes; (b) SURF surface stations; (c) UPAR upper-air soundings, and (d) MOD11 satellite pixels.
Figure 10.
Spatial distributions of heterogeneous meteorological data sources on 1 March 2020. (a) ERA5 atmospheric state nodes; (b) SURF surface stations; (c) UPAR upper-air soundings, and (d) MOD11 satellite pixels.
Figure 11.
Data processing flowchart.
Figure 11.
Data processing flowchart.
Figure 12.
Node importance map. (a) SA; (b) Guided-BP; (c) Grad-CAM++; (d) proposed SA-Grad-CAM++.
Figure 12.
Node importance map. (a) SA; (b) Guided-BP; (c) Grad-CAM++; (d) proposed SA-Grad-CAM++.
Figure 13.
Performance comparison of interpretability analysis methods. L1 (baseline GNN model); L2 (interpretability-enhanced model). (a) SA; (b) Guided-BP; (c) Grad-CAM++; (d) proposed SA-Grad-CAM++.
Figure 13.
Performance comparison of interpretability analysis methods. L1 (baseline GNN model); L2 (interpretability-enhanced model). (a) SA; (b) Guided-BP; (c) Grad-CAM++; (d) proposed SA-Grad-CAM++.
Table 1.
Description of meteorological datasets and variables used in this study.
Table 1.
Description of meteorological datasets and variables used in this study.
| Dataset | Temporal Resolution | Spatial Resolution | Variables Used |
|---|
| Type | Unit |
|---|
| EAR5 | Hourly → Daily | 0.25° × 0.25° | 2 m Air Temperature (AT) | (°C) |
| Relative Humidity (HUR) | (%) |
| 10 m Zonal Wind Component (UA) | (m/s) |
| 10 m Meridional Wind Component (VA) | (m/s) |
| Elevation | (m) |
| SURF | Daily | Irregular stations | Surface Air Temperature (TAVG) | (°C) |
| Precipitation (PRCP) | (mm) |
| Surface Pressure (PS) | (hPa) |
| Wind Speed (WA) | (m/s) |
| Elevation (m) | (m) |
| UPAR | Twice daily (00, 12 UTC) | Radiosonde stations (~500 hPa) | Upper-air Temperature (TAVG) | (°C) |
| Relative Humidity (HUR) | (%) |
| Zonal Wind (UA) | (m/s) |
| Meridional Wind (VA) | (m/s) |
| Elevation | (m) |
| MOD11C1 | Daily | 0.05° × 0.05° | Daytime Land Surface Temperature (TD) | (°C) |
| Nighttime Land Surface Temperature (TN) | (°C) |
| Daily Mean Land Surface Temperature (TAVG) | (°C) |
| Elevation | (m) |
Table 2.
Hyperparameter settings.
Table 2.
Hyperparameter settings.
| Hyperparameters | Value |
|---|
| Batch size | 460 |
| Optimizer | Adam |
| Learning rate (pre-training) | 0.01 |
| Epoch (pre-training) | 500 |
| Learning rate (fine-tuning) | 0.005 |
| Epoch (fine-tuning) | 1000 |
Table 3.
Results of the comparison experiment obtained on the test dataset (year 2023).
Table 3.
Results of the comparison experiment obtained on the test dataset (year 2023).
| Method | RMSE | MAE | R2 |
|---|
| GCN | 0.2389 | 0.1783 | 0.9413 |
| GAT | 0.1974 | 0.1496 | 0.9599 |
| GraphSAGE | 0.1273 | 0.0954 | 0.9832 |
| proposed | 0.1016 | 0.0759 | 0.9894 |
Table 4.
Results of the ablation experiment.
Table 4.
Results of the ablation experiment.
| Method | RMSE | MAE | R2 |
|---|
| Removing the pre-training phase | 0.1040 | 0.0896 | 0.9853 |
| Remove projection and feature fusion layers | 0.1086 | 0.0812 | 0.9878 |
| Removing the residual convolution layer | 0.1940 | 0.1459 | 0.9611 |
| Full model | 0.1016 | 0.0759 | 0.9894 |
Table 5.
Comparison of Interpretability Methods.
Table 5.
Comparison of Interpretability Methods.
| Method | ADC |
|---|
| SA | 0.0577 |
| Guided-BP | 0.0488 |
| Grad-CAM++ | 0.0188 |
| proposed | 0.0594 |
Table 6.
Effect of different gradient targets on interpretability performance (ADC).
Table 6.
Effect of different gradient targets on interpretability performance (ADC).
| Method | The Object of the Partial Derivation | ADC |
|---|
| SA | Output of the model | 0.0577 |
| -MSE | 0.0507 |
| Guided-BP | Output of the model | 0.0488 |
| -MSE | 0.0449 |
| Grad-CAM++ | Output of the model | −0.0016 |
| -MSE | 0.0188 |
Table 7.
Comparison of the combined performance of methods for selecting different objects for partial derivative calculation.
Table 7.
Comparison of the combined performance of methods for selecting different objects for partial derivative calculation.
| Objects for Finding Partial Derivatives Using Various Methods | ADC |
|---|
| SA: Output of the model; Grad-CAM++: Output of the model | 0.05857 |
| SA: − MSE; Grad-CAM++: − MSE | 0.03434 |
| SA: − MSE; Grad-CAM++: Output of the model | 0.05858 |
| SA: Output of the model; Grad-CAM++: − MSE (used in this paper) | 0.05941 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |