All experiments in this study were conducted under the Windows 10 operating system. The experiments were implemented in Python 3.8 with PyTorch 1.9.0 as the deep learning framework. The dataset used was the customized SST dataset, split into 70% for training, 10% for validation, and 20% for testing. All data partitions were automatically performed within the main program to ensure scientific rigor and reproducibility. For the model, a self-attention-based sequence prediction framework was employed. The main hyperparameter settings were as follows: input sequence length of 30, label length of 30, prediction step of 1, 3 encoder layers, 1 decoder layer, model dimension of 256, feed-forward network dimension of 1024, number of attention heads set to 8, temporal feature encoding mode set to timeF, activation function set to gelu, dropout ratio of 0.1, batch size of 16, initial learning rate of 0.0003, and a maximum of 80 training epochs. To guarantee reproducibility, a fixed random seed was used across all experiments. The PyTorch version is 2.7.1 and both model training and testing were accelerated using the GPU.
4.4.1. Ablation Study
To verify the contribution of each component in the proposed model to the overall prediction performance, four ablation experiments were designed. Specifically, we evaluated the baseline Transformer model, the Transformer with the Coordinate Attention module (CA-Transformer), the Transformer with the Adaptive Dynamic module (AD-Transformer), and the full integrated model (CAAD-Transformer). The specific experimental results are presented in
Table 1. From
Table 1, it can be observed that the baseline Transformer model performs worse across all three metrics RMSE, MSE, and MAE compared to its counterparts with integrated modules, yielding RMSE = 0.303, MSE = 0.091, and MAE = 0.234. After incorporating the CA module, the model shows slight improvements, with RMSE reduced to 0.298 and MSE decreased to 0.089. This indicates that the CA module enhances the model’s ability to discriminate information across channels, effectively highlighting critical spatial or temporal features within the high-dimensional embedding space.In the AD-Transformer, the introduction of the AD modeling mechanism further strengthens the model’s capability to capture temporal variation patterns. Although its RMSE = 0.300 shows only a marginal improvement compared to the CA-Transformer, it moderately enhances the accuracy of temporal fitting and demonstrates favorable generalization performance. The most significant improvement is observed in the complete CAAD-Transformer model, which integrates both CA and AD modules. In this case, RMSE drops markedly to 0.225, MSE decreases to 0.050, MAE declines to 0.173 and R value is also the highest, reaching 0.97, achieving the best overall performance. This suggests that the joint integration of the two mechanisms not only improves the representation of key spatial and temporal features but also effectively suppresses redundant and noisy information, enabling the model to achieve more stable and accurate predictions under the complex nonlinear dynamics of SST variations.
In addition, to ensure that the selected hyperparameters are optimal, we explored multiple hyperparameter ranges, such as learning rates [0.0001, 0.0003, 0.001, 0.005] and batch sizes 16, 32, 64. We also studied the impact of varying the number of encoder layers on the model’s performance. The experimental results obtained with a learning rate of 0.0003 and a batch size of 16 are shown in
Table 2. It can be seen that different encoder depths affect both predictive performance and computational cost. With only one layer, the model exhibits relatively large errors, particularly in RMSE and MAE, indicating weak fitting capacity and insufficient ability to capture complex data patterns. Increasing the encoder depth to two or three layers progressively improves performance, with noticeable reductions in RMSE and MAE. In particular, the three-layer configuration yields more stable improvements compared to the two-layer setup, demonstrating that the model can effectively capture the temporal and spatial dependencies in the data. Although a four-layer encoder further reduces RMSE, MSE, and MAE, the performance gains are marginal, while computational costs and training time increase significantly. Therefore, the four-layer configuration is not practical in real applications. Considering the trade-off between performance and computational efficiency, the three-layer encoder is selected as the optimal configuration.
4.4.2. Comparative Experiments
In this experiment, to comprehensively evaluate the performance advantages of the proposed model in SST prediction, we conducted comparative experiments with four mainstream time-series forecasting models, namely LSTM, ConvLSTM, RNN, and the proposed model. The experiments were carried out for four forecasting horizons of 1 day, 7 days, 15 days, and 30 days. The evaluation metrics used were RMSE, MSE, and MAE. The experimental results are illustrated in
Figure 3, which shows the error variations of different models under different forecasting horizons. The detailed analysis is provided as follows: First, from the perspective of prediction accuracy, the proposed model consistently outperforms the three baseline models across all forecasting horizons. For example, in terms of RMSE, the RMSE for one-day-ahead prediction is only 0.225, which is significantly lower than that of LSTM (0.268), RNN (0.328), and ConvLSTM (0.405). As the forecasting horizon extends to 7, 15, and 30 days, the RMSE of the proposed model remains at 0.404, 0.549, and 0.788, respectively, still maintaining a leading position and demonstrating strong long-term predictive capability. This advantage is also reflected in MSE and MAE metrics, indicating that the proposed model achieves superior overall error control.
Second, regarding the error growth trends across different models, although the errors of all models increase with forecasting horizons, the proposed model shows the most stable growth, exhibiting strong temporal robustness. This stability arises from the integration of multi-layer temporal attention mechanisms in the encoder, which effectively capture long-term dependencies in SST sequences. Furthermore, the hybrid modeling strategy combining stationary and non-stationary components enables the model to achieve stronger generalization capability for long-sequence forecasting tasks.
In addition, considering the structural design and performance of the baseline models: RNN, as the most basic sequential model, has the simplest network structure without gating or attention mechanisms, resulting in very limited ability to model long-term dependencies. Experimental results show that the RMSE of RNN reaches 0.328 for one-day-ahead forecasting, much higher than the proposed model, suggesting its shortcomings in capturing short-term dynamics. Moreover, when processing high-dimensional spatial data, RNN suffers from structural bottlenecks in feature coupling and information compression, leading to low performance. LSTM, as a classical recurrent neural network, alleviates the long-term dependency issue of traditional RNNs through its gating mechanism. However, it lacks explicit spatial modeling capability and struggles to handle spatial heterogeneity in SST data. This limitation becomes more evident in large-scale grid predictions, where its ability to capture long-term trends remains insufficient. ConvLSTM, by incorporating convolutional operations into LSTM, enhances the joint modeling of temporal and spatial dependencies, thereby mitigating the weakness of LSTM in spatial representation. ConvLSTM demonstrates relatively stable performance for short- and mid-term forecasts, with RMSE values of 0.537 and 0.668 for 7-day and 15-day horizons, slightly outperforming RNN and LSTM. Nonetheless, due to the limited receptive field of convolution operations, ConvLSTM fails to effectively capture long-term dependencies, resulting in insufficient performance for extended horizons. Furthermore, ConvLSTM typically requires fine-tuning of kernel sizes, strides, and temporal unfolding strategies, which increases training complexity, slows convergence, and introduces risks of overfitting due to its large number of parameters.
All the baseline models suffer from performance bottlenecks to varying degrees, primarily due to their lack of flexible feature selection and global modeling capacity. In contrast, the proposed model integrates the CA module and the AD module into the Transformer architecture, significantly enhancing the ability to capture multi-scale temporal dependencies while strengthening spatial feature extraction. Moreover, through hyperparameter optimization and lightweight network design, the proposed model achieves improved stability and faster convergence in long-sequence forecasting. Therefore, the proposed model demonstrates more comprehensive and robust performance across forecasting horizons, validating its practicality and advancement in complex spatiotemporal prediction scenarios.The Transformer extracts temporal features, while the CA module extracts spatial features. The AD module also distributes weights differently for features at different prediction horizons. In short-term predictions, such as 1-day or 3-day forecasts, temporal feature weights are higher because the dynamics of the recent time series have a more direct impact on short-term predictions. In medium-term predictions, such as 7-day forecasts, the weights of temporal and spatial features tend to balance. In long-term predictions, spatial features, which capture regional structures and climate backgrounds, play a larger role in stability. Additionally, a feed-forward layer is required to compensate for specific hierarchical features.In the long-term prediction for day 30, all models showed poor performance, highlighting that long-term SST prediction remains a challenging task. This is mainly due to the influence of various external factors on SST evolution, and deep learning methods that rely solely on past data for long-term prediction lack scientific validity and struggle to handle sudden changes. This is an area that still requires continued efforts.
Subplots a, b, c, and d in
Figure 4 represent the predicted values and true values for the SST prediction task in one of the training processes using the proposed model, LSTM, ConvLSTM, and RNN models, respectively. The red boxes indicate areas with significant differences. The proposed model successfully captures the fluctuation trends of the true values, especially in the region highlighted by the red box. Its predicted curve closely matches the true value curve, demonstrating strong fitting ability in capturing temporal changes and temperature fluctuations. Compared to the other models, the proposed model performs more accurately in short-term predictions and in regions with rapid changes, as indicated by the red box, and effectively handles rapid temperature fluctuations. The LSTM model performs well in most time steps, but has larger errors in the red box region. Although LSTM is suitable for handling long sequences, its performance declines when facing more complex temperature variations, especially for high-frequency fluctuations. Similarly, ConvLSTM combines the advantages of convolution and LSTM, showing some advantages in capturing spatial features and temporal dependencies, but still faces difficulties in predicting sudden fluctuations. The RNN model, on the other hand, has poor overall fitting performance across all time steps, especially in areas with significant temperature fluctuations. Compared to the other models, the RNN’s predicted curve deviates significantly from the true values and cannot effectively track rapid temperature changes. The RNN model clearly shows limitations in capturing long-term dependencies and local changes.
Figure 5 shows the visualization of predicted values, true values, and error maps for different prediction models. Columns a, b, c, and d represent the results of the proposed model, LSTM, ConvLSTM, and RNN models, respectively. From the predicted maps, it can be seen that the proposed model has the best spatial distribution of predictions, though there is still room for improvement. In the LSTM model’s prediction map, the distribution of predicted values is relatively uniform, with some fluctuations but relatively smooth. In the ConvLSTM predictions, clearer temperature trends can be observed, especially in the tropical region, where the predicted values show more detailed patterns. The RNN model’s predicted values exhibit larger fluctuations compared to LSTM and ConvLSTM, especially in local areas, indicating RNN’s weaker ability to capture spatiotemporal dynamics. In the error maps, the proposed model shows an even and small error distribution overall. The LSTM model’s error distribution is relatively uniform, but some local areas still have large errors, particularly in regions with complex temperature changes. The ConvLSTM error map is relatively smooth, indicating more stable performance in capturing temperature trends, especially in high-temperature regions. In contrast, the RNN model’s error fluctuations are larger, revealing its limitations in handling the complex spatiotemporal relationships of sea surface temperature.
Figure 5 visualizes the prediction results and errors, demonstrating the model’s ability to extract and predict the spatial features of SST.
Figure 6 presents the time attention weight map, showing the attention distribution between query time steps and key time steps. From the figure, it is evident that the model gives higher attention to recent time steps (shown in lighter yellow) when predicting future time steps, particularly for long-term forecasts in the red region on the right. This indicates that the model primarily focuses on historical time information, especially when predicting long time sequences, where historical patterns and trends have a significant impact on future outcomes. The ENSO phenomenon often leads to significant SST anomalies, especially in the tropical Pacific, and these anomalies can affect ocean currents and atmospheric circulation, thereby influencing global climate patterns. Therefore, achieving long-term predictions can help raise early awareness of the occurrence and evolution of these abnormal climate phenomena.
4.4.4. Experiments on Different Sea Regions
To comprehensively evaluate the generalization capability and robustness of the proposed model under different geographical regions, climate patterns, and dynamic environments, five representative offshore regions of China were selected for experiments: the Bohai Sea, the Yellow Sea, the East China Sea, the South China Sea, and the Taiwan Strait. These regions are distributed across temperate, subtropical, and tropical seas, featuring significant climatic differences and diverse oceanographic dynamics. Therefore, the experiments are both representative and broadly applicable. The experimental results are shown in
Figure 8. From the results, the South China Sea and the Taiwan Strait exhibit the smallest prediction errors, with RMSE values of 0.226 and 0.273, respectively. Their MSE and MAE metrics are also significantly better than those of the other regions, demonstrating outstanding predictive performance. These two regions are located between tropical and subtropical zones, where SST exhibits relatively stable overall trends with fewer extreme fluctuations and less abrupt seasonal transitions. Moreover, the SST data in these regions are continuous with few missing values, providing the model with high-quality training samples that effectively support time-series modeling. In addition, the two regions are geographically adjacent and share similar climatic control factors, such as the South China Sea summer monsoon and the Philippine warm current, leading to consistent predictive advantages of the model across both regions.
The East China Sea and Yellow Sea yield RMSE values of 0.415 and 0.429, respectively, representing intermediate levels. Both seas are located at the intersection of subtropical and temperate zones, characterized by complex ocean structures and strongly influenced by monsoons, shelf topography, and multiple ocean currents, sach as the Kuroshio Current and the Yellow Sea Cold Water Mass. These complex dynamics introduce greater nonlinearity and variability, posing more significant challenges for modeling. In such environments, the model must capture multiscale spatiotemporal features to maintain prediction accuracy. Nevertheless, the proposed model still delivers relatively accurate and stable results, indicating strong adaptability and robustness when dealing with multiscale disturbances and strongly coupled dynamic backgrounds.
The prediction error in the Bohai Sea is the highest, with an RMSE of 0.642, and both MSE and MAE are significantly higher than in other regions. Analysis reveals that the Bohai Sea may be affected by stronger pollution, weather influences, and other factors, leading to errors in data collection. Noisy data can impact the model’s learning process, resulting in larger errors. Additionally, the Bohai Sea, located in northern China, experiences strong seasonal and extreme climate phenomena, which cause significant fluctuations in SST. The presence of industrial, fishery activities, and coastal cities around the Bohai Sea may lead to significant human impacts on seawater temperature changes. For the model, these human factors are usually difficult to model accurately, which in turn leads to increased errors.
Figure 9 shows the error value visualization of the proposed model across different geographic regions. From the RMSE results, it can be observed that the prediction error is higher in coastal regions, especially in the Bohai Sea, where the error is relatively large. This indicates that in marine edge areas, due to factors such as water depth variations and the complexity of ocean currents, the prediction accuracy of SST is lower. In contrast, the error distribution in the South China Sea and Taiwan regions is relatively flat, with lower error values, showing better prediction performance. The MSE performance is similar to RMSE, but with an overall reduction in error values. Specifically, in the Yellow Sea region, the MSE spatial distribution is more concentrated, likely due to the relatively stable water body in this region, which allows the model to fit temperature changes better. In contrast, the MSE errors are higher in the East China Sea and Bohai Sea, suggesting that temperature predictions in these regions are influenced by more complex spatiotemporal changes. Looking at the MAE performance in each region, the Bohai Sea region shows more prominent prediction errors, especially in certain latitude ranges, where the errors are large. This indicates that when predicting in this region, the model may be constrained by the nonlinear nature of the data or external environmental factors.