Sea Surface Temperature Prediction Based on Adaptive Coordinate-Attention Transformer

Ji, Naihua; Dai, Yue; Xia, Menglei; Guo, Shuai; Qiu, Tianhui; Yu, Lu

doi:10.3390/jmse14020120

Open AccessArticle

Sea Surface Temperature Prediction Based on Adaptive Coordinate-Attention Transformer

by

Naihua Ji

¹,

Yue Dai

¹,

Menglei Xia

¹

,

Shuai Guo

¹,

Tianhui Qiu

²

and

Lu Yu

^3,*

¹

School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266520, China

²

School of Science, Qingdao University of Technology, Qingdao 266520, China

³

PLA Naval Submarine Academy, Qingdao 266194, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(2), 120; https://doi.org/10.3390/jmse14020120

Submission received: 24 November 2025 / Revised: 12 December 2025 / Accepted: 16 December 2025 / Published: 7 January 2026

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

Sea surface temperature (SST) serves as a critical indicator of oceanic thermodynamic processes and climate variability, exerting essential influence on ocean fronts, typhoon tracks, and monsoon evolution. Nevertheless, owing to the highly nonlinear and complex multi-scale characteristics of SST, achieving accurate spatiotemporal forecasting remains a formidable challenge. To address this issue, we proposed an enhanced Transformer architecture that incorporates a Coordinate Attention (CA) module and an Adaptive Fusion (AD) module, enabling the joint extraction and integration of temporal and spatial features. The proposed model is evaluated through SST prediction experiments over a localized region of the South China Sea with lead times of 1, 7, 15, and 30 days. Results indicate that our approach consistently outperforms baseline models across multiple evaluation metrics. Moreover, generalization experiments conducted on datasets from regions with diverse latitudes and climate regimes further demonstrate the model’s robustness and adaptability in terms of both accuracy and stability.

Keywords:

sea surface temperature; time series prediction; deep learning; transformer model; attention mechanism

1. Introduction

In the field of ocean research, sea surface temperature (SST) occupies a central position among various oceanic parameters. Its spatiotemporal evolution not only directly reflects oceanic thermodynamic processes but also plays a vital role in the global climate system [1]. Variations in SST are closely linked to numerous large-scale climate anomalies, such as the El Niño Southern Oscillation (ENSO) and the Indian Ocean Dipole, which are representative ocean–atmosphere coupling processes. These phenomena exert profound impacts on atmospheric circulation, precipitation distribution, and the global energy balance [2]. Moreover, SST serves as a critical indicator for studying global warming trends. Its warming rate is strongly coupled with climate change trajectories, making SST an indispensable window for monitoring both climate variability and global warming processes [2]. In addition, accurate SST prediction holds substantial practical value for a wide range of oceanic and climate-related applications. Given its prominent spatiotemporal variability, refined prediction of SST not only enhances the accuracy of oceanic weather and climate forecasts but also provides essential data support for marine fisheries management, offshore oil and gas exploration, early warning of extreme weather events, and marine environmental protection [3]. For instance, anomalies in SST are often precursors of shifts in fishery resources in coastal monitoring, while SST is also a critical factor in the early warning of hazardous events such as harmful algal blooms and tropical cyclones [3].

SST represents the thermal state of a sea region, characterized spatially as two-dimensional data across latitude–longitude grids and temporally as sequential time series. As a high-dimensional spatiotemporal variable influenced by multiple external factors, the evolution of SST exhibits nonlinear and multi-scale properties. Consequently, traditional physical and statistical models often struggle to effectively capture its complex dynamical characteristics and spatiotemporal dependencies. With the rapid development of deep learning techniques in recent years, particularly the widespread adoption of Transformer-based architectures in sequential modeling, new opportunities have emerged for predicting oceanic spatiotemporal data. Building upon the Transformer framework, whose positional encoding (PE) is well suited for extracting temporal features of SST data, this study introduces a Coordinate Attention (CA) module for spatial feature extraction and an Adaptive Fusion (AD) module that assigns differential weights to temporal and spatial representations, thereby enhancing the model’s ability to capture spatiotemporal dependencies.

The primary contributions of this work can be summarized as follows:

Building on the strengths and limitations of existing models, this study proposed the CAAD-Transformer and conducts SST prediction experiments within a small region of the South China Sea, achieving favorable results and the prediction error is as low as 0.225 °C.
The model retains the high accuracy of Transformer architectures for time-series forecasting, while creatively integrating the CA and AD modules to capture temporal dependencies and spatial features of SST, thereby further enhancing predictive performance.
Extensive validation on datasets from different regions, together with comparisons against other classical models, demonstrates the significant advantages and robustness of the proposed approach. In addition, seasonal testing and ablation studies confirm the consistent and reliable performance of the CAAD-Transformer.

The remainder of this paper is organized as follows. Section 2 reviews recent progress in SST forecasting, analyzing the strengths and limitations of existing studies. Section 3 details the proposed model architecture, including the problem formulation, data preprocessing, and design of individual modules. Section 4 describes the experimental setup, datasets, training configuration, and result analysis. Finally, Section 5 concludes the study and provides perspectives on future directions in SST prediction research.

2. Related Work

SST is one of the key observational variables for ocean front prediction, and the accuracy of SST forecasting directly determines the reliability of ocean front detection. In recent years, SST forecasting has become an important research topic, with a variety of methods proposed. Broadly, existing approaches can be categorized into two groups: numerical simulation methods and data-driven methods.Numerical simulation approaches are grounded in physical oceanography, where models are constructed from governing equations of ocean dynamics. These physics-based numerical models, such as ROMS and HYCOM [4], solve for SST forecasts by incorporating extensive physical and meteorological knowledge.They provide strong scientific interpretability and can capture complex dynamical processes. Yuan [5] proposed predicting ocean temperature by enforcing a series of physics-based constraint equations. He, Z. et al. [6] combined satellite observations of sea-surface parameters with a variational method to reconstruct temperature and salinity profiles in the Northern Indian Ocean (NIO). However, they demand substantial computational resources, rely heavily on specialized expertise, and are sensitive to initial and boundary conditions. Furthermore, many precise parameters are difficult to obtain, which can introduce significant biases into forecasts.

Data-driven methods, on the other hand, aim to uncover the intrinsic patterns and variability of oceanic variables by exploiting the spatiotemporal correlations embedded in large volumes of historical observational data. Compared with numerical simulations, they are generally more computationally efficient and easier to implement. With the rapid progress of machine learning, its application in the marine sciences has expanded substantially. Machine learning models can automatically extract features and construct predictive frameworks, offering a new perspective for SST forecasting and significantly improving accuracy. These approaches show considerable promise for the development of efficient, accurate, and lightweight ocean forecasting systems. For instance, linear regression [7], support vector machines [8], and neural networks [9], have been applied; however, these classical methods are typically designed for point-wise SST predictions rather than large-scale regional forecasts. Early statistical models, such as autoregressive integrated moving average (ARIMA) [10,11] and support vector regression (SVR) [12], were able to capture temporal dependencies in SST but were limited in handling nonlinear dynamics.

More recently, deep learning methods have been increasingly applied to SST forecasting. Li et al. [13] employed Backpropagation and Radial Basis Function neural networks for predicting SST and salinity time series, though their results still exhibited relatively low accuracy. Spatio-temporal data exhibit pronounced temporal and sequential dependencies, as the current SST may be influenced by the temperature at previous time steps. In essence, SST prediction is a time-series forecasting problem. Recurrent Neural Networks can effectively capture and leverage such sequential dependencies by maintaining prior information through hidden states and incorporating it into subsequent predictions [14]. As variants of Recurrent Neural Networks, Long Short-Term Memory (LSTM)and Gated Recurrent Unit (GRU) models possess distinctive advantages in time-series forecasting. Zhang et al. [15] were the first to apply LSTMs to SST forecasting, using the FC-LSTM framework to model sequential dependencies and achieving substantial accuracy gains. Wang et al. [16] proposed a Bi-LSTM model that integrates remote sensing observations with subsurface ARGO data to predict global subsurface temperature and salinity. Xie et al. [17] developed a Convolutional Gated Recurrent Unit with a multilayer perceptron to forecast sea surface temperature in the Bohai Sea.However, SST is inherently a spatiotemporal variable, and many studies have overlooked its spatial dimension. To address this issue, Yang et al. [18] proposed the HBCnet module, which integrates convolutional neural networks with LSTMs to jointly extract spatial and temporal features, thereby improving the accuracy and comprehensiveness of ocean front prediction. Similarly, Xiao et al. [19] combined CNNs with LSTMs and achieved higher prediction accuracy. Sun et al. [20] proposed a Temporal Convolutional Network for extracting temporal features from ocean temperature data. By stacking multiple convolutional layers, the model progressively extracts features and aggregates global information to capture patterns across temporal scales.

Moreover, SST variability is influenced by multiple external drivers and exhibits seasonality and instability, making it challenging for single time-series models to achieve robust performance. To this end, researchers have increasingly incorporated attention mechanisms into predictive frameworks. By assigning higher weights to strongly correlated features, attention-based models can better capture complex dynamics and enhance predictive accuracy. For example, Dai et al. [21] proposed a hybrid model that combines Transformer networks with spatial attention, enabling dynamic identification of critical regions and improving long-term SST prediction. Gao et al. [22] introduced graph attention mechanisms into SST forecasting, fully accounting for spatiotemporal dependencies and further enhancing predictive performance.In addition to employing models to extract spatio-temporal information of ocean temperature, incorporating positional information into the input features can also provide more accurate spatial correlations. Accordingly, Zhang et al. [23] proposed a novel Spatial-Temporal Siamese Convolutional Neural Network, which not only accounts for the differences in temperature characteristics across different sea regions but also captures informational features along the latitude and longitude dimensions.

3. Method

SST data exhibit pronounced spatiotemporal correlations and high-dimensional characteristics, influenced by various physical factors such as wind speed and ocean currents, which lead to complex variability. At the same time, the data acquisition process is often affected by sensor errors and environmental disturbances, introducing noise that poses significant challenges for fine-grained forecasting. To address these issues, we proposed the CAAD-Transformer network architecture for SST prediction, designed to extract and model both temporal and spatial features. As illustrated in Figure 1, the proposed model preserves the fundamental structure of the Transformer while incorporating a CA module to enable fine-grained prediction and feature extraction of key regions. In addition, the AD module introduces dynamic fusion weights that adjust according to the input scenario, thereby reflecting the varying importance of different features across temporal and spatial dimensions. This mechanism ultimately produces more accurate SST forecasts. The remainder of this section provides a detailed description of each component of the proposed network architecture.

3.1. Encoder–Decoder Architecture

The proposed model adopts an overall encoder–decoder architecture, in which the encoder is responsible for extracting multi-scale spatiotemporal representations from the input sequence, while the decoder generates SST values over different forecasting horizons. For one-dimensional time series prediction, the model employs a sliding-window mechanism. Assuming the window size is N, the prediction horizon is T, and the sliding stride is S, given the input data

X = {x_{1}, x_{2}, x_{3}, \dots, x_{k}}

, the first input sample is

{x_{1}, x_{2}, x_{3}, \dots, x_{N}}

with the corresponding prediction output

{x_{N + 1}, x_{N + 2}, \dots, x_{N + T}}

. The next input sample is then constructed by sliding the window forward by S steps to form new training samples.In the input stage, since SST data are inherently spatiotemporal and exhibit two-dimensional characteristics, they can be represented as a two-dimensional matrix

A_{i j}

, where each element

a_{i j}

denotes the temperature value at spatial position

(i, j)

. Therefore, the SST data are first flattened row by row into a one-dimensional time series before being fed into the model. The input feature tensor can thus be expressed as

X \in R^{B \times L \times M \times C}

, where B denotes the batch size, L the temporal length, M the number of spatial grids within a two-dimensional region

(H, W)

with

M = H \times W

, and C the number of channels. To facilitate attention computation along the temporal dimension, the input is reshaped into

X \in R^{(B \times M) \times L \times C}

, such that the temporal information at each spatial position can independently participate in the subsequent attention operations.

3.1.1. Encoder

Since the Transformer does not inherently encode sequential information, positional encoding must be introduced to incorporate temporal order, enabling the model to recognize the position of each time step. Positional encoding is typically generated using sinusoidal and cosine functions:

P E (p o s, 2 i) = sin (p o s / 10000^{(2 i / d)})

(1)

P E (p o s, 2 i + 1) = cos (p o s / 10000^{(2 i / d)})

(2)

Here,

p o s

denotes the temporal step index, i represents the index of the feature dimension, and d corresponds to the embedding dimension. Accordingly, the model input is defined as

X^{'} = X + P E

. By incorporating positional encoding along the temporal dimension, the model is endowed with an enhanced ability to perceive sequential order, enabling the attention mechanism to account for the influence of temporal dependencies. The multi-head attention mechanism, which constitutes the core of the encoder, is composed of multiple self-attention layers. Specifically, for the input data

X^{'}

, distinct linear transformations are first applied to obtain:

\begin{matrix} Q & = X^{'} W_{Q} \end{matrix}

(3)

\begin{matrix} K & = X^{'} W_{K} \end{matrix}

(4)

\begin{matrix} V & = X^{'} W_{V} \end{matrix}

(5)

Here,

W_{Q}

,

W_{K}

, and

W_{V}

are learnable parameter matrices, while Q, K, and V denote the Query, Key, and Value representations, respectively. In the attention mechanism, the Query is employed to identify the target to be attended to, the Key is used to match with the Query in order to determine relevance, and the Value corresponds to the actual information associated with each Key. In this way, the attention mechanism assigns different weights to the data, enabling the model to focus more effectively on the most critical components. The attention weights across temporal steps are computed using the dot-product attention, which is formally defined as:

α = softmax (Q K^{T} / \sqrt{d_{k}})

(6)

Here,

Q K^{T}

is employed to compute the similarity between temporal steps, while the scaling factor is introduced to mitigate gradient explosion. The softmax function is then applied to normalize the attention weights, ensuring that they sum to one. The output of this layer constitutes the result of the attention mechanism. Since a single attention head may be insufficient to capture all relevant information, the Transformer adopts a multi-head attention mechanism, where each head focuses on SST variations across different temporal scales. By integrating multiple attention heads, the model is able to enhance its capacity for feature extraction. This is also one of the primary reasons why the Transformer achieves higher predictive accuracy compared to other models. The outputs from all heads are concatenated and subsequently passed through a linear transformation to yield the final attention result:

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{k}) W_{O}

(7)

After the computation of attention weights, the SST representations at each time step are further subjected to nonlinear transformation in order to extract higher-level features. The feed-forward network (FeedForward) employs the ReLU activation function to enhance the model’s nonlinear representation capacity. In addition, to mitigate gradient vanishing and improve training stability, each layer incorporates residual connections and layer normalization (Add & Norm). This design maintains consistency in the dimensionality of the output features across layers, reinforces the learning capability of the network, and ensures that the model can effectively capture and integrate multi-level input information.

3.1.2. Decoder

The major difference between the decoder and the encoder lies in the use of Masked Multi-Head Attention. The masking operation is designed to prevent the decoder from accessing future information during training. For the query at time step t, it should only attend to

key (1), key (2), \dots, key (t - 1)

, since

key (t)

does not exist yet. The function of masking is to assign zero weights to the corresponding values: the inner product between

query (t)

and the keys from future time steps is set to negative infinity (

- \infty

), which results in zero weights after the softmax operation. In this way, the output at time t only depends on the key-value pairs from the past steps, ensuring consistency between training and inference.

In our model, each sublayer, including Multi-Head Attention, Coordinate Attention, Adaptive Fusion, and the feed-forward network, is wrapped with a residual connection followed by layer normalization, which stabilizes training and improves gradient flow. Let the input of the l-th sublayer be

x^{l}

, and the sublayer transformation be

S (\cdot)

. We adopt the Post-Norm formulation:

{\tilde{x}}^{l} = LN (x^{l} + S (x^{l})) .

(8)

In the encoder,

S \in {MHA, CA, FFN}

; in the decoder,

S \in {Masked MHA, MHA, CA,

AD, FFN}

. Layer normalization is applied across the channel dimension, and is formally defined as:

LN (h) = γ ⊙ \frac{h - μ (h)}{\sqrt{σ^{2} (h) + ϵ}} + β,

(9)

where

γ

and

β

are learnable scale and bias parameters, and

ϵ

is a constant for numerical stability. We initialize

γ = 1

and

β = 0

by default. The residual term

x^{l}

preserves the identity mapping, forcing the sublayer to learn only the residual transformation

S (x^{l})

, which accelerates convergence, alleviates gradient vanishing, and facilitates deep stacking.

3.2. Integration of Latitude and Longitude Information

When constructing temporal dependencies of SST data using the Transformer model, temporal features can be effectively extracted; however, the Transformer lacks the ability to capture spatial representations. To address this limitation, the proposed model integrates the CA mechanism into the Transformer framework in order to capture precise positional information and spatial features. As shown in Figure 2 below, after the input data are processed by the multi-head attention mechanism within the Transformer encoder, the resulting features have the dimensionality

B \times T \times d

, which is not directly compatible with the CA module. Therefore, the features are reshaped into the form

B \times T \times H \times W

. The CA mechanism then performs pooling operations along the horizontal direction

(H, 1)

and vertical direction

(1, W)

of the reshaped feature maps to obtain horizontal features

Z_{h}

and vertical features

Z_{w}

, respectively. These two parts are concatenated to form Z, which is subsequently transformed using a

1 \times 1

convolution F. The computation process is as follows:

f = δ (F (Z))

(10)

Here,

δ

denotes the nonlinear activation function. The output f is decomposed along the horizontal and vertical directions to obtain tensors

f_{h}

and

f_{w}

, respectively. Subsequently, two

1 \times 1

convolutional transformations,

F_{h}

and

F_{w}

, are applied to project them into tensors

G_{h}

and

G_{w}

with the same number of channels. The computation process is defined as follows:

\begin{matrix} G_{h} & = σ (F_{h} (f_{h})) \end{matrix}

(11)

\begin{matrix} G_{w} & = σ (F_{w} (f_{w})) \end{matrix}

(12)

Here,

σ

denotes the Sigmoid function. In this way, the attention weights along the horizontal and vertical directions are obtained, and the output of the coordinate attention mechanism can be expressed as Y. By applying coordinate attention (CA) to weight the feature maps along the horizontal (H) and vertical (W) directions, the mechanism better conforms to the latitude–longitude structure of oceanographic data, thereby providing advantages in the precise localization of ocean fronts and the extraction of directional features. The pseudocode of this process is presented as follows (Algorithm 1):

Algorithm 1: Coordinate Attention for Spatiotemporal SST Data

Input: Feature tensor

X \in R^{B \times T \times (H \times W)}

, where H = lat_dim, W = lon_dim

Output: Attended tensor

X^{'} \in R^{B \times T \times (H \times W)}

Reshape X to

X_{spatial} \in R^{(B \times T) \times H \times W}

;

Latitude Attention:

Aggregate

X_{spatial}

along longitude

\to X_{lat} \in R^{(B \times T) \times H}

;

Pass through

1 \times 1

Conv → BN → ReLU →

1 \times 1

Conv → Sigmoid

\to A_{lat} \in R^{(B \times T) \times H \times H}

;

Apply attention:

X_{lat_att} = A_{lat} \times X_{spatial}

;

Longitude Attention:

Aggregate

X_{lat_att}

along latitude

\to X_{lon} \in R^{(B \times T) \times W}

;

Pass through

1 \times 1

Conv → BN → ReLU →

1 \times 1

Conv → Sigmoid

\to A_{lon} \in R^{(B \times T) \times W \times W}

;

Apply attention:

X_{lon_att} = {(A_{lon} \times X_{lon_att}^{T})}^{T}

;

Reshape

X_{lon_att}

back to original shape

\to X^{'} \in R^{B \times T \times (H \times W)}

;

return

X^{'}

3.3. Adaptive Feature Fusion

To effectively integrate multi-source features, we proposed a lightweight feature fusion module, namely the AD module, for dynamically fusing representations from different branches. This module automatically adjusts the weights of different feature branches in the final representation in a learnable manner, thereby endowing the model with strong flexibility and adaptability. Within the Transformer architecture, three independent processing branches are introduced to extract different types of feature information:

X_{1}

originates from the output of the self-attention mechanism, which captures long-range dependencies within the sequence;

X_{2}

comes from the coordinate attention mechanism, which emphasizes spatial positional sensitivity; and

X_{3}

is derived from the feed-forward network, which performs nonlinear transformations and feature enhancement. These three types of features respectively focus on temporal relationships, spatial positional information, and semantic abstraction levels, serving as important complements for characterizing the spatiotemporal evolution of oceanic SST. Traditional fusion strategies, such as direct summation or concatenation, fail to adaptively adjust the contribution of each branch according to the input. The adaptive feature fusion mechanism is designed precisely to address this limitation. The core idea of the AD mechanism is to assign a learnable weight to each feature branch, with the weights being dynamically adjusted by the network based on the input. Specifically, let

X_{1}, X_{2}, X_{3} \in R^{C \times H \times W}

denote the three input features. They are first concatenated and fed into a lightweight neural network to produce three weights

α

. After normalization with the softmax function, the weights are guaranteed to sum to one. Finally, each feature branch is scaled by its corresponding weight, and the weighted features are summed to obtain the fused output. The operation of this fusion process is formally defined as follows:

F_{fused} = \sum_{i = 1}^{3} α_{i} \cdot X_{i}

(13)

Here, the weight vector

α = [1, 2, 3] \in R^{3}

can be obtained by normalizing the learnable parameters

W = [W_{1}, W_{2}, W_{3}]

through the softmax function:

α_{i} = exp (w_{i}) / (\sum_{j = 1}^{3} exp (w_{j})), i \in {1, 2, 3}

(14)

The initial weights are set to

\frac{1}{3}

to ensure that the fusion process starts in a fair and neutral manner. First, the dimensional consistency of the three input features is guaranteed, and then the softmax function is applied to ensure that the weights sum to 1 and remain strictly positive. The features are subsequently combined through weighted summation according to the learned weights, where branches with larger weights contribute more to the fused representation. Unlike traditional approaches that rely on simple averaging or concatenation, this module introduces learnable fusion weights, enabling the model to automatically optimize under different tasks and data conditions, thereby achieving adaptive capability at the feature level.

4. Experiments and Analysis

4.1. Data Source

The dataset used in this study is obtained from the Copernicus Marine Environment Monitoring Service (CMEMS), which focuses on the collection, analysis, and provision of Earth environmental data. Specifically, the Global Ocean Physics Reanalysis dataset (GLORYS12V1) is employed, with a spatial resolution of

0 . 083^{\circ} \times 0 . 083^{\circ}

and a temporal resolution of one day. This dataset integrates multi-source satellite remote sensing data and in-situ observations, generating seamless global SST fields through an optimal interpolation algorithm, with an accuracy error of less than 0.1 °C. The study area is a subregion of the South China Sea (110.44°–111.70° E, 17.03°–18.07° N), consisting of a

13 \times 15

grid. This region is significantly influenced by monsoon climate, Kuroshio branches, and coastal upwelling, leading to pronounced spatiotemporal variations in SST. The time span covers June 2010 to June 2020. A sliding window method is applied for time-series modeling, where a window length of 30 with a step size of 1 is adopted to construct training samples along the temporal dimension. The dataset is divided into training and testing sets with a ratio of 80% for training and 20% for testing.

In the study area, most of the grid cells represent ocean regions, while a small portion corresponds to land regions, which are denoted as NaN in the dataset. If NaN values are directly fed into the network model, the loss function would also become NaN, thereby preventing weight updates and loss computation. To address this issue, prior to feeding the data into the model, the NaN values are replaced with the mean value of the corresponding region, and the locations of NaNs are recorded. After the training and testing processes are completed, the NaN positions are restored to reconstruct the mixed land–ocean predictions. In addition, considering that SST exhibits pronounced seasonal periodic fluctuations, an anomaly-based preprocessing step is applied to the raw SST data prior to model training. This removes stable seasonal trends across years, thereby enhancing the model’s ability to learn anomalous signals and short-term variability. Let

X_{k} (i, j)

denote the raw SST value at the j-th grid cell on the i-th day of year k. The anomaly calculation is defined as follows:

X_{k}^{'} (i, j) = X_{k} (i, j) - {\bar{X}}_{k} (i, j)

(15)

Here,

{\bar{X}}_{k} (i, j)

denotes the climatological daily mean SST value at the spatial grid point

(i, j)

, where K represents the total number of years of data.

4.2. Loss Function

In this study, we use Smooth L1 Loss to calculate the error between the model’s predictions and the true values, which is particularly suitable for SST prediction tasks. SST prediction typically involves time series data, which often exhibit volatility and seasonal variations. These characteristics can lead to large gradients when the errors are substantial, thereby affecting the stability of the training process. Therefore, we choose Smooth L1 Loss to optimize the model. The core idea is to use L2 loss for smaller errors and L1 loss for larger errors, thus avoiding the issue of gradient explosion when the errors are large. The mathematical form of the loss function is as follows:

L (Y, \hat{Y}) = \frac{1}{N} \sum_{i = 1}^{N} ℓ (y_{i} - {\hat{y}}_{i})

where,

\hat{Y}

is the model’s predicted sea surface temperature (SST) sequence, Y is the true SST sequence, and N is the number of samples. The loss function

ℓ (e)

is defined as:

ℓ (e) = \{\begin{matrix} 0.5 \cdot e^{2}, & if | e | < 1 \\ | e | - 0.5, & otherwise . \end{matrix}

where e is the normalized prediction error. This loss function uses squared loss for smaller errors, and linear loss for larger errors, thus effectively reducing the impact of large errors on the training process. Compared with traditional Mean Squared Error (MSE) loss, Smooth L1 Loss behaves the same as MSE for small errors, but switches to L1 loss for large errors. This helps alleviate the negative impact of large errors during training, especially for SST sequences with outliers or significant fluctuations. Sea surface temperature, as a physical quantity with clear seasonal and long-term trends, is influenced by various factors such as atmospheric, oceanic, and climatic conditions. Therefore, using Smooth L1 Loss helps avoid abrupt changes due to large errors during training and maintains the robustness of the model, especially when predicting temperature under extreme variations.

In addition, early stopping is applied in this study to mitigate overfitting. Training is stopped when the model’s performance on the validation set starts to degrade, preventing overfitting that may occur if training continues. The Adam optimizer is used for model training with a learning rate of 0.8, which adjusts the learning rate and descent direction to improve convergence efficiency. The experiment iteratively reduces the learning rate, starting with an initial learning rate of 0.0001. When the validation loss does not improve for 4 consecutive epochs, the learning rate is reduced by 50%. This strategy helps stabilize the optimization process and allows the loss to gradually converge to the optimal solution.

4.3. Model Evaluation Metrics

To evaluate the effectiveness and feasibility of the proposed model, the following evaluation metrics are employed:

Mean Squared Error (MSE) measures the average squared difference between the predicted values and the ground truth, where n is the number of samples,

y_{i}

is the true value, and

{\hat{y}}_{i}

is the predicted value. Its formulation is:

M S E = 1 / n \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(16)

Root Mean Squared Error (RMSE) quantifies the error between the predicted results and the actual data. It is defined as:

R M S E = \sqrt{1 / n \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(17)

Mean Absolute Error (MAE) measures the average absolute difference between the predicted values and the actual values, without considering the direction of the error. It is calculated as:

M A E = 1 / n \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(18)

Correlation coefficient(R) measures the strength of the linear relationship between the predicted values and the observed values. The higher the R value, the better the model performance. The formula for calculating R is as follows:

R = \frac{\sum_{i = 1}^{n} (y_{i} - \bar{y}) ({\hat{y}}_{i} - \bar{\hat{y}})}{\sqrt{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{\hat{y}})}^{2}}}

(19)

4.4. Results and Analysis

All experiments in this study were conducted under the Windows 10 operating system. The experiments were implemented in Python 3.8 with PyTorch 1.9.0 as the deep learning framework. The dataset used was the customized SST dataset, split into 70% for training, 10% for validation, and 20% for testing. All data partitions were automatically performed within the main program to ensure scientific rigor and reproducibility. For the model, a self-attention-based sequence prediction framework was employed. The main hyperparameter settings were as follows: input sequence length of 30, label length of 30, prediction step of 1, 3 encoder layers, 1 decoder layer, model dimension of 256, feed-forward network dimension of 1024, number of attention heads set to 8, temporal feature encoding mode set to timeF, activation function set to gelu, dropout ratio of 0.1, batch size of 16, initial learning rate of 0.0003, and a maximum of 80 training epochs. To guarantee reproducibility, a fixed random seed was used across all experiments. The PyTorch version is 2.7.1 and both model training and testing were accelerated using the GPU.

4.4.1. Ablation Study

To verify the contribution of each component in the proposed model to the overall prediction performance, four ablation experiments were designed. Specifically, we evaluated the baseline Transformer model, the Transformer with the Coordinate Attention module (CA-Transformer), the Transformer with the Adaptive Dynamic module (AD-Transformer), and the full integrated model (CAAD-Transformer). The specific experimental results are presented in Table 1. From Table 1, it can be observed that the baseline Transformer model performs worse across all three metrics RMSE, MSE, and MAE compared to its counterparts with integrated modules, yielding RMSE = 0.303, MSE = 0.091, and MAE = 0.234. After incorporating the CA module, the model shows slight improvements, with RMSE reduced to 0.298 and MSE decreased to 0.089. This indicates that the CA module enhances the model’s ability to discriminate information across channels, effectively highlighting critical spatial or temporal features within the high-dimensional embedding space.In the AD-Transformer, the introduction of the AD modeling mechanism further strengthens the model’s capability to capture temporal variation patterns. Although its RMSE = 0.300 shows only a marginal improvement compared to the CA-Transformer, it moderately enhances the accuracy of temporal fitting and demonstrates favorable generalization performance. The most significant improvement is observed in the complete CAAD-Transformer model, which integrates both CA and AD modules. In this case, RMSE drops markedly to 0.225, MSE decreases to 0.050, MAE declines to 0.173 and R value is also the highest, reaching 0.97, achieving the best overall performance. This suggests that the joint integration of the two mechanisms not only improves the representation of key spatial and temporal features but also effectively suppresses redundant and noisy information, enabling the model to achieve more stable and accurate predictions under the complex nonlinear dynamics of SST variations.

In addition, to ensure that the selected hyperparameters are optimal, we explored multiple hyperparameter ranges, such as learning rates [0.0001, 0.0003, 0.001, 0.005] and batch sizes 16, 32, 64. We also studied the impact of varying the number of encoder layers on the model’s performance. The experimental results obtained with a learning rate of 0.0003 and a batch size of 16 are shown in Table 2. It can be seen that different encoder depths affect both predictive performance and computational cost. With only one layer, the model exhibits relatively large errors, particularly in RMSE and MAE, indicating weak fitting capacity and insufficient ability to capture complex data patterns. Increasing the encoder depth to two or three layers progressively improves performance, with noticeable reductions in RMSE and MAE. In particular, the three-layer configuration yields more stable improvements compared to the two-layer setup, demonstrating that the model can effectively capture the temporal and spatial dependencies in the data. Although a four-layer encoder further reduces RMSE, MSE, and MAE, the performance gains are marginal, while computational costs and training time increase significantly. Therefore, the four-layer configuration is not practical in real applications. Considering the trade-off between performance and computational efficiency, the three-layer encoder is selected as the optimal configuration.

4.4.2. Comparative Experiments

In this experiment, to comprehensively evaluate the performance advantages of the proposed model in SST prediction, we conducted comparative experiments with four mainstream time-series forecasting models, namely LSTM, ConvLSTM, RNN, and the proposed model. The experiments were carried out for four forecasting horizons of 1 day, 7 days, 15 days, and 30 days. The evaluation metrics used were RMSE, MSE, and MAE. The experimental results are illustrated in Figure 3, which shows the error variations of different models under different forecasting horizons. The detailed analysis is provided as follows: First, from the perspective of prediction accuracy, the proposed model consistently outperforms the three baseline models across all forecasting horizons. For example, in terms of RMSE, the RMSE for one-day-ahead prediction is only 0.225, which is significantly lower than that of LSTM (0.268), RNN (0.328), and ConvLSTM (0.405). As the forecasting horizon extends to 7, 15, and 30 days, the RMSE of the proposed model remains at 0.404, 0.549, and 0.788, respectively, still maintaining a leading position and demonstrating strong long-term predictive capability. This advantage is also reflected in MSE and MAE metrics, indicating that the proposed model achieves superior overall error control.

Second, regarding the error growth trends across different models, although the errors of all models increase with forecasting horizons, the proposed model shows the most stable growth, exhibiting strong temporal robustness. This stability arises from the integration of multi-layer temporal attention mechanisms in the encoder, which effectively capture long-term dependencies in SST sequences. Furthermore, the hybrid modeling strategy combining stationary and non-stationary components enables the model to achieve stronger generalization capability for long-sequence forecasting tasks.

In addition, considering the structural design and performance of the baseline models: RNN, as the most basic sequential model, has the simplest network structure without gating or attention mechanisms, resulting in very limited ability to model long-term dependencies. Experimental results show that the RMSE of RNN reaches 0.328 for one-day-ahead forecasting, much higher than the proposed model, suggesting its shortcomings in capturing short-term dynamics. Moreover, when processing high-dimensional spatial data, RNN suffers from structural bottlenecks in feature coupling and information compression, leading to low performance. LSTM, as a classical recurrent neural network, alleviates the long-term dependency issue of traditional RNNs through its gating mechanism. However, it lacks explicit spatial modeling capability and struggles to handle spatial heterogeneity in SST data. This limitation becomes more evident in large-scale grid predictions, where its ability to capture long-term trends remains insufficient. ConvLSTM, by incorporating convolutional operations into LSTM, enhances the joint modeling of temporal and spatial dependencies, thereby mitigating the weakness of LSTM in spatial representation. ConvLSTM demonstrates relatively stable performance for short- and mid-term forecasts, with RMSE values of 0.537 and 0.668 for 7-day and 15-day horizons, slightly outperforming RNN and LSTM. Nonetheless, due to the limited receptive field of convolution operations, ConvLSTM fails to effectively capture long-term dependencies, resulting in insufficient performance for extended horizons. Furthermore, ConvLSTM typically requires fine-tuning of kernel sizes, strides, and temporal unfolding strategies, which increases training complexity, slows convergence, and introduces risks of overfitting due to its large number of parameters.

All the baseline models suffer from performance bottlenecks to varying degrees, primarily due to their lack of flexible feature selection and global modeling capacity. In contrast, the proposed model integrates the CA module and the AD module into the Transformer architecture, significantly enhancing the ability to capture multi-scale temporal dependencies while strengthening spatial feature extraction. Moreover, through hyperparameter optimization and lightweight network design, the proposed model achieves improved stability and faster convergence in long-sequence forecasting. Therefore, the proposed model demonstrates more comprehensive and robust performance across forecasting horizons, validating its practicality and advancement in complex spatiotemporal prediction scenarios.The Transformer extracts temporal features, while the CA module extracts spatial features. The AD module also distributes weights differently for features at different prediction horizons. In short-term predictions, such as 1-day or 3-day forecasts, temporal feature weights are higher because the dynamics of the recent time series have a more direct impact on short-term predictions. In medium-term predictions, such as 7-day forecasts, the weights of temporal and spatial features tend to balance. In long-term predictions, spatial features, which capture regional structures and climate backgrounds, play a larger role in stability. Additionally, a feed-forward layer is required to compensate for specific hierarchical features.In the long-term prediction for day 30, all models showed poor performance, highlighting that long-term SST prediction remains a challenging task. This is mainly due to the influence of various external factors on SST evolution, and deep learning methods that rely solely on past data for long-term prediction lack scientific validity and struggle to handle sudden changes. This is an area that still requires continued efforts.

Subplots a, b, c, and d in Figure 4 represent the predicted values and true values for the SST prediction task in one of the training processes using the proposed model, LSTM, ConvLSTM, and RNN models, respectively. The red boxes indicate areas with significant differences. The proposed model successfully captures the fluctuation trends of the true values, especially in the region highlighted by the red box. Its predicted curve closely matches the true value curve, demonstrating strong fitting ability in capturing temporal changes and temperature fluctuations. Compared to the other models, the proposed model performs more accurately in short-term predictions and in regions with rapid changes, as indicated by the red box, and effectively handles rapid temperature fluctuations. The LSTM model performs well in most time steps, but has larger errors in the red box region. Although LSTM is suitable for handling long sequences, its performance declines when facing more complex temperature variations, especially for high-frequency fluctuations. Similarly, ConvLSTM combines the advantages of convolution and LSTM, showing some advantages in capturing spatial features and temporal dependencies, but still faces difficulties in predicting sudden fluctuations. The RNN model, on the other hand, has poor overall fitting performance across all time steps, especially in areas with significant temperature fluctuations. Compared to the other models, the RNN’s predicted curve deviates significantly from the true values and cannot effectively track rapid temperature changes. The RNN model clearly shows limitations in capturing long-term dependencies and local changes.

Figure 5 shows the visualization of predicted values, true values, and error maps for different prediction models. Columns a, b, c, and d represent the results of the proposed model, LSTM, ConvLSTM, and RNN models, respectively. From the predicted maps, it can be seen that the proposed model has the best spatial distribution of predictions, though there is still room for improvement. In the LSTM model’s prediction map, the distribution of predicted values is relatively uniform, with some fluctuations but relatively smooth. In the ConvLSTM predictions, clearer temperature trends can be observed, especially in the tropical region, where the predicted values show more detailed patterns. The RNN model’s predicted values exhibit larger fluctuations compared to LSTM and ConvLSTM, especially in local areas, indicating RNN’s weaker ability to capture spatiotemporal dynamics. In the error maps, the proposed model shows an even and small error distribution overall. The LSTM model’s error distribution is relatively uniform, but some local areas still have large errors, particularly in regions with complex temperature changes. The ConvLSTM error map is relatively smooth, indicating more stable performance in capturing temperature trends, especially in high-temperature regions. In contrast, the RNN model’s error fluctuations are larger, revealing its limitations in handling the complex spatiotemporal relationships of sea surface temperature.

Figure 5 visualizes the prediction results and errors, demonstrating the model’s ability to extract and predict the spatial features of SST. Figure 6 presents the time attention weight map, showing the attention distribution between query time steps and key time steps. From the figure, it is evident that the model gives higher attention to recent time steps (shown in lighter yellow) when predicting future time steps, particularly for long-term forecasts in the red region on the right. This indicates that the model primarily focuses on historical time information, especially when predicting long time sequences, where historical patterns and trends have a significant impact on future outcomes. The ENSO phenomenon often leads to significant SST anomalies, especially in the tropical Pacific, and these anomalies can affect ocean currents and atmospheric circulation, thereby influencing global climate patterns. Therefore, achieving long-term predictions can help raise early awareness of the occurrence and evolution of these abnormal climate phenomena.

4.4.3. Seasonal Testing

Under the influence of strong monsoons and ocean currents, oceanic regions exhibit pronounced seasonal variations. To evaluate the forecasting performance of the proposed model across different seasons, we divide the year into four seasons: spring (March–May), summer (June–August), autumn (September–November), and winter (December–February). Using historical 30-day SST data to predict the SST of the following day, we compute the RMSE, MSE, and MAE metrics for each season. The results are summarized in Table 3. From the experimental results, it can be observed that the proposed model achieves significantly higher prediction accuracy in summer and winter compared to spring and autumn. In particular, the RMSE and MAE in winter are the lowest, indicating that the model attains the best predictive performance during this season. By contrast, the prediction errors in spring and autumn are relatively higher, with autumn exhibiting the largest RMSE, which demonstrates the model’s reduced predictive capability in that season. These seasonal differences may be closely related to the oceanic environmental mechanisms of the South China Sea. During summer and winter, the air–sea coupling processes are relatively stable; in particular, the strong winter monsoon and smoother temperature variations facilitate the model’s ability to capture regular patterns in the time series. In contrast, spring and autumn are transitional periods characterized by rapid changes in mixed-layer depth and strong disturbances in air–sea fields, which increase data volatility and consequently make prediction more challenging.

As illustrated in Figure 7, which shows the boxplots of prediction errors across the four seasons, the red dashed line denotes the mean and the black solid line denotes the median. Overall, the error distributions across the seasons are symmetric around 0 °C, with the mean values close to the medians, and the box and whisker lengths relatively similar among the four seasons. This indicates that the model maintains stable predictive performance throughout spring, summer, autumn, and winter. Notably, the boxes for summer and winter are narrower with shorter whiskers, suggesting that the majority of error samples are more concentrated, reflecting smaller variance and higher stability. Although the boxes for spring and autumn are slightly wider with longer whiskers, their central positions still align closely with 0, indicating satisfactory predictive performance. Most error samples remain within a controllable range, suggesting that the degree of error dispersion differs only marginally across seasons, and the model demonstrates robust adaptability to seasonal background fields and variations in air-sea coupling.

4.4.4. Experiments on Different Sea Regions

To comprehensively evaluate the generalization capability and robustness of the proposed model under different geographical regions, climate patterns, and dynamic environments, five representative offshore regions of China were selected for experiments: the Bohai Sea, the Yellow Sea, the East China Sea, the South China Sea, and the Taiwan Strait. These regions are distributed across temperate, subtropical, and tropical seas, featuring significant climatic differences and diverse oceanographic dynamics. Therefore, the experiments are both representative and broadly applicable. The experimental results are shown in Figure 8. From the results, the South China Sea and the Taiwan Strait exhibit the smallest prediction errors, with RMSE values of 0.226 and 0.273, respectively. Their MSE and MAE metrics are also significantly better than those of the other regions, demonstrating outstanding predictive performance. These two regions are located between tropical and subtropical zones, where SST exhibits relatively stable overall trends with fewer extreme fluctuations and less abrupt seasonal transitions. Moreover, the SST data in these regions are continuous with few missing values, providing the model with high-quality training samples that effectively support time-series modeling. In addition, the two regions are geographically adjacent and share similar climatic control factors, such as the South China Sea summer monsoon and the Philippine warm current, leading to consistent predictive advantages of the model across both regions.

The East China Sea and Yellow Sea yield RMSE values of 0.415 and 0.429, respectively, representing intermediate levels. Both seas are located at the intersection of subtropical and temperate zones, characterized by complex ocean structures and strongly influenced by monsoons, shelf topography, and multiple ocean currents, sach as the Kuroshio Current and the Yellow Sea Cold Water Mass. These complex dynamics introduce greater nonlinearity and variability, posing more significant challenges for modeling. In such environments, the model must capture multiscale spatiotemporal features to maintain prediction accuracy. Nevertheless, the proposed model still delivers relatively accurate and stable results, indicating strong adaptability and robustness when dealing with multiscale disturbances and strongly coupled dynamic backgrounds.

The prediction error in the Bohai Sea is the highest, with an RMSE of 0.642, and both MSE and MAE are significantly higher than in other regions. Analysis reveals that the Bohai Sea may be affected by stronger pollution, weather influences, and other factors, leading to errors in data collection. Noisy data can impact the model’s learning process, resulting in larger errors. Additionally, the Bohai Sea, located in northern China, experiences strong seasonal and extreme climate phenomena, which cause significant fluctuations in SST. The presence of industrial, fishery activities, and coastal cities around the Bohai Sea may lead to significant human impacts on seawater temperature changes. For the model, these human factors are usually difficult to model accurately, which in turn leads to increased errors.

Figure 9 shows the error value visualization of the proposed model across different geographic regions. From the RMSE results, it can be observed that the prediction error is higher in coastal regions, especially in the Bohai Sea, where the error is relatively large. This indicates that in marine edge areas, due to factors such as water depth variations and the complexity of ocean currents, the prediction accuracy of SST is lower. In contrast, the error distribution in the South China Sea and Taiwan regions is relatively flat, with lower error values, showing better prediction performance. The MSE performance is similar to RMSE, but with an overall reduction in error values. Specifically, in the Yellow Sea region, the MSE spatial distribution is more concentrated, likely due to the relatively stable water body in this region, which allows the model to fit temperature changes better. In contrast, the MSE errors are higher in the East China Sea and Bohai Sea, suggesting that temperature predictions in these regions are influenced by more complex spatiotemporal changes. Looking at the MAE performance in each region, the Bohai Sea region shows more prominent prediction errors, especially in certain latitude ranges, where the errors are large. This indicates that when predicting in this region, the model may be constrained by the nonlinear nature of the data or external environmental factors.

5. Conclusions

This study addresses the highly nonlinear and multiscale dynamic characteristics of SST sequences in small-scale ocean regions by proposing an improved Transformer-based prediction model, the CAAD-Transformer, which integrates a CA mechanism and an AD module. The model architecture jointly emphasizes long-term dependency modeling in the temporal dimension and precise spatial feature extraction, enabling effective capture of nonlinear evolution patterns and spatiotemporal coupling relationships inherent in SST data.To thoroughly validate the model’s performance, multiscale SST forecasting tasks were designed in the South China Sea for four prediction horizons (1-day, 7-day, 15-day, and 30-day), and comparative experiments were conducted with state-of-the-art time-series models such as LSTM, RNN, and ConvLSTM. The results demonstrate that the CAAD-Transformer consistently outperforms the baselines in terms of RMSE, MSE, and MAE. Ablation experiments further confirm the critical contributions of the CA module in strengthening spatial position awareness and the AD module in dynamic feature selection. Together, these components enable superior spatiotemporal modeling capacity. Furthermore, generalization experiments across datasets from different latitudes and climatic backgrounds indicate that the CAAD-Transformer achieves robust and adaptive performance on diverse, multiscale SST data. The model dynamically adjusts attention weights according to regional features, enhances the representation of informative signals, and effectively suppresses noise and redundant inputs, thereby improving both predictive accuracy and stability under complex oceanic conditions.

The strategies for handling missing data and raw data in this paper could be improved. Enhancing the preprocessing of raw data can reduce inherent errors and, to some extent, improve the model’s prediction accuracy. In addition, the evolution of SST is influenced by various external factors, and it continuously interacts with the atmosphere through diffusion and energy exchange. Therefore, future work could further incorporate multi-variable fusion, such as wind fields, salinity, and flow velocity, to enhance both the model’s performance and its physical interpretability. This could also extend the model to predictions on larger spatial and longer temporal scales. Theoretically, this approach is scientifically sound, but in practical experiments, the complex relationships between variables are difficult to capture. Introducing multiple variables may lead to error accumulation and increased model complexity.

Author Contributions

N.J.: project administration, resources, supervision; Y.D.: methodology, writing, software, validation; M.X.: visualization, formal analysis, investigation; S.G.: supervision, funding acquisition, data curation; T.Q.: data curation, formal analysis, investigation; L.Y.: project administration, resources, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key R&D Program of Shandong Province (Grant No. 2023CXGC010901), the Shandong Province Natural Science Foundation (Grant No. ZR2020QF028), the National Natural Science Foundation of China (Grant No. 62272427), and the Youth Innovation Team of Shandong Province (Grant No. 2022KJ043).

Data Availability Statement

The files and data can be accessed at https://github.com/daiyue0208/sst.git, accessed on 12 December 2025.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Ham, Y.G.; Kim, J.H.; Luo, J.J. Deep learning for multi-year ENSO forecasts. Nature 2019, 573, 568–572. [Google Scholar] [CrossRef]
L’Heureux, M.L.; Tippett, M.K.; Wheeler, M.C.; Nguyen, H.; Narsey, S.; Johnson, N.; Hu, Z.-Z.; Watkins, A.B.; Lucas, C.; Ganter, C.; et al. A Relative Sea Surface Temperature Index for Classifying ENSO Events in a Changing Climate. J. Clim. 2024, 37, 1197–1211. [Google Scholar] [CrossRef]
Lai, Q.; Zhou, W. Multiscale variation analysis of sea surface temperature in the fishing grounds of pelagic fisheries. Front. Mar. Sci. 2025, 12, 1567030. [Google Scholar] [CrossRef]
Sam-Khaniani, A. Evaluation of HYCOM sea surface salinity and temperature using buoy measurements. Earth Observ. Geomat. Eng. 2022, 6, 1131. [Google Scholar]
Yuan, T.; Zhu, J.; Wang, W.; Lu, J.; Wang, X.; Li, X.; Ren, K. A Space-Time Partial Differential Equation Based Physics-Guided Neural Network for Sea Surface Temperature Prediction. Remote Sens. 2023, 15, 3498. [Google Scholar] [CrossRef]
He, Z.; Wang, X.; Wu, X.; Chen, Z.; Chen, J. Projecting three-dimensional ocean thermohaline structure in the North Indian Ocean from the satellite sea surface data based on a variational method. J. Geophys. Res. Ocean. 2021, 126, e2020JC016759. [Google Scholar] [CrossRef]
Huang, W.; Liang, H.; Zhang, T.; Chen, Z. Spatiotemporal variation characteristics and forecasting of the sea surface temperature in the North Indian Ocean. Front. Mar. Sci. 2025, 12. [Google Scholar] [CrossRef]
Tseng, J.C.-H.; Tsai, B.-A.; Chung, K. Sea surface temperature clustering and prediction in the Pacific Ocean based on isometric feature mapping analysis. Geosci. Lett. 2023, 10, 42. [Google Scholar] [CrossRef]
Xu, T.; Zhou, Z.; Li, Y.; Wang, C.; Liu, Y.; Rong, T. Short-Term Prediction of Global Sea Surface Temperature Using Deep Learning Networks. J. Mar. Sci. Eng. 2023, 11, 1352. [Google Scholar] [CrossRef]
Khan, A.; Lahori, A.H.; Vahidi, H.; Arsalan, M.H.; Shaukat, M.H.; Ilyas, S.; Zia, I. Sea surface temperature dynamics and seasonal forecasts based on ARIMA modelling for the Indian Ocean using MODIS satellite data (2002–2020). Ecol. Quest. 2025, 36, 1–29. [Google Scholar] [CrossRef]
Khan, A.; Lahori, A.H.; Vahidi, H.; Arsalan, M.H.; Shaukat, M.H.; Ilyas, S.; Zia, I. Time series analysis of sea surface temperature change in the coastal seas of Türkiye using SARIMA and LSTM models. J. Atmos. Sol. Terr. Phys. 2024, 263, 106339. [Google Scholar] [CrossRef]
Liu, Y.; Zhao, Z.; Zhang, Z.; Yang, Y. A Novel Sea Surface Temperature Prediction Model Using DBN-SVR and Spatiotemporal Secondary Calibration. Remote Sens. 2025, 17, 1681. [Google Scholar] [CrossRef]
Li, Y.; Ding, J.; Sun, B.; Guan, S. Comparison of BP and RBF neural network applications for short-term prediction of sea surface temperature and salinity. Adv. Mar. Sci. 2022, 40, 220–232. [Google Scholar] [CrossRef]
Pal, S.; Ma, L.H.; Zhang, Y.X.; Coates, M. RNN with particle flow for probabilistic spatio-temporal forecasting. arXiv 2021, arXiv:2106.06064. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, H.; Dong, J.; Zhong, G.; Sun, X. Prediction of sea surface temperature using long short-term memory. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1745–1749. [Google Scholar] [CrossRef]
Su, H.; Zhang, T.; Lin, M.; Lu, W.; Yan, X.-H. Predicting subsurface thermohaline structure from remote sensing data based on long short-term memory neural networks. Remote Sens. Environ. 2021, 260, 112465. [Google Scholar] [CrossRef]
Xie, J.; Ouyang, J.M.; Zhang, J.Y.; Jin, B.G.; Shi, S.X.; Xu, L.Y. An evolving sea surface temperature predicting method based on multidimensional spatiotemporal influences. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1502005. [Google Scholar] [CrossRef]
Yang, Y.; Sun, X.; Dong, J.; Lam, K.-M.; Zhu, X. Attention-ConvNet Network for Ocean-Front Prediction via Remote Sensing SST Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4212516. [Google Scholar] [CrossRef]
Xiao, C.; Chen, N.; Hu, C.; Wang, K.; Xu, Z.; Cai, Y.; Xu, L.; Chen, Z.; Gong, J. A spatiotemporal deep learning model for sea surface temperature field prediction using time-series satellite data. Environ. Model. Softw. 2019, 120, 104502. [Google Scholar] [CrossRef]
Sun, T.Y.; Feng, Y.; Li, C.; Zhang, X.Z. High precision sea surface temperature prediction of long period and large area in the Indian Ocean based on the temporal convolutional network and internet of things. Sensors 2022, 22, 1636. [Google Scholar] [CrossRef]
Dai, H.; He, Z.; Wei, G.; Lei, F.; Zhang, X.; Zhang, W. Long-term prediction of sea surface temperature by temporal embedding transformer with attention distilling and partial stacked connection. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2024, 17, 4280–4293. [Google Scholar] [CrossRef]
Gao, Z.; Li, Z.; Yu, J.; Xu, L. Global spatiotemporal graph attention network for sea surface temperature prediction. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1500905. [Google Scholar] [CrossRef]
Zhang, S.; Yang, Y.; Xie, K.; Gao, J.; Zhang, Z.; Niu, Q. Spatial-temporal siamese convolutional neural network for subsurface temperature reconstruction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4501416. [Google Scholar] [CrossRef]

Figure 1. Architecture of the CAAD-Transformer model.

Figure 2. CA module block diagram.

Figure 3. Error variations of different models under different forecasting horizons.

Figure 4. (a–d) The fitting curves of predicted values and actual values for the proposed model and the baseline models.

Figure 5. Visualization of predicted values, true values, and error maps for different models (a–d).

Figure 6. Time attention weight map.

Figure 7. The boxplots of model errors across the four seasons are shown.

Figure 8. Statistical results of RMSE, MSE, and MAE of the proposed model on different datasets.

Figure 9. Error value visualization across different geographic regions.

Table 1. Prediction error statistics for different module configurations (°C).

CA	AD	Model	RMSE	MSE	MAE	R
✗	✗	Transformer	0.303	0.091	0.234	0.90
✓	✗	CA-Transformer	0.298	0.089	0.230	0.95
✗	✓	AD-Transformer	0.300	0.091	0.235	0.92
✓	✓	CAAD-Transformer	0.225	0.050	0.173	0.97

Table 2. Impact of different encoder depths on model performance (°C).

Evaluation Metric	Layers = 1	Layers = 2	Layers = 3	Layers = 4
RMSE	0.228	0.226	0.225	0.223
MSE	0.052	0.051	0.050	0.049
MAE	0.176	0.174	0.173	0.172

Table 3. Prediction error statistics of SST in different seasons (°C).

Evaluation Metric	Spring	Summer	Autumn	Winter
RMSE	0.574	0.258	0.578	0.232
MSE	0.329	0.066	0.334	0.058
MAE	0.389	0.191	0.352	0.177

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ji, N.; Dai, Y.; Xia, M.; Guo, S.; Qiu, T.; Yu, L. Sea Surface Temperature Prediction Based on Adaptive Coordinate-Attention Transformer. J. Mar. Sci. Eng. 2026, 14, 120. https://doi.org/10.3390/jmse14020120

AMA Style

Ji N, Dai Y, Xia M, Guo S, Qiu T, Yu L. Sea Surface Temperature Prediction Based on Adaptive Coordinate-Attention Transformer. Journal of Marine Science and Engineering. 2026; 14(2):120. https://doi.org/10.3390/jmse14020120

Chicago/Turabian Style

Ji, Naihua, Yue Dai, Menglei Xia, Shuai Guo, Tianhui Qiu, and Lu Yu. 2026. "Sea Surface Temperature Prediction Based on Adaptive Coordinate-Attention Transformer" Journal of Marine Science and Engineering 14, no. 2: 120. https://doi.org/10.3390/jmse14020120

APA Style

Ji, N., Dai, Y., Xia, M., Guo, S., Qiu, T., & Yu, L. (2026). Sea Surface Temperature Prediction Based on Adaptive Coordinate-Attention Transformer. Journal of Marine Science and Engineering, 14(2), 120. https://doi.org/10.3390/jmse14020120

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sea Surface Temperature Prediction Based on Adaptive Coordinate-Attention Transformer

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Encoder–Decoder Architecture

3.1.1. Encoder

3.1.2. Decoder

3.2. Integration of Latitude and Longitude Information

3.3. Adaptive Feature Fusion

4. Experiments and Analysis

4.1. Data Source

4.2. Loss Function

4.3. Model Evaluation Metrics

4.4. Results and Analysis

4.4.1. Ablation Study

4.4.2. Comparative Experiments

4.4.3. Seasonal Testing

4.4.4. Experiments on Different Sea Regions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI