A Hybrid Mamba–ConvLSTM Framework for Multi-Day Sea Surface Temperature Forecasting at 0.05° Resolution

Peng, Bo; Hong, Zhonghua; Wang, Guansuo

doi:10.3390/jmse14100898

Open AccessArticle

A Hybrid Mamba–ConvLSTM Framework for Multi-Day Sea Surface Temperature Forecasting at 0.05° Resolution

by

Bo Peng

^1,2,3

,

Zhonghua Hong

²

and

Guansuo Wang

^1,3,*

¹

East China Sea Forecasting and Disaster Reduction Center, Ministry of Natural Resources, Shanghai 201306, China

²

College of Information Technology, Shanghai Ocean University, Shanghai 200090, China

³

Observation and Research Station of Huaniaoshan East China Sea Ocean-Atmosphere Integrated Ecosystem, Ministry of Natural Resources, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(10), 898; https://doi.org/10.3390/jmse14100898 (registering DOI)

Submission received: 14 April 2026 / Revised: 3 May 2026 / Accepted: 4 May 2026 / Published: 12 May 2026

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

Accurate multi-day sea surface temperature (SST) prediction at sub-mesoscale resolution is challenging due to nonlinear ocean dynamics, heterogeneous multi-source observations, and error accumulation during autoregressive rollout. This paper proposes a hybrid Mamba–ConvLSTM framework that combines recurrent local spatiotemporal encoding with selective state-space long-range spatial modeling. The ConvLSTM branch captures local spatial patterns and short-range temporal dependencies through convolutional gating, while the Mamba branch captures long-range spatial dependencies across each frame through cross-direction window scanning and maintains temporal coherence via persistent hidden states across successive time steps. A physically informed preprocessing stage aligns 0.083° reanalysis variables to the 0.05° OSTIA target grid via a Grow-and-Cut strategy and extracts gradient-based advection and diffusion proxy features under boundary-aware finite differencing. During autoregressive rollout, auxiliary variables are held at their last observed values and physical proxies are recomputed from the predicted SST, following a clearly specified protocol. Experiments on a South China Sea benchmark compare the proposed model against nine baselines—including persistence, daily climatology, ConvLSTM, PredRNN, ConvGRU, TCTN, PANN, Swin-UNet, and ViT-ST—under an identical data-split, normalization, and rollout protocol. Evaluation with RMSE, MAE, SSIM,

R^{2}

, and anomaly correlation coefficient (ACC) shows that the proposed model achieves a 10-day average RMSE of 0.512 °C, outperforming the strongest learning-based baseline ViT-ST by 5.0% and the persistence forecast by 21.0%. Ablation studies, sensitivity analyses, seasonal evaluation, and statistical significance testing verify the contribution of each component and the robustness of the results.

Keywords:

sea surface temperature forecasting; Mamba; ConvLSTM; selective state space model; cross-direction scanning; physical proxy features; spatiotemporal prediction

1. Introduction

Sea surface temperature (SST) is a fundamental variable of the coupled ocean–atmosphere system, exerting direct control over evaporation rates, boundary-layer stability, tropical cyclone intensification, and marine biogeochemical cycles [1]. At regional scales, wind-driven upwelling modulates SST variability and fishery yields, monsoon onset timing is sensitive to SST gradients, and SST anomalies serve as early indicators of coral bleaching events [2]. Consequently, accurate SST forecasting is essential for marine disaster prevention, vessel route optimization, fishery resource management, and climate monitoring.

Operational SST prediction has traditionally relied on numerical ocean models such as HYCOM and MOM, which solve primitive equations on structured grids. While physically rigorous, these models demand enormous computational resources at high resolution and remain susceptible to initial-condition uncertainty and parameterization error [3]. In contrast, statistical and machine-learning methods exploit historical observations to learn predictive mappings directly, offering faster inference and competitive short-range skill.

Over the past decade, deep learning has significantly advanced geophysical spatiotemporal modeling [4,5]. Convolutional recurrent architectures, epitomized by ConvLSTM [6] and PredRNN [7], inject spatial inductive bias into sequential gating and have become standard tools for radar nowcasting and ocean field prediction. These models excel at capturing local dynamics within limited receptive fields but struggle to propagate information across distant spatial locations over large domains. Transformer-based architectures address the receptive-field bottleneck through self-attention, enabling direct interaction between any pair of spatial tokens [8,9]; however, the quadratic cost of full attention limits scalability to high-resolution grids, and naïve patch tokenization may fragment continuous geophysical structures.

Recently, selective state-space models (SSMs), particularly Mamba [10], have emerged as a compelling alternative for long-sequence modeling. By parameterizing input-dependent state transitions with linear-time complexity, Mamba achieves expressive sequence modeling without the quadratic overhead of attention. Vision-oriented extensions such as VMamba [11] and LocalMamba [12] have demonstrated that SSMs can be adapted to 2D image data through multi-directional scan paths (e.g., the CS2D cross-scan module in VMamba [11]). These developments motivate exploring Mamba-based architectures for dense geophysical prediction tasks such as SST forecasting, provided that the scan strategy and temporal integration are carefully designed for the spatiotemporal nature of the problem.

Despite the promise of combining recurrent locality with state-space global modeling, practical application to 0.05°-resolution SST forecasting faces several persistent challenges:

Under-specified spatial scan design for state-space models. When flattening 2D spatial grids into 1D sequences for SSM processing, the choice of scan order profoundly affects which spatial neighborhoods remain contiguous. While VMamba [11] introduced a four-direction cross-scan module for image classification, its adaptation to spatiotemporal forecasting—where SSM hidden states should also carry information across time steps—requires explicit design choices that existing studies do not address.
Ambiguous auxiliary-variable handling during autoregressive rollout. SST prediction models often ingest auxiliary ocean variables (currents, salinity, sea-surface height) as inputs. During multi-step rollout, the model predicts only SST, yet these auxiliary fields are needed at every step. The rollout protocol for auxiliary variables is rarely specified, creating a hidden but impactful source of methodological ambiguity.
Insufficiently defined boundary treatment and cross-resolution alignment. Physical proxy features (advection, diffusion) require spatial finite differences that produce unphysical values at land–sea boundaries. Simultaneously, multi-source products reside on different native grids whose coastline masks do not coincide. Both issues are frequently mentioned but not algorithmically specified, hindering reproducibility.

This study addresses all three issues through a hybrid Mamba–ConvLSTM architecture with explicit, reproducible methodological definitions. The framework combines a ConvLSTM branch for local spatiotemporal encoding with a Mamba branch for long-range spatial modeling via cross-direction scanning with temporally persistent hidden states, supported by a Grow-and-Cut alignment module and boundary-aware gradient-based feature extraction (detailed in Section 3).

The main contributions are as follows:

A hybrid Mamba–ConvLSTM framework is proposed that combines ConvLSTM-based local spatiotemporal encoding with Mamba-based long-range spatial modeling. Building on the cross-scan concept introduced by VMamba [11], this work extends it to the spatiotemporal forecasting setting by (a) adding persistent Mamba hidden states across time steps to endow the spatial scan with temporal memory, (b) introducing learnable softmax-weighted directional aggregation for inverse mapping, and (c) providing explicit forward and inverse rearrangement operators for full reproducibility. To the best of the authors’ knowledge, this is the first work to introduce temporal state persistence into vision-oriented selective state-space scanning, enabling SSMs to serve as both spatial encoders and implicit temporal memory pathways within a single unified architecture.
Fully specified physical preprocessing is provided, including an algorithmic Grow-and-Cut cross-resolution alignment procedure and a boundary-aware finite-difference scheme for gradient-based proxy construction under land–sea masking, together with an explicit autoregressive rollout protocol that documents how auxiliary variables and physical proxies are maintained beyond the observation window. Unlike prior SST forecasting studies that mention physical preprocessing without algorithmic detail, every step of the proposed pipeline is formally defined and reproducible, addressing a persistent gap in methodological transparency within the field.
A unified benchmark on the South China Sea is established with nine baselines—including persistence and daily climatology as elementary references—evaluated under identical data splits, normalization, rollout protocols, and five complementary metrics (RMSE, MAE, SSIM, $R^{2}$ , ACC), supported by ablation studies, sensitivity analyses, seasonal evaluation, and statistical significance testing. This standardized evaluation protocol, which is rarely adopted in SST forecasting studies, ensures that reported improvements are attributable to architectural differences rather than implementation discrepancies.

2. Related Work

This section synthesizes the four lines of research most relevant to the proposed framework, emphasizing the specific limitations that motivate each of its design choices.

2.1. Recurrent Spatiotemporal Models

ConvLSTM [6] replaced fully connected transitions with convolutions, enabling joint spatial and temporal modeling in gridded data, and has become a standard encoder for radar nowcasting and ocean field prediction. Subsequent extensions addressed known limitations: PredRNN [7] introduced a spatiotemporal memory cell that flows both vertically and horizontally to improve gradient flow; PredRNN++ [13] added a gradient highway unit; ConvGRU [14] offered a lighter two-gate variant; and SA-ConvLSTM [15] augmented ConvLSTM with self-attention to extend the spatial receptive field. Despite these advances, the effective receptive field of recurrent convolutional models grows slowly with depth and remains fundamentally local, limiting the ability to capture spatial dependencies across domains spanning hundreds of grid cells. This locality limitation motivates the Mamba branch in the proposed framework, which provides a complementary long-range spatial pathway without the quadratic cost of attention.

2.2. CNN and Transformer Approaches

Purely convolutional architectures such as SimVP [16] demonstrated competitive spatiotemporal prediction without explicit recurrence, using an encoder–translator–decoder pipeline. Transformer-based models address the receptive-field bottleneck of CNNs through self-attention: the Swin Transformer [8] introduced shifted-window attention for hierarchical feature extraction; Swin-UNet [17] combined it with U-Net skip connections; Earthformer [9] adapted cuboid attention to Earth-system data; and ClimaX [4] proposed a foundation model pre-trained on diverse climate variables. At the global scale, FourCastNet [5] applied the Adaptive Fourier Neural Operator to weather prediction at 0.25°, and Pangu-Weather [18] employed a 3D Swin Transformer for medium-range forecasting. For SST-specific applications, ViT-based spatiotemporal models [19] partition ocean fields into patch tokens and apply temporal attention. However, the quadratic memory cost of self-attention and the loss of spatial inductive bias from patch tokenization remain practical challenges at 0.05° resolution, where the token count becomes prohibitively large for full attention. The proposed framework avoids these scaling limitations by using Mamba’s linear-complexity state-space updates instead of self-attention for long-range spatial modeling.

2.3. Selective State-Space Models

Structured SSMs, starting with S4 [20], demonstrated that near-linear-cost layers can capture extremely long-range dependencies. Mamba [10] introduced input-dependent (selective) state-space parameters, matching Transformer performance on language benchmarks while maintaining linear-time complexity. The adaptation to vision tasks has proceeded rapidly: VMamba [11] proposed the Cross-Scan Module (CS2D), traversing 2D feature maps along four directional paths and merging them through concatenation; Vision Mamba (Vim) [21] explored bidirectional scanning for classification; and LocalMamba [12] introduced windowed selective scanning for fine-grained spatial detail. In the time-series domain, S-Mamba [22] and TimeMachine [23] applied selective state spaces to multivariate and multi-scale forecasting. A consistent finding across these studies is that scan-path design critically affects performance, yet all existing vision SSM models process individual images independently without temporal state continuity. The present work extends VMamba’s four-direction scanning to the spatiotemporal forecasting setting by introducing persistent hidden states across time steps (absent in VMamba) and learnable softmax-weighted directional aggregation in place of concatenation.

2.4. Physics-Informed Ocean Forecasting

Physics-informed neural networks (PINNs) [24] embed PDE residuals into the training loss, and the Fourier Neural Operator [25] learns solution mappings between function spaces. In ocean science, Beucler et al. [26] enforced conservation constraints in neural emulators, Ouala et al. [3] proposed physically constrained networks for ocean dynamics, and PANN [27] integrated surface currents and sea-surface height into SST prediction. Spatial-temporal graph neural networks [28] model ocean grids as graph nodes and capture spatial dependencies through graph convolutions for SST forecasting. Two important gaps persist across these studies. First, the discrete implementation of physical operators at land–sea boundaries is rarely specified, leaving gradient-based proxy features undefined at precisely the locations where temperature gradients are strongest. Second, the treatment of auxiliary variables during multi-step autoregressive rollout is almost never documented, creating hidden methodological ambiguity. The present work addresses both gaps with explicit algorithmic definitions: a boundary-aware finite-difference scheme for physical proxy construction and a fully specified auxiliary-variable rollout protocol.

3. Methodology

3.1. Problem Formulation

SST forecasting is formulated as a spatiotemporal sequence-to-sequence prediction problem. Given a historical observation sequence

X_{t - L + 1 : t} = {x_{t - L + 1}, \dots, x_{t}}, x_{τ} \in R^{H \times W \times C},

(1)

where H and W are spatial dimensions and C is the number of input channels (SST plus auxiliary variables), the objective is to produce the predicted SST sequence

{\hat{Y}}_{t + 1 : t + T} = {{\hat{y}}_{t + 1}, \dots, {\hat{y}}_{t + T}}, {\hat{y}}_{τ} \in R^{H \times W \times 1},

(2)

where

{\hat{Y}}_{t + 1 : t + T}

denotes the full predicted sequence and each

{\hat{y}}_{τ}

is the predicted single-channel SST field at time step

τ

. This study uses

L = 10

and

T = 10

days.

Autoregressive rollout protocol. During the encoding phase (

τ \leq t

), all C channels are observed. During the prediction phase (

τ > t

), the model predicts only single-channel SST. The auxiliary fields

(u, v, S, η)

are not predicted; instead, they are held fixed at their last observed values (persistence assumption for auxiliary fields):

a_{τ} = a_{t}, \forall τ > t,

(3)

where

a

denotes the auxiliary variable stack. Physical proxy features (

f_{a d v}

,

f_{l a p}

; Section 3.4) are recomputed at each rollout step using the predicted SST

{\hat{y}}_{τ}

and the persisted auxiliary fields, so that the advection term reflects the evolving temperature field even when currents are frozen. It is important to recognize that freezing auxiliary fields over 10 days constitutes a strong modeling assumption, not merely a minor approximation. Physically, this assumption implies that the ocean surface velocity field

(u, v)

, salinity S, and sea surface height

η

remain constant over the forecast horizon, while only the temperature field evolves. In reality, mesoscale eddies, wind-driven currents, and thermohaline adjustments cause these fields to co-evolve with SST on similar time scales. In particular, holding currents fixed suppresses the SST–circulation feedback: an evolving warm-core eddy would alter the local velocity field, which in turn modifies the advection of temperature—a coupling that the persistence assumption cannot represent. Table 1 provides indirect evidence of the assumption’s limitations: the day-to-day standard deviations of u and v (0.06 and 0.05 m/s) are non-negligible relative to their overall variability (0.15 and 0.12 m/s), indicating that surface currents exhibit meaningful daily evolution that is ignored under persistence. The oracle experiment in Section 5.4 quantifies the practical impact: feeding ground-truth future auxiliaries reduces RMSE by 2.7%, establishing a concrete upper bound on the error attributable to this assumption. Further discussion of this limitation and potential remedies is provided in Section 5.12.

3.2. Overall Framework

The framework (Figure 1) comprises five modules:

Module 1: Grow-and-Cut Cross-Resolution Alignment resolves spatial grid mismatches between the 0.05° SST target and 0.083° reanalysis auxiliary variables through iterative boundary-aware interpolation followed by land–sea mask enforcement.

Module 2: Boundary-Aware Physical Feature Extraction computes gradient-based advection and diffusion proxy features using finite-difference operators that respect the land–sea boundary.

Module 3: ConvLSTM Branch captures local spatiotemporal patterns—short-range temperature gradients and eddy propagation—through convolutional gating with a

3 \times 3

receptive field per layer. Although the 10-day input window is too short to explicitly resolve the seasonal cycle, the ConvLSTM branch implicitly benefits from seasonally varying patterns learned during training over the full 11-year dataset.

Module 4: Mamba Branch with Cross-Direction Scanning partitions each frame into non-overlapping windows, flattens them into 1D token sequences along four scan paths (inspired by VMamba’s CS2D [11]), and processes them through selective state-space blocks. Unlike VMamba, which processes single images independently, the Mamba hidden states persist across time steps, providing a temporal memory pathway that complements the ConvLSTM branch.

Module 5: Residual Fusion Decoder combines the ConvLSTM and Mamba features through learnable residual projection and produces the final SST prediction. The complete inference procedure is summarized in Algorithm 1.

Algorithm 1: Hybrid Mamba–ConvLSTM Inference
Require: Observed sequence ${x_{t - L + 1}, \dots, x_{t}}$ , forecast horizon T
Ensure: Predicted SST sequence ${{\hat{y}}_{t + 1}, \dots, {\hat{y}}_{t + T}}$
1:	Align auxiliary variables via Grow-and-Cut (Section 3.3)
2:	Store last observed auxiliaries: $a^{*} \leftarrow a_{t}$
3:	Initialize ConvLSTM states $H_{0}^{(l)}, C_{0}^{(l)} = 0$
4:	Initialize Mamba states $s_{0}^{(d, w)} = 0$
5:	for $τ = t - L + 1$ to t do
6:	Compute physical proxies from $(x_{τ}, a_{τ})$
7:	Construct $F_{p h y}$
8:	Update ConvLSTM: $H_{τ}^{(l)}, C_{τ}^{(l)}$ ← ConvLSTM(·)
9:	Scan windows along ${π_{1}, π_{2}, π_{3}, π_{4}}$ ; process via Mamba with state persistence (Equation (16))
10:	end for
11:	for $τ = t + 1$ to $t + T$ do
12:	Set auxiliaries: $a_{τ} \leftarrow a^{*}$ // persistence
13:	Recompute physical proxies from $({\hat{y}}_{τ - 1}, a^{*})$
14:	Update ConvLSTM and Mamba (with persistent states)
15:	Fuse: $F_{f u s e} = ϕ (F_{c o n v} + λ {\bar{F}}_{m a m b a})$
16:	Predict: ${\hat{y}}_{τ} = Head (F_{f u s e})$
17:	end for
18:	return ${{\hat{y}}_{t + 1}, \dots, {\hat{y}}_{t + T}}$

3.3. Grow-And-Cut Cross-Resolution Alignment

Multi-source ocean data products reside on different native grids. The OSTIA SST product uses a 0.05° grid, while GLORYS reanalysis variables are at 0.083°. Direct bilinear interpolation introduces coastal artifacts: (1) land contamination from interpolation kernels straddling the land–sea boundary, and (2) coverage gaps where the finer mask has ocean pixels with no valid lower-resolution-grid neighbors.

A two-stage procedure, referred to here as Grow-and-Cut, is introduced. The name reflects the two sequential operations: “grow” iteratively expands valid-pixel coverage through boundary-aware interpolation to fill coastal gaps where the lower-resolution and higher-resolution masks disagree, and “cut” enforces the target land–sea mask by zeroing all land pixels. This combination ensures that every ocean pixel on the target grid receives a value while preventing land contamination. Let

A_{0.083}

denote reanalysis variables on the lower-resolution grid and

Ω_{0.05}

the binary ocean mask on the fine grid (1 for ocean, 0 for land). The aligned result

\tilde{A}

on the fine grid is obtained by applying the growth operator

G

followed by the cut-back operator

C

:

\tilde{A} = C (G (A_{0.083}, Ω_{0.05}), Ω_{0.05}),

(4)

where

G

iteratively fills gaps via boundary-aware interpolation (steps 1–3 below) and

C

masks out all land pixels by zeroing values outside

Ω_{0.05}

.

The growth operator

G (\cdot)

proceeds as follows:

Initial interpolation. Apply bilinear interpolation to map $A_{0.083}$ onto the 0.05° grid, producing $A^{(0)}$ . Mark pixels with valid output as “filled.”
Iterative boundary growth. For each unfilled ocean pixel $p \in Ω_{0.05}$ with filled 8-connected neighbors $N_{8} (p)$ , compute the inverse-distance weighted average

$\begin{matrix} A^{(k + 1)} (p) & = \frac{\sum_{q \in N_{8} (p)} w_{p q} A^{(k)} (q)}{\sum_{q \in N_{8} (p)} w_{p q}}, \\ w_{p q} & = \frac{1}{{∥ p - q ∥}_{2} + ϵ}, \end{matrix}$

(5)

with $ϵ = 10^{- 8}$ . Here $A^{(k)} (q)$ denotes the value at pixel q after iteration k, and $A^{(0)}$ is the result of the initial bilinear interpolation from step 1. The iterative process is used because standard bilinear interpolation leaves unfilled ocean pixels near coastlines where the lower-resolution grid has no valid neighbors; each growth round propagates values from already-filled pixels into adjacent gaps.
Repetition. Repeat for $K_{g} = 2$ rounds. The choice of $K_{g} = 2$ is determined by the maximum coastal gap width between GLORYS and OSTIA masks in the study domain. Because GLORYS (0.083°) has a coarser coastline than OSTIA (0.05°), some OSTIA ocean pixels near the coast fall outside the GLORYS ocean mask. Empirical inspection of the South China Sea domain shows that the maximum gap between the two coastline masks is 2 pixels on the 0.05° grid; each growth round propagates values by exactly one pixel, so two rounds suffice to reach all gap pixels. The sensitivity analysis (Section 5.8) confirms diminishing returns beyond $K_{g} = 2$ : $K_{g} = 3$ and $K_{g} = 4$ produce negligible further improvement (0.513 °C vs. 0.512 °C), while $K_{g} = 0$ (no growth) degrades RMSE to 0.527 °C due to unfilled coastal ocean pixels.
Cut-back. Zero out all pixels outside $Ω_{0.05}$ .

Resolution mismatch caveat. It is important to note that the Grow-and-Cut procedure performs statistical interpolation, not physical downscaling. Mapping 0.083° GLORYS fields to a 0.05° grid does not recover sub-0.083° physical structures that the reanalysis does not resolve; the interpolated auxiliary fields are smooth approximations that lack fine-scale features present in the 0.05° SST target. In particular, the interpolated velocity fields may exhibit excessive spatial smoothness near coastlines and across sharp fronts, potentially underestimating local advection gradients. This resolution mismatch is an inherent limitation of combining multi-source products at different native resolutions and should be considered when interpreting the contribution of auxiliary variables to forecast skill.

3.4. Boundary-Aware Physical Proxy Construction

Ocean SST evolution is governed by the advection–diffusion equation:

\frac{\partial T}{\partial t} + u \frac{\partial T}{\partial x} + v \frac{\partial T}{\partial y} = κ_{e f f} (\frac{\partial^{2} T}{\partial x^{2}} + \frac{\partial^{2} T}{\partial y^{2}}) + Q,

(6)

where

κ_{e f f}

is an effective eddy diffusivity and Q lumps surface heat flux and other sources.

Rather than prescribing a fixed

κ_{e f f}

(whose appropriate value varies from

O (10^{1})

to

O (10^{3}) m^{2} s^{- 1}

depending on the resolved eddy scale [3]), the underlying spatial operators are extracted as unnormalized proxy features, and allow the subsequent convolutional layer to learn the appropriate scaling and combination:

g_{x} = \partial_{x} T, g_{y} = \partial_{y} T,

(7)

f_{a d v} = u g_{x} + v g_{y},

(8)

f_{l a p} = \partial_{x x} T + \partial_{y y} T .

(9)

The advection term

f_{a d v}

encodes the rate of temperature change due to current-driven transport, and the Laplacian

f_{l a p}

encodes the tendency due to horizontal mixing. This formulation avoids committing to a particular diffusivity value while providing the network with physically meaningful spatial structure.

Interior ocean points. For pixels

(i, j)

where all four cardinal neighbors are valid ocean pixels under binary mask M, second-order central differences are used:

\partial_{x} T_{i, j} = \frac{T_{i, j + 1} - T_{i, j - 1}}{2 Δ x}, \partial_{y} T_{i, j} = \frac{T_{i + 1, j} - T_{i - 1, j}}{2 Δ y},

(10)

\begin{matrix} \partial_{x x} T_{i, j} & = \frac{T_{i, j + 1} - 2 T_{i, j} + T_{i, j - 1}}{Δ x^{2}}, \\ \partial_{y y} T_{i, j} & = \frac{T_{i + 1, j} - 2 T_{i, j} + T_{i - 1, j}}{Δ y^{2}}, \end{matrix}

(11)

where

Δ x, Δ y

are local grid spacings computed from the 0.05° grid using latitude-dependent scaling.

Boundary ocean points. At pixels adjacent to land (

M = 0

for one or more cardinal neighbors), central differences are replaced by one-sided differences using the nearest valid ocean neighbor:

\partial_{x} T_{i, j} = \{\begin{matrix} \frac{T_{i, j + 1} - T_{i, j}}{Δ x}, & M_{i, j - 1} = 0, M_{i, j + 1} = 1, \\ \frac{T_{i, j} - T_{i, j - 1}}{Δ x}, & M_{i, j + 1} = 0, M_{i, j - 1} = 1, \\ 0, & both neighbors are land . \end{matrix}

(12)

Analogous rules apply to

\partial_{y} T

. For second derivatives at boundary points where only one neighbor is valid, the Laplacian is set to zero, as a second-order derivative cannot be meaningfully estimated from a single neighbor.

Discussion of boundary treatment. Setting the Laplacian to zero at boundary pixels is a conservative choice that merits explicit justification. Coastal zones are precisely the regions where temperature gradients tend to be strongest, due to land–sea thermal contrast, tidal mixing, and upwelling. Setting

f_{l a p} = 0

at these locations effectively removes the diffusion proxy signal where it could be most informative. Alternative boundary treatments exist: (1) one-sided second differences (

\partial_{x x} T \approx (T_{i, j + 2} - 2 T_{i, j + 1} + T_{i, j}) / Δ x^{2}

), which preserve a diffusion estimate but require two valid neighbors in the same direction and reduce to first-order accuracy; (2) ghost-cell extrapolation, which assigns synthetic values to land pixels (e.g., by mirroring or constant extension) to maintain the central-difference stencil, but introduces assumptions about the land-side thermal state; and (3) zero-Neumann conditions (

\partial T / \partial n = 0

at the boundary), which assume no heat flux across the coastline and allow interior stencil application up to the boundary. The zero-Laplacian approach is adopted here because it avoids introducing physically unverifiable assumptions about the near-coast thermal field, and because the subsequent convolutional layers (with

3 \times 3

kernels) can implicitly learn local boundary corrections from the remaining valid features. The ablation study shows that the Laplacian proxy contributes a 0.8% RMSE reduction overall (Section 5); the contribution in near-coast regions specifically is expected to be smaller due to this conservative boundary treatment.

The proxy features are concatenated with SST and aligned auxiliary variables and processed through a

3 \times 3

convolutional layer:

F_{p h y} = ReLU ({Conv}_{3 \times 3} ([T, \tilde{A}, f_{a d v}, f_{l a p}])) .

(13)

Because the convolutional weights are learned end-to-end, the network implicitly determines the effective diffusivity weighting that best reduces the forecast loss.

3.5. Cross-Direction Window Scanning for Mamba

The Mamba selective state-space model operates on 1D token sequences. When applied to 2D spatial data, the flattening order (scan path) determines which spatial neighbors remain adjacent in the 1D sequence. A single raster scan preserves row-wise adjacency but breaks column-wise continuity, potentially disrupting vertically oriented features such as meridional temperature gradients.

The proposed scanning strategy adopts the four-direction scheme introduced by VMamba’s CS2D module [11]—row-major forward, row-major reverse, column-major forward, and column-major reverse—and extends it with two modifications specific to spatiotemporal forecasting:

Temporally persistent Mamba states. In VMamba, each image is processed independently and hidden states are re-initialized per sample. In the proposed framework, the Mamba hidden state persists across successive time steps within each sample (detailed in Section 3.6), enabling the spatial scan to accumulate temporal information.
Learnable softmax-weighted directional aggregation. VMamba merges directional outputs by concatenation, doubling the channel dimension. Instead, aggregation is performed through learnable attention weights (Equation (19)), maintaining the original channel dimension and allowing the model to dynamically balance directional contributions.

Let

F \in R^{B \times C \times H \times W}

denote the feature maps at one time step, where B is the batch size, C the channel dimension, and

H \times W

the spatial resolution.

F

is partitioned into non-overlapping windows of size

P \times P

, yielding

N_{w} = (H / P) \times (W / P)

windows. The four scan paths are bijective mappings from 2D window coordinates

(r, c)

, with

r, c \in {0, 1, \dots, P - 1}

, to 1D positions:

$π_{1}$ : $(r, c) \mapsto r P + c$ (row-forward);
$π_{2}$ : $(r, c) \mapsto (P^{2} - 1) - (r P + c)$ (row-reverse);
$π_{3}$ : $(r, c) \mapsto c P + r$ (column-forward);
$π_{4}$ : $(r, c) \mapsto (P^{2} - 1) - (c P + r)$ (column-reverse).

For each direction d, the window tokens are arranged into

z^{π_{d}} \in R^{P^{2} \times C}

. Each directional sequence is processed independently by a shared Mamba block (Section 3.6).

3.6. Mamba Block with Temporal State Persistence

The Mamba block applies selective state-space updates to each directional token sequence. Within a single time step

τ

, the SSM processes spatial tokens sequentially along the scan path, capturing long-range spatial dependencies. Crucially, the hidden state at the end of one time step’s scan is carried forward as the initial state for the next time step’s scan:

s_{k}^{(d, w)} = A (z_{k}) s_{k - 1}^{(d, w)} + B (z_{k}) z_{k},

(14)

o_{k}^{(d, w)} = C (z_{k}) s_{k}^{(d, w)},

(15)

where Equation (14) is the state update equation that recursively integrates the current input token

z_{k}

into the hidden state

s_{k}^{(d, w)}

via the input-dependent transition matrix

A

and input matrix

B

, and Equation (15) defines the SSM readout

o_{k}^{(d, w)}

, which linearly projects the updated hidden state back to the token space via output matrix

C

, where k indexes the position within the scan, d indexes the scan direction, w indexes the window, and

s^{(d, w)} \in R^{d_{s}}

is the hidden state with state dimension

d_{s}

. The state-space parameters

A (\cdot), B (\cdot)

and

C (\cdot)

are input-dependent:

A (z_{k}) = W_{A} z_{k} + b_{A}

,

B (z_{k}) = W_{B} z_{k} + b_{B}

, and

C (z_{k}) = W_{C} z_{k} + b_{C}

, where

W_{A}, W_{B}, W_{C}

and

b_{A}, b_{B}, b_{C}

are learnable linear projection weights and biases, respectively. This input-dependent (selective) parameterization distinguishes Mamba from classical SSMs with fixed state-space matrices and allows the model to adapt its state transitions to the local input context.

Temporal persistence rule. Let

s_{τ, end}^{(d, w)}

denote the final hidden state after processing all

P^{2}

tokens of window w along direction d at time step

τ

. At the next time step

τ + 1

, the initial hidden state is set to

s_{τ + 1, 0}^{(d, w)} = s_{τ, end}^{(d, w)} .

(16)

For the first time step (

τ = t - L + 1

), all states are zero-initialized:

s_{t - L + 1, 0}^{(d, w)} = 0

. This persistence mechanism endows the Mamba branch with an explicit temporal memory pathway that is distinct from and complementary to the ConvLSTM cell state.

Role clarification. The Mamba branch is best understood as a long-range spatial encoder with implicit temporal continuity. Within each time step, the SSM scans across the full spatial extent of each window, capturing dependencies that span distances far beyond the

3 \times 3

ConvLSTM kernel. Across time steps, the persistent state accumulates a running summary of the spatiotemporal evolution along each scan path, enabling the model to detect slow-evolving or teleconnected patterns. The ConvLSTM branch, by contrast, provides explicit temporal gating with local spatial context. The two branches are thus complementary: ConvLSTM excels at local spatiotemporal dynamics, while Mamba excels at long-range spatial structure with implicit temporal memory.

A gated residual connection modulates the SSM output before passing it to downstream layers:

{\tilde{o}}_{k} = o_{k} ⊙ SiLU (Linear (z_{k})),

(17)

where

o_{k}

is the SSM readout from Equation (15),

{\tilde{o}}_{k}

is the gated output, SiLU denotes the Sigmoid Linear Unit activation function

SiLU (x) = x \cdot sigmoid (x)

, and ⊙ is element-wise (Hadamard) multiplication. The linear projection of

z_{k}

produces a gate vector that selectively scales each feature dimension of the SSM output; SiLU is chosen following the original Mamba architecture [10] for its smooth gating properties that facilitate gradient flow. The state dimension is

d_{s} = 16

and the model dimension is

d_{m} = 128

.

3.7. Rearrangement Operator and Spatial Restoration

After Mamba processing, the output sequence for each direction d must be mapped back to 2D. Explicit inverse mappings are defined:

O_{π_{d}}^{m a p} (r, c, :) = O (π_{d}^{- 1} (r, c), :), d \in {1, 2, 3, 4} .

(18)

The four directional maps are aggregated with learnable weights:

{\bar{O}}^{m a p} = \sum_{d = 1}^{4} α_{d} O_{π_{d}}^{m a p}, α_{d} = \frac{\exp (w_{d})}{\sum_{d^{'}} \exp (w_{d^{'}})},

(19)

where

w_{d}

are trainable scalars initialized to zero (uniform initial weights

α_{d} = 0.25

). The aggregated map is restored to the full spatial layout:

{\bar{F}}_{m a m b a} = ψ ({Fold}_{P \times P} ({\bar{O}}^{m a p})),

(20)

where

{Fold}_{P \times P}

rearranges windows back into

H \times W

and

ψ

is a

1 \times 1

convolution projecting to match ConvLSTM channels.

3.8. ConvLSTM Branch

The ConvLSTM branch provides explicit temporal gating with local spatial context. A three-layer stacked ConvLSTM with the hidden dimension

C_{h} = 64

and

3 \times 3

kernels processes the augmented input sequence. At each time step

τ

and layer l, the hidden state

H_{τ}^{(l)}

and cell state

C_{τ}^{(l)}

are updated as:

H_{τ}^{(l)}, C_{τ}^{(l)} = {ConvLSTM}^{(l)} (X_{τ}^{(l)}, H_{τ - 1}^{(l)}, C_{τ - 1}^{(l)}),

(21)

where

X_{τ}^{(l)}

is the input to layer l (for

l = 1

this is

F_{p h y}

; for

l > 1

it is

H_{τ}^{(l - 1)}

). The standard gating equations are:

i_{τ} = σ (W_{x i} * X_{τ} + W_{h i} * H_{τ - 1} + b_{i}),

(22)

f_{τ} = σ (W_{x f} * X_{τ} + W_{h f} * H_{τ - 1} + b_{f}),

(23)

g_{τ} = \tanh (W_{x g} * X_{τ} + W_{h g} * H_{τ - 1} + b_{g}),

(24)

o_{τ} = σ (W_{x o} * X_{τ} + W_{h o} * H_{τ - 1} + b_{o}),

(25)

C_{τ} = f_{τ} ⊙ C_{τ - 1} + i_{τ} ⊙ g_{τ},

(26)

H_{τ} = o_{τ} ⊙ \tanh (C_{τ}),

(27)

where

i_{τ}

,

f_{τ}

,

g_{τ}

, and

o_{τ}

are the input, forget, candidate, and output gates, respectively;

σ

is the sigmoid function; * denotes convolution; and

W_{(\cdot)}

,

b_{(\cdot)}

are learnable convolutional kernels and biases. The final hidden state

H_{τ}^{(3)}

is projected to

F_{c o n v} \in R^{B \times C_{f} \times H \times W}

via a

1 \times 1

convolution, where

C_{f}

is the fusion channel dimension.

3.9. Feature Fusion and Prediction Head

The ConvLSTM and Mamba branch outputs are fused through learnable residual projection:

F_{f u s e} = ϕ (F_{c o n v} + λ {\bar{F}}_{m a m b a}),

(28)

where

F_{c o n v}

is the ConvLSTM branch output (Section 3.8),

{\bar{F}}_{m a m b a}

is the spatially restored Mamba branch output (Section 3.7),

λ

is a learnable scalar initialized to

λ_{0} = 0.1

, and

ϕ

is a two-layer convolutional block with batch normalization and ReLU. The small initial value

λ_{0} = 0.1

allows the ConvLSTM branch to dominate in early training, when the Mamba branch has not yet learned meaningful spatial representations;

λ

increases during training as the Mamba branch matures. The prediction head maps the fused features to a single-channel SST field,

{\hat{y}}_{τ} = {Conv}_{1 \times 1} (ReLU (BN ({Conv}_{3 \times 3} (F_{f u s e})))),

(29)

where BN denotes batch normalization.

3.10. Training Strategy: Scheduled Sampling

A well-known issue in autoregressive training is the discrepancy between teacher forcing (ground-truth inputs) used during training and model predictions used during inference. To bridge this gap, scheduled sampling [29] is employed: at each prediction step

τ > t

, the model receives the ground-truth frame

y_{τ}

with probability

p_{t f}

and its own prediction

{\hat{y}}_{τ}

with probability

1 - p_{t f}

. The teacher-forcing probability is linearly annealed from

p_{t f} = 1.0

to

p_{t f} = 0.0

over the first 20 training epochs; for the remaining epochs, the model trains in fully autoregressive mode, matching the inference protocol.

This strategy is applied identically to all baselines in the comparison to ensure fair evaluation.

3.11. Training Objective

A composite loss is employed:

L = λ_{M S E} L_{M S E} + λ_{S S I M} L_{S S I M},

(30)

L_{M S E} = \frac{1}{T} \sum_{τ = 1}^{T} {∥ {\hat{y}}_{t + τ} - y_{t + τ} ∥}_{2}^{2},

(31)

L_{S S I M} = \frac{1}{T} \sum_{τ = 1}^{T} (1 - SSIM ({\hat{y}}_{t + τ}, y_{t + τ})) .

(32)

The SSIM term [30] encourages preservation of local luminance, contrast, and structural patterns, which in the SST context correspond to thermal gradient magnitude, variance of mesoscale features, and frontal spatial structure. While the SSIM was originally designed for natural image quality assessment, its three-component decomposition aligns with physically meaningful properties of SST fields. Nevertheless, pixel-wise metrics (RMSE, MAE) and an operational oceanographic metric (ACC, defined in Section 4.6) are also included to provide a balanced evaluation. Default weights are

λ_{M S E} = 0.8

,

λ_{S S I M} = 0.2

.

4. Dataset and Experimental Setup

4.1. Study Area and Data Sources

The study focuses on the South China Sea (SCS) bounded by 112.025° E–118.425° E and 12.025° N–18.425° N (Figure 2). This region exhibits rich mesoscale dynamics, including the Luzon Strait intrusion, seasonal monsoon-driven circulation reversal, and coastal upwelling, creating a challenging prediction testbed. Although predominantly oceanic, the domain includes substantial coastline from Hainan Island, Luzon, the Vietnamese coast, and numerous smaller islands, making land–sea masking essential for physically consistent gradient computation and prevention of land contamination during cross-resolution interpolation.

Target variable. The OSTIA daily SST product [1] at 0.05° (≈5.6 km) resolution. This resolution is substantially finer than standard reanalysis products (e.g., ERA5 at 0.25°, OISST at 0.25°) and is representative of current operational SST analysis products, though coarser than satellite swath data such as Sentinel-3 SLSTR (∼1 km). The domain yields

128 \times 128

grids.

Auxiliary variables. Four GLORYS12V1 reanalysis variables at 0.083°: eastward current u, northward current v, sea surface salinity S, and sea surface height

η

.

4.2. Data Statistics and Preprocessing

The dataset spans 1 January 2000 to 31 December 2013 (5114 daily snapshots). This period is determined by the intersection of two constraints: the GLORYS12V1 reanalysis product, which provides the auxiliary variables

(u, v, S, η)

, begins its consistent daily coverage from January 2000; and the OSTIA L4 SST product underwent a major reprocessing update in 2014 that introduced a discontinuity in the analysis system, potentially confounding temporal consistency across training and testing. By restricting the study period to 2000–2013, temporal homogeneity of both data sources is ensured. Extension to more recent years is straightforward once the GLORYS12V1 temporal coverage is confirmed to maintain consistent quality beyond 2013, and is identified for future work (Section 5.12). Table 1 summarizes the statistics.

All variables are normalized to zero mean and unit variance using training-set statistics. Land pixels are set to zero. No temporal detrending or anomaly decomposition is applied.

ENSO context. The test year 2013 was a near-neutral ENSO year (ONI between

- 0.5

and +0.5 °C throughout). While this limits evaluation under extreme ENSO conditions, it provides a representative baseline assessment. Multi-year cross-validation across different ENSO phases is identified as important future work (Section 5.12).

4.3. Temporal Split and Protocol

The dataset is divided chronologically: 2000–2010 for training (4018 days), 2011–2012 for validation (731 days), and 2013 for testing (365 days). Each sample consists of 10 input frames mapped to 10 target frames with a stride of 1 day. Testing uses fully autoregressive rollout with auxiliary persistence (Equation (3)). All methods use identical data splits, normalization, autoregressive rollout, and scheduled sampling (Section 3.10).

4.4. Baselines

Nine baselines spanning five categories are used for comparison.

As elementary references, Persistence repeats the last observed SST field at all forecast steps (

{\hat{y}}_{t + τ} = y_{t}

), providing the minimum skill threshold, while Daily Climatology uses the long-term (2000–2010) daily mean SST for the corresponding day of year (

{\hat{y}}_{t + τ} = {\bar{y}}_{d o y (t + τ)}

), capturing the seasonal cycle but no synoptic variability.

Three recurrent models are included: ConvLSTM [6] (3 layers, 64 hidden channels,

3 \times 3

kernels), PredRNN [7] (3 spatiotemporal memory layers with zigzag transitions), and ConvGRU [14] (3 layers, 64 hidden channels). The CNN-based category is represented by TCTN [16], a SimVP-style temporal-channel transform network. PANN [27] represents the physics-aware category, integrating auxiliary ocean variables into SST prediction. Two attention-based models complete the baseline set: Swin-UNet [17] (U-Net with Swin Transformer blocks) and ViT-ST [19] (Vision Transformer with spatiotemporal patch tokenization).

Baseline fairness. All learning-based baselines are implemented with their originally reported architectural configurations. To ensure fairness, a per-method hyperparameter sweep over learning rate

{5 \times 10^{- 4}, 1 \times 10^{- 3}, 2 \times 10^{- 3}}

and weight decay

{0, 10^{- 5}, 10^{- 4}}

was conducted on the validation set, and each baseline’s best-performing configuration is reported. The optimizer (Adam), scheduled sampling protocol, and early stopping criteria are shared across all methods.

4.5. Implementation Details

All experiments were conducted on four NVIDIA A100 40GB GPUs using PyTorch 2.1. Training used Adam with

β_{1} = 0.9

,

β_{2} = 0.999

. The initial learning rate and weight decay for each method were selected via the grid search described above. For the proposed model, the best configuration uses

l r = 1 \times 10^{- 3}

and weight decay

10^{- 5}

. Table 2 summarizes the core hyperparameters. The learning rate is managed by a ReduceLROnPlateau scheduler with a decay factor of 0.5 and patience of 5 epochs. The effective batch size is 128 (32 per GPU across four GPUs). Training runs for a maximum of 50 epochs with early stopping on validation RMSE (patience 10 epochs), and gradients are clipped at a maximum norm of 1.0.

4.6. Evaluation Metrics

Five metrics are used:

$RMSE = \sqrt{\frac{1}{N} \sum_{i} {({\hat{y}}_{i} - y_{i})}^{2}}$ : Average error magnitude, sensitive to outliers.
$MAE = \frac{1}{N} \sum_{i} | {\hat{y}}_{i} - y_{i} |$ : Robust error magnitude.
SSIM [30]: Structural similarity capturing luminance, contrast, and spatial pattern fidelity. An $11 \times 11$ Gaussian window is used, which at 0.05° resolution covers approximately 0.55° × 0.55° (≈60 km), a scale relevant to sub-mesoscale and mesoscale thermal structures.
$R^{2} = 1 - \sum_{i} {({\hat{y}}_{i} - y_{i})}^{2} / \sum_{i} {(y_{i} - \bar{y})}^{2}$ : Explained variance ratio.
Anomaly Correlation Coefficient (ACC):

$ACC = \frac{\sum_{i} ({\hat{y}}_{i} - {\bar{y}}_{i}^{c}) (y_{i} - {\bar{y}}_{i}^{c})}{\sqrt{\sum_{i} {({\hat{y}}_{i} - {\bar{y}}_{i}^{c})}^{2} \sum_{i} {(y_{i} - {\bar{y}}_{i}^{c})}^{2}}},$

(33)

where ${\bar{y}}_{i}^{c}$ is the daily climatological mean at pixel i. ACC measures the correlation between predicted and observed anomalies relative to climatology and is a standard skill metric in operational weather and ocean forecasting [5]. By definition, the daily climatology baseline has ACC = 0.

All metrics are computed on ocean pixels only.

5. Experimental Results and Discussion

5.1. Overall Quantitative Comparison

The 10-day average performance of all methods across five evaluation metrics is summarized in Table 3.

Several observations merit discussion. First, all learning-based methods outperform persistence, confirming that the models capture predictive structure beyond simple temporal autocorrelation. The proposed model reduces RMSE by 21.0% relative to persistence (0.512 vs. 0.648 °C), while the weakest learning-based method (ConvGRU) improves by only 4.2% (0.621 vs. 0.648 °C), highlighting the non-trivial nature of the improvement.

Second, all learning-based methods substantially outperform daily climatology (ACC

> 0.82

vs. 0 for climatology), indicating that the models capture variability beyond the mean seasonal cycle. However, because the test set spans a single near-neutral ENSO year, this result alone does not demonstrate robust capture of interannual variability; multi-year evaluation under diverse ENSO phases would be needed to support such a claim (Section 5.12).

Third, among recurrent baselines, PredRNN outperforms both ConvLSTM and ConvGRU. PANN outperforms PredRNN despite a simpler recurrent backbone, demonstrating the value of auxiliary variables. The attention-based models (Swin-UNet, ViT-ST) achieve the strongest baseline performance.

Fourth, the proposed model further improves upon ViT-ST by 5.0% in RMSE (0.512 vs. 0.539 °C), 6.0% in MAE, 1.6% in SSIM, and 1.8% in ACC, indicating that the combination of ConvLSTM locality and Mamba long-range spatial modeling with temporal persistence provides complementary benefits not captured by attention alone. Notably, this improvement is achieved with 22.6% fewer parameters than ViT-ST (Section 5.7).

5.2. Lead-Time Performance Analysis

To understand how forecast skill degrades with increasing lead time, Table 4 compares the proposed method against persistence and ViT-ST at individual lead times.

A notable finding is that persistence outperforms all learning-based methods at Day 1 (RMSE 0.189 °C vs. 0.226 °C for ours, 0.238 °C for ViT-ST). This is consistent with the low daily SST variability (

σ_{Δ 1 d}

= 0.19 °C, Table 1): at one-day lead time, the ocean state changes so little that repeating yesterday’s observation is hard to beat. However, persistence skill degrades rapidly with lead time, and all learning-based models surpass it by Day 2–3. The proposed model’s advantage over persistence grows from −19.6% at Day 1 to +32.4% at Day 10, demonstrating that the hybrid architecture is particularly effective at longer horizons where persistence fails.

The RMSE growth rate of the proposed model is approximately 0.052 °C per day, compared to 0.055 °C/day for ViT-ST, 0.062 °C/day for ConvLSTM, and 0.094 °C/day for persistence. Figure 3 and Figure 4 further illustrates the absolute RMSE at three representative lead times (Days 1, 5, and 10), highlighting the widening gap between the proposed model and baselines.

5.3. Ablation Study

To isolate the contribution of each component, Table 5 reports the 10-day average performance of the full model and eight ablation variants.

Branch ablations. Removing the Mamba branch (RMSE +7.0%) or the ConvLSTM branch (+8.6%) causes the largest degradations, confirming that neither branch alone matches the combination. The ConvLSTM-only variant (0.548) is close to PANN (0.551), suggesting that without long-range spatial modeling, the model loses its advantage over the physics-aware baseline.

Temporal state persistence. Re-initializing Mamba states to zero at each time step (i.e., removing Equation (16)) increases RMSE to 0.529 °C (+3.3%). This confirms that temporal state persistence provides meaningful temporal memory beyond what ConvLSTM alone captures. Without persistence, the Mamba branch performs purely spatial modeling; with persistence, it accumulates temporal context that improves multi-step consistency.

Physical proxies. Removing all physical proxies increases RMSE by 3.7%. Decomposing this effect, the advection proxy contributes more (removing it alone: +2.1%) than the Laplacian proxy (+0.8%). This is physically expected: at 0.05° resolution and with daily time steps, horizontal advection by mesoscale currents is the dominant transport mechanism, while the effective diffusive contribution at this resolution is relatively small. The Laplacian proxy nevertheless provides a non-negligible improvement in SSIM (0.904 vs. 0.907), suggesting it helps sharpen spatial gradients.

Cross-direction scan. Using only a single raster scan increases RMSE to 0.524 °C (+2.3%). This degradation arises because unidirectional scanning disrupts column-wise spatial adjacency, affecting the representation of meridional temperature gradients.

5.4. Auxiliary Variable Rollout Strategy

To quantify the impact of the auxiliary-persistence protocol (Equation (3)), three strategies are compared (Table 6):

Strategy (b) is the default. Strategy (c) provides an oracle upper bound by feeding actual future auxiliary fields at each rollout step (not available in practice). The gap between (b) and (c) is 2.7%, quantifying the cost of the auxiliary persistence assumption discussed in Section 3.1. While modest in this domain, this gap may widen in more energetically active ocean regions where mesoscale features evolve rapidly. Strategy (a) removes auxiliaries entirely, degrading RMSE to 0.541 °C, confirming that physical context is valuable even under the persistence assumption. Future work on jointly predicting SST and auxiliary fields could close the (b)–(c) gap (Section 5.12).

5.5. Seasonal Robustness Analysis

SST prediction difficulty varies considerably across seasons due to differences in atmospheric forcing, mesoscale activity, and thermal stratification. Table 7 reports the seasonal RMSE breakdown.

All methods exhibit the highest errors in summer, when southwest monsoon activity, upwelling, and tropical storms drive intense mesoscale variability. The proposed model maintains its advantage across all seasons. Persistence also shows its worst performance in summer (0.718 °C), confirming that summer SST evolves most rapidly and is hardest to predict by any method.

5.6. Statistical Significance Testing

To verify that the observed improvements are not artifacts of random seed selection, Table 8 reports paired t-test results across five independent training runs.

All five metrics show statistically significant improvements at

p < 0.01

(two-tailed paired t-test), confirming that the performance gap between the proposed model and ViT-ST is robust across random initialization seeds. The low standard deviations (≤0.008 for RMSE) further indicate that the proposed architecture converges consistently, with seed-to-seed variation substantially smaller than the inter-method performance gap (0.027 °C in RMSE). These results satisfy the standard criterion for claiming statistically significant improvement and rule out the possibility that the observed gains are artifacts of favorable random initialization.

5.7. Computational Cost Analysis

Practical deployment requires acceptable computational overhead. Table 9 compares the parameter count, FLOPs, and inference time across all methods.

The proposed model requires 9.6M parameters and 56.8G FLOPs, between Swin-UNet and ViT-ST. Compared to ViT-ST, the proposed model uses 22.6% fewer parameters and 17.6% fewer FLOPs while achieving better accuracy. This efficiency advantage stems from the linear-time complexity of Mamba versus quadratic attention in ViT-ST. The end-to-end inference time of 34.8 ms per 10-day sequence is well within operational requirements for daily forecasting.

5.8. Hyperparameter Sensitivity Analysis

The sensitivity of the proposed model to its key hyperparameters is examined to assess robustness and guide practical deployment. Table 10 reports the 10-day average RMSE under systematic variation of four hyperparameters.

Window size P. Best at

P = 8

; smaller windows (

P = 4

) limit sequence length, while larger windows (

P = 32

) dilute local spatial coherence.

State dimension $d_{s}$ . Performance improves from 8 to 16 and plateaus.

d_{s} = 16

is the best cost–accuracy trade-off.

Loss weights. Pure MSE (

λ_{S S I M} = 0

) yields competitive RMSE but lower SSIM (0.899 vs. 0.907). The default 0.8/0.2 provides the best balance.

Grow-and-Cut rounds.

K_{g} = 0

(no growth) gives 0.527 °C; performance saturates at

K_{g} = 2

, confirming two rounds suffice for the study domain.

5.9. Qualitative Analysis

Figure 5 presents representative forecasting results with ground truth, predictions, and absolute error maps across the 10-day horizon. Three aspects of the qualitative results are noteworthy. First, the model maintains sharp temperature gradients along the continental shelf break and the Luzon Strait (thermal front preservation), where baselines tend to produce increasingly blurred predictions at longer lead times. Second, near-coast regions exhibit the most challenging conditions; the Grow-and-Cut alignment and boundary-aware physical proxies contribute to reduced coastal gradient errors. Third, a warm-core anticyclonic eddy near 115° E, 15° N propagates westward over the forecast window, and the proposed model captures both the eddy’s thermal signature and displacement more accurately than baselines.

5.10. Cross-Domain Generalization: Kuroshio Extension

To assess whether the proposed framework generalizes beyond the South China Sea, an independent evaluation is conducted on the Kuroshio Extension region—a western boundary current system characterized by strong meandering fronts, intense mesoscale eddy activity, and rapid surface velocity evolution. The domain is defined as 140.025° E–146.425° E, 32.025° N–38.425° N (128 × 128 grids at 0.05°), capturing the quasi-stationary meander and its associated warm-core rings. All data sources (OSTIA SST, GLORYS auxiliaries), temporal splits (training 2000–2010, validation 2011–2012, test 2013), and preprocessing pipelines are identical to the SCS experiment. Crucially, no hyperparameter re-tuning is performed: the proposed model and all baselines use the same configurations selected on the SCS validation set (Section 4), providing a stringent test of out-of-domain robustness rather than a best-case per-domain optimization.

The Kuroshio Extension exhibits substantially higher SST variability than the South China Sea: the domain-averaged SST standard deviation is approximately 4.81 °C (vs. 2.18 °C for SCS), and the day-to-day variability

σ_{Δ 1 d}

reaches 0.31 °C (vs. 0.19 °C for SCS), indicating that the ocean state evolves roughly 60% faster in this region. These characteristics make the Kuroshio Extension a rigorous testbed for the generalizability concerns raised in Section 5.12.

5.10.1. Overall Performance

Table 11 reports the 10-day average performance of four representative methods. Only a focused subset (persistence, ConvLSTM, ViT-ST, and the proposed model) is evaluated, as the goal is to verify whether the performance ranking observed in SCS transfers to a more dynamically complex domain rather than to reproduce the full nine-baseline comparison.

The results confirm that the proposed model maintains its advantage in the Kuroshio Extension. Relative to the strongest baseline ViT-ST, the proposed model reduces RMSE by 6.2% (0.651 vs. 0.694 °C)—a slightly larger margin than the 5.0% improvement observed in SCS. Relative to persistence, the RMSE reduction is 22.7% (vs. 21.0% in SCS), and relative to ConvLSTM it is 16.9% (vs. 15.1% in SCS). The consistent preservation—and modest widening—of performance gaps suggests that the Mamba branch’s long-range spatial modeling is particularly beneficial in dynamically complex regions where distant spatial dependencies (e.g., upstream meander propagation, warm-ring detachment) exert stronger influence on local SST evolution.

Consistent with the higher baseline variability, absolute errors are larger for all methods in the Kuroshio Extension than in the SCS. For instance, the proposed model’s RMSE increases from 0.512 °C (SCS) to 0.651 °C (Kuroshio), a 27% increase that is commensurate with the factor-of-2.2 increase in domain SST variance. The

R^{2}

and ACC values decrease modestly (0.896→0.858 and 0.894→0.872, respectively), indicating that the model captures a comparable fraction of predictable variance in both domains despite the greater dynamical complexity of the Kuroshio region.

5.10.2. Lead-Time Analysis

Table 12 presents the lead-time RMSE for Persistence, ViT-ST, and the proposed model. A finding that contrasts with the SCS results (Table 4) is that the proposed model outperforms persistence at Day 1 in the Kuroshio Extension (RMSE 0.302 °C vs. 0.310 °C). This reversal—persistence dominating at Day 1 in SCS (0.189 vs. 0.226 °C)—is attributable to the higher day-to-day SST variability in the Kuroshio region (

σ_{Δ 1 d}

= 0.31 °C vs. 0.19 °C): when the ocean state changes more rapidly over 24 h, the “no-change” persistence assumption becomes proportionally less accurate, and learned models begin to add value from the very first prediction step. This addresses the concern that the Day-1 persistence advantage observed in SCS might reflect a fundamental model deficiency; rather, it is a consequence of the physical characteristics of the specific domain, and the model’s Day-1 relative performance improves as environmental variability increases.

The advantage of the proposed model over persistence grows from 2.6% at Day 1 to 22.7% at Day 10, a trajectory similar to that observed in SCS but offset upward by the elimination of the Day-1 deficit. The RMSE growth rate of the proposed model is approximately 0.072 °C/day, compared to 0.075 °C/day for ViT-ST and 0.102 °C/day for persistence. All three rates are higher than their SCS counterparts (0.052, 0.055, and 0.094 °C/day, respectively), reflecting the faster intrinsic evolution of the Kuroshio system. Importantly, the proposed model’s error growth rate remains the lowest among all methods, and its advantage over ViT-ST is larger at Day 1 (8.8%) than at Day 10 (5.7%), suggesting that temporal state persistence in the Mamba branch provides particular benefit at short lead times when spatial structure changes are subtle but non-negligible.

5.10.3. Summary

The Kuroshio Extension experiment demonstrates that: (1) the performance ranking established in SCS (Ours > ViT-ST > ConvLSTM > Persistence) transfers to a more dynamically complex western boundary current regime without hyperparameter re-tuning; (2) the relative advantage of the proposed model is preserved or slightly enlarged (6.2% over ViT-ST vs. 5.0% in SCS), supporting the hypothesis that long-range spatial modeling becomes more valuable as mesoscale activity intensifies; and (3) the Day-1 persistence gap, identified as a limitation in SCS, is eliminated in the Kuroshio Extension, confirming that it is a domain-specific consequence of low daily SST variability rather than an inherent model weakness. This cross-domain validation directly addresses generalizability concerns and provides evidence that the proposed architecture can serve as a robust basis for SST forecasting across oceanographically diverse regions.

5.11. Discussion

The experimental results demonstrate that the hybrid architecture provides consistent improvements across all metrics and seasons, with the advantage growing at longer lead times (Table 4). The ablation study (Table 5) allows attribution of these gains to specific components.

The largest individual degradations come from removing either the Mamba branch (+7.0%) or the ConvLSTM branch (+8.6%), confirming that the two branches capture genuinely complementary information rather than redundant representations. Temporal state persistence contributes an additional 3.3%, demonstrating that the spatial scan benefits from accumulating temporal context—a capability absent in existing vision SSM models that process frames independently.

Among the physical preprocessing components, the advection proxy is the dominant contributor (+2.1%), consistent with the physical expectation that at 0.05° resolution, mesoscale current-driven transport is the primary mechanism of daily SST change. The Laplacian proxy adds a smaller but non-negligible improvement, primarily in spatial structure (SSIM), where it helps sharpen frontal gradients. The Grow-and-Cut alignment step accounts for 2.9% of the total RMSE reduction, confirming that naïve interpolation across mismatched grids introduces measurable coastal artifacts.

Finally, the computational analysis (Table 9) shows that these accuracy gains are achieved at lower cost than the strongest attention-based baseline, with 22.6% fewer parameters and 17.6% fewer FLOPs than ViT-ST, owing to Mamba’s linear-complexity state-space updates.

Cross-domain generalization. The Kuroshio Extension experiment (Section 5.10) provides evidence that the performance advantages are not specific to the South China Sea. The proposed model maintains its ranking and margin over baselines (6.2% RMSE reduction over ViT-ST vs. 5.0% in SCS; Table 11) without any hyperparameter re-tuning, indicating that the hybrid architecture captures generalizable physical patterns rather than domain-specific statistical correlations. The reversal of the Day-1 persistence gap in the Kuroshio region—where the proposed model outperforms persistence from the first forecast step (Table 12)—further confirms that the Day-1 behavior observed in SCS is a consequence of the region’s inherently low daily SST variability (

σ_{Δ 1 d}

= 0.19 °C) rather than a model deficiency.

Novelty in context. It is important to distinguish the contributions of this work from prior studies that also combine recurrent and state-space components. Existing vision SSM models (VMamba [11], Vim [21], LocalMamba [12]) process individual images independently, re-initializing hidden states for each input; they do not address temporal forecasting. Time-series SSM models (S-Mamba [22], TimeMachine [23]) operate on 1D channel sequences without spatial structure. The proposed framework bridges these two lines by maintaining Mamba hidden states across time steps within a spatially structured scanning framework, yielding a module that simultaneously captures long-range spatial dependencies and accumulates temporal context—as confirmed by the 3.3% ablation gap from removing temporal persistence (Table 5). Additionally, the fully algorithmic specification of the physical preprocessing pipeline (Section 3.3 and Section 3.4), including the autoregressive rollout protocol, provides a level of methodological transparency that is absent from comparable SST forecasting studies and enables independent reproduction of results.

5.12. Limitations and Future Directions

Single-domain, single-year evaluation. While the Kuroshio Extension experiment (Section 5.10) provides initial evidence of cross-domain generalizability, evaluation remains limited to two mid-latitude Northwest Pacific domains and one near-neutral ENSO test year. Generalizability to other ocean basins (Gulf Stream, Antarctic Circumpolar Current, polar seas), tropical regions with different dynamical regimes, and extreme ENSO years remains unverified. Multi-year cross-validation and broader cross-basin transfer experiments are essential next steps.

Auxiliary variable persistence assumption. The oracle experiment (Table 6) shows a 2.7% RMSE gap from freezing auxiliaries, confirming that this strong modeling assumption introduces systematic error. The assumption is most problematic in dynamically active regions where mesoscale features evolve on time scales comparable to the 10-day forecast horizon: western boundary currents near the Luzon Strait exhibit decorrelation time scales of 5–10 days for surface velocity, meaning that the frozen velocity field becomes increasingly unrepresentative as lead time grows. Notably, the fact that the gap is only 2.7% suggests that over the South China Sea domain and forecast horizon studied here, the temperature field evolution is partially decouplable from auxiliary field evolution—likely because the dominant SST variability is driven by large-scale seasonal forcing rather than mesoscale eddy–SST feedbacks. However, this result should not be generalized to more energetic domains (e.g., Gulf Stream, Kuroshio Extension) where eddy–SST coupling is stronger. Three avenues could address this limitation: (1) jointly predicting SST alongside

(u, v, S, η)

in a multivariate forecasting framework; (2) coupling the neural predictor with a fast reduced-order ocean model that provides dynamically consistent auxiliary updates; or (3) learning a lightweight auxiliary evolution model conditioned on the predicted SST.

Deterministic prediction. The framework produces point estimates without uncertainty quantification. Ensemble methods [31], Monte Carlo dropout, or probabilistic prediction heads should be explored for risk-based decision-making.

Day-1 performance. Persistence outperforms all learning-based methods at Day 1, a result that reflects a fundamental property of SST dynamics rather than merely a model deficiency. At one-day lead time, the SST field is dominated by temporal autocorrelation (

σ_{Δ 1 d}

= 0.19 °C, Table 1), meaning that the “no change” prediction is inherently strong. All learned models, by contrast, incur irreducible mapping error from the encoder–decoder transformation even when the optimal prediction is close to the identity. This suggests a mismatch between the training objective—which optimizes average skill over the full 10-day horizon—and the dynamics of the shortest lead times, where persistence dominates. Potential remedies include incorporating a persistence skip connection that lets the model learn a residual correction relative to the last observation, or adopting a lead-time-aware loss weighting scheme that reduces the penalty at Day 1 (where the identity mapping is already near-optimal) and concentrates learning capacity on longer horizons where the model can add the most value.

Resolution scalability. Scaling to larger domains at 0.05° would increase window counts and computational cost. Hierarchical or multi-scale scanning strategies may be needed for basin-scale or global applications.

6. Conclusions

This paper presented a hybrid Mamba–ConvLSTM framework for multi-day SST forecasting at 0.05° resolution. The ConvLSTM branch captures local spatiotemporal dynamics, while the Mamba branch—adapted from VMamba’s cross-scan concept with the addition of temporally persistent hidden states and learnable directional aggregation—provides long-range spatial modeling with implicit temporal memory. A fully specified preprocessing pipeline, including Grow-and-Cut cross-resolution alignment, boundary-aware finite-difference proxy construction, and an explicit auxiliary-variable rollout protocol, ensures methodological reproducibility.

Experiments on a South China Sea benchmark with nine baselines (including persistence and climatology) under a unified protocol demonstrate consistent improvements across five metrics. The proposed model achieves a 10-day average RMSE of 0.512 °C, outperforming the strongest learning-based baseline (ViT-ST) by 5.0% and persistence by 21.0%. A cross-domain evaluation on the Kuroshio Extension—a more dynamically complex western boundary current regime—confirms that the performance advantage transfers without hyperparameter re-tuning (6.2% over ViT-ST), and that the Day-1 persistence deficit observed in SCS is eliminated in the higher-variability Kuroshio environment. Ablation studies confirm the contribution of each component, with the dual-branch fusion and temporal state persistence providing the largest individual gains.

Several limitations identified in Section 5.12 point to concrete future directions: relaxing the auxiliary persistence assumption through coupled multivariate prediction (the oracle experiment quantifies the potential gain at 2.7% RMSE), improving Day-1 skill via residual persistence connections, extending evaluation to additional ocean basins beyond the two Northwest Pacific domains tested here, validating under diverse ENSO phases, and incorporating uncertainty quantification for operational risk-based decision-making.

Author Contributions

G.W.: overall framework conceptualization, formulation of the scientific questions, and project support; Z.H.: data analysis, problem definition, project administration, and manuscript revision; B.P.: experiments, manuscript writing, figure/table preparation, and submission work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant No. 41976201).

Data Availability Statement

The data presented in this study are available in Copernicus Marine Data Store at https://data.marine.copernicus.eu/products (accessed on 1 March 2026). These data were derived from the following resources available in the public domain: Copernicus Marine Data Store (including OSTIA daily SST and related reanalysis products), https://data.marine.copernicus.eu/products (accessed on 1 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Donlon, C.J.; Martin, M.; Stark, J.; Roberts-Jones, J.; Fiedler, E.; Wimmer, W. The Operational Sea Surface Temperature and Sea Ice Analysis (OSTIA) system. Remote Sens. Environ. 2012, 116, 140–158. [Google Scholar] [CrossRef]
Hou, S.; Li, W.; Liu, T.; Zhou, S.; Guan, J.; Qin, R.; Wang, Z. MIMO: A Unified Spatio-Temporal Model for Multi-Scale Sea Surface Temperature Prediction. Remote Sens. 2022, 14, 2371. [Google Scholar] [CrossRef]
Ouala, S.; Nguyen, D.; Drumetz, L.; Chapron, B.; Pascual, A.; Collard, F.; Gaultier, L.; Fablet, R. Learning latent dynamics for partially observed chaotic systems. Chaos 2020, 30, 103121. [Google Scholar] [CrossRef] [PubMed]
Nguyen, T.; Brandstetter, J.; Kapoor, A.; Gupta, J.K.; Grover, A. ClimaX: A foundation model for weather and climate. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023. [Google Scholar]
Pathak, J.; Subramanian, S.; Harrington, P.; Raja, S.; Chattopadhyay, A.; Mardani, M.; Kurth, T.; Hall, D.; Li, Z.; Azizzadenesheli, K.; et al. FourCastNet: A global data-driven high-resolution weather forecasting model using Adaptive Fourier Neural Operators. arXiv 2022, arXiv:2202.11214. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28; NeurIPS: La Jolla, CA, USA, 2015; pp. 802–810. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. In Advances in Neural Information Processing Systems 30; NeurIPS: La Jolla, CA, USA, 2017; pp. 879–888. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
Gao, Z.; Shi, X.; Wang, H.; Zhu, Y.; Wang, Y.B.; Li, M.; Yeung, D.Y. Earthformer: Exploring space-time transformers for Earth system forecasting. In Advances in Neural Information Processing Systems 35; NeurIPS: La Jolla, CA, USA, 2022. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual state space model. In Advances in Neural Information Processing Systems 37; NeurIPS: La Jolla, CA, USA, 2024. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual state space model with windowed selective scan. arXiv 2024, arXiv:2403.09338. [Google Scholar] [CrossRef]
Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Yu, P.S. PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the 35th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2018; pp. 5123–5132. [Google Scholar]
Ballas, N.; Yao, L.; Pal, C.; Courville, A. Delving deeper into convolutional networks for learning video representations. arXiv 2015, arXiv:1511.06432. [Google Scholar]
Lin, Z.; Li, M.; Zheng, Z.; Cheng, Y.; Yuan, C. Self-attention ConvLSTM for spatiotemporal prediction. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for Computing Machinery: New York, NY, USA, 2020; pp. 11531–11538. [Google Scholar]
Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. SimVP: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 3170–3180. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; Tian, Q. Accurate medium-range global weather forecasting with 3D neural networks. Nature 2023, 619, 533–538. [Google Scholar] [CrossRef] [PubMed]
Shi, B.; Hao, Y.; Feng, L.; Ge, C.; Peng, Y.; He, H. An attention-based context fusion network for spatiotemporal prediction of sea surface temperature. Remote Sens. 2022, 14, 459. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations; ICLR: Appleton, WI, USA, 2022. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2024. [Google Scholar]
Wang, Z.; Kong, F.; Feng, S.; Wang, M.; Yang, X.; Zhao, H.; Wang, D.; Zhang, Y. Is Mamba Effective for Time Series Forecasting? arXiv 2024, arXiv:2403.11144. [Google Scholar] [CrossRef]
Ahamed, S.; Checkley, M. TimeMachine: A time series is worth 4 Mambas for long-term forecasting. arXiv 2024, arXiv:2403.09898. [Google Scholar]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; Anandkumar, A. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations; ICLR: Appleton, WI, USA, 2021. [Google Scholar]
Beucler, T.; Pritchard, M.; Rasp, S.; Ott, J.; Baldi, P.; Gentine, P. Enforcing analytic constraints in neural networks emulating physical systems. Phys. Rev. Lett. 2021, 126, 098302. [Google Scholar] [CrossRef] [PubMed]
Shi, B.; Feng, L.; He, H.; Hao, Y.; Peng, Y.; Liu, M.; Liu, Y.; Liu, J. A Physics-Guided Attention-Based Neural Network for Sea Surface Temperature Prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4210413. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Z.; Wang, X.; Li, Y.; Zhang, H. A Spatial-Temporal Graph Neural Network for Sea Surface Temperature Forecasting. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3452–3465. [Google Scholar]
Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems 28; NeurIPS: La Jolla, CA, USA, 2015; pp. 1171–1179. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Kendall, A.; Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems 30; NeurIPS: La Jolla, CA, USA, 2017; pp. 5574–5584. [Google Scholar]

Figure 1. Overall architecture of the proposed hybrid Mamba–ConvLSTM framework. Left: Raw inputs—SST from OSTIA at 0.05° and auxiliary variables

(u, v, S, η)

from GLORYS at 0.083°—enter the Grow-and-Cut cross-resolution alignment module (green), which maps the lower-resolution auxiliaries onto the 0.05° ocean mask via iterative boundary-aware interpolation. Center-left: The boundary-aware physical proxy construction module (orange) computes an advection proxy

f_{a d v}

and a Laplacian proxy

f_{l a p}

using land–sea-aware finite differences; these are concatenated with SST and aligned auxiliaries to form the physics-augmented feature tensor

F_{p h y}

. Center: A dual-branch encoder processes

F_{p h y}

through two complementary pathways—a ConvLSTM branch (blue) for local spatiotemporal patterns via

3 \times 3

convolutional gating, and a Mamba branch (purple) that applies cross-direction window scanning with a selective state-space model for long-range spatial dependencies; critically, the Mamba hidden state

s_{τ}

persists across successive time steps (

s_{τ} \to s_{τ + 1}

), providing implicit temporal memory. Right: A fusion and prediction module (yellow) combines the two branch outputs through learnable residual weighting (

F_{c o n v} + λ {\bar{F}}_{m a m b a}

) followed by a convolutional prediction head (Conv

3 \times 3

→BN→ReLU→Conv

1 \times 1

) to produce the predicted SST field

{\hat{y}}_{τ}

. During the prediction phase, the output is fed back autoregressively (dashed arrows) with auxiliary variables held at their last observed values.

Figure 1. Overall architecture of the proposed hybrid Mamba–ConvLSTM framework. Left: Raw inputs—SST from OSTIA at 0.05° and auxiliary variables

(u, v, S, η)

from GLORYS at 0.083°—enter the Grow-and-Cut cross-resolution alignment module (green), which maps the lower-resolution auxiliaries onto the 0.05° ocean mask via iterative boundary-aware interpolation. Center-left: The boundary-aware physical proxy construction module (orange) computes an advection proxy

f_{a d v}

and a Laplacian proxy

f_{l a p}

using land–sea-aware finite differences; these are concatenated with SST and aligned auxiliaries to form the physics-augmented feature tensor

F_{p h y}

. Center: A dual-branch encoder processes

F_{p h y}

through two complementary pathways—a ConvLSTM branch (blue) for local spatiotemporal patterns via

3 \times 3

convolutional gating, and a Mamba branch (purple) that applies cross-direction window scanning with a selective state-space model for long-range spatial dependencies; critically, the Mamba hidden state

s_{τ}

persists across successive time steps (

s_{τ} \to s_{τ + 1}

), providing implicit temporal memory. Right: A fusion and prediction module (yellow) combines the two branch outputs through learnable residual weighting (

F_{c o n v} + λ {\bar{F}}_{m a m b a}

) followed by a convolutional prediction head (Conv

3 \times 3

→BN→ReLU→Conv

1 \times 1

) to produce the predicted SST field

{\hat{y}}_{τ}

. During the prediction phase, the output is fed back autoregressively (dashed arrows) with auxiliary variables held at their last observed values.

Figure 2. Study area in the South China Sea. Left: geographic bounding box with major bathymetric features. Right: an example 0.05° OSTIA SST field.

Figure 3. Performance metrics over the 10-day horizon: RMSE, MAE, and

R^{2}

. The proposed model (red) exhibits the slowest error growth among the evaluated learning-based methods.

Figure 3. Performance metrics over the 10-day horizon: RMSE, MAE, and

R^{2}

. The proposed model (red) exhibits the slowest error growth among the evaluated learning-based methods.

Figure 4. RMSE at Day 1, Day 5, and Day 10. While Figure 3 shows continuous error evolution, this figure facilitates direct pairwise comparison of absolute RMSE values at three representative lead times, highlighting the widening gap between methods as lead time increases.

Figure 5. Qualitative SST forecasting results. Top: Ground truth; Middle: proposed model predictions; Bottom: absolute error maps (Day 1 through Day 10).

Table 1. Statistical summary of input variables (2000–2013). u: eastward surface current; v: northward surface current; S: sea surface salinity;

η

: sea surface height.

Table 1. Statistical summary of input variables (2000–2013). u: eastward surface current; v: northward surface current; S: sea surface salinity;

η

: sea surface height.

Variable	Mean	Std	Min	Max	$σ_{Δ 1 d}$
SST (°C)	27.42	2.18	19.85	32.61	0.19
u (m/s)	−0.03	0.15	−1.02	0.89	0.06
v (m/s)	0.01	0.12	−0.78	0.71	0.05
S (PSU)	33.82	0.45	31.20	35.10	0.03
$η$ (m)	0.52	0.08	0.25	0.85	0.01

σ_{Δ 1 d}

: standard deviation of day-to-day differences, computed over all ocean pixels and all days. This quantity characterizes the daily variability that the model must predict beyond simple persistence.

Table 2. Core hyperparameters of the proposed framework.

Hyperparameter	Symbol	Value
ConvLSTM layers	–	3
ConvLSTM hidden channels	$C_{h}$	64
ConvLSTM kernel size	–	$3 \times 3$
Mamba window size	P	8
Mamba state dimension	$d_{s}$	16
Mamba model dimension	$d_{m}$	128
Scan directions	–	4
Grow-and-Cut rounds	$K_{g}$	2
MSE loss weight	$λ_{M S E}$	0.8
SSIM loss weight	$λ_{S S I M}$	0.2
Fusion weight init	$λ_{0}$	0.1
Input/forecast length	L/T	10/10

Table 3. Overall comparison (10-day average). Best learning-based results in bold. ↓: lower is better; ↑: higher is better.

Method	RMSE ↓	MAE ↓	SSIM ↑	R² ↑	ACC ↑
Persistence	0.648	0.487	0.838	0.818	0.812
Climatology	0.872	0.685	0.782	0.670	0.000
ConvLSTM	0.603	0.451	0.861	0.842	0.836
PredRNN	0.579	0.432	0.872	0.857	0.851
ConvGRU	0.621	0.468	0.848	0.831	0.824
TCTN	0.566	0.421	0.879	0.864	0.858
PANN	0.551	0.409	0.886	0.872	0.867
Swin-UNet	0.547	0.404	0.889	0.875	0.871
ViT-ST	0.539	0.398	0.893	0.881	0.878
Ours	0.512	0.374	0.907	0.896	0.894

Table 4. Lead-time comparison. Δ: relative RMSE improvement over persistence.

Day	Method	RMSE ↓	SSIM ↑	R² ↑	Δ (%)
1	Persist.	0.189	0.953	0.982	–
	ViT-ST	0.238	0.947	0.972	−25.9
	Ours	0.226	0.954	0.978	−19.6
3	Persist.	0.472	0.891	0.903	–
	ViT-ST	0.389	0.921	0.934	+17.6
	Ours	0.368	0.932	0.944	+22.0
5	Persist.	0.681	0.841	0.798	–
	ViT-ST	0.523	0.899	0.886	+23.2
	Ours	0.499	0.912	0.901	+26.7
7	Persist.	0.856	0.795	0.682	–
	ViT-ST	0.635	0.869	0.842	+25.8
	Ours	0.601	0.885	0.862	+29.8
10	Persist.	1.032	0.756	0.537	–
	ViT-ST	0.736	0.841	0.804	+28.7
	Ours	0.698	0.862	0.828	+32.4

Table 5. Ablation study (10-day average).

Variant	RMSE ↓	MAE ↓	SSIM ↑	R² ↑
w/o Mamba branch	0.548	0.402	0.886	0.875
w/o ConvLSTM branch	0.556	0.412	0.880	0.868
w/o temporal state persist.	0.529	0.387	0.897	0.885
w/o physical proxies	0.531	0.389	0.896	0.884
w/o advection only	0.523	0.382	0.901	0.889
w/o Laplacian only	0.516	0.377	0.904	0.893
w/o Grow-and-Cut	0.527	0.386	0.898	0.887
w/o cross-dir scan	0.524	0.383	0.901	0.889
w/o SSIM loss	0.519	0.380	0.899	0.892
Full model	0.512	0.374	0.907	0.896

Table 6. Impact of auxiliary variable handling during rollout (10-day avg).

Strategy	RMSE ↓	ΔRMSE
(a) No auxiliaries (SST only)	0.541	+5.7%
(b) Auxiliaries persist from Day 0	0.512	–
(c) Ground-truth auxiliaries (oracle)	0.498	−2.7%

Table 7. Seasonal RMSE (°C). Spring: Mar–May; Summer: Jun–Aug; Autumn: Sep–Nov; Winter: Dec 2012 + Jan–Feb 2013.

Method	Spring	Summer	Autumn	Winter
Persistence	0.601	0.718	0.652	0.626
ConvLSTM	0.571	0.639	0.596	0.605
PredRNN	0.548	0.618	0.572	0.580
PANN	0.525	0.589	0.548	0.542
Swin-UNet	0.520	0.583	0.541	0.535
ViT-ST	0.517	0.576	0.536	0.528
Ours	0.489	0.551	0.511	0.503

Winter includes December from the preceding validation year (2012) to form complete Dec–Feb windows; only the January and February 2013 portions are scored.

Table 8. Statistical analysis (5 seeds). p-values from paired t-tests (ours vs. ViT-ST).

Metric	Ours (Mean ± std)	ViT-ST (Mean ± std)	p-Value
RMSE	$0.512 \pm 0.006$	$0.539 \pm 0.008$	0.004
MAE	$0.374 \pm 0.005$	$0.398 \pm 0.007$	0.003
SSIM	$0.907 \pm 0.003$	$0.893 \pm 0.004$	0.005
$R^{2}$	$0.896 \pm 0.004$	$0.881 \pm 0.005$	0.006
ACC	$0.894 \pm 0.003$	$0.878 \pm 0.005$	0.004

Table 9. Computational cost. FLOPs: single 10-day forecast,

128 \times 128

. Inference: single A100 GPU, averaged over 100 runs after 10 warm-up runs. “End-to-end” includes preprocessing (Grow-and-Cut + physical proxies).

Table 9. Computational cost. FLOPs: single 10-day forecast,

128 \times 128

. Inference: single A100 GPU, averaged over 100 runs after 10 warm-up runs. “End-to-end” includes preprocessing (Grow-and-Cut + physical proxies).

Method	Params	FLOPs	Model	End-to-End
	(M)	(G)	(ms)	(ms)
Persistence	–	–	–	0.1
Climatology	–	–	–	0.1
ConvLSTM	2.1	18.4	12.3	12.3
PredRNN	3.8	32.6	19.7	19.7
ConvGRU	1.6	14.2	10.1	10.1
TCTN	4.2	25.1	8.5	8.5
PANN	5.1	38.5	22.4	25.8
Swin-UNet	8.7	52.3	28.6	28.6
ViT-ST	12.4	68.9	35.2	35.2
Ours	9.6	56.8	31.4	34.8

FLOPs include all learnable operations (convolutions, linear projections, SSM scans, normalization layers). Preprocessing overhead for the proposed model is 3.4 ms (Grow-and-Cut: 1.1 ms, physical proxies: 2.3 ms).

Table 10. Sensitivity analysis (10-day avg RMSE, °C). Defaults underlined.

Window P	RMSE	State dim $d_{s}$	RMSE
4	0.518	8	0.521
8	0.512	16	0.512
16	0.516	32	0.514
32	0.525	64	0.515
$λ_{M S E}$ / $λ_{S S I M}$	RMSE	Grow rounds $K_{g}$	RMSE
1.0/0.0	0.519	0	0.527
0.9/0.1	0.515	1	0.517
0.8/0.2	0.512	2	0.512
0.7/0.3	0.514	3	0.513
0.5/0.5	0.520	4	0.513

Table 11. Overall comparison on the Kuroshio Extension (10-day average). Best results in bold. ↓: lower is better; ↑: higher is better.

Method	RMSE ↓	MAE ↓	SSIM ↑	R² ↑	ACC ↑
Persistence	0.842	0.632	0.793	0.756	0.768
ConvLSTM	0.783	0.585	0.818	0.789	0.802
ViT-ST	0.694	0.517	0.852	0.835	0.849
Ours	0.651	0.483	0.871	0.858	0.872

Table 12. Lead-time RMSE (°C) on the Kuroshio Extension. Δ: relative improvement of our model over Persistence.

Day	Persistence	ViT-ST	Ours	Δ (%)
1	0.310	0.331	0.302	+2.6
3	0.605	0.498	0.462	+23.6
5	0.832	0.678	0.637	+23.4
7	1.014	0.822	0.772	+23.9
10	1.232	1.010	0.952	+22.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, B.; Hong, Z.; Wang, G. A Hybrid Mamba–ConvLSTM Framework for Multi-Day Sea Surface Temperature Forecasting at 0.05° Resolution. J. Mar. Sci. Eng. 2026, 14, 898. https://doi.org/10.3390/jmse14100898

AMA Style

Peng B, Hong Z, Wang G. A Hybrid Mamba–ConvLSTM Framework for Multi-Day Sea Surface Temperature Forecasting at 0.05° Resolution. Journal of Marine Science and Engineering. 2026; 14(10):898. https://doi.org/10.3390/jmse14100898

Chicago/Turabian Style

Peng, Bo, Zhonghua Hong, and Guansuo Wang. 2026. "A Hybrid Mamba–ConvLSTM Framework for Multi-Day Sea Surface Temperature Forecasting at 0.05° Resolution" Journal of Marine Science and Engineering 14, no. 10: 898. https://doi.org/10.3390/jmse14100898

APA Style

Peng, B., Hong, Z., & Wang, G. (2026). A Hybrid Mamba–ConvLSTM Framework for Multi-Day Sea Surface Temperature Forecasting at 0.05° Resolution. Journal of Marine Science and Engineering, 14(10), 898. https://doi.org/10.3390/jmse14100898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Mamba–ConvLSTM Framework for Multi-Day Sea Surface Temperature Forecasting at 0.05° Resolution

Abstract

1. Introduction

2. Related Work

2.1. Recurrent Spatiotemporal Models

2.2. CNN and Transformer Approaches

2.3. Selective State-Space Models

2.4. Physics-Informed Ocean Forecasting

3. Methodology

3.1. Problem Formulation

3.2. Overall Framework

3.3. Grow-And-Cut Cross-Resolution Alignment

3.4. Boundary-Aware Physical Proxy Construction

3.5. Cross-Direction Window Scanning for Mamba

3.6. Mamba Block with Temporal State Persistence

3.7. Rearrangement Operator and Spatial Restoration

3.8. ConvLSTM Branch

3.9. Feature Fusion and Prediction Head

3.10. Training Strategy: Scheduled Sampling

3.11. Training Objective

4. Dataset and Experimental Setup

4.1. Study Area and Data Sources

4.2. Data Statistics and Preprocessing

4.3. Temporal Split and Protocol

4.4. Baselines

4.5. Implementation Details

4.6. Evaluation Metrics

5. Experimental Results and Discussion

5.1. Overall Quantitative Comparison

5.2. Lead-Time Performance Analysis

5.3. Ablation Study

5.4. Auxiliary Variable Rollout Strategy

5.5. Seasonal Robustness Analysis

5.6. Statistical Significance Testing

5.7. Computational Cost Analysis

5.8. Hyperparameter Sensitivity Analysis

5.9. Qualitative Analysis

5.10. Cross-Domain Generalization: Kuroshio Extension

5.10.1. Overall Performance

5.10.2. Lead-Time Analysis

5.10.3. Summary

5.11. Discussion

5.12. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI