HiT_DS: A Modular and Physics-Informed Hierarchical Transformer Framework for Spatial Downscaling of Sea Surface Temperature and Height

Wang, Min; Liu, Weixuan; Chu, Rong; Wang, Xidong; Zhu, Shouxian; Liao, Guanghong

doi:10.3390/rs18020292

Open AccessArticle

HiT_DS: A Modular and Physics-Informed Hierarchical Transformer Framework for Spatial Downscaling of Sea Surface Temperature and Height

by

Min Wang

^1,*,

Weixuan Liu

¹,

Rong Chu

¹,

Xidong Wang

²,

Shouxian Zhu

² and

Guanghong Liao

²

¹

College of Computer Science and Software Engineering, Hohai University, Nanjing 210008, China

²

College of Oceanography, Hohai University, Nanjing 210008, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 292; https://doi.org/10.3390/rs18020292

Submission received: 29 November 2025 / Revised: 28 December 2025 / Accepted: 11 January 2026 / Published: 15 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

HiT_DS achieves high-resolution reconstruction of SST and SSH fields while preserving fine-scale structures and high-gradient ocean features.
Selective combination of E-DFE, GA, and physics-informed losses enhances reconstruction accuracy across regions with distinct ocean dynamics.

What are the implications of the main findings?

HiT_DS provides a flexible and modular framework for oceanographic data downscaling that can be tailored to different dynamical regimes.
The approach bridges the gap between generic super-resolution methods and physically consistent geophysical data reconstruction, supporting improved ocean monitoring and research.

Abstract

Recent advances in satellite observations have expanded the use of Sea Surface Temperature (SST) and Sea Surface Height (SSH) data in climate and oceanography, yet their low spatial resolution limits fine-scale analyses. We propose HiT_DS, a modular hierarchical Transformer framework for high-resolution downscaling of SST and SSH fields. To address challenges in multiscale feature representation and physical consistency, HiT_DS integrates three key modules: (1) Enhanced Dual Feature Extraction (E-DFE), which employs depth-wise separable convolutions to improve local feature modeling efficiently; (2) Gradient-Aware Attention (GA), which emphasizes dynamically important high-gradient structures such as oceanic fronts; and (3) Physics-Informed Loss Functions, which promote physical realism and dynamical consistency in the reconstructed fields. Experiments across two dynamically distinct oceanic regions demonstrate that HiT_DS achieves improved reconstruction accuracy and enhanced physical fidelity, with selective module combinations tailored to regional dynamical conditions. This framework provides an effective and extensible approach for oceanographic data downscaling.

Keywords:

spatial downscaling; hierarchical transformer; physics-informed learning; sea surface temperature; sea surface height

1. Introduction

The ocean temperature and height are two key parameters in the study of oceanic hydrological conditions. Ocean temperature changes not only affect the development of marine fisheries [1], but also serve as critical indicators for studying various oceanic phenomena, such as Tropical Instability Waves [2]. Similarly, ocean height—closely coupled with temperature via thermodynamic processes—is commonly used to detect and track mesoscale eddies [3]. It also has practical significance for ship navigation, fisheries resource forecasting, and marine engineering [4]. Moreover, the study of ocean height helps to reveal patterns of ocean temperature and salinity changes and the underlying dynamics [5].

Recent advancements in satellite observation technology have significantly expanded the use of Sea Surface Temperature (SST) and Sea Surface Height (SSH) data in climate and oceanographic studies. However, although these data are relatively accurate, their coarse spatial resolution often limits applicability in fine-scale analyses. Consequently, improving the spatial resolution of oceanographic data has become a critical research focus, driving the development of downscaling techniques. Spatial downscaling involves generating high-resolution data from low-resolution inputs, effectively transforming large-scale datasets into finer-scale representations, and has found broad applications in meteorology, climatology, and remote sensing.

In spatial downscaling research, three main approaches are widely employed: statistical, dynamic, and deep learning-based methods.

Statistical downscaling establishes empirical relationships between large-scale climate variables and local-scale parameters to generate high-resolution data. Unlike physics-based methods, it does not require solving complex equations, thereby reducing computational costs. Zhao et al. [6] developed a weather-type-based statistical model using ERA5 data with a wave clustering technique to simulate significant wave height in China’s marginal seas. While this method accurately reconstructs seasonal variations in high-latitude regions, it relies on linear assumptions, limiting its ability to capture nonlinear ocean dynamics, and exhibits strong regional dependence.

In contrast, dynamic downscaling directly solves physical equations through high-resolution Regional Ocean Models nested within global models. For instance, Al Azad et al. [7] applied the ADCIRC + SWAN model to ERA5 data, improving storm surge simulations in the western North Atlantic. By optimizing boundary conditions, they reduced the Root Mean Square Error (RMSE) of wave height from 0.41 m to 0.38 m (7%) and wave period from 2.1 s to 1.5 s (30%). Nevertheless, these methods are computationally intensive and highly dependent on parameterization schemes, limiting their large-scale application.

Recent years have witnessed growing interest in applying deep learning techniques to downscaling research. These methods offer significant advantages over traditional approaches: they capture complex nonlinear relationships often overlooked by statistical methods, while requiring lower computational resources than dynamic downscaling methods. Multiple innovative applications demonstrate this potential: Gao et al. [8] developed a deep learning framework to estimate coastal directional wave spectra from open-ocean data, combining UNet with attention mechanisms to achieve effective spectrum downscaling without imposing predefined shape constraints. Similarly, Zhang et al. [9] proposed a Land Surface Temperature Downscaling Residual Network that outperforms conventional methods in processing Landsat 8 data, maintaining robust performance across seasonal variations and diverse terrain types. For oceanographic applications, Thiria et al. [10] introduced the RESAC system, integrating low-resolution SSH with high-resolution SST using convolutional neural networks. This fusion enables accurate reconstruction of surface flow fields and improves detection of mesoscale oceanic features.

Despite recent advances in deep learning-based downscaling methods, several critical challenges remain in the accurate reconstruction of SSH and SST fields:

Preservation of Scientific Data Fidelity: Oceanographic variables such as SSH and SST are high-precision, single-channel measurements. Direct application of image-based super-resolution architectures may introduce errors or fail to capture intrinsic physical structures. Therefore, a key challenge is designing models that respect the inherent accuracy and spatial patterns of scientific data.
Limited Capacity of Existing Models for Long-Range Spatial Dependencies: Currently, most methods for downscaling SSH and SST rely on Convolutional Neural Networks (CNNs). However, due to their fixed and local receptive fields, CNNs have inherent limitations in capturing long-range dependencies. This constraint hinders the ability of such models to represent large-scale oceanic structures and low-frequency variability. While Transformer-based models have shown remarkable success in both computer vision and Earth system modeling, their application to geophysical downscaling remains underexplored.
Lack of Dynamic Feature Awareness in Spatially Heterogeneous Regions: Oceanic fields exhibit strong spatial heterogeneity, with dynamic features such as eddies, fronts, and filaments regulating energy and mass transport. Existing deep learning methods typically apply uniform enhancement across spatial domains, potentially diluting attention in highly dynamic regions. An open challenge is guiding models to focus on spatially complex and physically important structures during reconstruction.

To bridge the gap between general-purpose super-resolution frameworks and scientific data downscaling, we adapt the Hierarchical Transformer (HiT) architecture [11] with modular enhancements: (1) an Enhanced Dual Feature Extraction (E-DFE) module for improved local feature modeling; (2) a gradient-aware attention (GA) mechanism to emphasize high-gradient regions; and (3) physics-informed loss functions to promote structural consistency and physical realism. These modules can be selectively combined depending on regional ocean dynamics, providing a flexible framework capable of effectively reconstructing SST and SSH fields across areas with distinct spatial and dynamical characteristics.

2. Materials and Methods

2.1. Study Areas

Two representative regions in the Western Pacific were selected to evaluate the model’s performance across diverse ocean dynamic regimes and latitudinal zones. These include a tropical region and a mid-latitude region, defined as Study Area 1 and Study Area 2, respectively.

Study Area 1: South China Sea and Adjacent Waters (107°E–123°E, 6°N–22°N)

This study region encompasses the South China Sea (SCS; see the left panel of Figure 1), the Luzon Strait, Eastern Coastal Vietnam, and the western boundary of the Western Pacific Warm Pool. It is significantly influenced by the East Asian monsoon system and intrusions of the Kuroshio Current, leading to strong seasonal SST variability. Complex bathymetry, circulation patterns, and mesoscale processes contribute to dynamic ocean variability, making it an ideal site for evaluating model performance under tropical ocean conditions.

2.: Study Area 2: Kuroshio Extension Region (134°E–150°E, 28°N–44°N)

This mid-latitude region includes the Kuroshio Extension and surrounding waters (see the right panel of Figure 1), characterized by strong path variability of the Kuroshio Current and intense mesoscale activity. The interaction between the Kuroshio and neighboring currents generates sharp meridional gradients in both SST and SSH. Variations in thermocline depth and frequent storm activity further complicate the ocean dynamics, posing higher demands on the model’s generalization capacity.

2.2. Datasets

The datasets used in this study constitute the primary materials for model training and evaluation, including SST and SSH products derived from state-of-the-art reanalysis and satellite-based sources.

2.2.1. SST Data

The SST data used in this study are obtained from the Operational Sea Surface Temperature and Sea Ice Analysis—Reprocessed (OSTIA-REP, version 02.0) dataset [12], developed by the UK Met Office. OSTIA-REP provides globally reprocessed high-resolution SST and sea ice concentration products.

Considering the computational cost of model training, the original high-resolution SST fields are downsampled to 1/16° (~3 km) to serve as the high-resolution reference, while the corresponding low-resolution SST inputs are generated by further downsampling to 1/4° (~25 km). This configuration balances the preservation of fine-scale spatial features with computational efficiency during training.

OSTIA-REP v02.0 integrates multi-source satellite observations (e.g., AVHRR, AMSR-E, and VIIRS) and in situ measurements (e.g., Argo floats and ship-based observations) using an Optimal Interpolation (OI) scheme to ensure temporal consistency and spatial coherence. The reported RMSE is typically below 0.5 °C, indicating that the dataset is well suited for fine-scale SST analysis.

2.2.2. SSH Data

The SSH data used in this study are obtained from the GLORYS12V1 global ocean reanalysis product [13], produced by Mercator Ocean International. The high-resolution SSH fields have a spatial resolution of 1/12° (~9 km), while the low-resolution SSH inputs are generated by downsampling the high-resolution fields to 1/3° (~36 km). This downsampling procedure ensures spatial alignment and facilitates efficient training for capturing mesoscale ocean dynamics.

GLORYS12V1 is generated using the NEMO (Nucleus for European Modelling of the Ocean) circulation model and assimilates satellite altimetry data (e.g., AVISO) as well as in situ measurements (e.g., Argo floats and shipboard observations) through a four-dimensional variational (4D-Var) data assimilation system. This product effectively captures mesoscale ocean phenomena, including eddies, fronts, and current systems.

The synthetic low-resolution SSH fields are perfectly aligned with their high-resolution counterparts, ensuring consistency for model training and evaluation.

2.3. Hierarchical Transformer

The Transformer architecture, originally introduced by Vaswani et al. [14], revolutionized neural network design by relying solely on attention mechanisms rather than recurrent or convolutional operations. Its strong capability for global dependency modeling and highly parallelized computation has driven major advances in natural language processing and computer vision. However, two fundamental limitations remain: (1) the quadratic computational cost of self-attention and (2) the limited representation of fine-grained local structures, which are particularly problematic for high-resolution data reconstruction.

In the field of image restoration, Transformer-based frameworks such as SwinIR [15] and its derivatives [16,17,18] have achieved remarkable performance. These models employ mechanisms including Shifted Window Self-Attention (SW-SA) [16], N-gram modules [17], and Permutation Self-Attention (PSA) [18] to balance efficiency and representation. Nevertheless, the quadratic complexity with respect to window size limits scalability [19], while recent studies indicate that hierarchical and multi-scale feature representations are essential for super-resolution performance [20]. To address this conflict, multi-scale attention mechanisms such as the Grouped Multi-Scale Self-Attention (GMSA) in ELAN have been proposed [21]; however, their quadratic complexity and channel partitioning may cause information redundancy and loss, thereby constraining modeling efficiency. Moreover, most existing models target natural images and lack adaptability to multiscale physical field reconstruction in scientific data. To address these limitations, the HiT [11] architecture employs a hierarchical design to capture multi-scale dependencies efficiently.

2.3.1. Block-Level Design: Hierarchical Windows

At the block level, a hierarchical window mechanism efficiently captures and integrates multi-scale feature information across Transformer Layers (TLs). Let the base window size be

h_{B}

×

w_{B}

; the window size

h_{i}

×

w_{i}

for the i-th TL is defined as:

h_{i} = α_{i} h_{B}, w_{i} = α_{i} w_{B}

(1)

where

α_{i} > 0

is a scaling factor controlling the window growth across layers. Shallow layers employ smaller windows to capture fine local structures, while deeper layers progressively enlarge the window size to model long-range dependencies [11,22]. Compared with fixed-size or shifted window approaches [18], this hierarchical configuration enables efficient aggregation of multi-scale contextual information at lower computational cost, following principles similar to the Dual Aggregation Transformer [23], which has demonstrated the effectiveness of multi-branch attention for multi-scale feature aggregation.

2.3.2. Layer-Level Design: Spatial-Channel Correlation

At the layer level, HiT replaces conventional window-based self-attention (W-SA) with a Spatial–Channel Correlation (SCC) mechanism (Figure 2) to enhance feature representation while maintaining computational efficiency [11]. The SCC module consists of three parts: Dual Feature Extraction (DFE), Spatial Self-Correlation (S-SC), and Channel Self-Correlation (C-SC).

The DFE extracts spatial and channel features via linear and convolutional layers, and the interaction between them is formulated as:

DFE (X) = X_{c h} ⊙ X_{sp} X_{ch} = Linear (X), X_{sp} = Conv (X)

(2)

where ⊙ denotes element-wise multiplication. SCC adopts a shared key–value design derived from DFE outputs:

[Q, V] = DFE (X)

(3)

The S-SC component further enhances spatial dependency modeling through adaptive transformations across Transformer layers, which can be expressed as:

V_{↓, i}^{T} = S - {Linear}_{i} ({V_{i}}^{T})

(4)

where

V_{↓, i} \in {R^{h_{↓} w_{↓} \times}}^{\frac{C}{2}}

represents the projected value. The window configuration varies with the scaling factor

α_{i}

:

[h_{↓}, w_{↓}] = {\begin{matrix} [h_{B}, w_{B}], i f α_{i} > 1 \\ [h_{i}, w_{i}], i f α_{i} \leq 1 \end{matrix}

(5)

Accordingly, large windows capture high-level semantics, while small windows preserve fine-grained details. The correlation operation is then defined as:

S - SC (Q_{i}, V_{↓, i}) = (\frac{Q_{i} V_{↓, i}^{T}}{D} + B) V_{↓, i}

(6)

where B represents relative position encoding [24], and D is a normalization constant. Replacing attention maps with correlation maps removes the softmax operation [25] and achieves linear complexity with window size.

Similarly, the C-SC module models dependencies along the channel dimension:

C - SC (Q_{i}, V_{i}) = (\frac{{Q_{i}}^{T} V_{i}}{D_{i}} + B) {V_{i}}^{T}

(7)

where

D_{i} = h_{i} w_{i}

. Compared with the transpose attention mechanism [26], SCC achieves more efficient multi-scale aggregation and improves representation quality in reconstruction tasks.

2.4. HiT_DS: A Scientific Data-Optimized Architecture

The HiT_SR model, originally developed on the SwinIR framework [15], is selected as the backbone because of its efficient hierarchical Transformer design, which enables the extraction of multi-scale spatial dependencies across different resolution levels and is particularly critical for accurately reconstructing fine-scale structures in high-resolution geophysical fields such as SST and SSH. Within this framework, the conventional TB module is replaced by the HiTB architecture (Figure 2) [11], significantly enhancing multi-scale feature representation and establishing a robust baseline for super-resolution tasks. Moreover, this hierarchical design provides an interpretable and scalable foundation that facilitates the integration of domain-specific modular enhancements in subsequent stages.

While the original HiT_SR demonstrates strong scalability and efficiency in image restoration, its direct application to geophysical data downscaling faces non-trivial challenges arising from substantial domain discrepancies between natural images and geophysical variables, which include large dynamic ranges, strong anisotropy, complex spatial–temporal correlations, and physically constrained dependencies. Such domain gaps require the model not only to capture spatial correlations but also to respect underlying physical laws, which are often nonlinear and context-dependent.

To address these challenges, we propose HiT_DS, a hierarchical Transformer framework specifically tailored for geophysical data downscaling. HiT_DS inherits the computationally efficient and hierarchically organized architecture of HiT and incorporates a set of carefully designed, domain-oriented modular enhancements, each intended to improve local feature representation, dynamically emphasize high-gradient regions, and maintain physical consistency. These modules can be selectively integrated depending on regional ocean dynamics, providing a flexible framework capable of effectively reconstructing SST and SSH fields across areas with distinct spatial and dynamical characteristics.

2.4.1. Overall Architecture

As shown in Figure 2, HiT_DS follows an encoder–decoder paradigm while preserving the hierarchical feature extraction mechanism of HiT. Given a low-resolution input field

X_{L R} \in R^{H \times W \times 1}

, the model first performs shallow feature extraction using a convolutional layer:

F_{0} = C o n v (X_{L R})

(8)

This initial step projects the input into a higher-dimensional latent space, facilitating hierarchical propagation of spatial features. The resulting shallow features are processed through a series of stacked HiT blocks:

F_{i} = {H i T B l o c k}_{i} (F_{i - 1}), i = 1, 2, \dots, N

(9)

Each HiT block contains the SCC mechanism, which integrates two submodules: Enhanced Dual Feature Extraction (E-DFE) and Gradient-aware Attention (GA). This design allows hierarchical learning of cross-dimensional dependencies, combining local structure extraction and dynamic gradient awareness in a unified representation space.

Finally, a lightweight upsampling head reconstructs the high-resolution output:

Y_{H R} = U p s a m p l e (F_{N})

(10)

where

Y_{H R} \in R^{s H \times s W \times 1}

denotes the predicted high-resolution field and s is the scale factor.

This design allows hierarchical learning of cross-dimensional dependencies, combining local structure extraction and dynamic gradient awareness in a unified representation space.

2.4.2. Enhanced Dual Feature Extraction

To improve local feature modeling while maintaining computational efficiency, we redesign the original DFE module by replacing standard 3 × 3 convolution with Depth-wise Separable Convolution (DSC) [27]. DSC decomposes the convolution into two steps:

1.: Depth-wise Convolution:

F_{d w} (x, y, c) = \sum_{(i, j) \in Ω} K_{d w} (i, j, c) F_{i n} (x + i, y + j, c)

(11)

which performs spatial filtering within each channel independently, capturing local texture details.

2.: Pointwise Convolution:

F_{p w} (x, y, c^{'}) = \sum_{c} K_{p w} (1, 1, c^{'}, c) F_{d w} (x, y, c)

(12)

which fuses inter-channel information. The module further incorporates a residual connection [28]:

F_{o u t} = F_{i n} + D S C (F_{i n})

(13)

2.4.3. Gradient-Aware Attention (GA)

In geophysical data such as SST and SSH, regions of strong spatial gradient correspond to physically meaningful features, including oceanic fronts and mesoscale eddies. To enhance sensitivity to these dynamically active areas, we introduce a GA mechanism.

The gradient magnitude is estimated using the Sobel operator:

G_{x} = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}] * X, G_{y} = [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}] * X G = \sqrt{G_{x}^{2} + G_{y}^{2}}

(14)

The gradient map G is normalized via a sigmoid function after a lightweight convolutional mapping:

A_{g} = σ (Conv (G))

(15)

and the resulting attention map

A_{g}

is applied element-wise to the outputs of both the spatial and channel correlation submodules to emphasize dynamically active regions:

F'_{S} = A_{g} ⊙ S - S C, F'_{C} = A_{g} ⊙ C - S C

(16)

The attended features are subsequently fused as:

F'_{S C C} = ϕ (F'_{S}, F'_{C})

(17)

where

ϕ (\cdot)

denotes a fusion function (e.g., summation or concatenation followed by convolution).

This adaptive attention enables the model to dynamically emphasize high-gradient regions, leading to sharper, physically consistent reconstructions of temperature fronts and eddy boundaries.

2.4.4. Redesign of the Loss Functions

To enhance the physical fidelity of downscaled ocean variables, we redesign the loss functions by incorporating physically motivated constraints [29]. Specifically, we introduce two loss formulations that reflect key geophysical properties of SST and SSH fields.

(a): L1 Loss with Laplacian-Based Physical Constraint for SST

To better preserve the physical realism of SST fields, we incorporate a Laplacian-based constraint:

Δ f = \frac{\partial^{2} f}{\partial x^{2}} + \frac{\partial^{2} f}{\partial y^{2}}

(18)

In our implementation, we approximate this operator on the discrete grid using a 5-point stencil formulation:

Δ Y_{i, j} = Y_{i - 1, j} + Y_{i + 1, j} + Y_{i, j - 1} + Y_{i, j + 1} - 4 Y_{i, j}

(19)

This approximation enables efficient estimation of local curvature and spatial variability in SST data. In geophysical terms, large Laplacian magnitudes often correspond to dynamically active regions, such as thermal fronts, upwelling zones, and mesoscale eddies.

To supervise both the absolute SST values and their spatial structure, we employ a combined loss formulation. The standard L1 loss is defined as:

L_{L 1} (p r e d, g t) = \frac{1}{N} \sum_{i = 1}^{N} | p r e d_{i} - g t_{i} |

(20)

Here, pred and gt denote the predicted and reference SST values, respectively.

To enforce similarity in curvature, we define a Laplacian-based loss between the predicted and ground-truth SST fields:

L_{l a p} (p r e d, g t) = L_{L 1} (Δ p r e d, Δ g t)

(21)

The final total loss integrates both components with a balancing coefficient λ (here set to 0.5):

L_{S S T} = λ L_{L 1} (p r e d, g t) + (1 - λ) L_{l a p} (p r e d, gt)

(22)

This composite loss encourages the model to jointly minimize absolute value errors and curvature discrepancies, leading to smoother yet physically realistic SST reconstructions.

(b): L1 Loss with Geostrophic Constraint for SSH

To improve the dynamical consistency of the reconstructed SSH fields, we propose a hybrid loss function combining the standard L1 loss with a physically informed geostrophic constraint [30], derived from the physical principle of geostrophic balance.

In large-scale ocean circulation, the horizontal pressure gradient is balanced by the Coriolis force, which is expressed as:

f v_{g} = - g \nabla h

(23)

where f is the Coriolis parameter (a function of latitude), g is gravitational acceleration, and ∇h denotes horizontal gradient of SSH.

To encode this constraint into the learning objective, we design a geostrophic loss that penalizes discrepancies between the horizontal gradients of the predicted and ground truth SSH fields. Let pred and gt denote the predicted and ground truth SSH tensors of shape (N,1,H,W). The horizontal gradients are approximated using central first-order finite differences:

\partial_{x} p r e d = p r e d_{i, j + 1} - p r e d_{i, j} \partial_{y} p r e d = p r e d_{i + 1, j} - p r e d_{i, j}

(24)

The geostrophic loss is then computed as the L1 norm of the f-weighted gradient residuals:

L_{g e o} = L_{L 1} (\nabla p r e d, \nabla g t)

(25)

Finally, the total SSH loss is formulated as a weighted combination of L1 and geostrophic loss terms:

L_{S S H} = λ L_{L 1} (pred, gt) + 10 (1 - λ) L_{g e o} (pred, gt)

(26)

We set the balancing coefficient λ to 0.5, and amplify the geostrophic term by a factor of 10 to highlight its physical relevance.

This composite loss encourages the model not only to minimize pointwise SSH errors, but also to adhere to the fundamental geophysical dynamics of ocean circulation, ensuring that the reconstructed SSH gradients remain dynamically balanced and consistent with large-scale oceanic processes.

3. Results

This section presents a series of controlled ablation experiments to evaluate the effectiveness of the proposed improvements in both the baseline HiT_SR model and its scientific data–optimized extension, HiT_DS. These experiments quantitatively assess the contribution of individual modules. They also evaluate the integrated effectiveness of HiT_DS, a domain-optimized hierarchical Transformer framework specifically designed to bridge the gap between generic image super-resolution and physically consistent geophysical data reconstruction. Since the experiments are conducted across different regions and datasets, directly aggregating all modifications risks obscuring the distinct contribution of each component. Therefore, we first evaluate each component independently, including feature enhancement structures, attention mechanisms, and physics-based loss functions, and then selectively combine only those that yield measurable benefits.

In this study, we adapt and extend the HiT_SR architecture for the spatial downscaling of two essential oceanographic variables: SST and SSH. For SST, daily data from 2004 to 2019 are used for training, 2020–2022 for validation, and 2023 for testing. For SSH, the training set covers 2000–2018, validation spans 2018–2019, and testing is conducted on data from 2020. Two geographically distinct study areas are analyzed to rigorously assess spatial generalization and regional sensitivity of the model variants.

To ensure comprehensive evaluation, four performance metrics are reported: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Relative Error (RE), and Temporal Correlation Coefficient (TCC). These metrics collectively provide a comprehensive evaluation of spatial accuracy, magnitude fidelity, and temporal coherence.

All experiments were conducted on an NVIDIA A100 GPU. Models were trained using the Adam optimizer with an initial learning rate of 5 × 10⁻⁴, a batch size of 16, and approximately 80,000 iterations, beyond which no significant improvements were observed. Each full training run required about 8 h. With this experimental setup in place, we next detail the ablation strategy and model variants.

3.1. Ablation Strategy and Model Variants

To systematically evaluate the effectiveness of each proposed adaptation for enhancing the HiT_SR model in SST and SSH downscaling, we design a modular ablation study. This approach first assesses the contribution of individual improvements, namely E-DFE, GA, and physics-informed loss functions. Subsequently, we integrate the most effective modules into a composite model. This procedure facilitates a clear disentanglement of individual effects and potential synergistic interactions. It provides a clear understanding of both individual and synergistic impacts on reconstruction performance. This emphasizes the construction of HiT_DS as a coherent, scientifically optimized framework.

Baseline Model

The original HiT_SR model, built upon SwinIR with hierarchical attention and SCC, is used as a baseline to highlight the limitations of direct application on geophysical fields.

2.: E-DFE

Built upon the baseline HiT_SR, this version integrates Depth-wise Separable Convolution (DSC) and residual connections to improve local feature modeling efficiency. Denoted as HiT_SR + E-DFE.

3.: GA

This variant incorporates the gradient-aware attention module to emphasize geophysically meaningful gradient structures, denoted as HiT_SR + GA.

4.: Physics-informed Loss Functions

This variant evaluates models trained with physically guided loss terms: Laplacian-based loss for SST and geostrophic loss for SSH, denoted as HiT_SR + Loss.

5.: Selective Combination of Effective Modules

Modules that demonstrate measurable improvements over the baseline are selectively integrated to construct composite variants. Through this controlled ablation design, we establish a clear understanding of how each component contributes to the overall performance and physical consistency of the HiT_DS framework. The final integrated configuration, incorporating E-DFE, GA, and physics-informed losses, constitutes the complete HiT_DS architecture evaluated in the following sections. Variants that do not yield statistically meaningful improvements are excluded to maintain architectural parsimony and avoid redundancy.

This progressive integration strategy not only isolates the individual contributions of each module but also facilitates a practical assessment of their synergistic effects in the context of complex SST and SSH downscaling tasks.

3.2. Analysis of SST Downscaling Results

This section presents the quantitative and qualitative performance of HiT_SR and its ablated variants on the SST downscaling task, evaluated across two geographically distinct regions (Study Area 1 and Study Area 2). The objective is to assess both the absolute improvement offered by each proposed module and their generalization capability across diverse oceanographic conditions.

The evaluation results for each region are summarized in Table 1 and Table 2, using standard metrics including RMSE, MAE, RE, and TCC. These metrics collectively assess spatial fidelity, magnitude accuracy, and the temporal consistency of the downscaled SST fields.

3.2.1. SST Downscaling Performance in Study Area 1

As shown in Table 1, all enhanced variants achieve better performance than the HiT_SR baseline across all evaluation metrics in Study Area 1, which represents a region with relatively smooth SST patterns and moderate spatial variability. Both the E-DFE and GA modules lead to clear reductions in RMSE and MAE, indicating that improvements in local feature extraction and gradient-sensitive attention remain effective even in low-variability environments. In addition, these consistent gains suggest that the architectural refinements introduced in HiT_DS retain robustness across different spatial regimes. Comparison with the diffusion-based DIFFDS method [31] further highlights the strength of the proposed variants. DIFFDS yields a substantially higher RMSE of 0.1074 °C and a lower TCC of 0.9610, revealing limitations in its ability to control reconstruction errors and maintain temporal coherence.

Further analysis reveals that the E-DFE and GA modules provide complementary benefits. E-DFE enhances local spatial modeling and yields a notable reduction in RMSE, whereas GA improves sensitivity to geophysically meaningful gradients and contributes to lower MAE and RE values. When integrated within a unified framework, their effects become mutually reinforcing, and the combined E-DFE and GA model delivers the best overall performance. The E-DFE + GA variant achieves the lowest MAE of 0.02647 °C and the lowest RE of 0.001545, demonstrating the synergistic effect of improved local feature extraction and gradient-aware attention.

Figure 3 illustrates that the enhanced variants generally suppress large reconstruction errors more effectively than the baseline HiT_SR, as reflected in the visibly reduced upper whiskers for the E-DFE, GA, and E-DFE + GA models. The physics-informed loss, however, does not yield a comparable improvement. This is likely attributable to the limited informativeness of the Laplacian constraint in weak-gradient fields, where second-order derivatives provide minimal additional structure for optimization. Under such weak-gradient conditions, the Laplacian term contributes only marginal structural information to the learning objective, resulting in negligible performance gains.

Figure 4 illustrates the monthly mean RMSE and MAE. Each point represents the monthly average across all samples. The E-DFE + GA variant achieves the lowest error values in most months, demonstrating stable performance across different temporal conditions. The limited impact of the Laplacian-based loss in this region can be attributed to the smooth spatial structure of the SST field, where the Laplacian operator produces near-zero responses. As a result, this second-order physical constraint offers insufficient corrective information to meaningfully steer network optimization.

Overall, these findings show that the architectural components introduced in HiT-DS, particularly the E-DFE and GA modules, substantially enhance reconstruction accuracy and suppress large reconstruction errors in regions with moderate SST variability. These results establish the practicality of HiT-DS as an effective downscaling framework even in environments where dynamical signals are relatively weak.

3.2.2. SST Downscaling Performance in Study Area 2

Study Area 2 represents a more challenging environment, characterized by stronger SST gradients and more complex ocean dynamics. As shown in Table 2, the baseline HiT_SR model exhibits relatively high RMSE and MAE values (0.17185 °C and 0.09215 °C, respectively).

Based on the underlying design characteristics of each module and the observed experimental outcomes, the E-DFE module does not yield noticeable improvement over the baseline in terms of MAE, suggesting that its benefits are predominantly realized in smoother regimes where local spatial coherence is dominant. In contrast, both the GA module and physics-informed loss contribute consistent gains. The combined variant “HiT_SR + GA + Loss” achieves the lowest RMSE (0.16441 °C) and RE (0.008607) while maintaining a high TCC of 0.999999913, indicating the synergistic effect of emphasizing gradient-sensitive regions and incorporating physically motivated constraints in high-variability areas.

Figure 5 presents box plots of RMSE and MAE across months. The GA and physics-informed loss modules generally reduce large reconstruction errors, as reflected in shorter upper whiskers for the corresponding variants. The E-DFE module shows limited impact, consistent with the reasoning above.

Figure 6 shows the monthly mean RMSE and MAE trends. The “HiT_SR + GA + Loss” variant achieves the lowest errors in most months, particularly during periods of high SST variability (e.g., January–March and July–September). This observation confirms that the integration of GA and physics-informed constraints enhances both accuracy and temporal stability under challenging dynamic conditions.

Overall, these results indicate that the contributions of individual HiT_DS modules are region-dependent. In regions with high SST variability, gradient-aware attention and physically informed losses provide substantial benefits, whereas local feature extraction via E-DFE is more effective in smoother areas.

3.3. Analysis of SSH Downscaling Results

In this section, we evaluate the performance of HiT_SR and its enhanced variants on the SSH downscaling task across the same two study areas. Given that SSH is more dynamically driven and sensitive to mesoscale processes such as eddies, currents, and bathymetric features, this task poses distinct challenges compared to SST.

We adopt the same set of evaluation metrics, RMSE, MAE, RE, and TCC, as summarized in Table 3 and Table 4. These metrics jointly assess the models’ ability to restore spatial structure, suppress errors, and maintain temporal dynamics. Notably, improvements in SSH downscaling offer insights into the models’ adaptability under complex and highly variable oceanic conditions.

In calculating RE, to avoid numerical instability arising from division by near-zero SSH amplitudes, which can artificially inflate RE values, we apply amplitude thresholds tailored to each study area. Specifically, for Study Area 1, RE is only computed over regions where the SSH amplitude exceeds 0.1 m. For Study Area 2, where ocean dynamics are more intense, a higher threshold of 0.5 m is applied to focus on meaningful variations. This approach ensures the reliability and representativeness of the RE metric.

3.3.1. SSH Downscaling Performance in Study Area 1

Table 3 presents the quantitative performance metrics of HiT_SR and its enhanced variants for SSH downscaling in Study Area 1. The baseline HiT_SR exhibits an MAE of 0.00110 m and RMSE of 0.00207 m, reflecting comparatively larger reconstruction errors and reduced temporal coherence. The E-DFE module induces minor changes, with a slight increase in MAE to 0.00113 m, suggesting that its sensitivity to fine-scale features does not translate into overall performance gains in this low-gradient region.

In contrast, the physics-informed loss substantially reduces errors, achieving the lowest MAE of 0.00104 m and RMSE of 0.00202 m, demonstrating its effectiveness in improving both spatial accuracy and temporal consistency. The Gradient-aware Attention mechanism provides moderate improvements, lowering MAE to 0.00108 m and enhancing the model’s capacity to capture local spatial structures. The combined configuration with both GA and physics-informed loss delivers balanced performance across all metrics, attaining the highest TCC of 0.999999812, indicating effective preservation of spatial detail and temporal coherence.

Figure 7 illustrates box plots of the performance metrics for each model. While all enhanced variants outperform the baseline HiT_SR to varying degrees, the E-DFE, GA, and physics-informed loss show different magnitudes of improvement, with the physics-informed loss contributing the most pronounced reduction in reconstruction errors. The E-DFE module shows minimal effect, reflecting the limited benefit of local feature extraction in this relatively smooth SSH field, whereas GA provides moderate gains.

Given the disproportionately large errors of HiT_SR, Figure 8 presents monthly mean error trends with the baseline excluded, highlighting relative improvements among enhanced variants. The physics-informed loss consistently achieves the lowest errors across most months, confirming its general effectiveness in stabilizing SSH reconstruction throughout the seasonal cycle. Collectively, Table 3 and Figure 7 and Figure 8 underscore the dominant role of physics-informed loss in enhancing accuracy and robustness for SSH downscaling in regions with weak spatial gradients, while the contribution of gradient-aware attention is limited under such conditions.

3.3.2. SSH Downscaling Performance in Study Area 2

Table 4 summarizes the performance metrics of HiT_SR and its enhanced variants for SSH downscaling in Study Area 2. Overall, the enhanced models show clear improvements over the baseline HiT_SR, with notable reductions in RMSE and MAE, reflecting superior reconstruction accuracy and effective suppression of large errors. The baseline model records an RMSE of 0.00443 m and an MAE of 0.00282 m.

The introduction of the E-DFE module results in negligible improvements and even a slight increase in MAE, indicating limited adaptability to the complex dynamic processes in this region. In contrast, both the physics-informed loss and GA modules consistently reduce errors. The combined GA plus physics-informed loss variant achieves the best performance, attaining an RMSE of 0.00414 m and an MAE of 0.00260 m, while maintaining a very high temporal correlation coefficient, demonstrating the model’s stability and accuracy under highly dynamic ocean conditions.

Figure 9 presents box plots of SSH downscaling errors. The baseline exhibits the highest overall errors. The E-DFE variant does not reduce errors and slightly increases both RMSE and MAE. Models with GA or physics-informed loss achieve reductions in error magnitude, with the combined GA plus physics-informed loss model yielding the lowest errors, highlighting the synergistic benefits of gradient-aware attention and physics-based regularization. Figure 10 depicts monthly mean errors, confirming that the GA plus physics-informed loss variant consistently attains the lowest errors in most months, whereas E-DFE shows limited or negligible effectiveness.

In summary, the results for Study Area 2 confirm that gradient-aware attention and physics-informed loss are critical for enhancing SSH downscaling under highly dynamic conditions. While E-DFE shows limited adaptability and may degrade performance, GA and physics-informed loss consistently provide robust improvements, ensuring accurate and stable reconstruction across temporal scales.

4. Discussion

4.1. SST Downscaling: Module Effectiveness and Regime Dependence

Figure 11 and Figure 12 show the full regions with high daily-mean gradients. Figure 13 and Figure 14 present selected subregions from these areas that exhibit particularly high gradients. Consistent with the full-domain results, the downscaled SST fields produced by different models remain visually comparable, with no pronounced differences in the overall temperature patterns. The LR fields in these subregions exhibit overly smooth temperature transitions and systematically underestimated gradient magnitudes. In contrast, all learning-based downscaling methods successfully recover sharper spatial transitions and more realistic gradient structures that are consistent with the high-resolution reference.

It is worth noting that the proposed method is developed as an incremental enhancement to a strong baseline downscaling framework. As a result, absolute differences in SST values and spatial patterns among the learning-based variants are inherently small, making visual discrimination challenging. Under such circumstances, qualitative comparisons mainly serve to illustrate the general capability of data-driven downscaling to enhance spatial detail relative to the LR input, rather than to definitively rank model performance.

From a qualitative perspective, spatial detail enhancement is primarily manifested in the suppression of excessive smoothing in the LR fields and the improved representation of local SST gradients and spatial transitions. Although visual differences among the downscaled results remain subtle due to their shared architectural foundation, all learning-based reconstructions exhibit clearer thermal boundaries and more coherent gradient structures than the original LR data. These results indicate that the proposed approach enhances spatial detail by refining the organization and intensity of high-gradient features, rather than by introducing large changes in absolute SST values.

Therefore, to objectively evaluate the relative performance of different downscaling variants and to distinguish their advantages beyond visual inspection, a comprehensive statistical analysis is conducted in the remainder of this section, focusing on error-based metrics as well as gradient- and curvature-related indicators across different dynamical regimes.

Figure 15 further provides the dynamical context for this analysis by presenting the monthly mean SST gradient fields for the two study areas. Study Area 1 is characterized by relatively weak gradients and smooth temporal variability, representing a low-variability regime, whereas Study Area 2 exhibits substantially stronger and more variable gradients, indicative of a high-variability regime dominated by complex mesoscale dynamics. These contrasting conditions enable a systematic assessment of module adaptability and performance under different oceanographic scenarios.

In Study Area 1, which has slowly varying SST patterns, the E-DFE module consistently reduces reconstruction errors across months, reflecting its effectiveness in reinforcing locally coherent spatial structures. In contrast, in the highly dynamic Study Area 2, E-DFE yields negligible improvement, suggesting that the local convolutional inductive bias is insufficient to capture the dominant nonlocal, high-gradient dynamics in this region.

The GA module demonstrates robust positive effects in both study areas, consistently lowering reconstruction errors, with the largest gains observed during months with pronounced SST gradients. This confirms the module’s ability to emphasize geophysically meaningful edges and fronts.

The physics-informed loss exhibits contrasting behavior depending on the dynamical regime. In Study Area 1, which has weak gradients, the loss provides little or no improvement and may slightly increase error, likely because over-constraining curvature in low-gradient regions introduces negligible or noisy guidance. In Study Area 2, where gradients are stronger, the physics-informed loss produces clear and consistent improvements, underscoring its utility in regions with strong dynamical signals.

Figure 16 quantifies module-specific error improvements across gradient bins. In Study Area 1, all modules except the physics-informed loss show positive gains across low, medium, and high gradient bins, with E-DFE achieving the largest improvement and GA the second largest. In Study Area 2, single-module variants may degrade performance in low- and medium-gradient bins, with the physics-informed loss producing the smallest negative impact, while GA achieves the largest improvement in the high-gradient bin.

Figure 17 illustrates monthly average curvature improvements. In Study Area 1, most modules fail to produce consistent gains, with only the physics-informed loss showing measurable improvement in several months. In Study Area 2, apart from the composite model, only the physics-informed loss consistently enhances curvature in both positive and negative regimes, reflecting its ability to regularize strongly dynamic fields.

Finally, the results from composite architectures highlight the complementarity of individual modules. In Study Area 1, the combination of E-DFE and GA yields consistently positive improvements across gradient bins, reflecting both local-pattern enhancement and edge preservation. In Study Area 2, GA combined with the physics-informed Loss achieves positive gains across bins, effectively balancing gradient-aware reconstruction with physics-based regularization. Overall, these findings demonstrate that module effectiveness is regime-dependent, motivating an adaptive deployment strategy: prioritize E-DFE in smooth fields, apply GA widely—particularly in high-gradient months—and use the Loss selectively where dynamical signals render the constraint informative.

4.2. SSH Downscaling: Module Effectiveness and Regime Dependence

Figure 18 and Figure 19 present the spatial distributions of sea surface height (SSH) and the corresponding gradient magnitude on representative high-gradient days in Study Area 1 and Study Area 2, respectively. Similar to the SST results, all SSH downscaling models examined in this study are developed as incremental variants built upon a common baseline architecture. Consequently, the reconstructed SSH fields exhibit highly consistent large-scale spatial patterns, and visual differences among different model variants remain subtle at the full-domain scale.

Compared with the original LR SSH fields, all learning-based downscaling methods effectively suppress excessive spatial smoothing and recover finer-scale height variability, particularly in regions characterized by strong SSH gradients associated with mesoscale dynamics. These improvements are more clearly reflected in the gradient magnitude maps, where sharper gradients and more coherent frontal and eddy-related structures are observed relative to the LR input.

Figure 20 and Figure 21 further show enlarged views of selected high-gradient SSH subregions extracted from the highlighted areas. Consistent with the full-domain comparisons, the downscaled SSH fields produced by different model variants remain visually similar, with no pronounced differences discernible in absolute SSH values or overall spatial organization. In contrast, the LR SSH fields exhibit attenuated gradients and overly smooth spatial transitions, whereas all downscaled reconstructions restore more realistic spatial variability that is consistent with the high-resolution reference.

It is important to emphasize that the proposed SSH downscaling framework is designed as an incremental enhancement of a strong baseline model. As a result, the absolute differences among the learning-based SSH reconstructions are inherently small, making qualitative discrimination among model variants challenging. Under such conditions, visual inspection primarily serves to demonstrate the general capability of data-driven downscaling approaches to enhance spatial detail relative to the LR input, rather than to reliably distinguish the relative performance of different enhancement modules.

From a qualitative perspective, spatial detail enhancement in SSH is mainly manifested in the improved representation of mesoscale gradients, eddy boundaries, and localized height transitions, rather than in large changes in mean SSH levels. Although visual differences among the downscaled results remain subtle due to their shared architectural foundation, all learning-based reconstructions exhibit clearer gradient structures and reduced smoothing compared to the original LR fields.

Therefore, to objectively evaluate the relative performance of different enhancement modules and to identify their regime-dependent advantages, a comprehensive statistical analysis is conducted in the remainder of this section, focusing on error-based metrics as well as gradient- and curvature-related indicators under different dynamical conditions.

To further examine the robustness of the enhancement modules across contrasting oceanographic regimes, SSH downscaling performance is compared between Study Area 1 and Study Area 2. Study Area 1 is characterized by relatively quiescent mesoscale variability, whereas Study Area 2 exhibits more energetic dynamics with pronounced mesoscale fluctuations, providing a suitable testbed for assessing regime-dependent module effectiveness.

Figure 22 presents the monthly mean SSH gradients for both study areas, providing context for module-specific evaluation. Figure 23 quantifies error improvements across gradient bins. In Study Area 1, the physics-informed loss consistently achieves the largest reductions, while E-DFE shows limited impact on global errors but contributes to improved structural fidelity. In Study Area 2, GA yields the most substantial error suppression among the single-module variants, particularly in high-gradient regions, whereas E-DFE offers negligible improvement, reflecting the limitations of local convolutional biases in high-gradient, dynamically complex regions.

Figure 24 shows monthly average curvature improvements. In Study Area 1, E-DFE yields the largest curvature gains despite relatively higher reconstruction errors, indicating a trade-off between structural fidelity and global error reduction. In Study Area 2, the combined GA plus physics-informed loss model achieves the most pronounced curvature improvements, highlighting the synergistic effect of integrating multiple modules under dynamic conditions.

Monthly mean SSH reconstruction errors are presented in Figure 15 (Study Area 1) and Figure 17 (Study Area 2). The physics-informed loss achieves the lowest errors in most months in Study Area 1, whereas the combined GA plus physics-informed loss model consistently attains the lowest errors in Study Area 2, particularly during months with higher variability.

Overall, these results indicate that module effectiveness is context-dependent: the physics-informed loss stabilizes reconstruction errors, GA is most effective in high-gradient regions, and E-DFE enhances the representation of local curvature, although this does not consistently lead to reductions in global error metrics. Combining modules further improves both error suppression and curvature preservation in dynamically complex regions, confirming the benefit of module integration for robust SSH downscaling.

5. Conclusions

In this study, we propose HiT_DS, a hierarchical Transformer-based framework for scientific data downscaling, designed to enhance high-resolution reconstruction of SST and SSH fields. By integrating E-DFE, GA, and physics-informed loss functions, HiT_DS enables context-adaptive and physically consistent reconstruction across regions with distinct ocean dynamics.

Extensive experiments conducted in two representative marine regions lead to several key observations:

Module effectiveness is regime-dependent. In low-variability regions characterized by smooth SST or weak SSH gradients, E-DFE effectively reinforces local spatial structures and improves structural fidelity. In contrast, GA and physics-informed losses demonstrate their strongest impact in high-variability regions with sharp gradients, eddies, and mesoscale processes.
Complementarity of modules. Integrating E-DFE, GA, and physics-informed losses into a unified architecture yields synergistic improvements, suppressing large reconstruction errors, enhancing temporal consistency, and preserving high-gradient features. This confirms the importance of modular and adaptive deployment for complex oceanographic fields.
SST vs. SSH downscaling. For SST, improvements in local feature extraction and gradient-sensitive attention lead to more accurate and temporally stable reconstructions, especially in moderate-to-high variability regimes. For SSH, physically informed constraints play a dominant role in stabilizing errors, while GA enhances reconstruction in dynamically complex regions. E-DFE has limited effect for SSH, reflecting the need to account for large-scale and nonlocal dynamics.
Practical implications. Even modest reductions in RMSE, MAE, or relative error translate into practical value for operational ocean monitoring and forecasting, underscoring the utility of HiT_DS as a robust and flexible downscaling framework.

Overall, the proposed HiT_DS demonstrates that modular hierarchical Transformers, when combined with physics-informed enhancements, can effectively bridge the gap between generic image super-resolution and geophysically consistent oceanographic data reconstruction. Future improvements may include temporal modeling, multimodal data fusion, operational scalability, and physics-guided learning to further enhance accuracy, interpretability, and applicability in high-resolution ocean observation and forecasting.

Author Contributions

Methodology, M.W.; mathematical analysis, M.W.; writing—review, M.W. and X.W.; experiments, W.L.; writing—original draft, W.L.; data curation, R.C.; writing—review and editing, R.C.; conceptualization, X.W.; investigation, S.Z.; resources, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, X.; Chen, B.; Wu, M.; Gan, Q.; Wang, L. Characteristic Analysis of Spring Sea Surface Temperature Predictors for Tropical Cyclone Genesis over the Northwest Pacific in 2021. J. Agric. Disaster Res. 2021, 11, 76–80. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015. [Google Scholar]
Chen, Y.; Wang, X.; Liu, Y. Reprocessing of Sea Surface Height Anomaly Data in China Offshore Waters. Mar. Sci. 2016, 40, 151–159. [Google Scholar]
Solanki, H.U.; Bhatpuria, D.; Chauhan, P. Signature Analysis of Satellite Derived SSHa, SST and Chlorophyll Concentration and Their Linkage with Marine Fishery Resources. J. Mar. Syst. 2015, 150, 12–21. [Google Scholar] [CrossRef]
Carton, J.A.; Giese, B.S.; Grodsky, S.A. Sea Level Rise and the Warming of the Oceans in the Simple Ocean Data Assimilation (SODA) Ocean Reanalysis. J. Geophys. Res. Oceans 2005, 110, 1–8. [Google Scholar] [CrossRef]
Zhao, G.; Li, D.; Yang, S.; Qi, J.; Yin, B.S. The Development of a Weather-Type Statistical Downscaling Model for Wave Climate Based on Wave Clustering. Ocean Eng. 2024, 304, 117863. [Google Scholar] [CrossRef]
Al Azad, A.S.M.; Marsooli, R. A High-Resolution Coupled Circulation Wave Model for Regional Dynamic Downscaling of Water Levels and Wind Waves in the Western North Atlantic Ocean. Ocean Eng. 2024, 311, 118869. [Google Scholar] [CrossRef]
Gao, T.; Jiang, H. Statistical Downscaling of Coastal Directional Wave Spectra Using Deep Learning. Coast. Eng. 2024, 192, 104557. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, P.; Duan, S.; Yang, H.; Yin, Z. Downscaling of Landsat 8 Land Surface Temperature Products Based on Deep Learning. Natl. Remote Sens. Bull. 2021, 25, 1767–1777. [Google Scholar]
Thiria, S.; Sorror, C.; Archambault, T.; Charantonis, A.; Bereziat, D.; Mejia, C.; Molines, J.-M.; Crépon, M. Downscaling of Ocean Fields by Fusion of Heterogeneous Observations Using Deep Learning Algorithms. Ocean Modell. 2023, 182, 102174. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, Y.; Yu, F. Hit-SR: Hierarchical Transformer for Efficient Image Super-Resolution. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Good, S.; Fiedler, E.; Mao, C.Y.; Martin, M.J.; Maycock, A.; Reid, R.; Roberts-Jones, J.; Searle, T.; Searle, T.; Waters, J.; et al. The Current Configuration of the OSTIA System for Operational Production of Foundation Sea Surface Temperature and Ice Concentration Analyses. Remote Sens. 2020, 12, 720. [Google Scholar] [CrossRef]
E.U. Global Ocean Physics Reanalysis; Copernicus Marine Service Information (CMEMS); Marine Data Store (MDS); Mercator Ocean International: Toulouse, France, 2023. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Choi, H.; Lee, J.; Yang, J. N-gram in Swin Transformers for Efficient Lightweight Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Zhou, Y.; Li, Z.; Guo, C.; Bai, S.; Cheng, M.; Hou, Q. SRFormer: Permuted Self-Attention for Single Image Super-Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Hui, Z.; Gao, X.B.; Yang, Y.C.; Wang, X. Lightweight Image Super-Resolution with Information Multi-Distillation Network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019. [Google Scholar]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient Long-Range Attention Network for Image Super-Resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual Aggregation Transformer for Image Super-Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, F.; Liu, W. Crossformer++: A Versatile Vision Transformer Hinging on Cross-Scale Attention. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3123–3136. [Google Scholar] [CrossRef] [PubMed]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.-H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Beucler, T.; Pritchard, M.; Rasp, S.; Ott, J.; Baldi, P.; Gentine, P. Enforcing Analytic Constraints in Neural Networks Emulating Physical Systems. Phys. Rev. Lett. 2021, 126, 098302. [Google Scholar] [CrossRef] [PubMed]
Hardy, C.M.; Livermore, P.W.; Niesen, J.; Luo, J.W.; Li, K. Determination of the instantaneous geostrophic flow within the three-dimensional magnetostrophic regime. Proc. R. Soc. A Math. Phys. Eng. Sci. 2018, 474, 20180412. [Google Scholar]
Wang, S.; Li, X.; Zhu, X.; Li, J.; Guo, S. Spatial Downscaling of Sea Surface Temperature Using Diffusion Model. Remote Sens. 2024, 16, 3843. [Google Scholar] [CrossRef]

Figure 1. Geographic locations of the two study regions considered in this study. The (left) and (right) panels illustrate Study Area 1 and Study Area 2, respectively. Coastlines, national boundaries, and major cities are shown for geographic reference.

Figure 2. Integrated architecture of the HiT_DS framework with modular enhancements, where dashed boxes indicate newly introduced modules and improvements over the baseline HiT, and ellipses indicate that intermediate Transformer layers are omitted for clarity.

Figure 3. Box plots of performance metrics for various models on SST downscaling in Study Area 1.

Figure 4. Line plots of performance metrics for various models on SST downscaling in Study Area 1.

Figure 5. Box plots of performance metrics for various models on SST downscaling in Study Area 2.

Figure 6. Line plots of performance metrics for various models on SST downscaling in Study Area 2.

Figure 7. Box plots of performance metrics for various models on SSH downscaling in Study Area 1.

Figure 8. Line plots of performance metrics for various models on SSH downscaling in Study Area 1.

Figure 9. Box plots of performance metrics for various models on SSH downscaling in Study Area 2.

Figure 10. Line plots of performance metrics for various models on SSH downscaling in Study Area 2.

Figure 11. Downscaled SST comparisons from different models on days with high daily gradients in Study Area 1.

Figure 12. Spatial distribution and gradient magnitude of SST on a high-gradient day in Study Area 2.

Figure 13. High-gradient subregion of SST and gradient magnitude on a high-gradient day in Study Area 1.

Figure 14. High-gradient subregion of SST and gradient magnitude on a high-gradient day in Study Area 2.

Figure 15. Monthly mean SST gradient fields in Study Area 1 and 2.

Figure 16. Module-wise error improvement relative to HiT_SR across SST gradient bins.

Figure 17. Monthly average curvature improvements of SST relative to HiT_SR for all HiT variants.

Figure 18. Downscaled SSH comparisons from different models on days with high daily gradients in Study Area 1.

Figure 19. Downscaled SSH comparisons from different models on days with high daily gradients in Study Area 2.

Figure 20. High-gradient subregion of SSH and gradient magnitude on a high-gradient day in Study Area 1.

Figure 21. High-gradient subregion of SSH and gradient magnitude on a high-gradient day in Study Area 2.

Figure 22. Monthly mean SSH gradient fields in Study Area 1 and 2.

Figure 23. Module-wise error improvement relative to HiT_SR across SSH gradient bins.

Figure 24. Monthly average curvature improvements of SSH relative to HiT_SR for all HiT variants.

Table 1. Performance Comparison of HiT_SR and Enhanced Variants for SST Downscaling in Study Area 1.

Model	RMSE (°C)	MAE (°C)	RE	TCC
HiT_SR	0.04636	0.02696	0.001641	0.999999756
HiT_SR + E-DFE	0.04405	0.02655	0.001559	0.999999906
HiT_SR + Loss	0.04705	0.02739	0.001665	0.999997563
HiT_SR + GA	0.04397	0.02658	0.001557	0.999999934
HiT_SR + E-DFE + GA	0.04364	0.02647	0.001545	0.999999929

Table 2. Performance Comparison of HiT_SR and Enhanced Variants for SST Downscaling in Study Area 2.

Model	RMSE (°C)	MAE (°C)	RE	TCC
HiT_SR	0.17185	0.09215	0.009753	0.999999698
HiT_SR + E-DFE	0.17224	0.09393	0.009949	0.999999662
HiT_SR + Loss	0.16904	0.09171	0.009711	0.999999766
HiT_SR + GA	0.16887	0.09225	0.008938	0.999999756
HiT_SR + GA + Loss	0.16441	0.08980	0.008607	0.999999913

Table 3. Performance Comparison of HiT_SR and Enhanced Variants for SSH Downscaling in Study Area 1.

Model	RMSE (m)	MAE (m)	RE	TCC
HiT_SR	0.00207	0.00110	0.014070	0.999999716
HiT_SR + E-DFE	0.00219	0.00113	0.014939	0.999997704
HiT_SR + Loss	0.00202	0.00104	0.013681	0.999999799
HiT_SR + GA	0.00208	0.00108	0.014121	0.999999727
HiT_SR + GA + Loss	0.00210	0.00109	0.014096	0.999999812

Table 4. Performance Comparison of HiT_SR and Enhanced Variants for SSH Downscaling in Study Area 2.

Model	RMSE (m)	MAE (m)	RE	TCC
HiT_SR	0.00443	0.00282	0.007315	0.999999358
HiT_SR + E-DFE	0.00451	0.00295	0.007860	0.999997704
HiT_SR + Loss	0.00423	0.00269	0.006506	0.999999766
HiT_SR + GA	0.00421	0.00263	0.006191	0.999999556
HiT_SR + GA + Loss	0.00414	0.00260	0.005940	0.999999666

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Liu, W.; Chu, R.; Wang, X.; Zhu, S.; Liao, G. HiT_DS: A Modular and Physics-Informed Hierarchical Transformer Framework for Spatial Downscaling of Sea Surface Temperature and Height. Remote Sens. 2026, 18, 292. https://doi.org/10.3390/rs18020292

AMA Style

Wang M, Liu W, Chu R, Wang X, Zhu S, Liao G. HiT_DS: A Modular and Physics-Informed Hierarchical Transformer Framework for Spatial Downscaling of Sea Surface Temperature and Height. Remote Sensing. 2026; 18(2):292. https://doi.org/10.3390/rs18020292

Chicago/Turabian Style

Wang, Min, Weixuan Liu, Rong Chu, Xidong Wang, Shouxian Zhu, and Guanghong Liao. 2026. "HiT_DS: A Modular and Physics-Informed Hierarchical Transformer Framework for Spatial Downscaling of Sea Surface Temperature and Height" Remote Sensing 18, no. 2: 292. https://doi.org/10.3390/rs18020292

APA Style

Wang, M., Liu, W., Chu, R., Wang, X., Zhu, S., & Liao, G. (2026). HiT_DS: A Modular and Physics-Informed Hierarchical Transformer Framework for Spatial Downscaling of Sea Surface Temperature and Height. Remote Sensing, 18(2), 292. https://doi.org/10.3390/rs18020292

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HiT_DS: A Modular and Physics-Informed Hierarchical Transformer Framework for Spatial Downscaling of Sea Surface Temperature and Height

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Areas

2.2. Datasets

2.2.1. SST Data

2.2.2. SSH Data

2.3. Hierarchical Transformer

2.3.1. Block-Level Design: Hierarchical Windows

2.3.2. Layer-Level Design: Spatial-Channel Correlation

2.4. HiT_DS: A Scientific Data-Optimized Architecture

2.4.1. Overall Architecture

2.4.2. Enhanced Dual Feature Extraction

2.4.3. Gradient-Aware Attention (GA)

2.4.4. Redesign of the Loss Functions

3. Results

3.1. Ablation Strategy and Model Variants

3.2. Analysis of SST Downscaling Results

3.2.1. SST Downscaling Performance in Study Area 1

3.2.2. SST Downscaling Performance in Study Area 2

3.3. Analysis of SSH Downscaling Results

3.3.1. SSH Downscaling Performance in Study Area 1

3.3.2. SSH Downscaling Performance in Study Area 2

4. Discussion

4.1. SST Downscaling: Module Effectiveness and Regime Dependence

4.2. SSH Downscaling: Module Effectiveness and Regime Dependence

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI