SRCT: Structure-Preserving Method for Sub-Meter Remote Sensing Image Super-Resolution

Gao, Tianxiong; Zhang, Shuyan; Yao, Wutao; Shang, Erping; Yang, Jin; Ma, Yong; Ma, Yan

doi:10.3390/s26020733

Open AccessArticle

SRCT: Structure-Preserving Method for Sub-Meter Remote Sensing Image Super-Resolution

by

Tianxiong Gao

^1,2,

Shuyan Zhang

^1,*,

Wutao Yao

¹,

Erping Shang

¹,

Jin Yang

¹,

Yong Ma

^1,3,* and

Yan Ma

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

³

State Key Laboratory of Media Convergence Production Technology and Systems, Beijing 100803, China

^*

Authors to whom correspondence should be addressed.

Sensors 2026, 26(2), 733; https://doi.org/10.3390/s26020733

Submission received: 4 December 2025 / Revised: 7 January 2026 / Accepted: 20 January 2026 / Published: 22 January 2026

(This article belongs to the Section Remote Sensors)

Download

Browse Figures

Versions Notes

Abstract

To address the scarcity of sub-meter remote sensing samples and structural inconsistencies such as edge blur and contour distortion in super-resolution reconstruction, this paper proposes SRCT, a super-resolution method tailored for sub-meter remote sensing imagery. The method consists of two parts: external structure guidance and internal structure optimization. External structure guidance is jointly realized by the structure encoder (SE) and structure guidance module (SGM): the SE extracts key structural features from high-resolution images, and the SGM injects these structural features into the super-resolution network layer by layer, achieving structural transfer from external priors to the reconstruction network. Internal structure optimization is handled by the backbone network SGCT, which introduces a dual-branch residual dense group (DBRDG): one branch uses window-based multi-head self-attention to model global geometric structures, and the other branch uses lightweight convolutions to model local texture features, enabling the network to adaptively balance structure and texture reconstruction internally. Experimental results show that SRCT clearly outperforms existing methods on structure-related metrics, with DISTS reduced by 8.7% and LPIPS reduced by 7.2%, and significantly improves reconstruction quality in structure-sensitive regions such as building contours and road continuity, providing a new technical route for sub-meter remote sensing image super-resolution reconstruction.

Keywords:

remote sensing; super-resolution reconstruction; structure preservation; deep learning

1. Introduction

The rapid development of remote sensing technology has enabled the acquisition of large amounts of remote sensing data. High-resolution remote sensing imagery is widely used in construction planning [1], disaster assessment [2], agricultural production [3], and other fields [4,5]. Its precise texture information and rich ground object information provide a reliable data foundation for fine-grained analysis and decision-making. In particular, sub-meter remote sensing imagery can be widely applied in urban refined management, precision agriculture, and environmental monitoring. However, high-resolution imagery, especially sub-meter imagery, remains in short supply. This is mainly due to sensor limitations, imaging constraints, and the high cost of deploying dense satellite constellations. Super-resolution (SR) reconstruction technology reconstructs low-resolution (LR) images into high-resolution (HR) images through algorithms, thereby obtaining more precise ground object information and improving the accuracy of remote sensing applications. It is the most cost-effective approach to address the shortage of high-resolution data.

In super-resolution reconstruction, especially for large upscaling factors such as 4×, researchers face a fundamental challenge: how to preserve the integrity of complex ground object structures while recovering rich texture details. Remote sensing imagery covers various ground objects such as urban areas, farmland, and mountains, with significantly different texture characteristics and multi-scale structural features. As shown in Figure 1, urban buildings typically exhibit regular geometric structures, and super-resolution models must accurately restore these structures while avoiding aliasing, blurring, or deformation. In the HR image in (b), the target building has a clear and regular rectangular contour, whereas the reconstruction result of current mainstream super-resolution methods in (a) shows jagged edges and shape distortion, failing to preserve the true geometric structure of the building. In contrast, natural landscapes such as farmland and mountains have irregular textures and may contain many repetitive patterns (e.g., crops) or complex terrain undulations, where the model needs to capture fine details without distorting shapes. This imposes stricter requirements on super-resolution reconstruction models: when confronted with real complex scenes, they must simultaneously recover fine texture details and maintain the structural integrity of ground objects. To achieve high-precision super-resolved images, current research efforts mainly focus on super-resolution models based on convolutional neural networks (CNNs), generative adversarial networks (GANs), and Transformers.

CNN-based super-resolution methods began with SRCNN proposed by Dong et al. [6], which first applied CNNs to learn the mapping from low-resolution to high-resolution images, outperforming traditional methods. Subsequent improvements include the following: FSRCNN [7] introduced transposed convolution to reduce computational cost; VDSR [8] adopted residual learning and gradient clipping to alleviate gradient problems; DRRN [9] and DRCN [10] deepened networks through residual connections; EDSR [11] removed BN layers and introduced residual scaling to improve training stability; RCAN [12] introduced channel attention to enhance channel discrimination; DRN [13] proposed a dual-regression strategy to constrain the solution space; LTE [14] used implicit functions to reconstruct details in a continuous manner; and SMSR [15] achieved acceleration by learning sparse masks. These methods achieved some success in texture recovery, but they are limited by the local receptive field of CNNs, making it difficult to capture long-range dependencies in remote sensing images. They perform poorly when handling large-scale structures, often resulting in structural distortions such as edge blurring and straight-line distortion. Subsequent improvements, such as the dual-path structure of HAUNet [16], the residual split attention group by Chen et al. [17], and the edge extraction module by Wang et al. [18], have made progress, but the inherent locality limitations of CNNs have not been fully overcome.

To overcome the locality limitations of CNNs, researchers have introduced GAN-based methods. GANs were proposed by Goodfellow et al. [19] in 2014, initially for image generation, and later applied to various vision tasks. SRGAN [20] first introduced GANs into super-resolution, generating richer and clearer textures and details through adversarial training, making reconstruction results more visually realistic. Subsequent improvements include the following: Park et al. [21] introduced feature-domain discriminators to reduce high-frequency noise; ESRGAN [22] added dense connections and residual connections to the generator and used conditional discriminators to improve visual realism; Zhang et al. [23] constructed loss functions based on image quality ranking; and Ma et al. [24] introduced geometric structures of high-resolution images into generative adversarial networks. For remote sensing image super-resolution, CDGAN [25] proposed dual-input discriminators and added pairwise loss functions to address the discrimination difficulty caused by remote sensing images having far more low-frequency information than natural images; Ma et al. [26] constructed generators based on deep residual networks using a Wasserstein GAN with gradient penalty; Jia et al. [27] constructed MA-GAN, introducing pyramid multi-scale convolution, residual connections, and various attention mechanisms in the generator network; Dong et al. [28] proposed high-resolution reference image-assisted methods, mining high-frequency features of reference images through gradient-assisted feature alignment operations; SPSR [24] introduced gradient branches to provide structural priors; and WGSR [29] enhanced texture sensitivity through wavelet transforms. In addition, Tu et al. [30] fused Swin Transformer with CNN hybrid architectures, Jia et al. [27] integrated multiple attention mechanisms, and Wang et al. [31] designed multi-scale sliding window attention mechanisms, all effectively improving the reconstruction quality of remote sensing images. However, GAN frameworks also introduce new problems: unstable training, the tendency to generate false textures, and limitations in preserving key structures such as building contours and road networks in remote sensing images. Although the above improved methods enhance structure preservation through gradient branches, wavelet transforms, or attention mechanisms, they still face challenges when handling complex structures in remote sensing images.

In recent years, Transformer architectures have been introduced into super-resolution due to their strong global modeling capability. Transformer-based methods such as SwinIR [32] and IPT [33] capture global dependencies through self-attention mechanisms, effectively addressing the limitations of traditional CNN methods in large-scale scene reconstruction and showing clear advantages in complex texture regions, improving structure preservation to some extent. However, these methods also have inherent drawbacks: when models tend to amplify high-frequency details, high-frequency structural boundaries are easily weakened, leading to reduced accuracy of key geometric structures in remote sensing images. Additionally, deep networks still face information bottleneck effects, causing structural information to gradually degrade during transmission. To address these problems, subsequent researchers have proposed various improvement strategies. CATANet [34] introduces a content-aware token aggregation module to enhance perception of key structures and textures while maintaining global modeling capability. ACT-SR [35] proposes an aggregated connection Transformer, effectively enhancing the feature representation of remote sensing images and reducing artifacts by aggregating multi-level information from local to global, from original images to feature maps, and from spatial to channel. Xiao et al. [36] address the problem of redundant token representations in remote sensing scenes by dynamically selecting the top-k key tokens with the highest scores for each query, capturing the most critical information and reducing interference from irrelevant tokens. To expand the perspective of feature extraction, Shi et al. [37] fuse phase and amplitude features from the frequency domain into the image domain based on discrete Fourier transform (DFT) operators, providing a new approach for remote sensing image super-resolution reconstruction.

Beyond Transformer-based models, several approaches improve structure preservation in remote sensing super-resolution from different modeling paradigms. FreMamba [38] introduces the visual state space model Mamba into remote sensing super-resolution, leveraging its linear complexity to efficiently model long-range dependencies in large-scale imagery and reduce the computational burden associated with self-attention. Another related line explores dual-branch designs that couple complementary information to enhance reconstruction. CRefDiff [39] adopts a local–global dual-branch fusion strategy for reference-guided remote sensing super-resolution, while DBFNet [40] employs complementary branches within a residual dense framework and fuses them adaptively to improve restoration quality.

Across the above super-resolution methods, a common goal is to enhance perceptual quality and recover high-frequency details. CNN-based models excel at local texture reconstruction but struggle with long-range structure modeling. GAN-based methods improve visual sharpness yet may introduce hallucinated details and reduce geometric fidelity. Transformer-based approaches better capture long-range dependencies, but information can still attenuate during deep propagation. As a result, existing methods often fail to simultaneously restore fine textures and preserve the structural integrity of complex ground objects, largely because structure is not treated as an explicit signal with a dedicated shallow-to-deep transmission mechanism.

We build upon the ability of Transformer-based super-resolution methods to model long-range dependencies, and we are motivated to alleviate the information bottleneck that may arise during deep feature propagation in such architectures. In addition, dual-branch designs have demonstrated the effectiveness of coupling complementary information sources. Inspired by this principle, SRCT explicitly extracts structural cues and injects them hierarchically into the reconstruction network, establishing a dedicated pathway for structure transmission and mitigating structural attenuation in deep models.

Based on the above insights, we propose SRCT, a structure-preserving super-resolution method for remote sensing images. The method consists of three core components working together: a structure encoder (SE), a structure guidance module (SGM), and a structure-guided residual cross-Transformer (SGCT). Specifically, the SE extracts and preserves key structural information from the original image. The SGM delivers structural features to all levels of the super-resolution network through a two-stage injection strategy. The SGCT serves as the backbone network and employs a dual-branch residual dense group (DBRDG) tailored for remote sensing characteristics, effectively balancing the reconstruction needs of regular geometric structures and irregular textures. The entire framework maintains structural consistency while recovering texture details, providing a new technical approach for remote sensing image super-resolution reconstruction. The main contributions of this study are as follows:

(1): Structure Encoder (SE): A multi-level cascaded convolutional and spatial attention module is designed to effectively extract and preserve key structural information from the original image, while capturing both macro structural features and local texture patterns.
(2): Structure Guidance Module (SGM): A two-stage injection strategy is implemented, including feature initialization guidance and deep feature fusion. Through attention-enhanced hierarchical transmission of structural features, structural information is precisely injected into all levels of the super-resolution network, effectively alleviating the information bottleneck effect in Transformer models and ensuring the integrity of complex ground object structures.
(3): Dual-Branch Residual Dense Group (DBRDG) for Remote Sensing Characteristics: A dual-branch structure is innovatively designed with window-based multi-head self-attention in the main path and lightweight convolution in the residual path. This effectively balances the reconstruction needs of regular geometric structures and irregular textures in remote sensing images, achieving collaborative optimization of global structure modeling and local texture preservation. It is particularly suitable for handling the complex and diverse ground object types in remote sensing images.

2. Methods

2.1. Overview

As shown in Figure 2, the proposed structure-preserving super-resolution method for remote sensing images consists of three core components: a structure encoder (SE), a structure guidance module (SGM), and a structure-guided residual cross-Transformer (SGCT). The entire framework operates in two stages: training and inference. In the training stage, the left part of the framework shows the training dataset composed of HR–LR image pairs. The middle part illustrates the core processing flow of the model: high-resolution images first pass through the SE to extract structural features, and the SGM then transfers these features to the SGCT model for training. This design ensures that structural information can effectively guide the super-resolution reconstruction process. In the inference stage (right part of the framework), the LR input is processed by the trained SRCT model to generate an SR image that preserves structural integrity. By integrating structural guidance information, the SRCT model can recover texture details while accurately maintaining complex ground object structures in remote sensing imagery. This architecture effectively overcomes the limitations of traditional methods in structure preservation and is particularly suitable for diverse ground object types and complex scene structures in remote sensing images. With the SE extracting structural information, SGM providing structure guidance, and SRCT performing super-resolution reconstruction, the three components work together to balance texture detail restoration and structural integrity preservation.

2.2. Structure Encoder (SE)

The structure encoder (SE) is used to extract and preserve key structural information from the original image. Different from designs where structural cues are expected to emerge implicitly from backbone features, the SE explicitly highlights edge- and contour-related responses and provides structure-aware features for hierarchical guidance, which helps mitigate structural attenuation in deep reconstruction. As shown in the blue box in Figure 2, the SE adopts a multi-level cascaded convolutional structure composed of three groups of 3 × 3 convolutional layers, each followed by a ReLU activation function. Two spatial attention modules are interleaved in the processing pipeline, as illustrated in Figure 3, to enhance the perception of edges and contours. Specifically, the module applies channel-wise average pooling and max pooling to the input feature map, concatenates the two pooled results along the channel dimension, and then fuses them with a 7 × 7 convolution. A Sigmoid activation is finally used to generate the spatial attention weights

F_{r c}

, which adaptively enhance key structural regions.

In addition, the SE introduces residual blocks, as shown in Figure 4, to alleviate gradient vanishing and strengthen detail reuse. The main path of each residual block contains two 3 × 3 convolutional layers, each followed by batch normalization, with a ReLU activation inserted between them. A skip connection adds the block input to the main-path output, and the sum is passed through a ReLU activation to obtain the final output, achieving identity mapping to promote gradient flow and feature reuse. The entire structure encoding process can be expressed as follows:

F_{arch} = g_{SA} (f_{3 \times 3} (ReLU [I_{HR}]))

(1)

2.3. Structure Guidance Module (SGM)

As shown in the green box in Figure 2, the structure guidance module (SGM) serves as the key bridge between the structure encoder and the super-resolution network. It delivers structural information to the SGCT backbone through a two-stage injection strategy. In the first stage (feature initialization guidance), the structural features are first rescaled and channel-aligned by interpolation and convolution. They are then fed into a residual sub-block composed of LayerNorm, a 3 × 3 convolution, and ReLU. This suppresses noise and forms a stable shallow structural prior. The output, denoted as

E_{1}

, is injected into the shallow feature

F_{SGCT 1}

of the super-resolution network. In the second stage (deep multi-level structure guidance), the initialized structural features are sequentially passed through a LayerNorm plus spatial attention module and a LayerNorm plus a multi-scale convolution module (MSConv). The former explicitly enhances responses at edges and contours. The latter integrates structural information at different scales. The resulting features

E_{2}

and

E_{3}

are injected into the mid- and deep-level features

F_{SGCT 3}

and

F_{SGCT 5}

, forming a hierarchical structure feeding process from shallow to deep.

The structure of MSConv is shown in Figure 5. The input feature is first processed by a convolution layer to obtain a basic representation. It is then split into three parallel branches along the channel dimension. Each branch uses a depthwise convolution (DWConv) with kernel sizes of 1 × 1, 3 × 3, and 5 × 5 to extract structural details under different receptive fields. The outputs of the three branches are concatenated along the channel dimension and added to the main residual. After that, another channel split and convolution are applied to reorganize the features. In this way, MSConv achieves multi-scale structure modeling while keeping the computational cost under control. The fusion of structural features and backbone features in the SGM can be expressed as follows:

F_{fused} = F_{SGCT} + M S C o n v (M_{sa} \otimes F_{SE}^{matched})

(2)

F_{SGCT}

denotes the backbone feature at the current layer,

F_{SE}^{matched}

is the structural feature aligned in scale and channels through interpolation and convolution,

M_{sa}

is the spatial attention weight map generated by the spatial attention module,

M S C o n v (\cdot)

denotes the multi-scale convolutional transform, and ⊗ denotes element-wise multiplication.

This multi-level feature mapping

{E_{1}, E_{2}, E_{3}} \to {F_{SGCT 1}, F_{SGCT 3}, F_{SGCT 5}}

enables precise transmission of structural information. The staged multi-level injection strategy effectively alleviates the information bottleneck of the Transformer backbone in deep structural modeling, ensuring that the skeletal information extracted by the structure encoder can still be fully exploited in the deeper layers of the network.

2.4. Super-Resolution Reconstruction Network SGCT

As the backbone of the entire super-resolution reconstruction framework, the SGCT consists of three main modules—shallow feature extraction, deep feature extraction, and image reconstruction—forming an end-to-end pipeline, as shown in the orange part of Figure 2. The shallow feature extraction module uses a 3 × 3 convolution to obtain initial features from the low-resolution input and incorporates feature initialization guidance from the structure guidance module, so that shallow structural priors from the structure encoder are directly injected into this stage.

The deep feature extraction module is built from multiple cascaded dual-branch residual dense groups (DBRDGs), which form the core of the SGCT. As shown in Figure 6, each DBRDG is organized with two cooperative branches. One branch is a dense feature aggregation branch composed of several sequential units, and each unit consists of an STL block followed by a convolution layer and a LeakyReLU activation. The intermediate features from different units are progressively aggregated through dense concatenation to promote feature reuse and strengthen local representation. The other branch is a group-level residual shortcut that directly propagates the input feature and adds it to the aggregated output, which stabilizes optimization and preserves low-frequency information. By stacking multiple DBRDGs, the backbone is able to enhance fine-grained details while maintaining stable feature propagation. After the DBRDG stack, the features are further processed by Transformer-style sub-blocks, including a LayerNorm–W-MSA block and a LayerNorm–MLP block, together with a global residual branch. The window-based multi-head self-attention uses a 16 × 16 window with an overlap ratio of 0.5 to capture long-range dependencies and global structural relationships in remote sensing scenes. Throughout the deep feature extraction stage, multi-level structural features from the structure guidance module are injected into the corresponding layers, so that structural priors can be continuously exploited from shallow to deep features during reconstruction. Different from the dual-branch designs in CRefDiff [39] and DBFNet [40], our two paths are not modality- or domain-specific streams. One path densely aggregates intermediate features to reinforce structural cues, while the other directly forwards the input features through a skip connection to stabilize feature propagation, which helps preserve structure-sensitive patterns in remote sensing imagery.

SRCT is innovative in its tailored design for the unique characteristics of remote sensing images. For the coexistence of regular geometric structures (such as building contours and road networks) and irregular textures (such as vegetation and water bodies), it uses a dual-path structure of window attention and lightweight convolutions to balance global structure modeling and local texture preservation. For the diversity of object types and large scale variations in remote sensing imagery, it designs a multi-scale structural feature injection mechanism to ensure that structures at different scales can be effectively recovered. For the coexistence of large homogeneous regions and rich high-frequency details, it combines a structure guidance module with deep feature fusion to enhance the network’s sensitivity to edges and textures. These targeted designs make SRCT particularly suitable for handling complex object structures and multi-scale features in remote sensing images, enabling it to recover rich texture details while maintaining structural integrity.

To ensure that structural information is effectively preserved throughout the reconstruction process, this study introduces a structural consistency loss:

L_{struct} = \sum_{i = 1}^{3} w_{i} | | E_{i} (I_{degraded}) - E_{i} (D (I_{LR})) {| |}_{1}

(3)

where

E_{i}

denotes the i-th layer of the structure encoder, D denotes bicubic upsampling, and the weighting coefficients

w_{i}

are set to

{0.2, 0.6, 0.2}

to emphasize the importance of mid-level features. This loss term enforces consistency between the structural encodings before and after degradation, providing effective guidance for super-resolution reconstruction.

The complete training and inference workflow of SRCT is illustrated in Algorithm 1. During training, we adopt a feature-space supervision strategy based on the structure encoder. Specifically, both the super-resolved output and the original high-resolution image are passed through a pre-trained structure encoder to extract feature representations, and the L1 loss is then computed in the feature space. Compared with traditional pixel-space losses, feature-space supervision focuses on the semantic content and visually important structures of the image rather than simple pixel-level matching. This supervision enables the model to produce super-resolved results that are more natural in appearance and richer in detail, while avoiding excessive smoothing. The total loss function

L_{total}

consists of three parts:

L t o t a l = L r e c + λ p e r \cdot L p e r + λ s t r \cdot L s t r

(4)

where,

L_{rec}

is the L1 reconstruction loss, which ensures accurate pixel-level reconstruction;

L_{per}

is the perceptual loss based on the VGG network, which enhances visual perceptual quality; and

L_{str}

is the structural feature loss, which effectively preserves structural information through feature-space supervision of the structure encoder.

Algorithm 1 Training and inference pipeline of SRCT

Input: HR image set

X_{H R}

, LR image set

X_{L R}

, structure encoder SE,
structure guidance mechanism SGM, backbone SGCT,
loss weights

λ_{per}, λ_{feat}

Output: Trained SRCT model; SR prediction for given LR input
1   Initialize SE, SGM and SGCT parameters
2   for epoch = 1 …total_epochs do
3      for batch (xHR, xLR) in (XHR, XLR) do
4         // Phase 1: Structure encoding
5         fSE = SE(xHR)
6         Farch = gSA(fSE)
7         // Phase 2: Structure guidance
8         (E1x, E2x, E3x) = SGM(Farch)
9         Inject E1x, E2x, E3x =⇒ FSRCT1, FSRCT3, FSRCT5
10         // Phase 3: Structure-guided SR reconstruction
11         F0 = Conv3 × 3(xLR)
12         Fdeep = SGRCT(F0, E1x, E2x, E3x)
13         xSR = Upsample(Fdeep)
14         // Loss computation
15         Lrec = ||xSR − xHR||₁
16         Lper = VGG(xSR, xHR)
17         Lstr = ||SE(xSR) − SE(xHR)||₁
18         Ltotal = Lrec +

λ

per · Lper +

λ

str · Lstr
19         Backpropagate Ltotal, update parameters of SE, SGM, SGCT
20      end for
21   end for
22   // Inference
23   Given unseen LR image xLR*
24   fSE* = SE(HR_ref) or skip if unavailable
25   (E1*, E2*, E3*) = SGM(fSE*) or cached guidance
26   xSR* = SGRCT(xLR*, E1*, E2*, E3*)
27   return xSR*

3. Experiment and Results

3.1. Dataset

This study uses optical remote sensing images from China’s Gaofen series satellites as experimental data, as shown in Table 1, including the High-Resolution Multi-Mode Imaging Satellite (GFDM), Gaofen-1 (GF-1), and Gaofen-6 (GF-6). The panchromatic images of GFDM have a spatial resolution of 0.5 m and the multispectral images have a resolution of 2 m, while the panchromatic images of GF-1 and GF-6 have a resolution of 2 m and their multispectral images have a resolution of 8 m. The collected scenes cover multiple regions in China, including Beijing, Henan, and Guangxi, and encompass diverse landforms and land-cover types. The imagery was selected from different seasons to capture seasonal appearance variations and typical land-cover changes.

The data preprocessing pipeline comprises four main steps. First, inter-band registration is performed. Second, data fusion is conducted: the 0.5 m panchromatic images and 2 m multispectral images from the GFDM satellite are fused to generate 0.5 m multispectral images as the high-resolution reference, while the 2 m panchromatic images and 8 m multispectral images from GF-1 and GF-6 are fused to obtain 2 m multispectral images as the low-resolution input. Third, orthorectification is applied to the fused images to eliminate geometric distortions caused by terrain relief and satellite viewing angles. Finally, radiometric correction is performed to remove radiation errors introduced by atmospheric and sensor effects. Resampling operations involved in registration, fusion, and orthorectification may introduce small interpolation-related spectral deviations, which are expected to have a minor impact on super-resolution reconstruction.

After the complete preprocessing workflow, the fused data achieve a 4:1 resolution ratio, forming a 4× super-resolution reconstruction dataset. We crop the GFDM imagery into 1600 × 1600-pixel high-resolution patches and the GF-1/GF-6 imagery into corresponding 400 × 400-pixel low-resolution patches, obtaining 8192 paired training samples. The dataset is split into training, validation, and testing sets in a 7:2:1 ratio, and representative samples are shown in Figure 7.

3.2. Experimental Details

The experimental settings are summarized in Table 2. For the SRCT model, we trained on 256 × 256 image patches for a total of 300K iterations with a batch size of 4 and an initial learning rate of

1 \times 10^{- 4}

. We used the Adam optimizer with

β_{1}

= 0.9 and

β_{2}

= 0.99. Data augmentation during training included random rotation and horizontal flipping. All experiments were conducted using PyTorch 1.13.1 on an NVIDIA RTX 3090 GPU with 24 GB of memory.

3.3. Evaluation Metrics

To comprehensively evaluate the performance of the proposed method, this study establishes a multi-dimensional evaluation system. The system includes full-reference image quality metrics such as PSNR, SSIM [41], DISTS [42], and FID [43], as well as no-reference image quality metrics such as NIQE [44], MUSIQ [45], MANIQ [46], and CLIPIQA [47]. It covers both traditional evaluation metrics and advanced metrics based on deep learning. The specific formulas and descriptions are provided in Table 3.

3.4. Comparison with Other Models

In this study, we selected several advanced super-resolution methods as baselines for comparison. For remote sensing-specific super-resolution, we adopt HAUNet [16], FreMamba [38], and ACT-SR [35]. For general-purpose super-resolution, we choose EDSR [11], CATANet [34], SPSR [24], and WGSR [29] as representative methods. These baselines are selected because they cover diverse modeling paradigms, including CNN-, GAN-, Transformer-, and Mamba-based approaches, and include representative methods that are either classic or recent within each category. Moreover, by including both remote sensing-specific and general-purpose super-resolution models, we provide a more comprehensive comparison.

All compared SOTA methods were implemented under the same software and hardware environment. To ensure fairness, all experiments used the same data augmentation strategy and the default hyperparameter settings for each method. As shown in Table 4, our method achieves the best or near-best performance on multiple key metrics. In terms of perceptual quality, it performs particularly well, with an LPIPS value of 0.2594 and a DISTS value of 0.1625, both being the best among all methods, and a FID value of 77.301, which is also the lowest. This indicates that the distribution of the generated images is closest to that of the real high-resolution images. These results directly validate the effectiveness of the proposed structure-preserving mechanism, demonstrating that our method can retain and transmit critical structural information during degradation and thus address the information loss problem in traditional approaches.

In terms of no-reference quality metrics, our method achieves the best or near-best results on NIQE (5.833), CLIPIQA (0.3776), MUSIQ (42.725), and MANIQA (0.2027), indicating that the generated images have high naturalness and visual quality. This further demonstrates the superiority of our method in real-world scenarios without reference images. Comparing different method types, CNN-based methods such as EDSR and HAUNet tend to produce smooth but detail-deficient results. GAN-based methods such as SPSR and WGSR perform well on some perceptual quality metrics but are less balanced overall than our method. Mamba-based methods such as FreMamba show good balance but do not achieve the best performance on any metric. Transformer-based methods such as ACT-SR and CATANet fall short of our method on structure preservation metrics such as DISTS.

In terms of traditional distortion metrics, our method achieves a PSNR of 24.821 dB and an SSIM of 0.6516, ranking second to some baseline methods. Although it does not achieve the best performance on these traditional metrics, this reflects the common “perception–distortion trade-off” in super-resolution. PSNR and SSIM, which are based on pixel-wise differences and low-level statistics, have inherent limitations for evaluating remote sensing super-resolution. They do not reliably reflect semantic and structural fidelity in complex scenes, as their scores are often dominated by large homogeneous areas and they tend to favor overly smooth reconstructions, making them less sensitive to the structural quality of complex ground objects. To comprehensively demonstrate the performance of each model across multiple types of evaluation metrics, this study normalizes all metrics in the direction where larger values indicate better performance.

For metrics where larger values are better (PSNR, SSIM, CLIP-IQA, MUSIQ, and MANIQA), we use

\hat{x} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(5)

For metrics where smaller values are better (LPIPS, DISTS, FID, and NIQE), we use

\hat{x} = \frac{x_{\max} - x}{x_{\max} - x_{\min}}

(6)

The normalized results are shown in Figure 8. SRCT’s contour is closest to the outer ring across nine metrics, indicating superior performance in most evaluation dimensions. Overall, the radar chart shows that SRCT maintains high reconstruction accuracy while balancing structure and perceptual quality, achieving more balanced and comprehensive performance than other methods.

In summary, the experimental results demonstrate the effectiveness and superiority of the proposed structure-preserving super-resolution method for remote sensing image super-resolution, particularly in maintaining the integrity of complex ground object structures.

In addition to quantitative metric comparisons, we conduct visual comparisons to evaluate the reconstruction quality of each method in real remote sensing scenes.

As shown in Figure 9, we select typical remote sensing scenes such as roads, playgrounds, and buildings for visual comparison.

In building scenes, EDSR and HAUNet show obvious blurring in building contour recovery, with edges not sharp enough. Although SPSR and WGSR can enhance edge sharpness, they suffer from over-sharpening and false textures, resulting in unnatural texture patterns in roof areas. FreMamba maintains overall structure well but lacks detail processing, with rough roof textures. CATANET and ACT-SR can recover building contours well but exhibit color shifts in roof texture areas, deviating from true colors. In contrast, SRCT accurately reconstructs building geometric contours and roof textures, with clear and natural building edges, and the overall visual effect is closest to real high-resolution images.

In road scenes, EDSR and SPSR show blurring and aliasing artifacts in lane edge recovery, with discontinuous lane lines. Although HAUNet maintains some edge sharpness, it lacks road marking details, with inaccurate marking width and shape. WGSR and FreMamba show discontinuities in lane line recovery, with broken markings. Although CATANET and ACT-SR can sharpen road edges, they introduce random noise, affecting overall visual quality. In contrast, SRCT accurately reconstructs lane markings with continuous and natural lines and smooth road edges with rich details.

In playground scenes, EDSR and CATANET fail to recover track lines clearly, with blurred white line edges. WGSR and SPSR over-sharpen, causing track lines to appear jagged with unnatural edge effects. FreMamba maintains overall structure well but shows broken track markings with incomplete lines. Although ACT-SR can recover the overall playground structure well, detail processing shows distortion with slight geometric shape distortion of the track. In contrast, SRCT completely restores track white lines and lawn stripes, with continuous lines and accurate geometric shapes, maintaining overall structural integrity.

Overall observations show that our proposed SRCT method performs excellently across different types of scenes, maintaining continuous edges, complete structures, realistic colors, and rich details, and significantly outperforms other comparison methods in visual perceptual quality.

3.5. Ablation Study

To validate the effectiveness of each component in the proposed framework, we design a series of ablation experiments, as shown in Table 5. Starting from the baseline model, we gradually add three key components—the dual-branch residual dense group (DBRDG), the structure encoder (SE), and the structure guidance module (SGM)—to evaluate the contribution of each component to super-resolution reconstruction performance.

After introducing DBRDG, PSNR increases to 24.2235 dB, SSIM improves to 0.6312, LPIPS decreases to 0.2646, and MANIQA increases to 0.1654, demonstrating the effectiveness of the dual-branch design in balancing global structure modeling and local texture preservation. DBRDG can simultaneously handle regular geometric structures and irregular textures in remote sensing images through the collaborative work of window-based multi-head self-attention in the main path and lightweight convolution in the residual path.

After further introducing the structure encoder (SE), the model performance improves significantly: PSNR reaches 24.8915 dB, SSIM improves to 0.6423, LPIPS slightly increases to 0.2745, DISTS decreases from 0.2123 to 0.1732, and MANIQA increases to 0.1956. This indicates that the structure encoder can effectively extract and preserve key structural information from the original image, providing important structural priors for super-resolution reconstruction. At this stage, structural information is passed to the super-resolution model through feature concatenation, i.e., directly connecting the structural features extracted by SE with the shallow features of the super-resolution network, providing structural priors at the network input stage.

After adding the structure guidance module (SGM), the complete model is formed, and all metrics are further improved. PSNR slightly decreases to 24.821 dB, but SSIM improves to 0.6516, LPIPS decreases to 0.2594, DISTS further decreases to 0.1625, and MANIQA reaches the optimal value of 0.2027. This indicates that SGM successfully injects structural information into all levels of the super-resolution network through the two-stage strategy of feature initialization guidance and deep feature fusion, effectively alleviating the information bottleneck effect in Transformer models. Unlike the single-entry injection when using only the SE, the SGM achieves hierarchical and precise transmission of multi-scale structural features, ensuring that structural information can be effectively utilized at various depth levels of the network. The significant improvement in MANIQA, a metric for evaluating perceptual quality, demonstrates the important role of the SGM in enhancing visual perceptual quality.

The ablation results show that all three proposed core components contribute significantly to model performance. Their synergy enables the method to maintain the integrity of complex ground object structures while recovering rich texture details, making it particularly suitable for handling diverse ground object types and complex scene structures in remote sensing images.

4. Discussion

This study focuses on super-resolution reconstruction of sub-meter remote sensing imagery with a spatial resolution of 0.5 m, a resolution range that has been rarely explored in existing research. Although a few works such as SA-GAN [48] and DESAT [49] have attempted to use real remote sensing image pairs for training, their resolutions are still far from the sub-meter level. Sub-meter remote sensing imagery has unique characteristics and challenges. It contains extremely rich ground object details, such as fine building structures, vehicles, and small vegetation, which are often difficult to distinguish in lower-resolution imagery. At the same time, such imagery requires higher edge sharpness and geometric accuracy, making any slight structural distortion more prominent. In addition, the complex coupling between texture and structure is more evident in sub-meter imagery, making it difficult for traditional super-resolution methods to achieve a good balance between detail texture recovery and overall structure preservation.

This study constructs a real image-pair dataset of 0.5 m sub-meter remote sensing imagery and demonstrates the superiority of the proposed method through both quantitative metrics and visual comparisons. In quantitative evaluation, the SRCT model significantly outperforms existing methods on structure-related metrics, with DISTS reduced by 8.7% and LPIPS reduced by 7.2%. Visual comparisons show that SRCT achieves significant improvements in reconstruction quality in structure-sensitive regions such as building contours and road continuity. This study innovatively adopts a strategy combining “external structure guidance” and “internal structure optimization” to comprehensively improve the super-resolution reconstruction quality of sub-meter remote sensing imagery. At the external guidance level, the structure encoder (SE) and structure guidance module (SGM) build a bridge of structural information between high- and low-resolution images. The SE extracts key structural features from high-resolution images through multi-level cascaded convolutions and spatial attention mechanisms, while the SGM precisely delivers these structural features to all levels of the super-resolution network through a two-stage injection strategy. This effectively addresses the problem of structural information degradation in deep networks in traditional methods, enabling complete preservation of complex ground object structures. At the internal optimization level, the dual-branch residual dense group (DBRDG) reshapes the core architecture of the super-resolution network. Through the collaborative work of window-based multi-head self-attention in the main path and lightweight convolution in the residual path, it achieves adaptive processing of remote sensing ground objects. This internal design is particularly suited to the coexistence of regular geometric structures and irregular textures in sub-meter imagery, enabling the model to preserve the integrity of regular structures such as buildings and roads while restoring natural texture details in regions such as vegetation and water bodies. This innovative design combining external and internal mechanisms enables the SRCT model to simultaneously address the dual challenges of structure preservation and texture recovery in sub-meter remote sensing image super-resolution reconstruction, providing a practical and efficient technical solution to the shortage of sub-meter imagery.

This study still has limitations. As shown in Figure 10, SRCT does not have advantages in terms of parameters and FLOPs [50]. Both model size and computational cost per inference are at medium-to-high levels, and the larger bubble also indicates relatively higher FLOPs. Under the same hardware conditions, higher FLOPs typically lead to longer inference time. Nevertheless, SRCT’s actual inference speed and resource consumption remain within acceptable limits. For example, when performing super-resolution reconstruction on a 39,120 × 39,131 GF-6 image of Shouguang City, Shandong Province, China, on an RTX 3090 GPU, SRCT completes the computation in approximately 10 h, corresponding to roughly 23.5 s per megapixel, which can meet the engineering application requirements for sub-meter remote sensing image super-resolution reconstruction. In our operational workflow, the large scene is processed by tiling and mosaicking, including cropping the image into tiles, tile-wise super-resolution inference, adding georeferencing information to the super-resolved tiles, removing overlapping regions, and mosaicking the tiles to generate the final output.

5. Conclusions

To address the structure preservation problem in sub-meter remote sensing image super-resolution reconstruction, this paper proposes SRCT, a structure-guided super-resolution method. Through the collaborative work of three core components—the structure encoder (SE), structure guidance module (SGM), and dual-branch residual dense group (DBRDG)—the method achieves the goal of maintaining the integrity of complex ground object structures while recovering texture details. Experimental results show that SRCT significantly outperforms existing methods on structure-related metrics, with DISTS reduced by 8.7% and LPIPS reduced by 7.2%, and achieves notable improvements in reconstruction quality in structure-sensitive regions such as building contours and road continuity. Ablation studies validate the effectiveness of each core component and their synergy. This study provides a new technical approach for sub-meter remote sensing image super-resolution reconstruction and has important application value for addressing the shortage of high-resolution remote sensing data. Future research will further explore the application of structural information in more complex ground object scenarios, optimize model computational efficiency, and deeply integrate the method with specific application domains to promote the application of super-resolution technology in real-world scenarios.

Author Contributions

T.G. and S.Z. conceived the study and designed the SRCT framework and experimental protocol. T.G. implemented the algorithms, constructed the 0.5 m dataset, conducted the experiments, performed data analysis, and drafted the manuscript. W.Y. and E.S. contributed to data preprocessing, model implementation, and ablation/visualization experiments. J.Y. and Y.M. (Yan Ma) assisted in methodology refinement, result interpretation, and manuscript revision. Y.M. (Yong Ma) and S.Z. supervised the overall research, provided technical guidance, and critically reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 42201063.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiao, L.; Huang, Z.; Liu, X.; Yang, Y.; Ma, M.; Zhao, J.; You, C.; Hou, B.; Yang, S.; Liu, F.; et al. Brain-inspired remote sensing interpretation: A comprehensive survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2992–3033. [Google Scholar] [CrossRef]
Li, D.; Wang, M.; Jiang, J. China’s high-resolution optical remote sensing satellites and their mapping applications. Geo-Spat. Inf. Sci. 2021, 24, 85–94. [Google Scholar] [CrossRef]
Neyns, R.; Canters, F. Mapping of urban vegetation with high-resolution remote sensing: A review. Remote Sens. 2022, 14, 1031. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, H.; Lin, S.; Li, D.; Gao, Y. S2Transformer: Exploring Sparsity in Remote Sensing Images for Efficient Super-Resolution. Sensors 2025, 25, 5643. [Google Scholar] [CrossRef] [PubMed]
Yu, J.; Lin, C.; Peng, L.; Zhong, C.; Li, H. MSFANet: A Multi-Scale Feature Fusion Transformer with Hybrid Attention for Remote Sensing Image Super-Resolution. Sensors 2025, 25, 6729. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 391–407. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Tai, Y.; Yang, J.; Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3147–3155. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, K.; Zuo, W.; Zhang, L. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3262–3271. [Google Scholar]
Guo, Y.; Chen, J.; Wang, J.; Chen, Q.; Cao, J.; Deng, Z.; Xu, Y.; Tan, M. Closed-loop matters: Dual regression networks for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5407–5416. [Google Scholar]
Lee, J.; Jin, K.H. Local texture estimator for implicit representation function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1929–1938. [Google Scholar]
Wang, L.; Dong, X.; Wang, Y.; Ying, X.; Lin, Z.; An, W.; Guo, Y. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4917–4926. [Google Scholar]
Wang, J.; Wang, B.; Wang, X.; Zhao, Y.; Long, T. Hybrid attention-based U-shaped network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612515. [Google Scholar] [CrossRef]
Chen, X.; Wu, Y.; Lu, T.; Kong, Q.; Wang, J.; Wang, Y. Remote sensing image super-resolution with residual split attention mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1–13. [Google Scholar] [CrossRef]
Wang, Y.; Huang, Z.; Wang, X.; Zhang, S.; Liu, S.; Feng, L. Lightweight Edge-Guided Super-Resolution Network for Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5626714. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Park, S.J.; Son, H.; Cho, S.; Hong, K.S.; Lee, S. Srfeat: Single image super-resolution with feature discrimination. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 439–455. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhang, W.; Liu, Y.; Dong, C.; Qiao, Y. Ranksrgan: Generative adversarial networks with ranker for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3096–3105. [Google Scholar]
Ma, C.; Rao, Y.; Cheng, Y.; Chen, C.; Lu, J.; Zhou, J. Structure-preserving super resolution with gradient guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7769–7778. [Google Scholar]
Lei, S.; Shi, Z.; Zou, Z. Coupled adversarial training for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3633–3643. [Google Scholar] [CrossRef]
Ma, W.; Pan, Z.; Guo, J.; Lei, B. Super-resolution of remote sensing images based on transferred generative adversarial network. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 1148–1151. [Google Scholar]
Jia, S.; Wang, Z.; Li, Q.; Jia, X.; Xu, M. Multiattention generative adversarial network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624715. [Google Scholar] [CrossRef]
Dong, R.; Zhang, L.; Fu, H. RRSGAN: Reference-based super-resolution for remote sensing image. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601117. [Google Scholar] [CrossRef]
Korkmaz, C.; Tekalp, A.M.; Dogan, Z. Training generative image super-resolution models by wavelet-domain losses enables better control of artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5926–5936. [Google Scholar]
Tu, J.; Mei, G.; Ma, Z.; Piccialli, F. SWCGAN: Generative adversarial network combining swin transformer and CNN for remote sensing image super-resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5662–5673. [Google Scholar] [CrossRef]
Wang, C.; Zhang, X.; Yang, W.; Wang, G.; Li, X.; Wang, J.; Lu, B. MSWAGAN: Multispectral remote sensing image super-resolution based on multiscale window attention transformer. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5404715. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
Liu, X.; Liu, J.; Tang, J.; Wu, G. CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 20–25 June 2025; pp. 17902–17912. [Google Scholar]
Kang, Y.; Wang, X.; Zhang, X.; Wang, S.; Jin, G. ACT-SR: Aggregation Connection Transformer for Remote Sensing Image Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 8953–8964. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Lin, C.W.; Zhang, L. TTST: A top-k token selective transformer for remote sensing image super-resolution. IEEE Trans. Image Process. 2024, 33, 738–752. [Google Scholar] [CrossRef]
Shi, L.; Cheng, Y.; Li, R.; Wang, H.; Zhao, J.; Qiang, Y.; Zhao, J. Dual-Domain Optimization Model Based on Discrete Fourier Transform and Frequency Domain Fusion for Remote Sensing Single-image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2025, 63, 3000215. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Zhang, Q.; Lin, C.W. Frequency-assisted mamba for remote sensing image super-resolution. IEEE Trans. Multimed. 2024, 27, 1783–1796. [Google Scholar] [CrossRef]
Wang, C.; Sun, W. Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors. arXiv 2025, arXiv:2506.23801. [Google Scholar]
Sun, K.; Tian, Y. Dbfnet: A dual-branch fusion network for underwater image enhancement. Remote Sens. 2023, 15, 1195. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Ding, K.; Ma, K.; Wang, S.; Simoncelli, E.P. Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2567–2581. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
Zhang, L.; Zhang, L.; Bovik, A.C. A feature-enriched completely blind image quality evaluator. IEEE Trans. Image Process. 2015, 24, 2579–2591. [Google Scholar] [CrossRef]
Ke, J.; Wang, Q.; Wang, Y.; Milanfar, P.; Yang, F. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5148–5157. [Google Scholar]
Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; Yang, Y. Maniqa: Multi-dimension attention network for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1191–1200. [Google Scholar]
Wang, J.; Chan, K.C.; Loy, C.C. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2555–2563. [Google Scholar]
Zhao, J.; Ma, Y.; Chen, F.; Shang, E.; Yao, W.; Zhang, S.; Yang, J. SA-GAN: A second order attention generator adversarial network with region aware strategy for real satellite images super resolution reconstruction. Remote Sens. 2023, 15, 1391. [Google Scholar] [CrossRef]
Mao, Y.; He, G.; Wang, G.; Yin, R.; Peng, Y.; Guan, B. DESAT: A Distance-Enhanced Strip Attention Transformer for Remote Sensing Image Super-Resolution. Remote Sens. 2024, 16, 4251. [Google Scholar] [CrossRef]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]

Figure 1. Comparison example between super-resolution results and HR in real remote sensing scenes: (a) reconstruction results of current mainstream super-resolution methods; (b) corresponding HR. The red box marks a representative building.

Figure 2. Structure-preserving remote sensing image super-resolution framework. The method consists of three core components, a structure encoder (SE), a structure guidance module (SGM), and a structure-guided residual cross-Transformer (SGCT), and includes both training and inference stages. The two dashed lines separate the inputs and outputs from the model modules, and the dashed box indicates that the Residual Dense Groups (DBRDGs) are composed of three cascaded DBRDG blocks.

Figure 3. Structure of the spatial attention module. Channel-wise average pooling and max pooling are applied in parallel, concatenated along the channel dimension, and passed through a

7 \times 7

convolution followed by a Sigmoid function to generate the spatial attention map

F_{r c}

for emphasizing edge- and contour-dominant regions.

Figure 3. Structure of the spatial attention module. Channel-wise average pooling and max pooling are applied in parallel, concatenated along the channel dimension, and passed through a

7 \times 7

convolution followed by a Sigmoid function to generate the spatial attention map

F_{r c}

for emphasizing edge- and contour-dominant regions.

Figure 4. Structure of the residual blocks. Two 3 × 3 convolution layers with batch normalization and ReLU activation are applied in the main path, and a skip connection adds the input to the block output to facilitate stable optimization and feature reuse.

Figure 5. Structure of the MSConv (multi-scale convolution) module. Feature channels are split into parallel multi-scale depthwise branches, capturing structural cues at different receptive fields; the branch outputs are concatenated, fused with a residual connection, and reorganized to reinforce structural information.

Figure 6. Structure of the proposed DBRDG. Dense feature aggregation is realized by cascaded STL–Conv–LeakyReLU units with progressive feature concatenation, while a group-level skip connection forwards the input to stabilize deep propagation and preserve structural cues.

Figure 7. Sample HR–LR image pairs from the dataset.

Figure 8. Multi-metric normalized radar chart comparison. The red contour represents the proposed SRCT model. The axes show normalized scores rather than raw metric values. Metrics where lower values indicate better quality are inverted during normalization, so larger values consistently indicate better performance.

Figure 9. Visual quality comparison results. Visual comparison of SRCT with other models on GFDM (×4). The red box highlights the emphasized region.

Figure 10. Impact of parameter count and computational complexity on SRCT model performance on the test set. (a) Relationship between parameter count and SSIM; (b) relationship between parameter count and PSNR, where bubble size represents FLOPs.

Table 1. Spatial resolution of different Gao Fen satellites.

Satellite	PAN Spatial Resolution/m	Multispectral Spatial Resolution/m
GF Multi-Mode	0.5	2
GF-1	2	8
GF-6	2	8

Table 2. Experimental settings.

Parameter	Value
Batch size	4
Training patch size	$256 \times 256$
Iterations	300,000
Learning rate	$1 \times 10^{- 4}$
Optimizer	Adam, $β_{1} = 0.9, β_{2} = 0.99$
Data augmentation	Random rotation, horizontal flip
GPU	NVIDIA RTX 3090

Table 3. Image quality assessment metrics and descriptions.

Metrics	Formula	Description
PSNR	$PSNR (I_{GT}, I_{SR}) = 10 \cdot {log}_{10} (\frac{255^{2}}{\frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {(I_{GT} (i, j) - I_{SR} (i, j))}^{2}})$	A higher PSNR value indicates that the difference in pixel values between the reconstructed image and the original image is smaller.
SSIM	$SSIM (I_{GT}, I_{SR}) = \frac{(2 μ_{GT} μ_{SR} + C_{1}) (2 σ_{GT SR} + C_{2})}{(μ_{GT}^{2} + μ_{SR}^{2} + C_{1}) (σ_{GT}^{2} + σ_{SR}^{2} + C_{2})}$	A higher SSIM value shows that the two images are more similar in terms of structural information.
LPIPS	$LPIPS (I_{GT}, I_{SR}) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {∥ w_{l} ⊙ (F_{l} {(I_{GT})}_{h, w} - F_{l} {(I_{SR})}_{h, w}) ∥}_{2}^{2}$	A lower LPIPS value means that the two images are more similar in terms of human-perceived visual quality.
DISTS	$DISTS (I_{GT}, I_{SR}) = \sum_{i} α_{i} \cdot (1 - S_{i}) + β_{i} \cdot D_{i}$	A higher DISTS value represents a higher similarity between the images in terms of structure- and texture-related dimensions.
NIQE	$NIQE (I_{SR}) = \sqrt{{(μ_{SR} - μ_{model})}^{T} {(Σ_{SR} + Σ_{model})}^{- 1} (μ_{SR} - μ_{model})}$	A lower NIQE value indicates that the statistical properties of the image are closer to the statistical laws of typical natural images.
FID	$FID (I_{GT}, I_{SR}) = {∥ μ_{GT} - μ_{SR} ∥}^{2} + Tr (Σ_{GT} + Σ_{SR} - 2 {(Σ_{GT} Σ_{SR})}^{1 / 2})$	A lower FID means the features of the generated/processed image are more similar to the real/reference image.
MUSIQ	$MUSIQ (I_{SR}) = f_{θ} ([P_{1} (I_{SR}), P_{2} (I_{SR}), \dots, P_{n} (I_{SR})])$	A higher MUSIQ score shows that the image has fewer visual flaws like artifacts and clearer details.
MANIQA	$MANIQA (I_{SR}) = g_{ϕ} (A (E (I_{SR})))$	A higher MANIQA score means the image has a more coherent structure, richer details, and fewer distortions.
CLIPIQA	$CLIPIQA (I_{SR}) = {CLIP}_{high} (I_{SR}) - {CLIP}_{low} (I_{SR})$	A higher CLIPIQA score shows that the image has more qualities that match what humans consider a “high-quality image”.

Table 4. Performance evaluation results of different models on the test set. Red indicates the best values, blue indicates the second-best values, ↑ indicates higher is better, and ↓ indicates lower is better.

Type	Method	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓	FID ↓	NIQE ↓	CLIPIQA ↑	MUSIQ ↑	MANIQA ↑
CNN-based	EDSR	20.809	0.468	0.2721	0.503	96.402	7.208	0.1857	29.492	0.145
CNN-based	HAUNet	22.746	0.5534	0.2598	0.469	86.810	6.963	0.2629	36.915	0.1682
GAN-based	SPSR	21.0137	0.4926	0.267	0.4924	102.810	6.743	0.2915	35.665	0.171
GAN-based	WGSR	23.974	0.6402	0.3097	0.1959	84.097	6.432	0.322	42.891	0.1873
Mamba-based	FreMamba	22.994	0.5987	0.3074	0.198	98.523	5.9749	0.2681	40.618	0.1799
Transformer-based	ACT-SR	24.835	0.6651	0.2768	0.4094	96.471	8.454	0.2349	36.196	0.1665
	CATANET	24.734	0.6622	0.5313	0.3021	94.945	8.285	0.159	25.092	0.155
	Ours	24.821	0.6516	0.2594	0.1625	77.301	5.833	0.3776	42.725	0.2027

Table 5. Ablation study results for key modules of SRCT. Red indicates the best values; ↑ indicates higher is better, ↓ indicates lower is better, 🗸 indicates the module is included, and × indicates the module is removed.

Base	DBRDG	SE	SGM	PSNR ↑	SSIM ↑	LPIPS ↓	DISTS ↓	MANIQA ↑
🗸	×	×	×	23.651	0.6143	0.2748	0.206	0.1524
🗸	🗸	×	×	24.2235	0.6312	0.2646	0.2123	0.1654
🗸	🗸	🗸	×	24.8915	0.6423	0.2745	0.1732	0.1956
🗸	🗸	🗸	🗸	24.821	0.6516	0.2594	0.1625	0.2027

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, T.; Zhang, S.; Yao, W.; Shang, E.; Yang, J.; Ma, Y.; Ma, Y. SRCT: Structure-Preserving Method for Sub-Meter Remote Sensing Image Super-Resolution. Sensors 2026, 26, 733. https://doi.org/10.3390/s26020733

AMA Style

Gao T, Zhang S, Yao W, Shang E, Yang J, Ma Y, Ma Y. SRCT: Structure-Preserving Method for Sub-Meter Remote Sensing Image Super-Resolution. Sensors. 2026; 26(2):733. https://doi.org/10.3390/s26020733

Chicago/Turabian Style

Gao, Tianxiong, Shuyan Zhang, Wutao Yao, Erping Shang, Jin Yang, Yong Ma, and Yan Ma. 2026. "SRCT: Structure-Preserving Method for Sub-Meter Remote Sensing Image Super-Resolution" Sensors 26, no. 2: 733. https://doi.org/10.3390/s26020733

APA Style

Gao, T., Zhang, S., Yao, W., Shang, E., Yang, J., Ma, Y., & Ma, Y. (2026). SRCT: Structure-Preserving Method for Sub-Meter Remote Sensing Image Super-Resolution. Sensors, 26(2), 733. https://doi.org/10.3390/s26020733

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SRCT: Structure-Preserving Method for Sub-Meter Remote Sensing Image Super-Resolution

Abstract

1. Introduction

2. Methods

2.1. Overview

2.2. Structure Encoder (SE)

2.3. Structure Guidance Module (SGM)

2.4. Super-Resolution Reconstruction Network SGCT

3. Experiment and Results

3.1. Dataset

3.2. Experimental Details

3.3. Evaluation Metrics

3.4. Comparison with Other Models

3.5. Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI