A Multi-Scale Windowed Spatial and Channel Attention Network for High-Fidelity Remote Sensing Image Super-Resolution

Xiao, Xiao; Xiang, Xufeng; Wang, Jianqiang; Wang, Liwen; Gao, Xingzhi; Chen, Yang; Liu, Jun; He, Peng; Han, Junhui; Li, Zhiqiang

doi:10.3390/rs17213653

Open AccessArticle

A Multi-Scale Windowed Spatial and Channel Attention Network for High-Fidelity Remote Sensing Image Super-Resolution

by

Xiao Xiao

^1,2,3,

Xufeng Xiang

^1,*

,

Jianqiang Wang

^2,3,

Liwen Wang

⁴

,

Xingzhi Gao

^2,3,

Yang Chen

⁵,

Jun Liu

⁶,

Peng He

⁶,

Junhui Han

⁶ and

Zhiqiang Li

⁷

¹

School of Tele-Communications Engineering, Xidian University, Xi’an 710071, China

²

CCCC Civil Engineering Science & Technology Co., Ltd., Xi’an 710075, China

³

CCCC First Highway Consultants Co., Ltd., Xi’an 710075, China

⁴

Guangzhou Institute of Technology, Xidian University, Xi’an 710071, China

⁵

Public Development Department, Shanghai Huawei Technologies Co., Ltd., Shanghai 201206, China

⁶

The 41st Research Institute of China Electronics Technology Group Corporation, Qingdao 266555, China

⁷

Bejing Electro-Mechanical Engineering Institute, Beijing 100074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(21), 3653; https://doi.org/10.3390/rs17213653 (registering DOI)

Submission received: 23 September 2025 / Revised: 1 November 2025 / Accepted: 5 November 2025 / Published: 6 November 2025

(This article belongs to the Special Issue Artificial Intelligence for Optical Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose MSWSCAN, a multi-scale windowed spatial–channel attention network that better preserves high-frequency textures and structures.
It delivers strong results across ×2/×3/×4 scales on standard benchmarks (WHU-RS19, RSSCN7, UCMerced_LandUse), with reduced artefacts.

What are the implication of the main findings?

Clearer, sharper reconstructions can improve downstream tasks (e.g., building extraction, urban planning, environmental monitoring).
The attention and multi-scale design are general and reusable for other RS super-resolution models.

Abstract

Remote sensing image super-resolution (SR) plays a crucial role in enhancing the quality and resolution of satellite and aerial imagery, which is essential for various applications, including environmental monitoring and urban planning. While recent image super-resolution networks have achieved strong results, remote-sensing images present domain-specific challenges—complex spatial distribution, large cross-scale variations, and dynamic topographic effects—that can destabilize multi-scale fusion and limit the direct applicability of generic SR models. These features make it difficult for single-scale feature extraction methods to fully capture the complex structure, leading to the presence of artifacts and structural distortion in the reconstructed remote sensing images. Therefore, new methods are needed to overcome these challenges and improve the accuracy and detail fidelity of remote sensing image super-resolution reconstruction. This paper proposes a novel Multi-scale Windowed Spatial and Channel Attention Network (MSWSCAN) for high-fidelity remote sensing image super-resolution. The proposed method combines multi-scale feature extraction, window-based spatial attention, and channel attention mechanisms to effectively capture both global and local image features while addressing the challenges of fine details and structural distortion. The network is evaluated on several benchmark datasets, including WHU-RS19, UCMerced and RSSCN7, where it demonstrates superior performance in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) compared to state-of-the-art methods. The results show that the MSWSCAN not only enhances texture details and edge sharpness but also reduces reconstruction artifacts. To address cross-scale variations and dynamic topographic effects that cause texture drift in multi-scale SR, we combine windowed spatial attention to preserve local geometry with a channel-aware fusion layer (FFL) that reweights multi-scale channels. This stabilizes cross-scale aggregation at a runtime comparable to DAT and yields sharper details on heterogeneous land covers. Averaged over WHU–RS19, RSSCN7, and UCMerced_LandUse at ×2/×3/×4, MSWSCAN improves PSNR (peak signal-to-noise ratio, dB)/SSIM (structural similarity index measure, 0–1) by +0.10 dB/+0.0038 over SwinIR and by +0.05 dB/+0.0017 over DAT. In conclusion, the proposed MSWSCAN achieves state-of-the-art performance in remote sensing image SR, offering a promising solution for high-quality image enhancement in remote sensing applications.

Keywords:

remote sensing; image super-resolution; feature extraction; high-fidelity reconstruction

1. Introduction

Image super-resolution (SR) reconstruction improves low-resolution images, offering higher-quality data for tasks such as building extraction and environmental monitoring. This technique also reduces hardware costs by generating high-resolution images from low-cost equipment. SR technology thus provides significant support for various fields and has considerable research value and practical potential.

Traditional SR methods, including interpolation-based and reconstruction-based methods, each have their strengths and limitations. Interpolation-based methods, such as nearest-neighbor, bilinear, and bicubic interpolation [1,2,3], improve image resolution by adding pixel information. However, these methods often struggle to accurately recover complex textures, leading to artifacts and blurriness. Reconstruction-based methods, like those proposed by Tsai [4] and others, use prior constraints to restore details in both frequency and spatial domains, but they are computationally complex.

In recent years, deep learning-based super-resolution (SR) techniques have made significant progress, effectively addressing issues such as blurriness and missing texture details in traditional methods. Fixed-scale algorithms are one of the important and effective approaches. Fixed-scale algorithms improve the resolution and quality of images by training separate models for each upsampling factor, allowing for better capture of image details and structures. This method trains multiple models for different upsampling factors, and performs exceptionally well when handling super-resolution tasks for specific scales.

Remote-sensing SR faces strong cross-scale variations and dynamic topographic effects (viewpoint/slope), which misalign multi-scale features and cause texture drift at land-cover boundaries. Prior multi-scale SR (pyramid/dilated/concat) and Transformer SR (e.g., SwinIR, DAT) improve detail and context, yet can blur repetitive façades or break shoreline continuity due to unstable cross-scale fusion or higher cost.

We constrain interactions with windowed spatial attention to preserve local geometry, and add a lightweight channel-aware fusion layer (FFL) that reweights multi-scale channels before aggregation. This stabilizes cross-scale fusion with a runtime comparable to DAT (please refer to Section 4 and Table in Section 4), yielding sharper details on heterogeneous land-covers.

Our study tackles image super-resolution of raw remote-sensing imagery, i.e., reconstructing continuous radiometric content and evaluating with PSNR/SSIM. In contrast, prior works on resolution enhancement of interpreted outputs (label/map SR) refine discrete semantic products (e.g., land-cover maps), often under weak or inexact supervision and assessed by semantic metrics such as mIoU/F1; representative examples include Label Super-Resolution Networks [5], the low-to-high land-cover mapping approach [6], and Learning without Exact Guidance [7]. Our work belongs to the former: MSWSCAN stabilizes cross-scale fusion in raw imagery to preserve textures and boundaries, while these two directions are complementary.

This paper addresses the challenges of artifacts and structural distortions in remote sensing image super-resolution (SR) reconstruction by proposing a Multi-scale Windowed Spatial and Channel Attention Network. The main contributions of this work are as follows:

Multi-scale Windowed Spatial and Channel Attention Network: We design a multi-scale module that uses convolutions with different kernel sizes to extract features with diverse receptive fields, and aggregate the multi-scale features to enhance the ability of model to capture high-frequency information.
Windowed Spatial and Channel Attention Modules: By alternating the windowed spatial attention module and the channel attention module, the model dynamically re-weights both spatial and channel features within the module, improving its ability to learn fine-grained details and high-frequency features, thus reducing artifacts and structural distortions.
Feature Fusion Layer: A feature fusion layer is introduced before the upsampling reconstruction module, which integrates features from different depths to further enhance the expression ability of deep features, improving the quality of the reconstructed images.
Experimental Validation: Extensive experiments on the WHU-RS19, RSSCN7, and UCMerced_LandUse datasets demonstrate that the proposed Multi-scale Windowed Spatial and Channel Attention Network effectively enhances texture details in remote sensing images, alleviates artifacts and structural distortions, and outperforms existing methods in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM).

2. Related Work

2.1. Traditional Methods

Traditional image super-resolution (SR) methods mainly include interpolation-based and reconstruction-based techniques. Interpolation-based SR methods restore high-resolution images by adding pixel information to low-resolution ones, relying on linear fitting using surrounding low-resolution pixels. Common methods include nearest-neighbor interpolation [1], bilinear interpolation [8], and bicubic interpolation [3], each differing in precision and efficiency. Nearest-neighbor interpolation is the simplest and fastest but results in poor image details and jagged edges. Bilinear interpolation improves smoothing but still struggles with high-frequency detail recovery, leading to blurriness. Bicubic interpolation increases accuracy but has higher computational complexity, particularly when processing large images. Despite enhancing resolution, these methods fail to recover complex textures and details, often producing artifacts, jaggies, and blurriness, especially in high-frequency regions.

Reconstruction-based methods aim to restore more details using prior constraints and visual features, typically divided into frequency-domain and spatial-domain approaches. Tsai [4] first proposed frequency-domain methods in 1984, applying Fourier transforms to low-resolution and high-resolution images to establish a linear mapping for reconstruction. However, this method often results in blurriness and noise. Kim [9], Rhee [10], and Nguyen [11] improved this approach by eliminating noise and spatial blur. Spatial-domain methods, such as Stark’s [12] convex set projection, model optical and motion blur, offering better practical value but with higher computational complexity. Irani [8] introduced iterative back-projection, which iteratively reduces errors between high-resolution and SR images to improve reconstruction quality. Schultz [13] proposed a maximum posterior probability estimation method, and Elad [14] combined this with convex set projection. Haris [15] introduced deep back-projection networks (DBPN), which solve the dependency between low-resolution and high-resolution images by connecting upsampling and downsampling layers and improving final image quality through error feedback. Although reconstruction-based methods enhance precision by incorporating more data and effective prior information, they are computationally complex compared to interpolation-based methods, requiring significantly more resources.

2.2. Deep Learning-Based Methods

In recent years, deep learning-based super-resolution (SR) methods have made significant advancements, effectively addressing issues such as image blurriness and loss of texture details found in traditional algorithms. Dong et al. [16] introduced SRCNN, the first deep learning-based SR method, which performs feature extraction, nonlinear mapping, and upsampling using convolutions. However, it suffers from long reconstruction times and limited context utilization. FSRCNN [17] improves SRCNN by using smaller convolution kernels and deeper network layers, significantly reducing reconstruction time. Shi [18] proposed ESPCN, which replaces deconvolution with subpixel convolution layers, scaling the channel dimension by

r^{2}

(upsampling factor r) through three convolution layers and rearranging the feature map to

r H \times r W \times C

, accelerating the upsampling process by an order of magnitude compared to deconvolution.

He [19] introduced ResNet, which utilizes residual connections to accelerate convergence and focus on high-frequency information. Kim [20] proposed VDSR, which integrates ResNet into SR to improve high-frequency learning, avoid gradient explosion, and reduce parameters. Zhang [21] introduced RDN, using residual dense blocks to improve inter-layer information exchange. Qiu et al. [22] proposed EBRN, using shallow layers for low-frequency recovery and deep layers for high-frequency recovery. Liu [23] introduced RFAN, connecting shallow residual outputs to deeper layers for better information utilization. Despite strong fidelity on homogeneous textures, classical CNN-based SR (e.g., EDSR [24], RCAN [25], RDN [21], MSRN [26]) often struggles with cross-scale alignment under heterogeneous land covers. Around building–vegetation or road–grass boundaries, the mismatch manifests as texture drift and edge softening, indicating that channel re-weighting and pyramid-style fusion alone are insufficient for robust multi-scale aggregation.

To better capture features across different scales, another line of research focused on multi-scale (MS) architectures. For instance, Lai [27] proposed LapSRN, which uses a progressive multi-scale reconstruction based on Laplacian pyramids. Li [26] introduced MSRN, which explicitly extracts features at different scales using parallel convolution kernels of varying sizes within its residual blocks. However, these methods often employ simple feature concatenation or summation, which may not optimally fuse information from different scales and can struggle with the vast scale variations present in remote sensing imagery.

Simultaneously, attention mechanisms were introduced to help CNNs focus on more informative features. Zhang [25] proposed RCAN, which used channel attention (CA) to adaptively re-weight channel features, achieving significant performance gains. This concept was extended by SAN [28], which utilized second-order channel attention, and HAN [29], which added inter-layer attention. Others, like CRAN [30], explored context reasoning attention to balance performance and complexity. However, most of these methods focus heavily on channel-wise attention, often overlooking the importance of fine-grained spatial feature modulation. This can limit their ability to reconstruct sharp edges and complex textures, especially in heterogeneous remote sensing scenes where local spatial geometry is crucial.

More recently, Transformer-based models have been adapted for SR due to their powerful long-range dependency modeling. Liang et al. [31] proposed SwinIR, which applied window-based self-attention to image restoration. Following this, DAT [32] introduced dual aggregation transformers, using alternating spatial and channel self-attention blocks. Transformer-based SR (e.g., SwinIR [31], DAT [32]) benefits from global or windowed self-attention to capture long-range context. However, global modeling and block-wise fusion raise computational cost and may yield unstable aggregation on heterogeneous scenes, sometimes blurring repetitive façades or breaking shoreline continuity. In particular, alternating spatial–channel attention can propagate misaligned cues across scales when local geometry is weakly expressed.

Other architectural explorations include U-Net structures like DRN [33] for bidirectional supervision, encoder-decoder networks like RED-Net [34], recursive layers in DRCN [20] to reduce parameters, and persistent memory blocks in MemNet [35].

These observations motivate our design as follows: We retain the efficiency of windowed spatial attention to preserve local geometry and introduce a channel-aware fusion layer (FFL) that explicitly re-weights multi-scale channels, aiming to reduce cross-scale drift at moderate cost. We provide head-to-head comparisons against SwinIR and DAT on RSSCN7, WHU-RS19 and UCMerced, showing consistent gains and explaining why the improvements occur.

3. Materials and Methods

3.1. Network Architecture Design

The Multi-scale Windowed Spatial and Channel Attention Network (MSWSCAN) proposed in this section follows a similar overall architecture design to other networks, primarily consisting of three modules: shallow feature extraction, deep feature extraction, and upsampling reconstruction. The following sections will introduce these three modules in detail.

3.1.1. Shallow Feature Extraction Module

This module works similarly to other image super-resolution models. Specifically, it takes the low-resolution (LR) image as input to extract shallow features. The feature extraction process uses a kernel size of

3 \times 3

with a stride of 1, and the input image is passed through a convolutional layer that transforms the 3D feature map into a 64-dimensional 2D feature map. The shallow feature is denoted as

F_{0}

, and the mathematical expression of shallow feature extraction is represented as

F_{0} = H_{S F} (I_{L})

(1)

3.1.2. Deep Feature Extraction Module

As shown in Figure 1, the deep feature extraction module consists of a Multi-scale Windowed Spatial and Channel Attention Group (MSWSCAG) and a feature fusion layer (FFL). The MSWSCAG module aims to capture the most essential high-frequency information in the input image, which is then passed through multiple MSWSCAG modules for further refinement. The deep feature extraction module can be expressed as

F = H_{M S W S C A G} (F_{0})

(2)

where

H_{M S W S C A G} (\cdot)

represents the operation of the first MSWSCAG module. The complete mathematical expression for the entire deep feature extraction process is

F_{i} = H_{M S W S C A G, i} (F_{i - 1}) = H_{M S W S C A G, i} (H_{M S W S C A G, i - 1} (\dots (H_{M S W S C A G, 1} (F_{0})) \dots))

(3)

F = H_{F F L} (F_{0}, F_{1}, \dots, F_{i})

(4)

Here,

F_{i}

represents the features after the i-th MSWSCAG module. As shown in Equation (3), F aggregates the features from different layers, where

F_{0}, F_{1}, \dots, F_{i}

are features from each layer. The

H_{F F L} (\cdot)

function performs feature fusion across these layers.

3.1.3. Feature Fusion Layer (FFL)

In MSWSCAN, the feature fusion layer aggregates shallow detail features and multi-depth semantic features into a single reconstruction input. Let

F_{0} \in R^{C \times H \times W}

denote the shallow feature and

{F_{i} \in R^{C \times H \times W}}_{i = 1}^{L}

the outputs of the L MSWSCAG stages. We first concatenate them along the channel dimension

C = Concat (F_{0}, F_{1}, \dots, F_{L}) \in R^{(L + 1) C \times H \times W}

(5)

and then apply a 1 × 1 convolution to produce a channel-aware projection

\hat{F} = {Conv}_{1 \times 1} (C; W_{b}, b_{b}) \in R^{C \times H \times W}

(6)

where

(W_{b}, b_{b})

are learnable parameters. This 1 × 1 mapping acts as a learnable channel mixer that selects and reweights the concatenated multi-scale channels, suppressing misaligned responses while enhancing consistent cues, so that shallow high-frequency details are preserved instead of being over-smoothed by deeper semantics. Placed at the fusion tail of the backbone, FFL ensures that both shallow and deep features directly feed the reconstruction branch,

F_{rec} = {Conv}_{3 \times 3} (\hat{F}) \overset{PixelShuffle}{\to} SR output

(7)

which improves boundary continuity and repetitive textures in heterogeneous land-cover regions. FFL complements the windowed spatial attention used in the backbone: WSA maintains local geometric consistency in the spatial domain, whereas FFL addresses channel-level alignment during multi-scale fusion.

3.1.4. Upsampling and Reconstruction

Given the fused feature

\hat{F} \in R^{C \times H \times W}

from the FFL, the reconstruction head generates an HR image at scale

r \in {2, 3, 4}

. We first refine

\hat{F}

with a 3 × 3 convolution and then produce

r^{2} C_{out}

channels that a sub-pixel operator (PixelShuffle) rearranges onto the HR lattice,

F_{u} = {Conv}_{3 \times 3} (\hat{F}) .

(8)

Y_{pre} = {PS}_{r} ({Conv}_{3 \times 3}^{r^{2} C_{out}} (F_{u})) \in R^{C_{out} \times (r H) \times (r W)} .

(9)

A final 3 × 3 convolution map

Y_{pre}

to the output space mitigates minor ringing,

\hat{Y} = {Conv}_{3 \times 3} (Y_{pre}),

(10)

where

C_{out}

denotes the number of output image channels. The same head is shared across scales by setting the PixelShuffle factor r accordingly; no progressive cascades are used.

3.2. Multi-Scale Windowed Spatial and Channel Attention Group (MSWSCAG) Module

The MSWSCAG module is the core component of the entire network, and its structure is shown in Figure 2.

The MSWSCAG module mainly consists of the Multi-scale (MS) module, the LayerNorm (LN) layer, the Windowed Spatial and Channel Attention Group (WSCAG) module, and two 3 × 3 convolution layers. The MS module extracts and fuses multi-scale feature information. The first 3 × 3 convolution after the MS module preserves the feature dimensionality and prepares the input for WSACG; after the WSACG stack, another 3 × 3 convolution mixes features and closes the residual connection. The LayerNorm (LN) layer is mainly used to normalize the input features before the next MSWSCAG module operation, avoiding problems like exploding and vanishing gradients.

The MSWSCAG module is essential for capturing both the spatial and channel-wise attention, ensuring that the model focuses more on the crucial feature information. The second convolutional layer after the MSWSCAG module’s output is used for further processing, and the final feature is passed through another MSWSCAG module. The MSWSCAG module consists of the following components:

One MS module.
One LayerNorm (LN) layer.
One WSCAG module.
Two convolutional layers.

The MSWSCAG module is defined by the following equations:

F_{i, 1} = H_{M S} (F_{0}) + F_{0}

(11)

F_{i, 2} = H_{C o m} (F_{i, 1})

(12)

F_{i, 3} = H_{L N} (F_{i, 2})

(13)

F_{i, 4} = H_{W S C A G} (F_{i, 3})

(14)

F_{i} = H_{C o m} (F_{i, 4}) + F_{i, 1}

(15)

In the equations above,

F_{i, 1}

represents the feature after passing through the MS module in the MSWSCAG module,

H_{M S} (\cdot)

represents the operation of the Multi-scale module,

F_{i, 2}

represents the feature after passing through the first convolutional layer in the MSWSCAG module,

H_{C o m} (\cdot)

represents the convolution operation,

F_{i, 3}

represents the feature after passing through the LayerNorm (LN) layer in the MSWSCAG module,

H_{L N} (\cdot)

represents the LayerNorm operation,

F_{i, 4}

represents the feature after passing through the WSCAG module in the MSWSCAG module,

H_{W S C A G} (\cdot)

represents the operation of the Windowed Spatial and Channel Attention Group module,

F_{i}

represents the output feature of the MSWSCAG module.

In order to capture rich image feature information, we designed a multi-scale module, as shown in Figure 3.

Considering the complex and fine-grained features of the sensed image, which have varying sizes, we use different kernel sizes for feature extraction. Specifically, in the multi-scale module, we use convolution kernels of sizes

3 \times 3

,

5 \times 5

, and

7 \times 7

to extract features. These extracted features are then processed by three different convolution operations. Next, we use the

concat

operation to merge the features extracted at different scales into one combined feature. Afterward, we use the

concat

operation to merge the feature maps along the depth dimension to improve the multi-scale feature fusion. The resulting feature is then passed through a final

3 \times 3

convolution to complete the multi-scale feature fusion. The operation in the multi-scale module is illustrated in Figure 3.

Given

X \in R^{C \times H \times W}

, we extract features with three parallel convolutions of kernel sizes 3 × 3, 5 × 5 and 7 × 7 to cover different receptive fields. Let

B_{3} = {Conv}_{3 \times 3}^{C} (X)

,

B_{5} = {Conv}_{5 \times 5}^{C} (X)

,

B_{7} = {Conv}_{7 \times 7}^{C} (X)

. We concatenate the three responses along the channel dimension and fuse them with a final 3 × 3 convolution

C_{ms} = Concat (B_{3}, B_{5}, B_{7}), F_{ms} = {Conv}_{3 \times 3}^{C} (C_{ms}) .

The MS output is then fed to the subsequent WSACG stack within MSWSCAG.

3.3. Windowed Spatial and Channel Attention Group (WSCAG) Module

In this section, we propose the Windowed Spatial and Channel Attention Group (WSCAG) module, which is primarily designed to enhance the model’s focus on important features while weakening the focus on less important ones. We use a self-attention mechanism to extract global features and convolutional operations to extract local features. Simultaneously, the aggregation of both global and local spatial channel features is strengthened from both intra-block and inter-block perspectives. This improves the ability of the model to learn high-frequency information and reconstruct remote sensing images with rich texture details.

The WSCAG module is composed of three cascaded WSCA modules. The WSCA module is illustrated in Figure 4. The Windowed Spatial Attention (WSA) module utilizes a self-attention mechanism to capture global spatial features. To introduce locality into the self-attention mechanism, we add a parallel convolution branch. The input features are processed through two parallel branches to extract features. The upper branch captures global spatial features via the WSA module, while the lower branch uses depthwise separable convolutions to extract local spatial features. However, simply adding a convolution branch does not effectively aggregate the global and local spatial features. Therefore, we use a convolutional layer and a Sigmoid activation function in the upper branch to extract spatial weights, while the lower branch uses global average pooling (GAP), convolution, and Sigmoid activation to extract channel weights. The two branches then interact and adaptively re-weight the spatial and channel features in both the spatial and channel dimensions, achieving the aggregation of global and local spatial channel features within the block.

Once the aggregated spatial features are obtained, channel features are extracted through two parallel branches. The upper branch captures global channel features using the Channel Attention (CA) module, while the lower branch extracts local features through depthwise separable convolutions. To better fuse the global and local channel features, the upper branch applies global average pooling, convolution, and Sigmoid to the features from the CA module to obtain re-weighted channel weights. The lower branch applies convolution and Sigmoid operations to the features from the depthwise separable convolution to obtain re-weighted spatial weights. The two branches then interact in the spatial and channel dimensions, adaptively re-weighting the features from both branches, thus achieving the aggregation of global and local spatial channel features within the block.

Additionally, the two modules described above are cascaded. By alternately cascading these two modules, inter-block spatial channel feature aggregation is realized, while each module also performs spatial channel feature aggregation within the block. Therefore, the proposed WSCA module achieves spatial channel feature aggregation both within and between blocks, enhancing the ability of the model to distinguish spatial channel features and strengthening the ability of the model to learn high-frequency information, which helps the model recover high-frequency details. Below, the computational processes for the WSCAG module, WSCA module, WSA module, and CA module are explained in detail.

3.3.1. WSCAG Module

The WSCAG module is composed of 3 cascaded WSCA modules. The mathematical expression for this is as follows:

f_{W S C A G} = H_{W S C A G} (f) = H_{W S C A, 3} (H_{W S C A, 2} (H_{W S C A, 1} (f)))

(16)

where f is the input feature,

f_{W S C A G}

is the feature after passing through the WSCAG module,

H_{W S C A G} (\cdot)

represents the operation of the WSCAG module.

3.3.2. WSCA Module

From Figure 4, the WSCA module consists of two cascaded blocks in the previous layers and uses parallel branches. The mathematical expression for this is as follows:

f_{w s a}^{1} = H_{W S A} (f)

(17)

f_{d w c}^{1} = H_{D W C} (f) = τ (H_{B N} (H_{d w c} (f)))

(18)

W_{s p a t i a l}^{1} = H_{C o n v} (δ (f_{w s a}^{1}))

(19)

W_{c h a n n e l}^{1} = H_{G A P} (H_{C o n v} (δ (f_{d w c}^{1})))

(20)

f_{1} = f_{w s a}^{1} ⊙ W_{c h a n n e l}^{1}

(21)

f_{2} = f_{d w c}^{1} ⊙ W_{s p a t i a l}^{1}

(22)

f_{W S A} = f_{1} \oplus f_{2}

(23)

f_{w c a}^{2} = H_{C A} (f_{W S A})

(24)

f_{d w c}^{2} = H_{D W C} (f) = τ (H_{B N} (H_{d w c} (f_{W S A})))

(25)

W_{s p a t i a l}^{2} = H_{C o n v} (δ (f_{d w c}^{2}))

(26)

W_{c h a n n e l}^{2} = H_{G A P} (H_{C o n v} (δ (f_{w c a}^{2})))

(27)

f_{3} = f_{w c a}^{2} ⊙ W_{s p a t i a l}^{2}

(28)

f_{4} = f_{d w c}^{2} ⊙ W_{c h a n n e l}^{2}

(29)

f_{W C A} = f_{3} \oplus f_{4}

(30)

In the equations,

H_{W S A} (\cdot)

represents the WSA module operation;

H_{C A} (\cdot)

represents the CA module operation;

H_{C o n v} (\cdot)

represents the convolution operation;

H_{d w c} (\cdot)

represents the depthwise separable convolution operation;

H_{B N} (\cdot)

represents the BatchNorm operation;

τ (\cdot)

represents the nonlinear activation function Gelu;

σ (\cdot)

represents the nonlinear activation function Sigmoid;

W_{s p a t i a l}^{i}

represents the spatial feature weighting for the i-th convolutional kernel;

W_{c h a n n e l}^{i}

represents the channel feature weighting for the i-th convolutional kernel; ⊕ represents element-wise addition.

3.3.3. WSA Module

From Figure 5, the WSA module first inputs the feature through a linear projection to generate the query, key, and value matrices. These matrices are then reshaped to form the query, key, and value for the self-attention mechanism.

The operation is as follows:

q u e r y = X W_{Q}^{S}

(31)

k e y = X W_{K}^{S}

(32)

v a l u e = X W_{V}^{S}

(33)

where X is the input feature,

W_{Q}^{S}

,

W_{K}^{S}

, and

W_{V}^{S}

represent the linear projections for query, key, and value, respectively.

Simultaneously, we employ the multi-head attention mechanism, splitting the feature into multiple heads, i.e., the query is split as

[Q_{s}^{1}, Q_{s}^{2}, \dots, Q_{s}^{h}]

, the key is split as

[K_{s}^{1}, K_{s}^{2}, \dots, K_{s}^{h}]

, and the value is split as

[V_{s}^{1}, V_{s}^{2}, \dots, V_{s}^{h}]

. These are then processed for each head’s respective attention mechanism, as shown below:

Y_{s}^{i} = softmax (\frac{(Q_{s}^{i}) {(K_{s}^{i})}^{T}}{\sqrt{d}} + D) V_{s}^{i}

(34)

In the equations,

Q_{s}^{i}

represents the query for the i-th head,

K_{s}^{i}

represents the key for the i-th head,

V_{s}^{i}

represents the value for the i-th head, d represents the dimension of the key vectors, D represents the position encoding, and

y_{s}^{i}

represents the output after the self-attention calculation for the i-th head.

Finally, we concatenate the self-attention outputs for all heads and pass them through a final linear projection as follows:

Y_{s} = Concat (Y_{s}^{1}, Y_{s}^{2}, \dots, Y_{s}^{h})

(35)

f_{w s a} = Y_{s} W_{p}^{S}

(36)

where

Concat (\cdot)

represents concatenation along the feature dimension, combining the self-attention outputs from all heads, and

W_{p}^{s}

is the learnable weight matrix of a final linear projection layer. This layer maps the concatenated output

Y_{s}

back to the original feature dimension to produce the final output

f_{w s a}

.

Through the above operations, spatial attention has been implemented within the window, which is equivalent to performing spatial feature differentiation locally. To further achieve global feature differentiation, we follow the sliding window attention mechanism proposed by the Swin Transformer. This mechanism involves sliding windows and interacting between different windows to capture global context information, thus realizing global spatial attention calculation. For example, each MSWSCAG module consists of three WSA modules. The second WSA module in the first MSWSCAG module, the first WSA module in the second MSWSCAG module, and the third WSA module, etc., all need to perform sliding window attention calculations. This allows information to interact across different windows, capturing global context information and realizing global spatial feature differentiation.

3.3.4. CA Module

From Figure 6, the input feature is first reshaped to

H W \times C

and then reshaped to

H \times W \times C

. This is followed by the linear projection to generate the query, key, and value matrices, which are denoted as

Q_{c}

,

K_{c}

, and

V_{c}

, respectively.

q u e r y = X W_{Q}^{C}

(37)

k e y = X W_{K}^{C}

(38)

v a l u e = X W_{V}^{C}

(39)

where X is the input feature after the reshape operation,

W_{Q}^{C}

,

W_{K}^{C}

, and

W_{V}^{C}

represent the linear projections for query, key, and value, respectively.

Similarly to the WSA module, we use the multi-head attention mechanism, splitting the feature into multiple heads, i.e., the query is split as

[Q_{c}^{1}, Q_{c}^{2}, \dots, Q_{c}^{h}]

, the key is split as

[K_{c}^{1}, K_{c}^{2}, \dots, K_{c}^{h}]

, and the value is split as

[V_{c}^{1}, V_{c}^{2}, \dots, V_{c}^{h}]

. We then compute the attention for each head as follows:

Y_{c}^{i} = softmax (\frac{(Q_{s}^{i}) {(K_{s}^{i})}^{T}}{α}) V_{c}^{i}

(40)

where

Q_{c}^{i}

represents the query for the i-th head,

K_{c}^{i}

represents the key for the i-th head,

V_{c}^{i}

represents the value for the i-th head,

α

is a learnable scaling factor for the softmax operation.

Finally, we concatenate the self-attention outputs from all heads and perform another self-attention operation to integrate the multi-head attention outputs:

Y_{c} = Concat (Y_{c}^{1}, Y_{c}^{2}, \dots, Y_{c}^{h})

(41)

f_{w c a} = Y_{c} W_{p}^{C}

(42)

In the above equations,

Concat (\cdot)

represents concatenation along the feature dimension, merging the self-attention outputs from all heads.

W_{p}^{C}

represents the learned weight matrix for the final output of the CA module.

By performing global self-attention calculation as described above, the model can capture global context information, enabling it to achieve global spatial feature differentiation. This helps the model obtain a more comprehensive understanding of the input data and effectively learn important global patterns across the entire input.

4. Results

In order to verify the performance of our multi-scale window spatial channel attention network in remote sensing image reconstruction, We compared it with the existing image SR network in WHU-RS19, RSSCN7 and UCMerced_LandUse Remote sensing data sets.

4.1. Experimental Setup

4.1.1. Data Setup

We performed experiments on three remote sensing datasets: WHU-RS19, RSSCN7, and UCMerced_LandUse. For these experiments, we generated the LR images by applying bicubic downsampling to the high-resolution (HR) images for three scaling factors:

\times 2

,

\times 3

, and

\times 4

. The datasets used for training are WHU-RS19, RSSCN7, and UCMerced_LandUse, while the datasets for validation are AID. Data augmentation was performed by randomly rotating the images by 90°, 180°, and 270°. The training data was expanded with such rotations, and the validation dataset was kept intact. For evaluation, we performed measurements using the PSNR and SSIM as evaluation metrics on the Y channel in the YCbCr color space.

4.1.2. Model Configuration

The deep feature extraction module is designed with a hierarchical architecture consisting of six MSWSCAG modules. Within each MSWSCAG module, the WSCAG unit is further decomposed into three WSCA modules, thereby enabling multi-level feature representation. In both the WSA and CA modules, the number of attention heads is fixed at six, while the channel dimension is set to 180. Moreover, a window size of 6 × 24 is adopted to enhance the capacity of local context modeling.

4.1.3. Training Configuration

The model was trained on images of size

48 \times 48

with a batch size of 32. Adam optimizer was used with parameters

β_{1} = 0.9, β_{2} = 0.999, ϵ = 10^{- 8}

. The initial learning rate was set to

2 \times 10^{- 4}

, and the learning rate was halved every 100 epochs. The model was trained for a total of 300 epochs, and the L1 loss function was used during training. We performed the training on a system with Windows 10 and Python version 3.9, using Pytorch version 2.1.1 and CUDA version 12.1. The training was accelerated using an NVIDIA GeForce RTX 5090 GPU.

4.2. Analysis of Experimental Results

This section demonstrates the effectiveness of the MSWSCAN in remote sensing image reconstruction. We conducted experiments on the SR datasets WHU-RS19, RSSCN7, and UCMerced_LandUse. The experiments were performed under different upsampling rates of

\times 2

,

\times 3

, and

\times 4

, and compared with MSWSCAN. We compared our results with those of methods such as Bicubic [3], EDSR [24], RCAN [25], MSRN [26], RDN [21], SwinIR [31], and DAT [32].We report standard distortion metrics: PSNR (dB) and SSIM (structural similarity index measure, 0–1; higher is better).

From Table 1, the MSWSCAN achieves the best performance in the RSSCN7 dataset for PSNR and SSIM at upsampling rates of

\times 2

,

\times 3

and

\times 4

. Specifically, the results using bold and underline text highlight the best and second-best performance for these respective metrics.

We can also see that our MSWSCAN outperforms the DAT network in terms of PSNR and SSIM at

\times 2

,

\times 3

, and

\times 4

upsampling rates. At

\times 2

upsampling, MSWSCAN improves the PSNR and SSIM by 0.05 dB and 0.0011, respectively, compared to DAT. At

\times 3

upsampling, the tMSWSCAN improves PSNR and SSIM by 0.05 dB and 0.0018, respectively. At

\times 4

upsampling, MSWSCAN improves PSNR and SSIM by 0.04 dB and 0.0021, respectively.

Next, this section will demonstrate the reconstruction results of various networks on the RSSCN7 dataset from a visualization perspective.

As shown in Figure 7, Figure 8 and Figure 9, the super-resolved images reconstructed by various networks exhibit high similarity to the original HR images, demonstrating their fundamental capability to recover primary structures. In the complex solar panel region of Figure 7, while most methods achieve reasonable reconstruction, our MSWSCAN shows superior performance in delineating thin white boundaries between solar panel units, producing more photorealistic visual details. Particularly in Figure 9, MSWSCAN reconstructs straight black lines with significantly sharper textures compared to MSRN’s blurred outputs and shows finer line continuity than SwinIR and DAT. These visual comparisons confirm that the proposed MSWSCAN generates super-resolved images with enhanced clarity and richer textural details on the RSSCN7 dataset.

As shown in Table 2, the proposed MSWSCAN achieves optimal performance on the WHU-RS19 dataset at ×2, ×3, and ×4 upsampling factors (with the exception of SSIM at the ×4 factor). At the ×2 upsampling factor, the MSWSCAN improves PSNR and SSIM by 0.05 dB and 0.0011, respectively, compared to the sub-optimal DAT network. At the ×3 upsampling factor, the MSWSCAN improves PSNR and SSIM by 0.06 dB and 0.0019, respectively, over the sub-optimal DAT network. At the ×4 upsampling factor, the MSWSCAN achieves a 0.04 dB improvement in PSNR compared to the sub-optimal DAT network, and its SSIM differs by only 0.0002 from the optimal DAT network.

The following section presents visual comparisons of super-resolution performance across different networks on the WHU-RS19 dataset.

Figure 10, Figure 11 and Figure 12 demonstrate that all compared networks reconstruct remote sensing images with high similarity to the HR references on the WHU-RS19 dataset. In Figure 10, while most methods successfully recover the general desert texture patterns, our MSWSCAN produces more natural undulating sand ridge transitions that better match the HR ground truth. The proposed method exhibits superior performance in restoring fine-grained details within the central region of Figure 10, whereas comparative methods generate oversmoothed reconstructions with blurred structures. These visual observations corroborate the quantitative superiority of MSWSCAN in both PSNR and SSIM metrics.

We also conducted experiments on the UCMerced_LandUse dataset, and the results are shown in Table 3.

As demonstrated in Table 3, the proposed MSWSCAN achieves state-of-the-art performance across all upsampling factors (

\times 2

,

\times 3

, and

\times 4

) on the UCMerced_LandUse dataset. The network exhibits consistent improvements over the suboptimal DAT method: (a)

\times 2

scale, +0.06 dB PSNR and +0.0018 SSIM; (b)

\times 3

scale, +0.05 dB PSNR and +0.0021 SSIM; and (c)

\times 4

scale, +0.07 dB PSNR and +0.0033 SSIM. These progressive enhancements verify MSWSCAN’s superior capability in maintaining both structural fidelity and textural details under various magnification requirements.

The following section presents visual comparisons of super-resolution performance across different networks on the UCMerced_LandUse dataset. Figure 13, Figure 14 and Figure 15 demonstrate that all networks generate SR images with satisfactory structural consistency on the UCMerced_LandUse dataset. Three critical observations emerge: (1) In the guidance line reconstruction of Figure 13, MSWSCAN uniquely preserves clear negative space at line terminals (red boxes), matching the HR ground truth with subpixel precision. (2) Aircraft nose edges in Figure 14 show more natural curvature continuity through our method. (3) Figure 15 reveals most networks achieve realistic reconstruction in the black-and-white stripe regions. However, in the yellow “V”-shaped area, the proposed MSWSCAN demonstrates superior recovery in the blank region at the tail of the “V” symbol. Specifically, it clearly reveals a small discontinuity in the blank space at the tail, which makes the shape of the “V” symbol more distinct. In contrast, other models fail to produce a noticeable separation at the tail of the “V” symbol. These visual advantages align with the quantitative superiority (PSNR/SSIM) shown in Table 3, confirming MSWSCAN’s dual excellence in both perceptual quality and metric performance.

In order to further prove the performance of the MSWSCAN proposed in this paper, the complex structures of some images will be enlarged below so that the reconstruction performance of different models at complex structures can be clearly displayed. Figure 16 reveals distinct reconstruction capabilities in the red rectangular area containing intricate HR structures. Conventional methods exhibit significant blurring with complete absence of inter-window boundaries. While SwinIR and DAT partially recover boundary contours, their reconstructions show undersized black rectangular patterns (approximately half of HR dimensions) with low contrast. The proposed MSWSCAN notably preserves both the original pattern geometry and boundary sharpness, achieving the closest visual match to the HR reference.

Figure 17 demonstrates reconstruction performance on high-density architectural patterns, where each grid represents individual rooms. The analysis reveals three reconstruction levels: (1) The reconstructed images of Bicubic are very blurry, with no boundaries between different windows, and no boundary structure similarly to HR images has been reconstructed; (2) RCAN/MSRN/RDN partially restore grids but with intermittent boundary discontinuities; and (3) EDSR/SwinIR/DAT reconstruct basic grid layouts but suffer from blurred inter-cell boundaries in congested areas. Notably, the proposed MSWSCAN maintains continuous boundary definitions matching the HR reference, achieving the highest structural fidelity in these challenging high-density regions.

Limitations and Failure Cases

Figure 18 highlights challenging cases with very long and thin grid structures under curved boundaries (lower-left arc) and dense line intersections. Across methods, horizontal strokes can be partially missing and fine grids become ambiguous due to aliasing and averaging effects in low-variance areas.

For MSWSCAN, the most notable artifacts are (i) slightly blunt vertical strokes with faint double edges near cross-window boundaries, and (ii) attenuated horizontal strokes in dense grids. We attribute it (i) to residual cross-scale misalignment plus limited cross-window consistency of windowed interactions, which becomes visible on long contours and at turning points (e.g., the lower-left arc). We attribute (ii) to a bias of the channel-aware fusion (FFL) towards high-energy texture channels, which can underweight sparse thin-line cues during aggregation.

By comparison, DAT tends to produce straighter and more continuous lines due to stronger long-range coupling, whereas SwinIR emphasizes sharpness but occasionally exhibits aliasing (jagged edges). These patterns indicate that MSWSCAN’s local windowing and channel reweighting are effective for most textured regions but remain less robust on ultra-long, ultra-thin structures and under pronounced curvature. Future work will consider stronger cross-window propagation and thin-structure–aware fusion to improve continuity without sacrificing efficiency.

4.3. Component Ablation Analysis

This section systematically investigates the individual contributions of key components in the proposed MSWSCAN architecture through controlled ablation studies. Three fundamental aspects are examined as follows: (1) the quantitative impact of Multi-scale Windowed Spatial-Channel Attention Group (MSWSCAG) module counts, (2) the functional efficacy of constituent elements within each MSWSCAG (multi-scale mechanisms, window attention, and channel attention), (3) and the optimization effects of feature fusion layers. The experimental configuration maintains identical training protocols across all variants to ensure fair comparison.

4.3.1. The Impact of the Number of Cascaded MSWSCAG Modules on Network Performance

Ablation studies are conducted on the WHU-RS19, RSSCN7, and UCMerced_LandUse datasets. The reconstruction performance is evaluated under

\times 2

,

\times 3

, and

\times 4

upsampling scales by cascading two, four, six, and eight MSWSCAG modules in the network, with quantitative results summarized in Table 4.

As shown in Table 4, both PSNR and SSIM metrics generally improve with increasing MSWSCAG module counts, with the most significant gains observed when scaling from two to four and four to six modules. On the RSSCN7 dataset, increasing modules from two to four yields average improvements of +0.21 dB PSNR and +0.0291 SSIM across three upsampling scales, while four to six modules provide additional gains of +0.13 dB PSNR and +0.0218 SSIM. The six to eight module configuration shows marginal changes:

\times 2

scale (+0.03 dB/+0.0012),

\times 3

scale (+0.01 dB/−0.0002), and

\times 4

scale (−0.01 dB/+0.0001), accompanied by a 4.88M parameter increase. Similar patterns emerge on WHU-RS19: two to four modules achieve +0.24 dB/+0.0314, four to six modules +0.15 dB/+0.0245, while six to eight modules result in

\times 2

(+0.02 dB/−0.0001),

\times 3

(+0.02 dB/+0.0008), and

\times 4

(−0.02 dB/−0.0005) performance fluctuations with equivalent parameter overhead.

On the UCMerced_LandUse dataset, increasing MSWSCAG modules from two to four achieves average gains of +0.23 dB PSNR and +0.0343 SSIM across three upsampling scales, while scaling from four to six modules provides additional improvements of +0.16 dB PSNR and +0.0242 SSIM. However, expanding to eight modules yields diminishing returns:

\times 2

scale (+0.02 dB/+0.0006),

\times 3

scale (+0.01 dB/0.0000 SSIM), and

\times 4

scale (+0.01 dB/+0.0007), despite a 4.88M parameter increase. This analysis confirms that the six- to eight-module expansion offers negligible performance benefits relative to its computational overhead. Consequently, we adopt six MSWSCAG modules in the backbone network to optimally balance reconstruction quality and model complexity.

4.3.2. Influence of MSWSCAG Internal Module on Network Performance

We investigate the effects of the multi-scale (MS) module, window-based spatial attention (WSA) module, and channel attention (CA) module within the MSWSCAG block on network performance. Experiments are conducted on the UCMerced_LandUse dataset with an

\times 2

upsampling factor, using six cascaded MSWSCAG blocks in the backbone. The results are presented in Table 5.

From Table 5, the MSWSCAG (MS+WSA+CA) module achieves improvements of

0.18 dB

in PSNR and

0.0062

in SSIM over the MS-only module, demonstrating the importance of the designed WSCA (WSA+CA) module for capturing global contextual information and aggregating spatial and channel features both intra-block and inter-block. Compared with the WSCA-only module, MSWSCAG further improves PSNR and SSIM by

0.12 dB

and

0.0020

, respectively, validating the effectiveness of the MS module in extracting and fusing multi-scale features.

Additionally, we conduct experiments cascading MS with WSA and CA separately. The MSWSA (MS+WSA) module improves PSNR by

0.10 dB

and SSIM by

0.0044

over the MS-only module, while the MSCA (MS+CA) module yields gains of

0.09 dB

and

0.0045

. However, MSWSCAG outperforms both MSWSA and MSCA by approximately

0.08 dB

in PSNR and

0.0018

in SSIM. These results further confirm the effectiveness of the MS, WSA, and CA modules for remote sensing image reconstruction.

4.3.3. Influence of Feature Fusion Layer on Network Performance

Experiments evaluating the impact of the feature fusion layer were conducted on WHU-RS19 at

\times 2

upsampling with six cascaded MSWSCAG modules. As reported in Table 6, enabling FFL brings a small yet consistent gain over the variant without FFL, namely +0.03 dB PSNR and +0.0008 SSIM. The improvements concentrate on heterogeneous boundaries (e.g., building–vegetation and shoreline transitions), where channel-aware reweighting helps reduce texture drift.

From Table 6, it can be observed that adding a feature fusion layer (FFL) at the end of the backbone network improves the PSNR and SSIM by 0.03 dB and 0.0008, respectively, compared to the network without FFL. Although the performance improvement is relatively modest, the FFL comprises only 4.16k parameters, essentially functioning as a 1 × 1 convolution, which demonstrates the effectiveness of aggregating multi-level features for remote sensing image reconstruction.

4.3.4. Complexity and Runtime

We complement accuracy comparisons with runtime and memory analysis on a unified setup (FP32, batch = 1 unless specified, HR 256 × 256 for inference; PyTorch 2.8.0+cu128 on an NVIDIA GeForce RTX 5090). Results are averaged over repeated forward passes after warm-up; only model forward is timed.

As shown in Table 7, MSWSCAN achieves DAT-level inference efficiency—latency and peak memory are close to DAT (558 vs. 545 ms; 3.03 vs. 3.15 GB)—while SwinIR remains the fastest (315 ms). During training on 64 × 64 inputs (batch = 2), MSWSCAN’s iteration time is slightly higher than DAT (199 vs. 181 ms) and higher than SwinIR, whereas its memory footprint lies between the two baselines (6.39 vs. 4.97/7.59 GB). Together with the accuracy gains on heterogeneous land covers, these results indicate that MSWSCAN offers a practical fidelity–efficiency trade-off, providing more stable cross-scale aggregation at a runtime budget comparable to DAT.

5. Conclusions

In order to mitigate the artifacts and structural distortions that arise during the reconstruction of remote sensing images due to their complex and variable features, we propose the Multi-scale Window Spatial–Channel Attention Network (MSWSCAN). First, we design a multi-scale module that employs three convolutional kernels of different sizes to extract features with diversified receptive fields, and fuse these multi-scale features to enrich the feature representation. Next, within each module, we alternately cascade the Window Spatial Attention (WSA) and channel attention (CA) submodules to dynamically reweight spatial and channel features, thereby enhancing the network’s ability to discriminate spatial features and to capture high-frequency information. Finally, before the reconstruction module, we introduce a feature fusion layer to integrate features from different depths, further improving the expressiveness of deep features and facilitating the reconstruction of remote sensing images with rich texture details. Extensive experiments on public datasets demonstrate that MSWSCAN effectively enriches texture details in reconstructed images, alleviates artifacts and structural distortions, and achieves superior performance in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), compared to existing methods.

Author Contributions

X.X. (Xiao Xiao) gave some suggestions, revised the manuscript, and conducted experiments; X.X. (Xufeng Xiang) proposed the initial idea, conducted experiments, and wrote the manuscript; J.W. contributed to figure modification and manuscript revision; L.W. conducted experiments and revised the manuscript; X.G. contributed to table modification and manuscript revision; Y.C. collected experimental data and conducted experiments; J.L. assisted in data visualization and improved the clarity of figures; P.H. contributed to reference organization and formatting of the manuscript; J.H. performed supplementary data analysis and assisted in result validation; Z.L. contributed to proofreading and provided constructive comments on presentation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Project of the Joint Fund of the National Natural Science Foundation of China, grant number U21A20446, by the Natural Science Basic Research Program of Shaanxi Province, grant number 2025JC-YBMS-716, by The 2023 Annual Traffic Scientific Research Project of Department of transport of Shaanxi Province “Research on Digital Sensing and Intelligent Decision-Making for Highway Traffic Infrastructure Safety”, project number 23-40X, and by The Key Research and Development Plan Project of Department of science and technology of Shaanxi Province “Research on Highway Traffic Safety Active Prevention and Control Equipment and Early Warning System”, project number 2024SF-YBXM-664.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to express their sincere gratitude to the School of Telecommunications Engineering, Xidian University, for providing experimental facilities and technical support. Special thanks are extended to the Department of Transport of Shaanxi Province and the Department of Science and Technology of Shaanxi Province for their valuable guidance and collaboration in the research on highway traffic safety and intelligent sensing.

Conflicts of Interest

Author Jianqiang Wang and Xingzhi Gao were employed by the company CCCC Civil Engineering Science & Technology Co., Ltd, Xi’an. Author Yang Chen was employed by the company Shanghai Huawei Technologies Co., Ltd., Public Development Department. Author Jun Liu, Peng He and Junhui Han were employed by the company The 41st Research Institute of China Electronics Technology Group Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Brown, L.G. A survey of image registration techniques. ACM Comput. Surv. (CSUR) 1992, 24, 325–376. [Google Scholar] [CrossRef]
Yang, S.; Kim, Y.; Jeong, J. Fine edge-preserving technique for display devices. IEEE Trans. Consum. Electron. 2008, 54, 1761–1769. [Google Scholar] [CrossRef]
Keys, R. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
Tsai, R.Y.; Huang, T.S. Multiframe image restoration and registration. Adv. Comput. Vis. Image Process. 1984, 1, 317–339. [Google Scholar]
Malkin, K.; Robinson, C.; Hou, L.; Soobitsky, R.; Czawlytko, J.; Samaras, D.; Saltz, J.; Joppa, L.; Jojic, N. Label super-resolution networks. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2018. [Google Scholar]
Li, Z.; Zhang, H.; Lu, F.; Xue, R.; Yang, G.; Zhang, L. Breaking the resolution barrier: A low-to-high network for large-scale high-resolution land-cover mapping using low-resolution labels. ISPRS J. Photogramm. Remote Sens. 2022, 192, 244–267. [Google Scholar] [CrossRef]
Li, Z.; He, W.; Li, J.; Lu, F.; Zhang, H. Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27717–27727. [Google Scholar]
Irani, M.; Peleg, S. Improving resolution by image registration. CVGIP Graph. Model. Image Process. 1991, 53, 231–239. [Google Scholar] [CrossRef]
Kim, S.P.; Bose, N.K.; Valenzuela, H.M. Recursive reconstruction of high resolution image from noisy undersampled multiframe. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 1013–1027. [Google Scholar] [CrossRef]
Rhee, S.; Kang, M.G. Discrete cosine transform based regularized high-resolution image reconstruction algorithm. Opt. Eng. 1999, 38, 1348–1356. [Google Scholar] [CrossRef]
Nguyen, N.; Milanfar, P. An efficient wavelet-based algorithm for image superresolution. In Proceedings of the 2000 International Conference on Image Processing (Cat. No. 00CH37101), Vancouver, BC, Canada, 10–13 September 2000; pp. 351–354. [Google Scholar]
Stark, H.; Oskoui, P. High-resolution image recovery from image-plane arrays, using convex projections. J. Opt. Soc. Am. A 1989, 6, 1715–1726. [Google Scholar] [CrossRef] [PubMed]
Schultz, R.R.; Stevenson, R.L. A Bayesian approach to image expansion for improved definition. IEEE Trans. Image Process. 1994, 3, 233–242. [Google Scholar] [CrossRef] [PubMed]
Elad, M.; Feuer, A. Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images. IEEE Trans. Image Process. 1997, 6, 1646–1658. [Google Scholar] [CrossRef] [PubMed]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1664–1673. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II. Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1874–1883. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
Qiu, Y.; Wang, R.; Tao, D.; Cheng, J. Embedded block residual network: A recursive restoration model for single-image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4180–4189. [Google Scholar]
Liu, J.; Zhang, W.; Tang, Y.; Tang, J.; Wu, G. Residual feature aggregation network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2359–2368. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 517–532. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.-H. Deep Laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII. Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar]
Zhang, Y.; Wei, D.; Qin, C.; Wang, H.; Pfister, H.; Fu, Y. Context reasoning attention network for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4278–4287. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 12312–12321. [Google Scholar]
Guo, Y.; Chen, J.; Wang, J.; Chen, Q.; Cao, J.; Deng, Z.; Xu, Y.; Tan, M. Closed-loop matters: Dual regression networks for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5407–5416. [Google Scholar]
Mao, X.; Shen, C.; Yang, Y.B. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Tai, Y.; Yang, J.; Liu, X.; Xu, C. MemNet: A persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4539–4547. [Google Scholar]

Figure 1. MSWSCAN network architecture.

Figure 2. Multi-scale Windowed Spatial and Channel Attention Group (MSWSCAG) module.

Figure 3. Multi-scale module (MS): three parallel convolutions with kernel sizes 3 × 3, 5 × 5 and 7 × 7; the outputs are concatenated and fused by a final 3 × 3 convolution.

Figure 4. Windowed Spatial and Channel Attention (WSCA) module.

Figure 5. Window space attention (WSA) module.

Figure 6. Channel attention (CA) module.

Figure 7. Visual comparison of

\times 2

super-resolution results on RSSCN7 dataset.

Figure 7. Visual comparison of

\times 2

super-resolution results on RSSCN7 dataset.

Figure 8. Visual comparison of

\times 3

super-resolution results on RSSCN7 dataset.

Figure 8. Visual comparison of

\times 3

super-resolution results on RSSCN7 dataset.

Figure 9. Visual comparison of

\times 4

super-resolution results on RSSCN7 dataset.

Figure 9. Visual comparison of

\times 4

super-resolution results on RSSCN7 dataset.