MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods

Song, Wei; Yan, Xiaoyu; Guo, Wei; Xu, Yiyang; Ning, Keqing

doi:10.3390/sym17030431

Open AccessArticle

MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods

by

Wei Song

¹

,

Xiaoyu Yan

¹,

Wei Guo

²,

Yiyang Xu

¹ and

Keqing Ning

^1,*

¹

College of Information Science and Technology, North China University of Technology, Beijing 100144, China

²

School of Electrical and Control Engineering, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(3), 431; https://doi.org/10.3390/sym17030431

Submission received: 15 February 2025 / Revised: 3 March 2025 / Accepted: 10 March 2025 / Published: 13 March 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Single-image super-resolution (SISR) methods based on convolutional neural networks (CNNs) have achieved breakthrough progress in reconstruction quality. However, their high computational costs and model complexity have limited their applications in resource-constrained devices. To address this, we propose the MSWSR (multi-scale wavelet super-resolution) method, a lightweight multi-scale feature selection network that exploits both symmetric and asymmetric feature patterns. MSWSR achieves efficient feature extraction and fusion through modular design. The core modules include a mixed feature module (MFM) and a gated attention unit (GAU). The MFM employs a symmetric multi-branch structure to efficiently integrate multi-scale features and enhance low-frequency information modeling. The GAU combines the spatial attention mechanism with the gating mechanism to further optimize symmetric feature representation capability. Moreover, a lightweight spatial selection module (SSA) adaptively assigns weights to key regions while maintaining structural symmetry in feature space. This significantly improves reconstruction quality in complex scenes. In 4× super-resolution tasks, compared to SPAN, MSWSR improves PSNR by 0.22 dB on Urban100 and 0.26 dB on Manga109 datasets. The model contains only 316K parameters, which is substantially lower than existing approaches. Extensive experiments demonstrate that MSWSR significantly reduces computational overhead while maintaining reconstruction quality, providing an effective solution for resource-constrained applications.

Keywords:

single-image duper-resolution (SISR); CNN; lightweight networks

1. Introduction

The single-image super-resolution (SR) method reconstructs high-resolution (HR) images from low-resolution (LR) inputs and plays a crucial role in computer vision and image processing [1]. This technique is widely applied in medical imaging, surveillance, and security to enhance image quality and improve the performance of other computer vision tasks [2,3]. Mobile devices such as iPhone 16 Pro Max, Huawei P70 Pro, and Xiaomi 15 are developing rapidly. This evolution emphasizes the need for efficient and accurate super-resolution methods. These methods improve image display quality and meet the processing demands of high-performance devices. The investigation of super-resolution estimation methods for these devices holds significant practical importance.

In the past decade, convolutional neural networks (CNNs) [4,5,6,7,8,9] have been the cornerstone of super-resolution methods. Recently, however, the emergence of transformer-based approaches [9,10,11,12,13,14,15,16] has started to shift this paradigm, and they have gradually surpassed CNNs in prominence. Specifically, transformers hold a distinct advantage over CNNs through their multi-head self-attention mechanism. This mechanism enables the extraction of global image features, whereas CNNs are structurally limited to capturing local features. To address the performance gap between CNNs and vision transformers (ViTs), several studies have explored strategies, such as increasing the size of convolutional kernels to achieve a larger receptive field. Wu et al. [17] employed large-kernel convolutions as feature mixers to replace attention modules. This approach effectively models long-range dependencies and broad receptive fields with minimal computational overhead. Xie et al. [18] developed LKDN to enhance model performance and computational efficiency through large-kernel attention (LKA) mechanisms. Wang et al. [19] proposed a multi-scale attention network (MAN), which consists of multi-scale large-kernel attention (MLKA) mechanisms and gated spatial attention units (GSAUs). The MLKA mechanism modifies large-kernel attention through multi-scale and gating schemes to obtain rich attention maps at various granularity levels. This design enables the network to aggregate global and local information while avoiding potential blocking artifacts. Empirical evidence shows that performance saturates at a convolution kernel size of 7 × 7. Beyond this threshold, further increases in kernel size not only fail to provide significant improvements but may actually lead to performance degradation. Although Ding et al. [20] proposed that larger kernels could still be utilized through carefully designed kernel structures, the resulting kernels often become over-parameterized, and model performances still saturate before achieving global receptive fields.

To address the aforementioned issues, we propose a lightweight multi-scale wavelet super-resolution network MSWSR. The core component of the network is the multi-scale wavelet block (MWB), which consists of a multi-scale feature module (MFM) and a gated attention unit (GAU). Through this structural design, MWB effectively integrates multi-scale feature representation with attention mechanisms while maintaining lightweight architecture and significantly improving performance. Specifically, MFM adopts a multi-branch architecture for the efficient extraction and fusion of multi-scale features, comprising three key branches: (1) A standard 3 × 3 convolution branch extracts basic feature information. (2) A re-parameterization convolution branch dynamically adjusts the receptive field during training through strip convolution, dilated convolution, and 3 × 3 convolution. Also, it will be folded into a single equivalent 3 × 3 convolution during inference to improve computational efficiency. (3) The wavelet transform convolution branch (WTConv) [21,22] captures low-frequency features through time-frequency decomposition. This branch expands the receptive field through wavelet transform while using small-kernel convolutions to enhance its low-frequency response modeling capability. The features from these multi-branch structures are unified during the fusion phase to construct rich multi-scale feature representations. The GAU module further enhances feature discriminative ability by integrating spatial attention and gating mechanisms. To dynamically highlight key regional features and suppress redundant information, we introduce a lightweight spatial selection attention module (SSA). The SSA module achieves precise weight redistribution through the adaptive learning of importance distribution across feature map positions, enabling the model to focus on features crucial to reconstruction quality. This module significantly improves the global feature discriminative ability and task relevance while maintaining computational efficiency. The proposed architecture adopts the Metformer [23] paradigm to organically combine MFM, GAU, and SSA modules. These modules are stacked in a modular structure that maximizes the synergy between multi-scale mechanisms, attention mechanisms, and feature-selection capabilities. As shown in Figure 1, in the ×4 super-resolution task, MSWSR demonstrates significant advantages on the Urban100 dataset. It achieves superior performance compared to state-of-the-art methods while requiring fewer parameters and lower FLOPs. This effectively balances reconstruction quality, model size, and computational efficiency. These results thoroughly validate the effectiveness of the proposed multi-scale feature selection strategy in balancing model efficiency and performance, providing an efficient solution for image super-resolution reconstruction in resource-constrained scenarios. Furthermore, we conducted comprehensive ablation studies to investigate the individual and synergistic contributions of each proposed component. Notably, all experiments were performed under a lightweight model constraint. These experiments quantitatively demonstrate that our multi-scale feature module, gated attention unit, and spatial selection attention module each significantly enhance reconstruction quality without substantially increasing computational complexity.

Overall, our contributions are as follows:

We propose a lightweight multi-scale feature selection network (MSWSR) for efficient feature fusion and modeling. Through modular design and multi-scale feature extraction strategies, MSWSR effectively balances model performance and computational complexity.
We propose two key components: MFM and GAU. MFM enhances multi-scale feature modeling through a multi-branch structure, while GAU combines spatial attention with the gating mechanism to optimize feature representation through their synergistic interaction.
Extensive experiments demonstrate MSWSR’s superior performance on benchmark datasets. With only 316K parameters, it achieves PSNR improvements of 0.22 dB and 0.26 dB on Urban100 and Manga109 datasets for ×4 super-resolution methods, validating its effectiveness in resource-constrained scenarios.

2. Related Work

2.1. CNN-Based SR Methods

Single-image super-resolution (SISR) methods aim to recover high-resolution images from low-resolution inputs. Convolutional neural network (CNN) methods have become dominant in this field. In 2014, Dong et al. [4] proposed SRCNN. They used a three-layer convolutional network for end-to-end training. This network learned the mapping between low-resolution and high-resolution images and significantly improved reconstruction quality. SRCNN faced computational efficiency challenges. To address this, Dong et al. [24] introduced FSRCNN in 2016. They optimized the network structure and used deconvolution for upsampling. The reduction in convolutional layers made real-time super-resolution possible. Kim et al. [5] developed VDSR with deep residual learning and a deeper network structure to solve the vanishing gradient problem. In 2017, Lim et al. [6] proposed EDSR by removing batch normalization layers from VDSR and deepening the structure, showing excellent performance in high-magnification scenarios. Zhang et al. [25] introduced RDN (Residual Dense Network) in 2018, which incorporated dense residual blocks and showed particular strength in detail recovery and edge sharpening.

Recent years have seen the continued exploration of novel architecture. Dai et al. [15] proposed SAN with second-order attention mechanisms to improve the capture of image details. Huang et al. [13] introduced the Holistic Attention Network (HAN) with global attention mechanisms, showing excellent performance in recovering complex image details, especially for scenes with complex backgrounds. These new attention mechanisms have improved both visual quality and computational efficiency.

2.2. Transformer-Based Architectures

Chen et al. [11] introduced HAT, which combines channel attention with window-based self-attention mechanisms to leverage both global statistics and local features. The incorporation of overlapping cross-attention modules enhances feature interaction between adjacent windows, enabling more precise reconstruction with broader spatial information. Li et al. [12] developed EDT with dual-transformer structures to model low-resolution and high-resolution features separately. EDT optimizes computational efficiency through a two-stage attention mechanism while maintaining high restoration accuracy with reduced computational cost. Abhisek Ray et al. [16] introduced CAFT with innovative non-overlapping triangular window techniques alongside traditional rectangular windows, which effectively mitigates boundary distortion and expands the model’s spatial processing capability.

2.3. Lightweight SR Models

In recent years, lightweight SISR networks have become a research focus, with researchers proposing various innovative methods to balance model performance and computational complexity for resource-constrained scenarios. Hui et al. [26] proposed the Information Multi-Distillation Network (IMDN), which enhances feature extraction efficiency through a multi-level distillation mechanism for feature splitting and fusion, combined with contrast-aware channel attention (CCA). Building upon this work, Liu et al. [27] further optimized IMDN by introducing the Residual Feature Distillation Network (RFDN) with feature distillation connections (FDC) and shallow residual blocks (SRB), maintaining high super-resolution reconstruction while reducing model complexity. Li et al. [28] developed the Blueprint Separable Residual Network (BSRN), which achieves more efficient feature extraction by replacing traditional convolutions with blueprint separable convolutions (BSConv) and incorporating enhanced spatial attention (ESA) modules. Sun et al. [29] introduced the ShuffleMixer network, designing an efficient feature mixing module by combining large convolution kernels with channel splitting and shuffling operations, achieving good performance with extremely low parameters.

Wang et al. [30] proposed the Feature De-redundancy [8] and Self-Calibration network (FDSCSR) with feature de-redundancy and self-calibration modules (FDSCB). The network incorporates local feature fusion modules (LFFM) to improve computational efficiency and feature integration capability. Xie et al. [18] introduced the large-kernel distillation network (LKDN), which explores the potential of large convolution kernels by combining large-kernel attention (LKA) modules with re-parameterization techniques. Wang et al. [19] proposed the multi-scale attention network (MAN) with multi-scale large-kernel attention (MLKA) mechanisms and gated spatial attention units (GSAUs). This architecture effectively enhances feature representation while avoiding blocking artifacts.

Transformer-based research has opened new directions for lightweight super-resolution networks. Li et al. [31] proposed GRFormer featuring Grouped Residual Self-Attention (GRSA) modules. By introducing group residuals in QKV linear layers and ES-RPB, this network significantly reduces parameters and computational cost. Kim et al. [32] introduced the low-to-high multi-level visual transformers (LMLTs) to capture information at different levels through multi-head mechanisms. The network uses parallel stacking to reduce computational complexity while effectively addressing window boundary issues.

Despite transformers’ advantages in capturing long-range dependencies and global context, their inference speed remains slower than CNN-based models, particularly in resource-constrained devices. Moreover, CNNs remain a preferred choice for lightweight super-resolution networks, as they achieve efficient local feature extraction and reconstruction with lower computational complexity through deep residual structures and large convolution kernels. While dynamic convolutions achieve superior performance in specific scenarios, their runtime weight generation mechanism introduces computational overhead. This overhead directly conflicts with the lightweight design principles we aim to maintain.

The proposed MSWSR method distinguishes itself from existing lightweight SR methods through its multi-branch parallel architecture, which enables simultaneous multi-scale feature extraction. This architecture ensures comprehensive feature representation without introducing computational redundancy. Additionally, MSWSR employs wavelet transform convolution (WTConv) for efficient receptive field expansion, achieving logarithmic parameter scaling and enhanced low-frequency information modeling. This combination enables superior reconstruction quality while maintaining minimal parameters (316K). It also reduces computational complexity compared to current state-of-the-art approaches. Methods such as LKDN and MAN rely on parameter-intensive large-kernel convolutions, whereas our approach achieves better efficiency.

3. Materials and Methods

3.1. Architecture Overview

Given a low-resolution image I_LR ∈ R^C×H×W, where C and H×W represent the number of channels and spatial resolution, respectively, MSWSR first extracts features of X₀ ∈ R^C×H×W through a 3×3 convolution. Then, we design a MWB composed of multiple attention blocks, which takes X₀ as the input and produces deeper features of X₁ ∈ R^D×H×W through feature enhancement. To fully exploit discriminative feature representations, we introduce a spatial attention enhancement (SAA) module, which adaptively learns attention weights in the spatial dimension to highlight features in regions with strong representation capability while suppressing redundant and noisy information, thereby generating feature maps X₂ ∈ R^D×H×W with enhanced discriminative ability. This mechanism of feature extraction and attention enhancement is repeated m times in the network, forming a deep feature extraction chain. The feature extraction process at the k-th iteration can be expressed as

X_{k + 1} {= f}_{SAA} {(f}_{MAB} (X_{k})), k \in [1, m - 1]

(1)

where f_SAA denotes the spatial attention enhancement mapping function and f_MAB represents the mapping function of the multi-scale attention block. The SAA module acts as a feature filter in each iteration, continuously optimizing the feature representation capability. After m iterations, we obtain the feature Xm. Finally, the features are mapped to high-resolution reconstruction space through a 3×3 convolution layer, and global residual connections [33] are employed to fuse the initial features X₀ with the final features to obtain X ∈ R^D×H×W. The high-resolution image I_HR ∈ R^C×rH×rW is then generated through an image reconstruction layer consisting of Conv3×3 and PixelShuffle [34], where r denotes the upscaling factor. This multi-stage feature extraction network design, integrated with adaptive feature selection mechanisms, not only effectively captures multi-scale discriminative features but also preserves original image information through residual learning, thereby achieving high-quality super-resolution reconstruction. The following sections will detail each module.

3.2. Multi-Scale Wavelet Block

In this section, we introduce the core module of our network-MWB, as illustrated in Figure 2. Previous studies have shown that expanding features before activation layers enhances the network’s nonlinear capability [35]. Therefore, we incorporated 1×1 convolutions before and after MFM for feature expansion and compression, respectively.

In previous multi-scale studies, many network architectures input all features into each branch, leading to increased parameter count and redundant computations. To address this issue, we implemented a slice operation after the 1×1 convolution. For simplicity, we evenly distributed the input features among branches, meaning only a portion of the expanded features was fed into each branch.

Z_{i} = \{\begin{cases} w (x_{i}) & i = 1 \\ Rep C (x_{i}) & i = 2 \\ WT (x_{i}) & i = 3 \end{cases}

(2)

where w(.) represents the 3×3 convolution operation, RepC denotes the re-parameterization convolution mapping function, and WT represents the wavelet convolution operation.

RepConv: The re-parameterization convolution module is designed to enhance local feature extraction efficiency while incorporating dynamic receptive field expansion as Figure 3. Drawing inspiration from human visual perception mechanisms, particularly the foveal focusing characteristics, this module adaptively captures spatial contexts. The training phase integrates multiple convolutional operations, including standard 3×3 convolutions, decomposed 3×1 and 1×3 convolutions, and dilated convolutions to establish comprehensive spatial relationships. This architectural design can be mathematically formulated as

\begin{array}{l} Y_{1} & {= DWConv}_{3 \times 3} (x_{2}) + {DWConv}_{1 \times 3} (x_{2}) \\ + {DWConv}_{3 \times 1} (x_{2}) + {DWConv}_{2 \times 2} (x_{2}) \end{array}

(3)

At the inference time, all operations are structurally re-parameterized into a single 3 × 3 convolution, which effectively optimizes computational efficiency while maintaining the model’s representational capacity.

WTConv [21] introduces a novel architectural paradigm with distinctive computational advantages. The hierarchical nature of WTConv enables progressive expansion of the receptive field across layers while maintaining remarkable parameter efficiency. More notably, the inherent structure of WTConv layers demonstrates superior capability in low-frequency feature capture compared to standard convolutions, attributed to the iterative wavelet transform decomposition that systematically emphasizes low-frequency components and enhances corresponding layer responses.

Traditional approaches relying on large convolution kernels for global feature extraction often suffer from over-parameterization. In contrast, WTConv leverages sophisticated time-frequency analysis principles to achieve efficient receptive field expansion and orchestrates a CNN low-frequency response through cascading mechanisms. The architecture implements cascaded wavelet transform (WT) decomposition coupled with strategically designed small-kernel convolutions, enabling the focused processing of distinct frequency bands within the progressively expanding receptive fields. Notably, for a k×k receptive field, the architecture achieves logarithmic parameter scaling with k, marking a significant improvement over conventional large-kernel approaches.

As depicted in Figure 4 the WTConv operational pipeline initiates with wavelet transform for frequency-based content decomposition and downsampling. Subsequently, the framework applies specialized small-kernel depth-wise convolutions to frequency-specific feature maps before reconstruction through inverse wavelet transform (IWT).

Y = IWT (Conv (W, WT (x_{i})))

(4)

where x_i denotes the input tensor and W represents the weight tensor of k × k depth-wise kernels, with input channels four times that of x_i. These operations not only segregate convolutions between frequency components but also enable smaller kernels to operate over larger regions of the original input, effectively expanding their receptive field.

Inspired by [18], we introduce the GAU module to enhance feature representation. GAU integrated single spatial attention (SSA) with gated linear units (GLU), thereby reducing both parameter count and computational complexity. To capture spatial information more effectively, GAU employs single-layer depth-wise convolutions for feature map weighting. Given dense transformations X and Y, the key process of GAU can be formulated as:

G S A U (X, Y) = H_{D W} (X) \otimes Y

(5)

where H_DW(·) represents the depth-wise convolution operation.

3.3. Spatial Selection Attention Module

In contrast to the conventional SE [36] architecture, our approach replaces fully connected layers with 1×1 convolutions, facilitating more efficient feature extraction while achieving parameter efficiency. The module systematically compresses two-dimensional features (H×W) from each channel into a scalar representation, transforming feature maps from [h, w, c] to [1, 1, c] dimensions and generating channel-specific weight values. SSA establishes inter-channel correlations through dual point-wise convolutions, yielding weight values with cardinality matching the channel dimension of the input feature maps. These normalized weights are subsequently applied to modulate channel-wise feature response.

To further elaborate on the design rationale of our SSA module, we provide a comparative analysis with standard attention mechanisms such as SENet and CBAM.SENet that primarily focuses on channel-wise relationships through global average pooling followed by fully connected layers. In contrast, our SSA module implements a more parameter-efficient design by replacing fully connected layers with 1×1 convolutions, significantly reducing parameter count while maintaining feature representation capacity. Unlike this, CBAM sequentially applies both channel and spatial attention, creating additional computational overhead. In contrast, SSA employs a streamlined approach that achieves effective spatial feature weighting with substantially lower computational complexity. Our approach focuses on direct spatial feature modulation through point-wise convolutions. By avoiding separate channel and spatial attention branches, it becomes more suitable for resource-constrained super-resolution applications.

4. Experiments and Results

4.1. Experimental Settings

We train our MWSR network on the DIV2K [37] dataset, which contains 800 training images, and fine-tune it on the Flickr2K [6] dataset. For evaluation, we employ five benchmark datasets: Set5 [38], Set14 [38], BSD100 [39], Urban100 [40], and Manga109 [41]. To assess the reconstruction quality, we adopt two widely used metrics, PSNR and SSIM, computed on the Y channel of YCbCr images. PSNR is a pixel-based image quality assessment metric, defined by the maximum pixel value (denoted as L) and the Mean Square Error (MSE) between images. Given a ground truth image and its reconstruction with N pixels, denoted as

I_{H R}

and

{I^{^}}_{H R}

, respectively, the Peak Signal-to-Noise Ratio (PSNR) is defined as:

PSNR = 10 • {10 \log}_{10} (\frac{L^{2}}{\frac{1}{N} \sum_{i = 1}^{N} {(I_{H R} (i) - {I^{^}}_{H R} (i))}^{2}})

(6)

PSNR, operating at the pixel level, exhibits limitations in capturing localized quality variations and demonstrates inadequate correlation with human visual perception. In contrast, SSIM functions as a perceptual model aligned with the human visual system (HVS), providing a more comprehensive assessment of structural similarity between images. The Structural Similarity Index (SSIM) is defined as:

SSIM (I_{H R}, {I^{^}}_{H R} {) = [C}_{l} {(I_{H R}, {I^{^}}_{H R})}^{α} C_{c} {(I_{H R}, {I^{^}}_{H R})}^{β} C_{l} {(I_{H R}, {I^{^}}_{H R})}^{γ}]

(7)

where α, β, and γ denote weighting parameters that modulate the relative contribution of each corresponding component.

For our MSWSR configuration, we set the number of MWB and channels to 12 and 48, respectively. During training, we randomly crop patches of size 48 × 48 pixels with a batch size of 64. Data augmentation is performed through random rotations of 90°, 180°, 270°, and horizontal flips. We employ the ADAN optimizer [42] with β1 = 0.9 and β2 = 0.99 to minimize the L1 loss. The initial learning rate is set to 5 × 10⁻³, and the model is trained for 500,000 iterations. The implementation is based on PyTorch 1.13.0, with training conducted on an NVIDIA RTX 4060 TI GPU manufactured by Colorful (Chengdu, China) with NVIDIA Corporation (Santa Clara, CA, USA) as the original GPU designer.

4.2. Comparison with Other Network Architectures

4.2.1. Quantitative Analysis

We evaluated our proposed network against state-of-the-art lightweight single-image super-resolution models: IMDN, RFDN, RLFN, CFSR, and SPAN. Performance testing was conducted on benchmark datasets (Set5, Set14, Urban100, and Manga109) across multiple upscaling factors (×2, ×3, ×4). To ensure a fair comparison, all models were implemented using their publicly available training code and evaluated following standard benchmark protocols. As shown in Table 1, our method demonstrates superior performance across all upscaling factors (×2, ×3, ×4), with particularly notable advantages on the ×4 scale. Compared to CFSR and SPAN, our approach achieved improvements of 0.22 dB and 0.28 dB on Urban100 and Manga109 datasets, respectively. Additionally, our method showed gains of 0.10 dB and 0.17 dB on Set14. These results validate our method’s robust capability in reconstructing complex scenes and high-frequency textures. Specifically, MSWSR achieves approximately 25% fewer FLOPs than SPAN (257.6K vs. 391.9K) and 26% parameter reduction compared to RFDN, while delivering higher PSNR values. We agree that these percentage-based comparisons will make the efficiency advantages of our approach more immediately apparent to readers and provide a clearer context for understanding our model’s computational benefits in resource-constrained environments.

Our method also demonstrates superior performance on ×2 and ×3 scales. On the ×2 scale, the PSNR improvements over CFSR and SPAN reached 0.15 dB and 0.14 dB on Set5 and Urban100 datasets, respectively. The performance gains were even more substantial on the ×3 scale, achieving a 0.45 dB improvement on Urban100 and surpassing SPAN by 0.35 dB on Manga109. On the ×4 scale, our model achieves superior performance with only 316K parameters. In comparison, CFSR and SPAN require 303K and 426.3K parameters, respectively, demonstrating lower computational efficiency. This lightweight design makes our network particularly suitable for practical applications, especially on resource-constrained devices. It is worth noting that further parameter reduction could be achieved through techniques such as weight quantization and network pruning. Our preliminary experiments showed that these approaches led to significant performance degradation.

Additionally, Figure 5 shows the training curves of our proposed MSWSR, CFSR, and SPAN, all with comparable computational complexity. MSWSR exhibits better convergence in the early stages of training without requiring extensive iterations. We conducted inference time testing on a Redmi 13 Pro smartphone. Our method achieved an inference speed of 85 ms/frame, thus enabling real-time processing on resource-constrained devices. This balance between performance and efficiency makes MSWSR particularly suitable for edge computing scenarios such as mobile photography enhancement and security monitoring. The improvements in super-resolution performance are not only reflected in quantitative metrics but are also significant in practical applications.

4.2.2. Visual Comparison

As shown in Figure 6, our method demonstrates significant visual advantages across multiple benchmark datasets on the ×4 scale, particularly excelling in texture detail and high-frequency information recovery. In the BSDS100 dataset, our network accurately reconstructs architectural surface textures, while other methods exhibit varying degrees of blurring or distortion. For the Barbara image from Set14, our approach faithfully preserves stripe continuity, effectively suppressing the common distortions and artifacts seen in other methods. Furthermore, on the structurally complex Urban100 dataset, our method excels in reconstructing railing details within the red-boxed regions. It maintains both smoothness and continuity of lines while avoiding the common fragmentation and blurring observed in other approaches. Overall, on the ×4 scale, our method demonstrates superior detail reconstruction capability compared to existing approaches, further validating its exceptional performance and practical value at higher upscaling factors.

4.3. Ablation Studies

To ensure the fairness of the experiment, we will dynamically adjust the number of MWB to maintain all ablation experiment network structural parameters around 310K.

4.3.1. Effects of RepConv and WTConv

To investigate the specific contributions of WTConv and RepConv in enhancing model performance, we designed ablation experiments at a ×4 upscaling rate. The Table 2 show that after individually adding WTConv, the model’s PSNR on the Urban100 dataset improved by 0.44 dB. This demonstrates that WTConv significantly optimizes the model’s reconstruction capability in complex scenarios by expanding the receptive field and enhancing low-frequency feature modeling. Similarly, after introducing RepConv alone, the model’s PSNR on the Urban100 dataset improved by 0.38 dB. This indicates that re-parameterization techniques can effectively enhance local feature extraction capabilities while improving inference efficiency. When WTConv and RepConv are combined, the model’s performance achieves maximum improvement. In the Urban100 dataset, the PSNR increased by 0.61 dB compared to the baseline model, demonstrating their synergistic effect on capturing low-frequency information and enhancing detail representation. The visual comparison in Figure 7 confirms RepConv’s critical role in preserving high-frequency edge details (e.g., railing sharpness) and WTConv’s necessity for maintaining low-frequency structural coherence. This effectively proves the rationality and practical value of modular design through performance optimization beyond their individual contributions.

4.3.2. Effects of SSA

To verify the performance improvement brought by the SSA module, we conducted ablation experiments, with results as shown in the Table 3. Upon removing the SSA module, PSNR decreased by 0.08 dB, 0.23 dB, and 0.30 dB on BSDS100, Urban100, and Manga109 datasets, respectively. These results demonstrate that the SSA module effectively enhances the model’s feature-capturing capabilities by adaptively allocating higher weights to critical regions, leading to significant performance improvements in detail restoration and global modeling.

5. Conclusions

In this paper, we have proposed a network for lightweight super-resolution tasks, named MSWSR, which achieves state-of-the-art performances. MSWSR utilizes the hybrid feature module (MFM) as a core component. MFM combines re-parameterized convolution and wavelet transform to efficiently model multi-scale features while significantly reducing computational overhead. In addition, the introduced GAU and lightweight SSA further enhance the feature discrimination and optimize the feature expression efficiency. Numerous experimental results show that MSWSR significantly outperforms CFSR and SPAN on several benchmark datasets while maintaining a low parameter count (316K). The result fully demonstrates the effectiveness of the method in lightweight image super-resolution tasks. While our MSWSR model has demonstrated exceptional performance on standard benchmark datasets, we recognize opportunities for further exploration. As a future research direction, we plan to extend our approach to address real-world image super-resolution scenarios, where degradation processes are more complex than bicubic downsampling. By adapting our lightweight architecture to handle diverse real-world degradations such as motion blur, sensor noise, and compression artifacts, we aim to further enhance the practical utility of our approach. This extension would leverage the computational efficiency of MSWSR, making it particularly valuable for mobile devices and resource-constrained environments where both processing power and high-quality image reconstruction are essential.

Author Contributions

Methodology, X.Y.; Data curation, Y.X.; Writing—original draft, X.Y. and W.G.; Writing—review & editing, W.S., W.G. and K.N.; Supervision, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Z.; Chen, J.; Hoi, S.C. Deep learning for image super-resolution: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
Yue, T.; Lu, X.; Cai, J.; Chen, Y.; Chu, S. YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO. arXiv 2024, arXiv:2412.19878. [Google Scholar]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 10 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Cision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 7 October 2018. [Google Scholar]
Li, M.; Zhao, Y.; Zhang, F.; Luo, B.; Yang, C.; Gui, W.; Chang, K. Multi-scale feature selection network for lightweight image super-resolution. Neural Netw. 2024, 169, 352–364. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Tian, C.; Liu, J.; Di, C.; Ning, K. HADT: Image super-resolution restoration using Hybrid Attention-Dense Connected Transformer Networks. Neurocomputing 2025, 614, 128790. [Google Scholar] [CrossRef]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
Li, W.; Lu, X.; Qian, S.; Lu, J.; Zhang, X.; Jia, J. On efficient transformer-based image pre-training for low-level vision. arXiv 2021, arXiv:2112.10175. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 24 November 2021; pp. 1833–1844. [Google Scholar]
Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Computer Vision—ECCV. Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
Ray, A.; Kumar, G.; Kolekar, M.H. CFAT: Unleashing Triangular Windows for Image Super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26120–26129. [Google Scholar]
Wu, G.; Jiang, J.; Jiang, J.; Liu, X. Transforming image super-resolution: A ConvFormer-based efficient approach. arXiv 2024, arXiv:2401.05633. [Google Scholar] [CrossRef] [PubMed]
Xie, C.; Zhang, X.; Li, L.; Meng, H.; Zhang, T.; Li, T.; Zhao, X. Large kernel distillation network for efficient single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1283–1292. [Google Scholar]
Wang, Y.; Li, Y.; Wang, G.; Liu, X. Multi-scale attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5950–5960. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Computer Vision—ECCV 2024, Proceeding of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025. [Google Scholar]
Daubechies, I. Ten Lectures on Wavelets; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Computer Vision–ECCV 2016, Proceeding of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part II 14; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Computer Vision–ECCV 2020 Workshops. Proceeding of the ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 833–843. [Google Scholar]
Sun, L.; Pan, J.; Tang, J. Shufflemixer: An efficient convnet for image super-resolution. Adv. Neural Inf. Process. Syst. 2022, 35, 17314–17326. [Google Scholar]
Wang, Z.; Gao, G.; Li, J.; Yan, H.; Zheng, H.; Lu, H. Lightweight feature de-redundancy and self-calibration network for efficient image super-resolution. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–15. [Google Scholar] [CrossRef]
Li, Y.; Deng, Z.; Cao, Y.; Liu, L. GRFormer: Grouped Residual Self-Attention for Lightweight Single Image Super-Resolution. In Proceedings of the 32nd ACM International Conference on Multimedia, New York, NY, USA, 28 October–1 November 2024; ACM: New York, NY, USA, 2024; pp. 9378–9386. [Google Scholar]
Kim, J.; Nang, J.; Choe, J. LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution. arXiv 2024, arXiv:2409.03516. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yu, J.; Fan, Y.; Yang, J.; Xu, N.; Wang, Z.; Wang, X.; Huang, T. Wide activation for efficient and accurate image super-resolution. arXiv 2018, arXiv:1808.08718. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC; BMVA Press: Survery, UK, 2012; pp. 1–10. [Google Scholar]
Martin, D.R.; Fowlkes, C.C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the ICCV, Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9508–9520. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Illustrates the PSNR, FLOPs, and parameter counts of different SISR models on the Urban100 dataset for the ×4 SR task.

Figure 2. The overall architecture of MSWSR with a detailed illustration of its key components: (a) The complete network structure showing the feature extraction pathway, (b) multi-scale wavelet block (MWB), (c) spatial selection attention (SSA) module that adaptively assigns weights to enhance feature discrimination, (d) the multi-scale feature module (MFM) and gated attention unit (GAU).

Figure 3. The overall architecture of RepConv.

Figure 4. An example of the WTConv operation.

Figure 5. Convergence speed comparison on Urban100 dataset for ×4 super-resolution task.

Figure 6. Visual comparison results.

Figure 7. Visual comparison result.

Table 1. Quantitative comparison of average PSNR (dB)/SSIM among lightweight models. The best results are highlighted in red.

Scale	Method	#Param	FLOPS	Set5		Set14		BSDS100		Urban100		Manga109
	Method			PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
X2	IMDN	694K	635.4G	37.89	0.9606	33.51	0.9169	32.12	0.8990	32.00	0.9267	38.61	0.9770
	RFDN	417 K	365.3G	37.78	0.9606	33.35	0.9166	32.09	0.8991	31.79	0.9254	38.29	0.9764
	RLFN	526K	461.7G	37.88	0.9606	33.44	0.9168	32.13	0.8991	31.88	0.9259	38.39	0.9766
	CFSR	298K	260.2G	37.86	0.9605	33.44	0.9169	32.12	0.8992	31.77	0.9352	38.31	0.9764
	SPAN	410K	377.5G	37.94	0.9608	33.47	0.9165	32.14	0.8993	31.92	0.9265	38.30	0.9765
	Ours	312K	243.3G	38.01	0.9610	33.71	0.9193	32.22	0.9003	32.29	0.9301	38.86	0.9774
X3	IMDN	703K	643.4G	34.36	0.9272	30.28	0.8412	29.05	0.8045	28.09	0.8504	33.48	0.9438
	RFDN	424K	371.4G	34.18	0.9260	30.23	0.8406	29.02	0.8037	27.90	0.8475	33.23	0.9422
	RLFN	533K	468.2G	34.24	0.9266	30.26	0.8412	29.04	0.8412	27.99	0.8489	33.28	0.9426
	CFSR	294K	266.2G	34.23	0.9262	30.25	0.8406	29.04	0.8044	27.90	0.8475	33.30	0.9428
	SPAN	417K	383.5G	34.28	0.9268	30.27	0.8417	29.06	0.8049	28.04	0.8499	33.39	0.9436
	Ours	307K	249.6G	34.40	0.9277	30.35	0.8437	29.12	0.8067	28.22	0.8548	33.68	0.9454
X4	IMDN	715K	654.5G	32.09	0.8942	28.54	0.7810	27.52	0.7340	25.96	0.7819	30.33	0.9063
	RFDN	433 K	380.2G	32.13	0.8943	28.50	0.7795	27.51	0.7339	25.92	0.7803	30.20	0.9051
	RLFN	543K	477.3G	31.97	0.8931	28.47	0.7795	27.51	0.7342	25.88	0.7803	30.12	0.9035
	CFSR	303K	274.6G	32.00	0.8930	28.49	0.7797	27.52	0.7343	25.84	0.7781	30.15	0.9045
	SPAN	426.3K	391.9G	32.08	0.8942	28.53	0.7810	27.55	0.7351	25.95	0.7812	30.34	0.9064
	Ours	316K	257.6G	32.26	0.8966	28.67	0.7843	27.62	0.7379	26.17	0.7896	30.60	0.9092

Table 2. Multi-scale ablation experiment. √ indicates that the corresponding convolution method is used in this configuration.

WtConv	RepConv	BSDS100		Urban100		Manga109
WtConv	RepConv	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM

√		27.57	0.7365	26.00	0.7844	30.39	0.9076
	√	27.54	0.7349	25.94	0.7811	30.30	0.9061
√	√	27.62	0.7379	26.17	0.7896	30.60	0.9092

Table 3. SSA ablation experiment. √ indicates that the corresponding SSA is used in this configuration.

SSA	BSDS100		Urban100		Manga109
SSA	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
	27.54	0.7351	25.91	0.7801	30.27	0.9057
√	27.62	0.7379	26.17	0.7896	30.60	0.9092

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, W.; Yan, X.; Guo, W.; Xu, Y.; Ning, K. MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods. Symmetry 2025, 17, 431. https://doi.org/10.3390/sym17030431

AMA Style

Song W, Yan X, Guo W, Xu Y, Ning K. MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods. Symmetry. 2025; 17(3):431. https://doi.org/10.3390/sym17030431

Chicago/Turabian Style

Song, Wei, Xiaoyu Yan, Wei Guo, Yiyang Xu, and Keqing Ning. 2025. "MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods" Symmetry 17, no. 3: 431. https://doi.org/10.3390/sym17030431

APA Style

Song, W., Yan, X., Guo, W., Xu, Y., & Ning, K. (2025). MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods. Symmetry, 17(3), 431. https://doi.org/10.3390/sym17030431

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based SR Methods

2.2. Transformer-Based Architectures

2.3. Lightweight SR Models

3. Materials and Methods

3.1. Architecture Overview

3.2. Multi-Scale Wavelet Block

3.3. Spatial Selection Attention Module

4. Experiments and Results

4.1. Experimental Settings

4.2. Comparison with Other Network Architectures

4.2.1. Quantitative Analysis

4.2.2. Visual Comparison

4.3. Ablation Studies

4.3.1. Effects of RepConv and WTConv

4.3.2. Effects of SSA

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI