Next Article in Journal
A Fault Identification Method for Ferroresonance Based on a Gramian Angular Summation Field and an Improved Cloud Model
Previous Article in Journal
A Note on Generalized Parabolic Marcinkiewicz Integrals with Grafakos–Stefanov Kernels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods

1
College of Information Science and Technology, North China University of Technology, Beijing 100144, China
2
School of Electrical and Control Engineering, North China University of Technology, Beijing 100144, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(3), 431; https://doi.org/10.3390/sym17030431
Submission received: 15 February 2025 / Revised: 3 March 2025 / Accepted: 10 March 2025 / Published: 13 March 2025
(This article belongs to the Section Computer)

Abstract

:
Single-image super-resolution (SISR) methods based on convolutional neural networks (CNNs) have achieved breakthrough progress in reconstruction quality. However, their high computational costs and model complexity have limited their applications in resource-constrained devices. To address this, we propose the MSWSR (multi-scale wavelet super-resolution) method, a lightweight multi-scale feature selection network that exploits both symmetric and asymmetric feature patterns. MSWSR achieves efficient feature extraction and fusion through modular design. The core modules include a mixed feature module (MFM) and a gated attention unit (GAU). The MFM employs a symmetric multi-branch structure to efficiently integrate multi-scale features and enhance low-frequency information modeling. The GAU combines the spatial attention mechanism with the gating mechanism to further optimize symmetric feature representation capability. Moreover, a lightweight spatial selection module (SSA) adaptively assigns weights to key regions while maintaining structural symmetry in feature space. This significantly improves reconstruction quality in complex scenes. In 4× super-resolution tasks, compared to SPAN, MSWSR improves PSNR by 0.22 dB on Urban100 and 0.26 dB on Manga109 datasets. The model contains only 316K parameters, which is substantially lower than existing approaches. Extensive experiments demonstrate that MSWSR significantly reduces computational overhead while maintaining reconstruction quality, providing an effective solution for resource-constrained applications.

1. Introduction

The single-image super-resolution (SR) method reconstructs high-resolution (HR) images from low-resolution (LR) inputs and plays a crucial role in computer vision and image processing [1]. This technique is widely applied in medical imaging, surveillance, and security to enhance image quality and improve the performance of other computer vision tasks [2,3]. Mobile devices such as iPhone 16 Pro Max, Huawei P70 Pro, and Xiaomi 15 are developing rapidly. This evolution emphasizes the need for efficient and accurate super-resolution methods. These methods improve image display quality and meet the processing demands of high-performance devices. The investigation of super-resolution estimation methods for these devices holds significant practical importance.
In the past decade, convolutional neural networks (CNNs) [4,5,6,7,8,9] have been the cornerstone of super-resolution methods. Recently, however, the emergence of transformer-based approaches [9,10,11,12,13,14,15,16] has started to shift this paradigm, and they have gradually surpassed CNNs in prominence. Specifically, transformers hold a distinct advantage over CNNs through their multi-head self-attention mechanism. This mechanism enables the extraction of global image features, whereas CNNs are structurally limited to capturing local features. To address the performance gap between CNNs and vision transformers (ViTs), several studies have explored strategies, such as increasing the size of convolutional kernels to achieve a larger receptive field. Wu et al. [17] employed large-kernel convolutions as feature mixers to replace attention modules. This approach effectively models long-range dependencies and broad receptive fields with minimal computational overhead. Xie et al. [18] developed LKDN to enhance model performance and computational efficiency through large-kernel attention (LKA) mechanisms. Wang et al. [19] proposed a multi-scale attention network (MAN), which consists of multi-scale large-kernel attention (MLKA) mechanisms and gated spatial attention units (GSAUs). The MLKA mechanism modifies large-kernel attention through multi-scale and gating schemes to obtain rich attention maps at various granularity levels. This design enables the network to aggregate global and local information while avoiding potential blocking artifacts. Empirical evidence shows that performance saturates at a convolution kernel size of 7 × 7. Beyond this threshold, further increases in kernel size not only fail to provide significant improvements but may actually lead to performance degradation. Although Ding et al. [20] proposed that larger kernels could still be utilized through carefully designed kernel structures, the resulting kernels often become over-parameterized, and model performances still saturate before achieving global receptive fields.
To address the aforementioned issues, we propose a lightweight multi-scale wavelet super-resolution network MSWSR. The core component of the network is the multi-scale wavelet block (MWB), which consists of a multi-scale feature module (MFM) and a gated attention unit (GAU). Through this structural design, MWB effectively integrates multi-scale feature representation with attention mechanisms while maintaining lightweight architecture and significantly improving performance. Specifically, MFM adopts a multi-branch architecture for the efficient extraction and fusion of multi-scale features, comprising three key branches: (1) A standard 3 × 3 convolution branch extracts basic feature information. (2) A re-parameterization convolution branch dynamically adjusts the receptive field during training through strip convolution, dilated convolution, and 3 × 3 convolution. Also, it will be folded into a single equivalent 3 × 3 convolution during inference to improve computational efficiency. (3) The wavelet transform convolution branch (WTConv) [21,22] captures low-frequency features through time-frequency decomposition. This branch expands the receptive field through wavelet transform while using small-kernel convolutions to enhance its low-frequency response modeling capability. The features from these multi-branch structures are unified during the fusion phase to construct rich multi-scale feature representations. The GAU module further enhances feature discriminative ability by integrating spatial attention and gating mechanisms. To dynamically highlight key regional features and suppress redundant information, we introduce a lightweight spatial selection attention module (SSA). The SSA module achieves precise weight redistribution through the adaptive learning of importance distribution across feature map positions, enabling the model to focus on features crucial to reconstruction quality. This module significantly improves the global feature discriminative ability and task relevance while maintaining computational efficiency. The proposed architecture adopts the Metformer [23] paradigm to organically combine MFM, GAU, and SSA modules. These modules are stacked in a modular structure that maximizes the synergy between multi-scale mechanisms, attention mechanisms, and feature-selection capabilities. As shown in Figure 1, in the ×4 super-resolution task, MSWSR demonstrates significant advantages on the Urban100 dataset. It achieves superior performance compared to state-of-the-art methods while requiring fewer parameters and lower FLOPs. This effectively balances reconstruction quality, model size, and computational efficiency. These results thoroughly validate the effectiveness of the proposed multi-scale feature selection strategy in balancing model efficiency and performance, providing an efficient solution for image super-resolution reconstruction in resource-constrained scenarios. Furthermore, we conducted comprehensive ablation studies to investigate the individual and synergistic contributions of each proposed component. Notably, all experiments were performed under a lightweight model constraint. These experiments quantitatively demonstrate that our multi-scale feature module, gated attention unit, and spatial selection attention module each significantly enhance reconstruction quality without substantially increasing computational complexity.
Overall, our contributions are as follows:
  • We propose a lightweight multi-scale feature selection network (MSWSR) for efficient feature fusion and modeling. Through modular design and multi-scale feature extraction strategies, MSWSR effectively balances model performance and computational complexity.
  • We propose two key components: MFM and GAU. MFM enhances multi-scale feature modeling through a multi-branch structure, while GAU combines spatial attention with the gating mechanism to optimize feature representation through their synergistic interaction.
  • Extensive experiments demonstrate MSWSR’s superior performance on benchmark datasets. With only 316K parameters, it achieves PSNR improvements of 0.22 dB and 0.26 dB on Urban100 and Manga109 datasets for ×4 super-resolution methods, validating its effectiveness in resource-constrained scenarios.

2. Related Work

2.1. CNN-Based SR Methods

Single-image super-resolution (SISR) methods aim to recover high-resolution images from low-resolution inputs. Convolutional neural network (CNN) methods have become dominant in this field. In 2014, Dong et al. [4] proposed SRCNN. They used a three-layer convolutional network for end-to-end training. This network learned the mapping between low-resolution and high-resolution images and significantly improved reconstruction quality. SRCNN faced computational efficiency challenges. To address this, Dong et al. [24] introduced FSRCNN in 2016. They optimized the network structure and used deconvolution for upsampling. The reduction in convolutional layers made real-time super-resolution possible. Kim et al. [5] developed VDSR with deep residual learning and a deeper network structure to solve the vanishing gradient problem. In 2017, Lim et al. [6] proposed EDSR by removing batch normalization layers from VDSR and deepening the structure, showing excellent performance in high-magnification scenarios. Zhang et al. [25] introduced RDN (Residual Dense Network) in 2018, which incorporated dense residual blocks and showed particular strength in detail recovery and edge sharpening.
Recent years have seen the continued exploration of novel architecture. Dai et al. [15] proposed SAN with second-order attention mechanisms to improve the capture of image details. Huang et al. [13] introduced the Holistic Attention Network (HAN) with global attention mechanisms, showing excellent performance in recovering complex image details, especially for scenes with complex backgrounds. These new attention mechanisms have improved both visual quality and computational efficiency.

2.2. Transformer-Based Architectures

Chen et al. [11] introduced HAT, which combines channel attention with window-based self-attention mechanisms to leverage both global statistics and local features. The incorporation of overlapping cross-attention modules enhances feature interaction between adjacent windows, enabling more precise reconstruction with broader spatial information. Li et al. [12] developed EDT with dual-transformer structures to model low-resolution and high-resolution features separately. EDT optimizes computational efficiency through a two-stage attention mechanism while maintaining high restoration accuracy with reduced computational cost. Abhisek Ray et al. [16] introduced CAFT with innovative non-overlapping triangular window techniques alongside traditional rectangular windows, which effectively mitigates boundary distortion and expands the model’s spatial processing capability.

2.3. Lightweight SR Models

In recent years, lightweight SISR networks have become a research focus, with researchers proposing various innovative methods to balance model performance and computational complexity for resource-constrained scenarios. Hui et al. [26] proposed the Information Multi-Distillation Network (IMDN), which enhances feature extraction efficiency through a multi-level distillation mechanism for feature splitting and fusion, combined with contrast-aware channel attention (CCA). Building upon this work, Liu et al. [27] further optimized IMDN by introducing the Residual Feature Distillation Network (RFDN) with feature distillation connections (FDC) and shallow residual blocks (SRB), maintaining high super-resolution reconstruction while reducing model complexity. Li et al. [28] developed the Blueprint Separable Residual Network (BSRN), which achieves more efficient feature extraction by replacing traditional convolutions with blueprint separable convolutions (BSConv) and incorporating enhanced spatial attention (ESA) modules. Sun et al. [29] introduced the ShuffleMixer network, designing an efficient feature mixing module by combining large convolution kernels with channel splitting and shuffling operations, achieving good performance with extremely low parameters.
Wang et al. [30] proposed the Feature De-redundancy [8] and Self-Calibration network (FDSCSR) with feature de-redundancy and self-calibration modules (FDSCB). The network incorporates local feature fusion modules (LFFM) to improve computational efficiency and feature integration capability. Xie et al. [18] introduced the large-kernel distillation network (LKDN), which explores the potential of large convolution kernels by combining large-kernel attention (LKA) modules with re-parameterization techniques. Wang et al. [19] proposed the multi-scale attention network (MAN) with multi-scale large-kernel attention (MLKA) mechanisms and gated spatial attention units (GSAUs). This architecture effectively enhances feature representation while avoiding blocking artifacts.
Transformer-based research has opened new directions for lightweight super-resolution networks. Li et al. [31] proposed GRFormer featuring Grouped Residual Self-Attention (GRSA) modules. By introducing group residuals in QKV linear layers and ES-RPB, this network significantly reduces parameters and computational cost. Kim et al. [32] introduced the low-to-high multi-level visual transformers (LMLTs) to capture information at different levels through multi-head mechanisms. The network uses parallel stacking to reduce computational complexity while effectively addressing window boundary issues.
Despite transformers’ advantages in capturing long-range dependencies and global context, their inference speed remains slower than CNN-based models, particularly in resource-constrained devices. Moreover, CNNs remain a preferred choice for lightweight super-resolution networks, as they achieve efficient local feature extraction and reconstruction with lower computational complexity through deep residual structures and large convolution kernels. While dynamic convolutions achieve superior performance in specific scenarios, their runtime weight generation mechanism introduces computational overhead. This overhead directly conflicts with the lightweight design principles we aim to maintain.
The proposed MSWSR method distinguishes itself from existing lightweight SR methods through its multi-branch parallel architecture, which enables simultaneous multi-scale feature extraction. This architecture ensures comprehensive feature representation without introducing computational redundancy. Additionally, MSWSR employs wavelet transform convolution (WTConv) for efficient receptive field expansion, achieving logarithmic parameter scaling and enhanced low-frequency information modeling. This combination enables superior reconstruction quality while maintaining minimal parameters (316K). It also reduces computational complexity compared to current state-of-the-art approaches. Methods such as LKDN and MAN rely on parameter-intensive large-kernel convolutions, whereas our approach achieves better efficiency.

3. Materials and Methods

3.1. Architecture Overview

Given a low-resolution image ILR ∈ RC×H×W, where C and H×W represent the number of channels and spatial resolution, respectively, MSWSR first extracts features of X0 ∈ RC×H×W through a 3×3 convolution. Then, we design a MWB composed of multiple attention blocks, which takes X0 as the input and produces deeper features of X1 ∈ RD×H×W through feature enhancement. To fully exploit discriminative feature representations, we introduce a spatial attention enhancement (SAA) module, which adaptively learns attention weights in the spatial dimension to highlight features in regions with strong representation capability while suppressing redundant and noisy information, thereby generating feature maps X2 ∈ RD×H×W with enhanced discriminative ability. This mechanism of feature extraction and attention enhancement is repeated m times in the network, forming a deep feature extraction chain. The feature extraction process at the k-th iteration can be expressed as
X k + 1 = f SAA ( f MAB ( X k ) )   ,   k [ 1 , m - 1 ]
where fSAA denotes the spatial attention enhancement mapping function and fMAB represents the mapping function of the multi-scale attention block. The SAA module acts as a feature filter in each iteration, continuously optimizing the feature representation capability. After m iterations, we obtain the feature Xm. Finally, the features are mapped to high-resolution reconstruction space through a 3×3 convolution layer, and global residual connections [33] are employed to fuse the initial features X0 with the final features to obtain X ∈ RD×H×W. The high-resolution image IHR ∈ RC×rH×rW is then generated through an image reconstruction layer consisting of Conv3×3 and PixelShuffle [34], where r denotes the upscaling factor. This multi-stage feature extraction network design, integrated with adaptive feature selection mechanisms, not only effectively captures multi-scale discriminative features but also preserves original image information through residual learning, thereby achieving high-quality super-resolution reconstruction. The following sections will detail each module.

3.2. Multi-Scale Wavelet Block

In this section, we introduce the core module of our network-MWB, as illustrated in Figure 2. Previous studies have shown that expanding features before activation layers enhances the network’s nonlinear capability [35]. Therefore, we incorporated 1×1 convolutions before and after MFM for feature expansion and compression, respectively.
In previous multi-scale studies, many network architectures input all features into each branch, leading to increased parameter count and redundant computations. To address this issue, we implemented a slice operation after the 1×1 convolution. For simplicity, we evenly distributed the input features among branches, meaning only a portion of the expanded features was fed into each branch.
Z i = w ( x i ) i = 1 Rep C ( x i ) i = 2 WT ( x i ) i = 3  
where w(.) represents the 3×3 convolution operation, RepC denotes the re-parameterization convolution mapping function, and WT represents the wavelet convolution operation.
RepConv: The re-parameterization convolution module is designed to enhance local feature extraction efficiency while incorporating dynamic receptive field expansion as Figure 3. Drawing inspiration from human visual perception mechanisms, particularly the foveal focusing characteristics, this module adaptively captures spatial contexts. The training phase integrates multiple convolutional operations, including standard 3×3 convolutions, decomposed 3×1 and 1×3 convolutions, and dilated convolutions to establish comprehensive spatial relationships. This architectural design can be mathematically formulated as
Y 1 = DWConv 3 × 3 ( x 2 ) + DWConv 1 × 3 ( x 2 ) + DWConv 3 × 1 ( x 2 ) + DWConv 2 × 2 ( x 2 )
At the inference time, all operations are structurally re-parameterized into a single 3 × 3 convolution, which effectively optimizes computational efficiency while maintaining the model’s representational capacity.
WTConv [21] introduces a novel architectural paradigm with distinctive computational advantages. The hierarchical nature of WTConv enables progressive expansion of the receptive field across layers while maintaining remarkable parameter efficiency. More notably, the inherent structure of WTConv layers demonstrates superior capability in low-frequency feature capture compared to standard convolutions, attributed to the iterative wavelet transform decomposition that systematically emphasizes low-frequency components and enhances corresponding layer responses.
Traditional approaches relying on large convolution kernels for global feature extraction often suffer from over-parameterization. In contrast, WTConv leverages sophisticated time-frequency analysis principles to achieve efficient receptive field expansion and orchestrates a CNN low-frequency response through cascading mechanisms. The architecture implements cascaded wavelet transform (WT) decomposition coupled with strategically designed small-kernel convolutions, enabling the focused processing of distinct frequency bands within the progressively expanding receptive fields. Notably, for a k×k receptive field, the architecture achieves logarithmic parameter scaling with k, marking a significant improvement over conventional large-kernel approaches.
As depicted in Figure 4 the WTConv operational pipeline initiates with wavelet transform for frequency-based content decomposition and downsampling. Subsequently, the framework applies specialized small-kernel depth-wise convolutions to frequency-specific feature maps before reconstruction through inverse wavelet transform (IWT).
Y = IWT ( Conv ( W , WT ( x i ) ) )
where xi denotes the input tensor and W represents the weight tensor of k × k depth-wise kernels, with input channels four times that of xi. These operations not only segregate convolutions between frequency components but also enable smaller kernels to operate over larger regions of the original input, effectively expanding their receptive field.
Inspired by [18], we introduce the GAU module to enhance feature representation. GAU integrated single spatial attention (SSA) with gated linear units (GLU), thereby reducing both parameter count and computational complexity. To capture spatial information more effectively, GAU employs single-layer depth-wise convolutions for feature map weighting. Given dense transformations X and Y, the key process of GAU can be formulated as:
G S A U ( X , Y ) = H D W ( X ) Y
where HDW(·) represents the depth-wise convolution operation.

3.3. Spatial Selection Attention Module

In contrast to the conventional SE [36] architecture, our approach replaces fully connected layers with 1×1 convolutions, facilitating more efficient feature extraction while achieving parameter efficiency. The module systematically compresses two-dimensional features (H×W) from each channel into a scalar representation, transforming feature maps from [h, w, c] to [1, 1, c] dimensions and generating channel-specific weight values. SSA establishes inter-channel correlations through dual point-wise convolutions, yielding weight values with cardinality matching the channel dimension of the input feature maps. These normalized weights are subsequently applied to modulate channel-wise feature response.
To further elaborate on the design rationale of our SSA module, we provide a comparative analysis with standard attention mechanisms such as SENet and CBAM.SENet that primarily focuses on channel-wise relationships through global average pooling followed by fully connected layers. In contrast, our SSA module implements a more parameter-efficient design by replacing fully connected layers with 1×1 convolutions, significantly reducing parameter count while maintaining feature representation capacity. Unlike this, CBAM sequentially applies both channel and spatial attention, creating additional computational overhead. In contrast, SSA employs a streamlined approach that achieves effective spatial feature weighting with substantially lower computational complexity. Our approach focuses on direct spatial feature modulation through point-wise convolutions. By avoiding separate channel and spatial attention branches, it becomes more suitable for resource-constrained super-resolution applications.

4. Experiments and Results

4.1. Experimental Settings

We train our MWSR network on the DIV2K [37] dataset, which contains 800 training images, and fine-tune it on the Flickr2K [6] dataset. For evaluation, we employ five benchmark datasets: Set5 [38], Set14 [38], BSD100 [39], Urban100 [40], and Manga109 [41]. To assess the reconstruction quality, we adopt two widely used metrics, PSNR and SSIM, computed on the Y channel of YCbCr images. PSNR is a pixel-based image quality assessment metric, defined by the maximum pixel value (denoted as L) and the Mean Square Error (MSE) between images. Given a ground truth image and its reconstruction with N pixels, denoted as I H R and I ^ H R , respectively, the Peak Signal-to-Noise Ratio (PSNR) is defined as:
PSNR = 10 10   log   10 ( L 2 1 N i = 1 N ( I H R ( i ) I ^ H R ( i ) ) 2 )
PSNR, operating at the pixel level, exhibits limitations in capturing localized quality variations and demonstrates inadequate correlation with human visual perception. In contrast, SSIM functions as a perceptual model aligned with the human visual system (HVS), providing a more comprehensive assessment of structural similarity between images. The Structural Similarity Index (SSIM) is defined as:
SSIM ( I H R ,   I ^ H R ) = [ C l ( I H R ,   I ^ H R ) α   C c ( I H R ,   I ^ H R ) β   C l ( I H R ,   I ^ H R ) γ ]
where α, β, and γ denote weighting parameters that modulate the relative contribution of each corresponding component.
For our MSWSR configuration, we set the number of MWB and channels to 12 and 48, respectively. During training, we randomly crop patches of size 48 × 48 pixels with a batch size of 64. Data augmentation is performed through random rotations of 90°, 180°, 270°, and horizontal flips. We employ the ADAN optimizer [42] with β1 = 0.9 and β2 = 0.99 to minimize the L1 loss. The initial learning rate is set to 5 × 10−3, and the model is trained for 500,000 iterations. The implementation is based on PyTorch 1.13.0, with training conducted on an NVIDIA RTX 4060 TI GPU manufactured by Colorful (Chengdu, China) with NVIDIA Corporation (Santa Clara, CA, USA) as the original GPU designer.

4.2. Comparison with Other Network Architectures

4.2.1. Quantitative Analysis

We evaluated our proposed network against state-of-the-art lightweight single-image super-resolution models: IMDN, RFDN, RLFN, CFSR, and SPAN. Performance testing was conducted on benchmark datasets (Set5, Set14, Urban100, and Manga109) across multiple upscaling factors (×2, ×3, ×4). To ensure a fair comparison, all models were implemented using their publicly available training code and evaluated following standard benchmark protocols. As shown in Table 1, our method demonstrates superior performance across all upscaling factors (×2, ×3, ×4), with particularly notable advantages on the ×4 scale. Compared to CFSR and SPAN, our approach achieved improvements of 0.22 dB and 0.28 dB on Urban100 and Manga109 datasets, respectively. Additionally, our method showed gains of 0.10 dB and 0.17 dB on Set14. These results validate our method’s robust capability in reconstructing complex scenes and high-frequency textures. Specifically, MSWSR achieves approximately 25% fewer FLOPs than SPAN (257.6K vs. 391.9K) and 26% parameter reduction compared to RFDN, while delivering higher PSNR values. We agree that these percentage-based comparisons will make the efficiency advantages of our approach more immediately apparent to readers and provide a clearer context for understanding our model’s computational benefits in resource-constrained environments.
Our method also demonstrates superior performance on ×2 and ×3 scales. On the ×2 scale, the PSNR improvements over CFSR and SPAN reached 0.15 dB and 0.14 dB on Set5 and Urban100 datasets, respectively. The performance gains were even more substantial on the ×3 scale, achieving a 0.45 dB improvement on Urban100 and surpassing SPAN by 0.35 dB on Manga109. On the ×4 scale, our model achieves superior performance with only 316K parameters. In comparison, CFSR and SPAN require 303K and 426.3K parameters, respectively, demonstrating lower computational efficiency. This lightweight design makes our network particularly suitable for practical applications, especially on resource-constrained devices. It is worth noting that further parameter reduction could be achieved through techniques such as weight quantization and network pruning. Our preliminary experiments showed that these approaches led to significant performance degradation.
Additionally, Figure 5 shows the training curves of our proposed MSWSR, CFSR, and SPAN, all with comparable computational complexity. MSWSR exhibits better convergence in the early stages of training without requiring extensive iterations. We conducted inference time testing on a Redmi 13 Pro smartphone. Our method achieved an inference speed of 85 ms/frame, thus enabling real-time processing on resource-constrained devices. This balance between performance and efficiency makes MSWSR particularly suitable for edge computing scenarios such as mobile photography enhancement and security monitoring. The improvements in super-resolution performance are not only reflected in quantitative metrics but are also significant in practical applications.

4.2.2. Visual Comparison

As shown in Figure 6, our method demonstrates significant visual advantages across multiple benchmark datasets on the ×4 scale, particularly excelling in texture detail and high-frequency information recovery. In the BSDS100 dataset, our network accurately reconstructs architectural surface textures, while other methods exhibit varying degrees of blurring or distortion. For the Barbara image from Set14, our approach faithfully preserves stripe continuity, effectively suppressing the common distortions and artifacts seen in other methods. Furthermore, on the structurally complex Urban100 dataset, our method excels in reconstructing railing details within the red-boxed regions. It maintains both smoothness and continuity of lines while avoiding the common fragmentation and blurring observed in other approaches. Overall, on the ×4 scale, our method demonstrates superior detail reconstruction capability compared to existing approaches, further validating its exceptional performance and practical value at higher upscaling factors.

4.3. Ablation Studies

To ensure the fairness of the experiment, we will dynamically adjust the number of MWB to maintain all ablation experiment network structural parameters around 310K.

4.3.1. Effects of RepConv and WTConv

To investigate the specific contributions of WTConv and RepConv in enhancing model performance, we designed ablation experiments at a ×4 upscaling rate. The Table 2 show that after individually adding WTConv, the model’s PSNR on the Urban100 dataset improved by 0.44 dB. This demonstrates that WTConv significantly optimizes the model’s reconstruction capability in complex scenarios by expanding the receptive field and enhancing low-frequency feature modeling. Similarly, after introducing RepConv alone, the model’s PSNR on the Urban100 dataset improved by 0.38 dB. This indicates that re-parameterization techniques can effectively enhance local feature extraction capabilities while improving inference efficiency. When WTConv and RepConv are combined, the model’s performance achieves maximum improvement. In the Urban100 dataset, the PSNR increased by 0.61 dB compared to the baseline model, demonstrating their synergistic effect on capturing low-frequency information and enhancing detail representation. The visual comparison in Figure 7 confirms RepConv’s critical role in preserving high-frequency edge details (e.g., railing sharpness) and WTConv’s necessity for maintaining low-frequency structural coherence. This effectively proves the rationality and practical value of modular design through performance optimization beyond their individual contributions.

4.3.2. Effects of SSA

To verify the performance improvement brought by the SSA module, we conducted ablation experiments, with results as shown in the Table 3. Upon removing the SSA module, PSNR decreased by 0.08 dB, 0.23 dB, and 0.30 dB on BSDS100, Urban100, and Manga109 datasets, respectively. These results demonstrate that the SSA module effectively enhances the model’s feature-capturing capabilities by adaptively allocating higher weights to critical regions, leading to significant performance improvements in detail restoration and global modeling.

5. Conclusions

In this paper, we have proposed a network for lightweight super-resolution tasks, named MSWSR, which achieves state-of-the-art performances. MSWSR utilizes the hybrid feature module (MFM) as a core component. MFM combines re-parameterized convolution and wavelet transform to efficiently model multi-scale features while significantly reducing computational overhead. In addition, the introduced GAU and lightweight SSA further enhance the feature discrimination and optimize the feature expression efficiency. Numerous experimental results show that MSWSR significantly outperforms CFSR and SPAN on several benchmark datasets while maintaining a low parameter count (316K). The result fully demonstrates the effectiveness of the method in lightweight image super-resolution tasks. While our MSWSR model has demonstrated exceptional performance on standard benchmark datasets, we recognize opportunities for further exploration. As a future research direction, we plan to extend our approach to address real-world image super-resolution scenarios, where degradation processes are more complex than bicubic downsampling. By adapting our lightweight architecture to handle diverse real-world degradations such as motion blur, sensor noise, and compression artifacts, we aim to further enhance the practical utility of our approach. This extension would leverage the computational efficiency of MSWSR, making it particularly valuable for mobile devices and resource-constrained environments where both processing power and high-quality image reconstruction are essential.

Author Contributions

Methodology, X.Y.; Data curation, Y.X.; Writing—original draft, X.Y. and W.G.; Writing—review & editing, W.S., W.G. and K.N.; Supervision, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, Z.; Chen, J.; Hoi, S.C. Deep learning for image super-resolution: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
  2. Yue, T.; Lu, X.; Cai, J.; Chen, Y.; Chu, S. YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO. arXiv 2024, arXiv:2412.19878. [Google Scholar]
  3. Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  4. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  5. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  6. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 10 July 2017; pp. 136–144. [Google Scholar]
  7. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Cision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  8. Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 7 October 2018. [Google Scholar]
  9. Li, M.; Zhao, Y.; Zhang, F.; Luo, B.; Yang, C.; Gui, W.; Chang, K. Multi-scale feature selection network for lightweight image super-resolution. Neural Netw. 2024, 169, 352–364. [Google Scholar] [CrossRef] [PubMed]
  10. Guo, Y.; Tian, C.; Liu, J.; Di, C.; Ning, K. HADT: Image super-resolution restoration using Hybrid Attention-Dense Connected Transformer Networks. Neurocomputing 2025, 614, 128790. [Google Scholar] [CrossRef]
  11. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
  12. Li, W.; Lu, X.; Qian, S.; Lu, J.; Zhang, X.; Jia, J. On efficient transformer-based image pre-training for low-level vision. arXiv 2021, arXiv:2112.10175. [Google Scholar]
  13. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 24 November 2021; pp. 1833–1844. [Google Scholar]
  14. Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Computer Vision—ECCV. Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar]
  15. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
  16. Ray, A.; Kumar, G.; Kolekar, M.H. CFAT: Unleashing Triangular Windows for Image Super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26120–26129. [Google Scholar]
  17. Wu, G.; Jiang, J.; Jiang, J.; Liu, X. Transforming image super-resolution: A ConvFormer-based efficient approach. arXiv 2024, arXiv:2401.05633. [Google Scholar] [CrossRef] [PubMed]
  18. Xie, C.; Zhang, X.; Li, L.; Meng, H.; Zhang, T.; Li, T.; Zhao, X. Large kernel distillation network for efficient single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1283–1292. [Google Scholar]
  19. Wang, Y.; Li, Y.; Wang, G.; Liu, X. Multi-scale attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5950–5960. [Google Scholar]
  20. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
  21. Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Computer Vision—ECCV 2024, Proceeding of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025. [Google Scholar]
  22. Daubechies, I. Ten Lectures on Wavelets; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
  23. Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  24. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Computer Vision–ECCV 2016, Proceeding of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part II 14; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
  25. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  26. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
  27. Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Computer Vision–ECCV 2020 Workshops. Proceeding of the ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  28. Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 833–843. [Google Scholar]
  29. Sun, L.; Pan, J.; Tang, J. Shufflemixer: An efficient convnet for image super-resolution. Adv. Neural Inf. Process. Syst. 2022, 35, 17314–17326. [Google Scholar]
  30. Wang, Z.; Gao, G.; Li, J.; Yan, H.; Zheng, H.; Lu, H. Lightweight feature de-redundancy and self-calibration network for efficient image super-resolution. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–15. [Google Scholar] [CrossRef]
  31. Li, Y.; Deng, Z.; Cao, Y.; Liu, L. GRFormer: Grouped Residual Self-Attention for Lightweight Single Image Super-Resolution. In Proceedings of the 32nd ACM International Conference on Multimedia, New York, NY, USA, 28 October–1 November 2024; ACM: New York, NY, USA, 2024; pp. 9378–9386. [Google Scholar]
  32. Kim, J.; Nang, J.; Choe, J. LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution. arXiv 2024, arXiv:2409.03516. [Google Scholar]
  33. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  34. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  35. Yu, J.; Fan, Y.; Yang, J.; Xu, N.; Wang, Z.; Wang, X.; Huang, T. Wide activation for efficient and accurate image super-resolution. arXiv 2018, arXiv:1808.08718. [Google Scholar]
  36. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  37. Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
  38. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC; BMVA Press: Survery, UK, 2012; pp. 1–10. [Google Scholar]
  39. Martin, D.R.; Fowlkes, C.C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the ICCV, Vancouver, BC, Canada, 7–14 July 2001; pp. 416–423. [Google Scholar]
  40. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
  41. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
  42. Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9508–9520. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Illustrates the PSNR, FLOPs, and parameter counts of different SISR models on the Urban100 dataset for the ×4 SR task.
Figure 1. Illustrates the PSNR, FLOPs, and parameter counts of different SISR models on the Urban100 dataset for the ×4 SR task.
Symmetry 17 00431 g001
Figure 2. The overall architecture of MSWSR with a detailed illustration of its key components: (a) The complete network structure showing the feature extraction pathway, (b) multi-scale wavelet block (MWB), (c) spatial selection attention (SSA) module that adaptively assigns weights to enhance feature discrimination, (d) the multi-scale feature module (MFM) and gated attention unit (GAU).
Figure 2. The overall architecture of MSWSR with a detailed illustration of its key components: (a) The complete network structure showing the feature extraction pathway, (b) multi-scale wavelet block (MWB), (c) spatial selection attention (SSA) module that adaptively assigns weights to enhance feature discrimination, (d) the multi-scale feature module (MFM) and gated attention unit (GAU).
Symmetry 17 00431 g002
Figure 3. The overall architecture of RepConv.
Figure 3. The overall architecture of RepConv.
Symmetry 17 00431 g003
Figure 4. An example of the WTConv operation.
Figure 4. An example of the WTConv operation.
Symmetry 17 00431 g004
Figure 5. Convergence speed comparison on Urban100 dataset for ×4 super-resolution task.
Figure 5. Convergence speed comparison on Urban100 dataset for ×4 super-resolution task.
Symmetry 17 00431 g005
Figure 6. Visual comparison results.
Figure 6. Visual comparison results.
Symmetry 17 00431 g006
Figure 7. Visual comparison result.
Figure 7. Visual comparison result.
Symmetry 17 00431 g007
Table 1. Quantitative comparison of average PSNR (dB)/SSIM among lightweight models. The best results are highlighted in red.
Table 1. Quantitative comparison of average PSNR (dB)/SSIM among lightweight models. The best results are highlighted in red.
ScaleMethod#ParamFLOPSSet5Set14BSDS100Urban100Manga109
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
X2IMDN694K635.4G37.890.960633.510.916932.120.899032.000.926738.610.9770
RFDN417 K365.3G37.780.960633.350.916632.090.899131.790.925438.290.9764
RLFN526K461.7G37.880.960633.440.916832.130.899131.880.925938.390.9766
CFSR298K260.2G37.860.960533.440.916932.120.899231.770.935238.310.9764
SPAN410K377.5G37.940.960833.470.916532.140.899331.920.926538.300.9765
Ours312K243.3G38.010.961033.710.919332.220.900332.290.930138.860.9774
X3IMDN703K643.4G34.360.927230.280.841229.050.804528.090.850433.480.9438
RFDN424K371.4G34.180.926030.230.840629.020.803727.900.847533.230.9422
RLFN533K468.2G34.240.926630.260.841229.040.841227.990.848933.280.9426
CFSR294K266.2G34.230.926230.250.840629.040.804427.900.847533.300.9428
SPAN417K383.5G34.280.926830.270.841729.060.804928.040.849933.390.9436
Ours307K249.6G34.400.927730.350.843729.120.806728.220.854833.680.9454
X4IMDN715K654.5G32.090.894228.540.781027.520.734025.960.781930.330.9063
RFDN433 K380.2G32.130.894328.500.779527.510.733925.920.780330.200.9051
RLFN543K477.3G31.970.893128.470.779527.510.734225.880.780330.120.9035
CFSR303K274.6G32.000.893028.490.779727.520.734325.840.778130.150.9045
SPAN426.3K391.9G32.080.894228.530.781027.550.735125.950.781230.340.9064
Ours316K257.6G32.260.896628.670.784327.620.737926.170.789630.600.9092
Table 2. Multi-scale ablation experiment. √ indicates that the corresponding convolution method is used in this configuration.
Table 2. Multi-scale ablation experiment. √ indicates that the corresponding convolution method is used in this configuration.
WtConvRepConvBSDS100Urban100Manga109
PSNRSSIMPSNRSSIMPSNRSSIM
27.570.736526.000.784430.390.9076
27.540.734925.940.781130.300.9061
27.620.737926.170.789630.600.9092
Table 3. SSA ablation experiment. √ indicates that the corresponding SSA is used in this configuration.
Table 3. SSA ablation experiment. √ indicates that the corresponding SSA is used in this configuration.
SSABSDS100Urban100Manga109
PSNRSSIMPSNRSSIMPSNRSSIM
27.540.735125.910.780130.270.9057
27.620.737926.170.789630.600.9092
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, W.; Yan, X.; Guo, W.; Xu, Y.; Ning, K. MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods. Symmetry 2025, 17, 431. https://doi.org/10.3390/sym17030431

AMA Style

Song W, Yan X, Guo W, Xu Y, Ning K. MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods. Symmetry. 2025; 17(3):431. https://doi.org/10.3390/sym17030431

Chicago/Turabian Style

Song, Wei, Xiaoyu Yan, Wei Guo, Yiyang Xu, and Keqing Ning. 2025. "MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods" Symmetry 17, no. 3: 431. https://doi.org/10.3390/sym17030431

APA Style

Song, W., Yan, X., Guo, W., Xu, Y., & Ning, K. (2025). MSWSR: A Lightweight Multi-Scale Feature Selection Network for Single-Image Super-Resolution Methods. Symmetry, 17(3), 431. https://doi.org/10.3390/sym17030431

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop