Next Article in Journal
From Unstructured Feedback to Structured Insight: An LLM-Driven Approach to Value Proposition Modeling
Previous Article in Journal
Delay Analysis of Pinching-Antenna-Assisted Cellular Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Adaptive Modulation Network for Efficient Image Super-Resolution

School of Information Engineering, Xinjiang Institute of Technology, Aksu 843000, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2025, 14(22), 4404; https://doi.org/10.3390/electronics14224404 (registering DOI)
Submission received: 25 September 2025 / Revised: 20 October 2025 / Accepted: 24 October 2025 / Published: 12 November 2025
(This article belongs to the Special Issue Intelligent Signal Processing and Its Applications)

Abstract

As convolutional neural networks (CNNs) become gradually larger and deeper, their applicability in real-time and resource-constrained environments is significantly limited. Furthermore, while self-attention (SA) mechanisms excel at capturing global dependencies, they often emphasize low-frequency information and struggle to represent fine local details. To overcome these limitations, we propose a multi-scale adaptive modulation network (MAMN) for image super-resolution. The MAMN mainly consists of a series of multi-scale adaptive modulation blocks (MAMBs), each of which incorporates a multi-scale adaptive modulation layer (MAML), a local detail extraction layer (LDEL), and two Swin Transformer Layers (STLs). The MAML is designed to capture multi-scale non-local representations, while the LDEL complements this by extracting high-frequency local features. Additionally, the STLs enhance long-range dependency modeling, effectively expanding the receptive field and integrating global contextual information. Extensive experiments demonstrate that the proposed method achieves an optimal trade-off between computational efficiency and reconstruction performance across five benchmark datasets.

1. Introduction

In the current digital era, images serve as a critical medium for information, whose quality and resolution directly affect the effectiveness of information transmission and analysis. Single-Image Super-Resolution (SISR) aims to reconstruct low-resolution (LR) images as high-resolution (HR) counterparts through algorithmic processing, thereby restoring fine details and enhancing overall image quality. Beyond basic bicubic interpolation [1], more sophisticated classical methods (e.g., overlapping bicubic interpolation [2] and prior-based [3,4,5]) have also been developed, pushing the performance boundaries of non-learning-based approaches. However, due to the inherently ill-posed nature of SISR, traditional super-resolution (SR) methods [1,2,3,4,5] often struggle to effectively model complex non-linear mapping relationships between LR and HR images.
Deep learning-based methods [6,7,8,9,10,11,12] have significantly advanced SR by automatically learning complex details and high-frequency information from large datasets. Among these, convolutional neural network (CNN)-based approaches [6,8,13,14] have been widely adopted. However, due to the inherent locality of convolutional operations, shallow CNN models struggle to capture global contextual information, which often leads to distortions when reconstructing large-scale textures and long-range dependencies. To mitigate these issues and enhance representational capacity, CNN architectures have tended to grow both deeper and larger. For instance, RCAN [9] builds an over 400-layer network and contains more than 15 million parameters. Nevertheless, as CNN-based models grow deeper and more complex, they demand substantially more computational resources and memory than lightweight models, thereby hindering their deployment in real-time applications and on resource-constrained devices.
Recently, Vision Transformers (ViTs) [11,12,15,16,17] effectively capture global dependencies and improve detail recovery in SR through self-attention (SA). However, SA typically requires high computational resources, especially for HR images. While efficient variants like window-based [11], permuted [12], and spatial window [17] models reduce costs, they still face efficiency issues in feature dependency modeling, leading to slow training and inference. Furthermore, recent research has shown that ViTs tend to prioritize low-frequency information, limiting their ability to represent local details [18,19].
To address the above challenges, we propose a multi-scale adaptive modulation network (MAMN), a simple yet efficient architecture that integrates multi-scale adaptive modulation blocks (MAMBs) to achieve a favorable balance between reconstruction quality and computational efficiency. In contrast to merely stacking lightweight convolutional modules, we introduce a Swin Transformer Layer (STL) to better capture long-range dependencies. Specifically, we develop a multi-scale adaptive modulation layer (MAML) that leverages variance-based weighting to dynamically modulate multi-scale non-local representations. Given MAML’s global characteristic, we further design a local detail extraction layer (LDEL) to complement the modulation process with fine-grained local contextual information. Additionally, two Swin Transformer Layers (STLs) are incorporated to enhance the modeling of long-range features. Together, these components form an end-to-end trainable framework that effectively achieves high reconstruction performance with manageable complexity, as demonstrated in Figure 1.
The remainder of this paper is organized as follows: Section 2 provides a comprehensive review of the existing research in SR, including CNN-based, ViT-based, and lightweight approaches, while also highlighting the shortcomings of these methods, which serve as the motivation for the present study. Section 3 details the proposed multi-scale adaptive modulation network (MAMN), including the MAML, LDEL, and STL modules, ensuring the validity of our research. Section 4 presents and discusses experimental results, including quantitative and qualitative comparisons, with supporting data visualized in figures and tables. Finally, Section 5 concludes the paper and suggests directions for future research.
The main contributions of this paper are summarized as follows:
  • We propose a novel multi-scale adaptive modulation layer (MAML) that employs multi-scale decomposition and variance-based modulation to effectively extract multi-scale global structural information.
  • We design a lightweight local detail extraction layer (LDEL) to capture fine local details, complemented by Swin Transformer Layers (STLs) to efficiently model long-range dependencies.
  • Through comprehensive quantitative and qualitative evaluations on five benchmark datasets, we demonstrate that our method achieves a favorable trade-off between computational complexity and reconstruction quality.

2. Related Works

2.1. CNN-Based Super-Resolution

Compared to traditional SR methods, CNN-based approaches can automatically learn complex non-linear mappings from extensive datasets. Owing to this capability, CNN-based models have achieved significant success on SR tasks. For instance, SRCNN [6] pioneered an end-to-end learning framework that directly maps LR inputs to HR outputs. Subsequent models (e.g., FSRCNN [13] and ESPCN [21]) introduced an efficient post-upsampling mechanism that significantly accelerates the inference process while maintaining competitive reconstruction quality. Further advancing the field, VDSR [22] incorporates residual learning to facilitate the training of deeper networks and improve mapping accuracy. However, small networks often struggle to effectively model long-range dependencies due to limited model capacity and small receptive fields, resulting in insufficient image detail and excessive smoothing. To overcome the limitations of smooth results generated by previous CNN-based methods [6,13,21], generative adversarial network (GAN)-based SR approaches (e.g., [23,24]) leverage adversarial learning to reconstruct high-resolution images with more realistic textures and enhanced detail. However, these methods still suffer from challenges (e.g., training instability, occasional illusory textures, and structural inconsistencies) and still face limitations in maintaining pixel-level precision.
To further enhance representational capacity, many CNN-based SR methods have adopted increasingly deep and wide architectures. For example, EDSR [8] scales up to 43 million parameters, resulting in improved reconstruction accuracy and visual quality. Similarly, RCAN [9] employs a very deep structure of over 400 layers and integrates a channel attention mechanism to adaptively extract informative features from different image regions, thereby enhancing both local detail and global consistency. However, these large-scale models require considerable computational resources and exhibit slow inference speeds, rendering them less suitable for real-time or mobile applications.

2.2. ViT-Based Super-Resolution

Owing to the remarkable ability of the Transformer architecture [25] in modeling global contextual information, it has attracted growing interest in the SR field. Vision Transformer (ViT) [26], for instance, employs global self-attention (SA) to capture long-range dependencies, significantly improving the reconstruction of fine details and global consistency. This approach has achieved breakthrough performance in various SR tasks, surpassing traditional convolutional neural networks. IPT [15] utilizes a pre-trained SA mechanism based on the ImageNet dataset to integrate both global and local features, thereby enhancing high-frequency details and structural integrity. Nevertheless, the standard SA mechanism demands the calculation of pair interactions between all image patches, resulting in quadratic growth in computational and memory costs as the image size increases, which severely limits its practicality.
To mitigate these computational challenges, numerous efficient ViT variants have been developed. For instance, SwinIR [11] introduces local window-based SA and cross-layer feature fusion, effectively capturing detailed local structures while maintaining computational efficiency. ELAN [27] incorporates a streamlined long-range attention mechanism that reduces complexity while preserving the ability to model dependencies between distant pixels, thereby improving both detail recovery and structural coherence. Restormer [16] employs adaptive feature fusion strategies to balance local and global information for high-quality image reconstruction. SRFormer [12] achieves efficient large-window SA by transferring computation to the channel dimensions, reducing spatial complexity. Meanwhile, HAT [28] activates a larger proportion of pixels via hybrid attention to enhance representation capacity, and DAT [17] introduces dual aggregation across spatial and channel dimensions (both intra-block and inter-block), significantly improving the model’s expressive capability with a reduced number of channels. Although these methods effectively leverage SA to capture global dependencies and improve long-range feature propagation, they still face growing computational and memory demands as image resolution and network depth increase, ultimately constraining training and inference efficiency and imposing higher hardware requirements.

2.3. Lightweight and Efficient Image Super-Resolution

To balance computational efficiency and resource consumption, numerous lightweight SR methods have been developed. For instance, FSRCNN [13] and ESPCN [21] employ a post-upsampling strategy to accelerate SR processing and alleviate computational overhead from input upscaling. CARN [7] introduces group convolutions and a cascading mechanism to progressively refine image details, thereby enhancing network performance. IMDN [29] leverages multi-stage information distillation to effectively extract and fuse multi-level features, significantly improving SR quality with reducing computational cost. LatticeNet [30] utilizes serial lattice blocks coupled with backward feature fusion to minimize parameters while preserving competitive reconstruction performance. Meanwhile, LCRCA [31] proposes a lightweight yet efficient deep residual block (DRB) capable of generating more accurate residual information. ShuffleMixer [32] integrates channel shuffling and group convolution to optimize feature reorganization and computation, markedly improving efficiency without sacrificing performance. BSRN [33] adopts blueprint separable convolution to reduce model complexity, and HNCT [34] introduces a hybrid architecture that combines local and non-local priors using both CNN and Transformer components for enhanced SR performance. Furthermore, SAFMN [10] implements a spatially adaptive feature modulation mechanism to dynamically select informative representations, while HDSRNet [35] exploits heterogeneous dynamic convolution for efficient SR. SMFANet [36] contributes a self-modulating feature aggregation (SMFA) module to enhance feature expressiveness in spatial dimensions. Despite gaining ground in efficiency-oriented designs, achieving an optimal trade-off between reconstruction accuracy and computational efficiency remains a continuous challenge in lightweight SR tasks.

3. Proposed Method

In this paper, we propose an efficient SR method that integrates a multi-scale adaptive modulation layer (MAML) to capture multi-scale non-local information and a local detail extraction layer (LDEL) to extract fine-grained local details. To further enhance feature refinement and long-range dependency modeling, we introduce a Swin Transformer Layer (STL), which effectively refines the extracted features and facilitates global contextual interaction throughout the network. The proposed method achieves a superior balance between model complexity and reconstruction performance.
As illustrated in Figure 2, the overall architecture of the proposed multi-scale adaptive modulation network (MAMN) comprises three components: a 3 × 3 convolution layer, a series of stacked multi-scale adaptive modulation blocks (MAMBs), and an upsampler layer. Given an LR input ( I L R R H × W × 3 ), we first adopt a 3 × 3 convolution layer to extract shallow features ( F s R H × W × C ). These features are then progressively refined through multiple MAMBs to produce more representative deep features. Each MAMB integrates one MAML, one LDEL, and two STLs. To facilitate the reconstruction of high-frequency details, a global residual connection is introduced. Finally, the HR output image ( I S R R H × W × 3 ) is reconstructed through an upsampler layer, which consists of a 3 × 3 convolution followed by a sub-pixel convolution [10]. The entire process can be formulated as follows:
F s = C o n v I L R , I S R = G U P G D F F s + F s
where C o n v · denotes a 3 × 3 convolution operation, G D F · represents a series of stacked MAMBs, and G U P · refers to an upsampler layer. Following previous studies [10,32], the loss function combines a mean pixel-wise loss and an FFT-based frequency loss to enhance the recovery of high-frequency details. The overall loss function is defined as follows:
L = I S R I H R 1 + λ F I S R F I H R 1
where I H R is the ground-truth HR image, F refers to the fast Fourier transform, and the λ parameter represents a weight value (empirically set to 0.05).

3.1. Multi-Scale Adaptive Modulation Layer

Existing methods primarily rely on single-scale features [8,9,34,35], which often fail to capture multi-level details and global information, leading to an incomplete representation of complex image structures. To address this limitation, we propose a multi-scale adaptive modulation module that extracts features at multiple scales and adaptively adjusts the importance of information at each scale. This enables refined integration of global context and enhances the model’s capacity to handle variations in scale. As illustrated in Figure 2b, to reduce model complexity while obtaining multi-scale representations, we first split the input ( X i n R H × W × C ) channel-wise into four components. The first component is processed using a 1 × 1 depth-wise convolution, while the remaining three are fed into feature generation units with progressively increased receptive fields to produce multi-scale features. This procedure can be expressed as follows:
X i = S p l i t X i n , i = 1 , 2 , 3 , 4 , X ^ 1 = D W C o n v 1 × 1 X 1 , X ^ i = D W C o n v j × j D k X j , i = 2 , 3 , 4 j = 3 , 5 , 7 k = 2 , 4 , 8 ,
where S p l i t · refers to the channel-split operation, D W C o n v 1 × 1 · and D W C o n v j × j · are 1 × 1 and j × j depth-wise convolution layers, D k · represents adaptive max pooling for k downsampling, X i R H × W × C 4 , X ^ 1 R H × W × C 4 , and X ^ i R H k × W k × C 4 . Here, adaptive max pooling is combined with convolutions of different scales to capture broader contextual information and extract more abstract and global features. Inspired by SMFANet [36], the variance of X is employed as a measure of spatial information variability. To enhance the model’s expressiveness and adaptability, the introduction of the learnable parameters α and β to weight different feature information (the variance of the convolution processed features and the input features) allows the model to dynamically adjust the importance of this feature information based on the training data. This procedure is expressed as:
σ 2 X i = 1 N j = 0 N 1 x i μ 2 , i = 1 , 2 , 3 , 4 ,
where σ 2 X i R 1 × 1 × C 4 is the variance of X i , N is the total number of pixels, x i represents the pixel value, and μ is the mean value of all pixels.
Subsequently, a 1 × 1 depth-wise convolution performs the channel-adaptive nonlinear modulation of multi-scale features and variance statistics-weighted results to enhance feature representation. We then obtain the non-local feature representation by applying a 1 × 1 convolution to the concatenated multi-scale features. The process operation is as follows:
X ˜ i = C o n v 1 × 1 α · X ^ i + β · σ 2 X i , i = 1 , 2 , 3 , 4 , X ˇ i = U k X ˜ i , i = 2 , 3 , 4 , X c a t = C o n v 1 × 1 C o n c a t X ˜ 1 , X ˇ i , i = 2 , 3 , 4 ,
where C o n v 1 × 1 · is a 1 × 1 convolution operation, α and β are learnable parameters, U k · denotes k upsampling feature maps to the original resolution through nearest interpolation, X ˜ i R H × W × C 4 are the variance-adjusted features, X ˇ i R H × W × C 4 are the after upsampling features, C o n c a t · represents the channel concatenation operation, X c a t R H × W × C is the channel concatenating feature.
Finally, we use the concatenating feature to modulate the input feature ( X i n R H × W × C ) to extract the representative feature ( X o u t R H × W × C ), which can be expressed as follows:
X o u t = X i n ϕ X c a t
where ϕ · denotes the GELU function [37] and ⊙ is the element-wise product.

3.2. Local Detail Extraction Layer

Local details play a crucial role in enhancing the recovery of fine image structures. While the MAML captures global multi-scale information, we designed a lightweight local detail extraction layer (LDEL) to simultaneously augment local detail representation. As illustrated in Figure 2c, the input ( Y i n R H × W × C ) is first projected via a 1 × 1 convolution to expand its channel dimensions. The expanded features ( Y e R H × W × 2 C ) are then processed by a 3 × 3 depth-wise convolution to encode localized patterns, producing feature set Y l R H × W × 2 C . Finally, a 1 × 1 convolution followed by a GELU activation [37] is applied to reduce channel dimensions and generate the refined local feature ( Y o u t R H × W × C ). The entire procedure can be formulated as follows:
Y e = C o n v 1 × 1 Y i n , Y l = D W C o n v 3 × 3 Y e , Y o u t = ϕ C o n v 1 × 1 Y l
where ϕ · and D W C o n v 3 × 3 · remain the same as the previous definition.

3.3. Swin Transformer Layer

The Transformer architecture [25] effectively captures long-range dependencies and integrates global information through its SA mechanism, thereby enhancing the flexibility and expressiveness of feature representation. However, classic SA approaches [25,26] suffer from high computational complexity, particularly when processing long sequences, leading to a significant burden on memory and computational resources. To mitigate these limitations, we incorporate the Swin Transformer Layer (STL) from SwinIR [11], which maintains the ability to model long-range interactions while significantly improving computational efficiency.
As shown in Figure 2d, the inputs ( F i n R H × W × C ) are first normalized by a LayerNorm layer to stabilize the training process. Subsequently, the normalized feature is fed into a multi-head self-attention (MSA) module to effectively capture long-range dependencies. At the end of the MSA, a residual connection is introduced to add the MSA’s output to the original input feature, thereby preserving original information and enhancing gradient flow. This process can be described as follows:
F s = M S A L N F i n + F i n ,
where L N · represents a LayerNorm operation, and F s R H × W × C is the intermediate feature. F s is then processed by the second LayerNorm layer for further normalization. The normalized feature is subsequently fed into a multi-layer perceptron (MLP) to perform non-linear transformation and enhancement. Finally, a residual connection is similarly introduced at the output of the MLP to add its output to the feature preceding the second LayerNorm. This procedure can be formulated as follows:
F o u t = M L P L N F s + F s ,
where F o u t R H × W × C is the output of the STL.

3.4. Multi-Scale Adaptive Modulation Block

Based on the complementary function of the MAML, LDEL, and STL, we integrate these modules into a multi-scale adaptive modulation block (MAMB) to extract rich and representative deep features. The input is simultaneously processed through two parallel branches: the MAML, which captures multi-scale non-local contextual information, and the LDEL, which focuses on extracting fine-grained local details. This process can be expressed as follows:
X m = M A M L L N X + X , X l = L D E L L N X + X ,
Subsequently, the features extracted from these two branches are concatenated along the channel dimensions to form a comprehensive representation that integrates both multi-scale global information and fine local details. The concatenated features are then processed by a 1 × 1 convolutional layer to adjust channel dimensions and compress the feature representation, thereby enhancing computational efficiency and facilitating effective information integration. This process can be formulated as follows:
X c = C o n v 1 × 1 C o n c a t X m , X l ,
The processed features are then passed through two consecutive STLs to enable deeper modeling, further enhancing the network’s ability to capture long-range dependencies. Finally, the output of the STLs undergoes refinement via a 1 × 1 convolutional layer and is combined with the original input features through a residual connection [38]. This design not only improves the representational capacity of the features but also ensures stability during training.
Y = C o n v 1 × 1 S T L S T L X c + X
where X m R H × W × C , X l R H × W × C , and X c R H × W × C represent the intermediate features, while X R H × W × C and Y R H × W × C refer to the input and output of the MAMB.

4. Experimental Results

In this section, we evaluate the performance of the proposed method from both quantitative and qualitative perspectives on five benchmark test datasets.

4.1. Datasets and Implementation Details

Datasets. Similar to previous works [10,11], we employ the DIV2K [8] and Flickr2K [8] datasets for model training. DIV2K provides 800 high-resolution and high-quality images, which can effectively support the model in learning rich textures and details. Flickr2K offers 2650 diverse real-world images, contributing to the improvement of the generalization ability and robustness of the model. LR images are generated by bicubic downscaling from HR images. As for test data, we use five benchmark datasets as test datasets: Set5 [39], Set14 [40], BSD100 [41], Urban100 [42], and Manga109 [20]. Set5 and Set14 are designed to facilitate quick and fair numerical comparisons with other methods. The BSD100 dataset consists of 100 natural scene images that contain natural noise and complex structures. We use PSNR and SSIM as evaluation metrics. Urban100 comprises 100 urban images including numerous regular repeating structures (e.g., building windows, staircases, and floor tiles). Manga109 comprises 109 manga-style images characterized by sharp edges and smooth color regions. These datasets collectively cover diverse image types, enabling fair comparison and objective evaluation of different SR models’ performance. The peak signal-to-noise ratio (PSNR) [43] and structural similarity index measure (SSIM) [44] are calculated on the Y channel after converting the images to the YCbCr color space.
The PSNR is employed to quantitatively assess the reconstruction quality of SR images by measuring the pixel-level similarity between the generated high-resolution image and the corresponding ground truth. It is defined as follows:
P S N R = 10 × l o g 10 M A X 2 M S E ,
where M A X represents the maximum possible pixel value of the image (typically 255 for 8-bit images), M S E denotes the mean squared error between the super-resolved image and the ground-truth high-resolution image. The MSE can be expressed as follows:
M S E = 1 m n m 1 i = 0 n 1 j = 0 I i , j K i , j 2 ,
where I i , j and K i , j denote the pixel values at position (i,j) of the ground-truth high-resolution image and the SR image, respectively, while m and n represent the width and height of the images. The SSIM evaluates the perceptual similarity between these two images by comparing brightness, contrast, and structural characteristics. Its calculation is defined as follows:
S S I M x , y = 2 μ x μ y + C 1 2 σ x y + C 2 μ x 2 + μ y 2 + C 1 σ x 2 + σ y 2 + C 2 ,
where μ x and μ y represent the average brightness of the ground-truth high-resolution image and the SR image, respectively; σ x and σ y denote their standard deviations; and σ x y is the covariance between the two images. Constants C 1 and C 2 are introduced to stabilize the division, with C 1 = K 1 L 2 and C 2 = K 2 L 2 , where K 1 = 0.01, K 2 = 0.03, and L indicates the pixel value’s dynamic range (e.g., L = 255 for 8-bit image).
Implementation details. During training, low-resolution (LR) input images are randomly cropped into patches with dimensions of 64 × 64 and augmented through random horizontal flipping and rotation. A batch size of 16 is used throughout the training process. The MAMN employs 8 MAMBs with a feature channel of 36. We train the proposed model with the Adam optimizer [45], β 1 = 0.9 , and β 2 = 0.99 . The number of total iterations is set to 1,000,000. The initial learning rate is set to 1 × 10 4 and decayed to a minimum of 1 × 10 5 following a cosine annealing scheme [46]. All experiments are implemented using the PyTorch (torch==2.3.0 + CUDA 11.8) framework and executed on an NVIDIA GeForce RTX 3090 GPU.

4.2. Comparisons with State-of-the-Art Methods

Quantitative comparisons. To evaluate the performance of the proposed model, we perform comprehensive comparisons with state-of-the-art CNN-based and traditional lightweight SR methods, including Bicubic [1], FSRCNN [13], EDSR-baseline [8], CARN [7], IMDN [29], PAN [47], DPSR [48], LatticeNet [30], LCRCA [31], ShuffleMixer [32], HNCT [34], FDIWN [49], and HDSRNet [35]. Table 1 presents the quantitative comparisons of × 2 , × 3 , and × 4 SR using a CNN-based architectureacross five benchmark datasets. Besides the widely adopted PSNR and SSIM metrics, we also provide parameters and FLOPs (three color channels) to assess model complexity with the fvcore1 library (i.e., https://detectron2.readthedocs.io/en/latest/modules/fvcore.html#fvcore.nn.parameter_count (accessed on 25 June 2025)) under an LR image to 1280 × 720 pixels. Params are linked to the memory footprint, and FLOPs reflect computational consumption. As shown in Table 1, the proposed model achieves the best performance on both × 3 and × 4 SR tasks. For the × 2 , the proposed MAMN attains the second best PSNR/SSIM results (excluding Set5 and Urban100) while using only 40% of the parameters compared with the best model, LatticeNet.
We also compared our approach with attention-based methods, including lightweight dynamic modulation (e.g., SAFMN [10], SMFANet [36], and SRConvNet [50]) and large-scale self-attention (e.g., SwinIR [11], HAT [28], and RGT [51]). As observed in Table 2, compared to similar lightweight models (SAFMN, SMFANet, and SRConvNet), the proposed MAMN consistently achieves the best overall performance while maintaining comparable parameters. At scaling factors of × 3 and × 4 , MAMN attains the highest metrics across all five test datasets. The most significant improvements are observed on the × 3 scale for the Urban100 (28.43/0.8570) and Manga109 (34.20/0.9478) datasets while maintaining reasonable FLOP control (only 21 G at × 4 ). Compared to large-scale models (SwinIR, HAT, and RGT) with parameter counts dozens of times greater than ours, MAMN achieves approximately 95% of HAT’s PSNR performance in × 4 SR tasks while utilizing less than 3.1% of the parameters (0.31 M vs. 10 M–21 M). In terms of computational efficiency, MAMN requires only 1.4–3.5% of the FLOPs of these large models (e.g., 21 G vs. 592 G–1458 G at × 4 scale). Particularly on the Set5 dataset for × 4 SR, the proposed method attains about 98% of HAT’s performance with merely 1.5% of its parameters.
Qualitative comparisons. Visual comparisons for × 3 and × 4 SR are presented in Figure 3 and Figure 4, respectively, using images from the Urban100 dataset. The results demonstrate that the proposed method achieves superior reconstructions compared to other approaches, exhibiting enhanced recovery of fine details and improved structural integrity. Specifically, the proposed method preserves original image details more effectively while reducing blurring and visual distortions. Although all compared methods achieve perceptually plausible results, the proposed approach shows notably stronger performance in reconstructing complex textures and sharp edges, indicating its enhanced capability in handling complex scenarios.
Running time comparisons. To further assess the computational efficiency and practical applicability of various SR methods, we compare the inference times of different methods (including DPSR [48], LatticeNet [30], HNCT [34], ShuffleMixer [32], SAFMN [10], SMFANet [36], and SRConvNet [50]) when performing × 4 SR on 50 images with a resolution of 160 × 120 pixels. The test platform utilizes an Intel(R) i5-13600KF processor (20 cores @ 3.5 GHz), 32 GB system memory, and an NVIDIA GeForce RTX 4060 Ti GPU running on the Windows operating system. Based on the running time comparisons presented in Table 3, the proposed MAMN achieves an average inference time of 0.066 s. This performance demonstrates highly competitive efficiency among state-of-the-art methods.
As shown in Table 1 and Table 3, MAMN achieves superior reconstruction quality across all five benchmark datasets while being 20.5% faster than HNCT (0.066 s vs. 0.083 s). Compared to the lightweight SMFANet [36], which requires 0.034 s, MAMN maintains a better trade-off between performance and speed, achieving significantly higher accuracy despite a moderate increase in inference time. Furthermore, the proposed method demonstrates substantially greater efficiency than several larger models; for example, DPSR [48] requires 0.169 s (156.1% slower), while LatticeNet [30] takes 0.120 s (81.8% slower) to process the same data. Table 1 and Table 3 indicate that the proposed method balances computational efficiency with high reconstruction quality. Specifically, the overall runtime is displayed in Figure 1 (labeled as “Runtime vs. PSNR”).
LAM comparisons. The Local Attribution Map (LAM) [52] highlights significant correlations between red pixels and rectangular reference patches during the reconstruction process. In Figure 5, we compare the LAM visualizations of the proposed method with other efficient SR approaches [10,29,32] and annotate the corresponding Diffusion Index (DI) value below each subfigure. Higher DI values indicate a broader range of pixel participation in the reconstruction. The results demonstrate that the proposed MAMN effectively integrates information from a wider spatial area, leading to superior reconstruction quality.

4.3. Model Analysis

To thoroughly evaluate the contribution of each component in the proposed MAMN, we conduct comprehensive ablation studies under consistent experimental conditions. All ablation experiments are performed at an upscaling factor of × 4 using identical settings to ensure fair comparisons. As summarized in Table 4, the baseline corresponds to the complete MAMN model. The notation “A→B” indicates replacing component A with B, while “None” denotes removing the corresponding operation. The abbreviations represent the following modules: FA (Feature Aggregation), FM (Feature Modulation), MC (Multi-scale Convolution), Down (Downsampling), VM (Variance Modulation), and FFTLoss (Frequency Loss). Performance is evaluated quantitatively on the Set5 [39] and Manga109 [20] datasets.
Effectiveness of the multi-scale adaptive modulation layer. We conduct an ablation study to evaluate the impact of the multi-scale adaptive modulation layer (MAML). Experimental results indicate that removing this module causes PSNR decreases of 0.08 dB on the Set5 dataset and 0.17 dB on the Manga109 dataset, underscoring the critical importance of MAML. To further elucidate its underlying mechanism, we perform a depth analysis of this module.
  • Feature Modulation. The MAML incorporates a feature modulation mechanism to adaptively adjust feature weights. Ablation results show that removing this operation (“w/o FM”) leads to performance degradation of 0.02 dB on Set5 and 0.03 dB on Manga109 compared to the baseline model.
  • Multi-scale representation. To evaluate the effectiveness of multi-scale features in the proposed MAMN, we construct two variant models: “w/o MC” and “w/o Down”. Here, the “w/o MC” configuration replaces the multi-scale depth-wise convolution with a single-scale 3 × 3 depth-wise convolution for spatial feature extraction. As shown in Table 4, the use of multi-scale features yields a PSNR improvement of 0.04 dB on the Manga109 dataset. The “w/o Down” variant, which removes the downsampling operation, confirms that incorporating downsampling can bring about superior PSNR performance. These results demonstrate that multi-scale feature extraction enhances the model’s ability to capture information at different levels of detail, thereby improving SR reconstruction. Furthermore, we employ adaptive max pooling to construct multi-scale representations. In comparison to adaptive average pooling and nearest interpolation, adaptive max pooling more effectively identifies salient features, contributing to improved reconstruction quality.
  • Variance modulation. To enhance the ability to capture non-local information, the proposed MAMN incorporates variance modulation within the MAML branch. An ablation study is conducted by removing this operation to evaluate its contribution. As summarized in Table 4, the absence of variance modulation results in a consistent performance reduction of 0.06 dB on both the Set5 and Manga109 datasets. Furthermore, replacing variance modulation with standard attention mechanisms improves performance but leads to sharp increases in parameters and computational complexity, rising by 58.6% and 81%, respectively. These findings confirm that variance modulation plays a critical role in improving the representational capacity of the model.
  • Feature aggregation. To evaluate the effectiveness of feature aggregation, we construct an ablation model denoted as “w/o FA”, in which the 1 × 1 convolutional layer for the integration of multi-scale features along the channel dimensions is removed. Experimental results show that incorporating feature aggregation improves PSNR by 0.06 dB on the Set5 dataset and 0.07 dB on the Manga109 dataset. These results demonstrate the essential role of multi-scale feature aggregation in improving reconstruction performance.
Effectiveness of the local detail extraction layer. To capture local features and fine-grained details, we design a LDEL branch. An ablation study is performed by removing this branch to evaluate its contribution. As presented in Table 4, the absence of the LDEL results in PSNR decreases of 0.13 dB on the Set5 dataset and 0.36 dB on the Manga109 dataset. These results demonstrate the critical role of the LDEL in extracting and preserving locally important features and structural details.
Figure 6 presents the power spectral density (PSD) maps and corresponding feature visualizations, illustrating the complementary interactions between the MAML and LDEL branch. Through periodic spectrum map transformation, the low-frequency components are shifted to the center, where a brighter central region indicates stronger energy concentration in low-frequency components. The MAML exhibits higher energy density, with pronounced brightness at the center, whereas the F l feature from the LDEL shows more dispersed energy distribution in the peripheral regions compared to the input feature ( F i n ) and the MAML output ( F m ). This contrast highlights the distinct yet complementary roles of the two branches in capturing frequency information.
Figure 7 visualizes the high-frequency components within feature maps of the input, MAML, and LDEL across different radial regions. The low-frequency components are extracted through the following procedure: First, a low-frequency mask is generated by computing the squared distance from each pixel to the center coordinates within a given radius (r). Pixels within or on the circle of radius r are assigned a mask value of 1, while all others are set to 0. The high-frequency mask is then obtained by inverting the low-frequency mask using 1-mask such that regions with mask values of 1 correspond to the high-frequency components. Finally, the high frequency mask is inverted by a transform operation to obtain a time-domain image (complex form); then, its absolute value is taken to produce a real-valued image, and its mean value is computed to quantify the high-frequency energy within the current radial region.
As shown in Figure 7, the high-frequency component values of the LDEL are significantly higher than those of the Input and MAML at small region radii, indicating that the LDEL retains richer high-frequency information in its initial state and is more effective at capturing fine textural details. As the radial region expands, the high-frequency responses of the LDEL decrease at a relatively gradual rate. In contrast, the MAML exhibits a more observable decline in high-frequency energy, while the input maintains consistently low values throughout. These results demonstrate that the LDEL can stably preserve high-frequency information across varying scale regions, indicating its superior capability in maintaining high-frequency features when processing images of different sizes.
Effectiveness of the Swin transformer layer. The introduced STL can enhance the global representation capability of MAMN by effectively capturing long-range dependencies. To evaluate the contribution of this module, we remove the STL from the MAMB. As summarized in Table 4, the model without STL achieves PSNR values of only 32.01 dB on the Set5 dataset and 30.30 dB on the Manga109 dataset. These results underscore the critical importance of modeling long-range dependencies for high-quality image SR.

5. Conclusions

In this paper, we propose a lightweight and efficient super-resolution network, termed the multi-scale adaptive modulation network (MAMN), which leverages multi-scale adaptive modulation to enhance reconstruction performance for complex details. The core component, the multi-scale adaptive modulation layer (MAML), effectively captures multi-level information while dynamically adjusting feature contributions at different scales. To further improve the recovery of fine details, a local detail extraction layer (LDEL) is proposed. Moreover, we introduce a Swin Transformer Layer (STL) to strengthen long-range feature dependencies and improve contextual coherence. Extensive experimental results demonstrate that the proposed MAMN achieves a highly favorable balance between computational complexity and reconstruction quality across multiple benchmark datasets. However, the self-attention (SA) mechanism inherent in the STL still imposes high memory and computational demands, particularly when processing high-resolution images. Future work will focus on the development of architectural optimization and acceleration strategies (e.g., exploring low-rank attention mechanisms or adaptive token pruning) to substantially reduce the computational costs of Transformer-based models and enhance operational efficiency in HR applications.

Author Contributions

Z.L.’s main contribution was to propose the methodology of this work and write the paper. G.Z. guided the entire research process, participated in the writing of the paper, and secured funding for the research. J.T. participated in the implementation of the algorithm. R.Q. was responsible for algorithm validation and data analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2022D01C461 and 2022D01C460), the “Tianshan Talents” Famous Teachers in Education and Teaching project of Xinjiang Uygur Autonomous Region (2025), and the “Tianchi Talents Attraction Project” of Xinjiang Uygur Autonomous Region (2024TCLJ04).

Data Availability Statement

The data are openly available in a public repository. The code is available at https://github.com/smilenorth1/MAMN-main (accessed on 30 June 2025).

Conflicts of Interest

The authors declare that they have no competing interests. There are no financial, personal, or professional relationships, affiliations, or circumstances that could be construed as influencing the research reported in this manuscript.

Nomenclature

The following nomenclature is used in this manuscript:
CNNsConvolutional neural networksSASelf-attention
MAMNMulti-scale adaptive modulation networkLRLow-resolution
MAMBMulti-scale adaptive modulation blockHRHigh-resolution
MAMLMulti-scale adaptive modulation layerViTVision Transformer
LDELLocal detail extraction layerSTLSwin transformer layer
SISRSingle-image super-resolution σ 2 Variance
D Adaptive max pooling downsamplingNTotal pixels
U Nearest interpolation upsampling ϕ GELU activation
MSAMulti-head self-attentionMLPMulti-layer perceptron
PSNRPeak signal-to-noise ratio M S E Mean squared error
SSIMStructural similarity index measure M A X Maximum pixel value
GANGenerative adversarial network

References

  1. Donya, K.; Abdolah, A.; Kian, J.; Mohammad, H.M.; Abolfazl, Z.K.; Najmeh, M. Low-Cost Implementation of Bilinear and Bicubic Image Interpolation for Real-Time Image Super-Resolution. In Proceedings of the GHTC, Online, 29 October–1 November 2020; pp. 1–5. [Google Scholar] [CrossRef]
  2. Ruangsang, W.; Aramvith, S. Efficient super-resolution algorithm using overlapping bicubic interpolation. In Proceedings of the GCCE, Nagoya, Japan, 24–27 October 2017; pp. 1–2. [Google Scholar] [CrossRef]
  3. Dai, S.; Han, M.; Xu, W.; Wu, Y.; Gong, Y. Soft Edge Smoothness Prior for Alpha Channel Super Resolution. In Proceedings of the CVPR, Minneapolis, MN, USA, 23–28 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
  4. Radu, T.; Vincent, D.; Luc, V. A+: Adjusted Anchored Neighborhood Regression for Fast Super-Resolution. In Proceedings of the ACCV, Singapore, 1–5 November 2015; pp. 111–126. [Google Scholar] [CrossRef]
  5. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image Super-Resolution Via Sparse Representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef] [PubMed]
  6. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the ECCV, Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar] [CrossRef]
  7. Ahn, N.; Kang, B.; Sohn, K. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018; pp. 252–268. [Google Scholar] [CrossRef]
  8. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the CVPRW, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar] [CrossRef]
  9. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar] [CrossRef]
  10. Sun, L.; Dong, J.; Tang, J.; Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the ICCV, Paris, France, 2–6 October 2023; pp. 13190–13199. [Google Scholar] [CrossRef]
  11. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the ICCVW, Montreal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
  12. Zhou, Y.; Li, Z.; Guo, C.L.; Bai, S.; Cheng, M.M.; Hou, Q. SRFormer: Permuted Self-Attention for Single Image Super-Resolution. In Proceedings of the ICCV, Paris, France, 2–6 October 2023; pp. 12734–12745. [Google Scholar] [CrossRef]
  13. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016; pp. 391–407. [Google Scholar] [CrossRef]
  14. Zhang, X.; Zeng, H.; Zhang, L. Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices. In Proceedings of the ACMM, Virtual Event, China, 20–24 October 2021; pp. 4034–4043. [Google Scholar] [CrossRef]
  15. Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the CVPR, Nashville, TN, USA, 10–25 June 2021; pp. 12299–12310. [Google Scholar] [CrossRef]
  16. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar] [CrossRef]
  17. Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual aggregation transformer for image super-resolution. In Proceedings of the ICCV, Paris, France, 2–6 October 2023; pp. 12278–12287. [Google Scholar] [CrossRef]
  18. Dong, J.; Pan, J.; Yang, Z.; Tang, J. Multi-scale residual low-pass filter network for image deblurring. In Proceedings of the ICCV, Paris, France, 2–6 October 2023; pp. 12311–12320. [Google Scholar] [CrossRef]
  19. Namuk, P.; Songkuk, K. How Do Vision Transformers Work? In Proceedings of the ICLR, Virtual, 25–29 April 2022. [Google Scholar] [CrossRef]
  20. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
  21. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
  22. Kim, J.; Lee, J.; Lee, K. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
  23. Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar] [CrossRef]
  24. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Chen, C.L. ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2019; pp. 63–79. [Google Scholar] [CrossRef]
  25. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the NeurIPS, Red Hook, NY, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar] [CrossRef]
  26. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the ICLR, Vienna, Austria, 3–7 May 2021. [Google Scholar] [CrossRef]
  27. Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Proceedings of the ECCV, Tel Aviv, Israel, 23–27 October 2022; pp. 649–667. [Google Scholar] [CrossRef]
  28. Chen, X.; Wang, X.; Zhou, J.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the CVPR, Vancouver, BC, Canada, 18–22 June 2023; pp. 22367–22377. [Google Scholar] [CrossRef]
  29. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the ACMM, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar] [CrossRef]
  30. Luo, X.; Xie, Y.; Zhang, Y.; Qu, Y.; Li, C.; Fu, Y. Latticenet: Towards lightweight image super-resolution with lattice block. In Proceedings of the ECCV, Glasgow, UK, 23–28 August 2020; pp. 272–289. [Google Scholar] [CrossRef]
  31. Peng, C.; Shu, P.; Huang, X.; Fu, Z.; Li, X. LCRCA: Image super-resolution using lightweight concatenated residual channel attention networks. Appl. Intell. 2022, 52, 10045–10059. [Google Scholar] [CrossRef]
  32. Sun, L.; Pan, J.; Tang, J. Shufflemixer: An efficient convnet for image super-resolution. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022; pp. 17314–17326. [Google Scholar] [CrossRef]
  33. Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the CVPRW, New Orleans, LA, USA, 19–20 June 2022; pp. 833–843. [Google Scholar] [CrossRef]
  34. Fang, J.; Lin, H.; Chen, X.; Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the CVPRW, New Orleans, LA, USA, 19–20 June 2022; pp. 1102–1111. [Google Scholar] [CrossRef]
  35. Tian, C.; Zhang, X.; Wang, T.; Zhang, Y.; Zhu, Q.; Chia-Wen, L. A Heterogeneous Dynamic Convolutional Neural Network for Image Super-resolution. Image Video Process. 2024. [Google Scholar] [CrossRef]
  36. Zheng, M.; Sun, L.; Dong, J.; Jinshan, P. SMFANet: A Lightweight Self-Modulation Feature Aggregation Network for Efficient Image Super-Resolution. In Proceedings of the ECCV, Milan, Italy, 29 September–4 October 2025; pp. 359–375. [Google Scholar] [CrossRef]
  37. Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar] [CrossRef]
  38. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  39. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Morel, M.l.A. Low-Complexity Single Image Super-Resolution Based on Nonnegative Neighbor Embedding. In Proceedings of the BMVC, Surrey, UK, 3–7 September 2012; pp. 1–10. [Google Scholar] [CrossRef]
  40. Zeyde, R.; Elad, M.; Protter, M. On Single Image Scale-Up Using Sparse-Representations. In Proceedings of the Curves and Surfaces, Avignon, France, 24–30 June 2010; pp. 711–730. [Google Scholar] [CrossRef]
  41. Arbeláez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour Detection and Hierarchical Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 898–916. [Google Scholar] [CrossRef]
  42. Huang, J.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar] [CrossRef]
  43. Korhonen, J.; You, J. Peak signal-to-noise ratio revisited: Is simple beautiful? In Proceedings of the QoMEX, Melbourne, Australia, 5–7 July 2012; pp. 37–38. [Google Scholar] [CrossRef]
  44. Zhou, W.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  45. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
  46. Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar] [CrossRef]
  47. Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient image super-resolution using pixel attention. In Proceedings of the ECCVW, Glasgow, UK, 23–28 August 2020; pp. 56–72. [Google Scholar] [CrossRef]
  48. Zhang, K.; Zuo, W.; Zhang, L. Deep Plug-and-Play Super-Resolution for Arbitrary Blur Kernels. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019; pp. 1671–1681. [Google Scholar] [CrossRef]
  49. Gao, G.; Li, W.; Li, J.; Wu, F.; Lu, H.; Yu, Y. Feature distillation interaction weighting network for lightweight image super-resolution. In Proceedings of the AAAI, Virtual, 22 February–1 March 2022; pp. 661–669. [Google Scholar] [CrossRef]
  50. Li, F.; Cong, R.; Wu, J.; Bai, H.; Wang, M.; Zhao, Y. SRConvNet: A Transformer-Style ConvNet for Lightweight Image Super-Resolution. Int. J. Comput. Vis. 2025, 133, 173–189. [Google Scholar] [CrossRef]
  51. Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X. Recursive Generalization Transformer for Image Super-Resolution. In Proceedings of the ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar] [CrossRef]
  52. Gu, J.; Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the CVPR, Nashville, TN, USA, 19–25 June 2021; pp. 9195–9204. [Google Scholar] [CrossRef]
Figure 1. Model complexity and performance comparison between the proposed method and other lightweight methods on Manga109 [20] for × 4 SR. The left subplot illustrates the relationship between PSNR (a measure of image reconstruction quality), parameters, and FLOPs (a measure of computational complexity) for different SR methods. The sizes of circles in the figure denote the models’ FLOPs. The right subplot illustrates the relationship between PSNR and runtime (a measure of computational efficiency) for different SR methods. The proposed MAMN achieves a better trade-off between computational complexity (reflected by parameters, FLOPs, and runtime) and reconstruction performance (reflected by PSNR) compared to other lightweight SR models.
Figure 1. Model complexity and performance comparison between the proposed method and other lightweight methods on Manga109 [20] for × 4 SR. The left subplot illustrates the relationship between PSNR (a measure of image reconstruction quality), parameters, and FLOPs (a measure of computational complexity) for different SR methods. The sizes of circles in the figure denote the models’ FLOPs. The right subplot illustrates the relationship between PSNR and runtime (a measure of computational efficiency) for different SR methods. The proposed MAMN achieves a better trade-off between computational complexity (reflected by parameters, FLOPs, and runtime) and reconstruction performance (reflected by PSNR) compared to other lightweight SR models.
Electronics 14 04404 g001
Figure 2. The upper figure shows the network architecture of the proposed MAMN. The MAMN comprises a 3 × 3 convolution layer, multi-scale adaptive modulation blocks (MAMBs), and an upsampler layer. The key component of an MAMB consists of a multi-scale adaptive modulation layer (MAML), a local detail extraction layer (LDEL), and two Swin Transformer Layers (STLs).
Figure 2. The upper figure shows the network architecture of the proposed MAMN. The MAMN comprises a 3 × 3 convolution layer, multi-scale adaptive modulation blocks (MAMBs), and an upsampler layer. The key component of an MAMB consists of a multi-scale adaptive modulation layer (MAML), a local detail extraction layer (LDEL), and two Swin Transformer Layers (STLs).
Electronics 14 04404 g002
Figure 3. Visual comparisons for × 3 SR image061 from the Urban100 dataset.
Figure 3. Visual comparisons for × 3 SR image061 from the Urban100 dataset.
Electronics 14 04404 g003
Figure 4. Visual comparisons for × 4 SR image099 from the Urban100 dataset.
Figure 4. Visual comparisons for × 4 SR image099 from the Urban100 dataset.
Electronics 14 04404 g004
Figure 5. Comparison of local attribution maps (LAMs) [52] and diffusion indices (DIs) [52]. The proposed MAMN can utilize more feature information and reconstruct a more accurate image structure.
Figure 5. Comparison of local attribution maps (LAMs) [52] and diffusion indices (DIs) [52]. The proposed MAMN can utilize more feature information and reconstruct a more accurate image structure.
Electronics 14 04404 g005
Figure 6. The power spectral density (PSD) and feature map visualizations. The low-frequency components are shifted to the center via periodic spectrum map transformation, which indicates that the brighter center region of the image represents the stronger energy of the low-frequency component. The MAML activates more low-frequency components for feature F m , and the LDEL enhances high-frequency representations for feature F l .
Figure 6. The power spectral density (PSD) and feature map visualizations. The low-frequency components are shifted to the center via periodic spectrum map transformation, which indicates that the brighter center region of the image represents the stronger energy of the low-frequency component. The MAML activates more low-frequency components for feature F m , and the LDEL enhances high-frequency representations for feature F l .
Electronics 14 04404 g006
Figure 7. The high-frequency components are shown on the feature map across different radius areas of input, MAML, and LDEL. The results demonstrate that the LDEL contains more high-frequency components.
Figure 7. The high-frequency components are shown on the feature map across different radius areas of input, MAML, and LDEL. The results demonstrate that the LDEL contains more high-frequency components.
Electronics 14 04404 g007
Table 1. Comparative results of different CNN-based methods. PSNR/SSIM are calculated on the Y channel. The top two methods are labeled and bold in red and blue.
Table 1. Comparative results of different CNN-based methods. PSNR/SSIM are calculated on the Y channel. The top two methods are labeled and bold in red and blue.
ScaleModelParams [K]FLOPs [G]Set5Set14BSD100Urban100Manga109
× 2 Bicubic [1]-0.02933.66/0.929930.24/0.868829.56/0.843126.88/0.840330.80/0.9339
FSRCNN [13]12636.99/0.956432.70/0.909331.50/0.890529.91/0.902036.44/0.9708
EDSR_baseline [8]137031637.97/0.960533.61/0.917432.14/0.899331.98/0.927138.53/0.9769
CARN [7]159222937.82/0.960133.59/0.917332.09/0.898531.96/0.926438.36/0.9765
IMDN [29]69416137.99/0.960533.67/0.917632.17/0.899432.17/0.928338.86/0.9773
PAN [47]2617137.99/0.960533.63/0.917932.16/0.899732.02/0.927238.68/0.9773
DPSR [48]129635037.84/0.960133.55/0.917032.13/0.899331.91/0.926438.19/0.9764
LatticeNet [30]75617038.15/0.961033.78/0.919332.25/0.900532.43/0.9302-
LCRCA [31]81318638.13/0.961033.69/0.918432.22/0.899932.36/0.9299-
ShuffleMixer_base [32]3949138.00/0.960633.67/0.917932.15/0.899531.90/0.925638.80/0.9774
HNCT [34]0.368238.08/0.960933.65/0.918432.23/0.900332.22/0.929638.87/0.9775
FDIWN [49]62911238.07/0.960833.75/0.920132.23/0.900332.40/0.930538.85/0.9774
HDSRNet [35]182029137.94/0.960433.57/0.916932.13/0.898932.00/0.926638.30/0.9765
MAMN (Ours)3028038.12/0.961033.81/0.919432.28/0.900932.36/0.930239.21/0.9782
× 3 Bicubic [1]-0.02930.39/0.868227.55/0.774227.21/0.738524.46/0.734926.95/0.8556
FSRCNN [13]12533.01/0.914229.53/0.826128.50/0.789026.40/0.807331.04/0.9217
EDSR_baseline [8]155516034.37/0.927130.30/0.841629.08/0.805128.14/0.852533.44/0.9439
CARN [7]159211934.33/0.926730.31/0.841429.06/0.804128.07/0.850033.50/0.9440
IMDN [29]7037634.36/0.927030.33/0.841529.09/0.804428.17/0.851933.60/0.9444
PAN [47]2613934.41/0.927230.37/0.842129.10/0.804928.10/0.850933.57/0.9447
DPSR [48]129619434.36/0.927130.27/0.841729.09/0.805328.08/0.851233.30/0.9435
LatticeNet [30]7657634.53/0.928130.39/0.842429.15/0.805928.33/0.8538-
LCRCA [31]8228434.51/0.928030.44/0.843229.15/0.806028.37/0.8558-
ShuffleMixer_base [32]4154334.40/0.927230.37/0.842229.11/0.805128.08/0.849733.68/0.9447
HNCT [34]0.363834.47/0.927830.44/0.844229.16/0.807228.29/0.856033.81/0.9461
FDIWN [49]6455234.52/0.928130.42/0.843829.14/0.806528.36/0.856733.77/0.9456
HDSRNet [35]200014934.32/0.926830.28/0.840929.05/0.804128.01/0.849033.29/0.9431
MAMN (Ours)3073634.55/0.928430.55/0.845929.22/0.808228.43/0.857034.20/0.9478
× 4 Bicubic [1]-0.02928.42/0.810426.00/0.702725.96/0.667523.14/0.657724.89/0.7866
FSRCNN [13]12530.73/0.869527.73/0.759227.00/0.715824.68/0.731328.00/0.8649
EDSR_baseline [8]151811432.09/0.893828.58/0.781327.57/0.735826.03/0.784830.35/0.9067
CARN [7]15929132.15/0.894828.61/0.781426.08/0.784526.07/0.784430.47/0.9084
IMDN [29]7154132.21/0.894828.58/0.781127.56/0.735326.04/0.783830.45/0.9075
PAN [47]2612232.13/0.894828.61/0.782227.60/0.736526.11/0.785430.51/0.9095
DPSR [48]133314832.21/0.895628.68/0.783727.59/0.736526.15/0.787230.54/0.9097
LatticeNet [30]7774432.30/0.896228.68/0.783027.62/0.736726.25/0.7873-
LCRCA [31]8344832.33/0.896328.68/0.782227.62/0.735726.23/0.7882-
ShuffleMixer_base [32]4112832.21/0.895328.66/0.782727.62/0.736826.08/0.783530.65/0.9093
HNCT [34]0.372232.30/0.896028.68/0.783327.64/0.738826.20/0.790030.70/0.9114
FDIWN [49]6642832.23/0.895528.66/0.782927.62/0.738026.27/0.791930.63/0.9098
HDSRNet [35]197010832.14/0.894028.55/0.780427.56/0.735026.01/0.783230.36/0.9067
MAMN (Ours)3142132.35/0.896828.81/0.785627.70/0.739826.39/0.792931.04/0.9137
Table 2. Comparative results of different attention-based methods (e.g., SA-based and dynamic modulation methods). PSNR/SSIM are calculated on the Y channel. The top two methods are labeled and bold in red and blue. Our results are displayed in bold.
Table 2. Comparative results of different attention-based methods (e.g., SA-based and dynamic modulation methods). PSNR/SSIM are calculated on the Y channel. The top two methods are labeled and bold in red and blue. Our results are displayed in bold.
ScaleModelParams [M]FLOPs [G]Set5Set14BSD100Urban100Manga109
× 2 SAFMN_c36n8 [10]0.235238.00/0.960533.59/0.917632.15/0.899431.85/0.925538.69/0.9771
SMFANet [36]0.194138.08/0.960733.65/0.918532.22/0.900232.20/0.928239.11/0.9779
SRConvNet [50]0.397438.00/0.960533.58/0.918632.16/0.899532.05/0.927238.87/0.9774
SwinIR [11]11.75230138.42/0.962334.46/0.925032.53/0.904133.81/0.942739.92/0.9797
HAT [28]20.71555438.63/0.963034.86/0.927432.62/0.905334.45/0.946640.26/0.9809
RGT [51]10.10225538.59/0.962834.83/0.927132.62/0.905034.47/0.946740.34/0.9808
MAMN (Ours)0.308038.12/0.961033.81/0.919432.28/0.900932.36/0.930239.21/0.9782
× 3 SAFMN_c36n8 [10]0.232334.35/0.926830.34/0.841729.08/0.804827.94/0.847333.57/0.9437
SMFANet [36]0.191934.42/0.927430.41/0.843029.16/0.806528.22/0.852333.96/0.9460
SRConvNet [50]0.393334.40/0.927230.30/0.841629.07/0.804728.04/0.850033.56/0.9443
SwinIR [11]11.94102634.97/0.931830.93/0.853429.46/0.814529.75/0.882635.12/0.9537
HAT [28]20.85249935.07/0.932931.08/0.855529.54/0.816730.23/0.889635.53/0.9552
RGT [51]10.24101535.15/0.932931.13/0.855029.55/0.816530.28/0.889935.55/0.9553
MAMN (Ours)0.313634.55/0.928430.55/0.845929.22/0.808228.43/0.857034.20/0.9478
× 4 SAFMN_c36n8 [10]0.241432.18/0.894828.60/0.781327.58/0.736025.97/0.780930.43/0.9063
SMFANet [36]0.201132.25/0.895628.71/0.783327.64/0.737726.18/0.786230.82/0.9104
SRConvNet [50]0.382232.18/0.895128.61/0.781827.57/0.735926.06/0.784530.35/0.9075
SwinIR [11]11.9083432.92/0.904429.09/0.795027.92/0.748927.45/0.825432.03/0.9260
HAT [28]20.82145833.04/0.905629.23/0.797328.00/0.751727.97/0.836832.48/0.9292
RGT [51]10.2059233.12/0.906029.23/0.797228.00/0.751327.98/0.836932.50/0.9291
MAMN (Ours)0.312132.35/0.896828.81/0.785627.70/0.739826.39/0.792931.04/0.9137
Table 3. Running time comparisons on × 4 SR. #Avg.Time is the average running time on 50 LR images with a size of 160 × 120 pixels.
Table 3. Running time comparisons on × 4 SR. #Avg.Time is the average running time on 50 LR images with a size of 160 × 120 pixels.
MethodsDPSR [48]LatticeNet [30]HNCT [34]ShuffleMixer [32]SMFANet [36]SRConvNet [50]MAMN (Ours)
#Avg.Time [s]0.1690.1200.0830.2110.0340.0620.066
Table 4. Performance Analysis of the proposed MAMN and its variants on the Set5 and Manga109 datasets for × 4 SR. Parameters and FLOPs computed using an input resolution of 320 × 180 pixels. The Baseline’s results are highlighted in bold.
Table 4. Performance Analysis of the proposed MAMN and its variants on the Set5 and Manga109 datasets for × 4 SR. Parameters and FLOPs computed using an input resolution of 320 × 180 pixels. The Baseline’s results are highlighted in bold.
AblationVariantParams [K]FLOPs [G]Set5Manga109
Baseline-3142132.35/0.896831.04/0.9137
Main moduleMAML→None27619↓0.08/↓0.0011↓0.17/↓0.0018
LDEL→None24017↓0.13/↓0.0021↓0.36/↓0.0043
STL→None1196↓0.34/↓0.0038↓0.74/↓0.0079
MAMLw/o FA30320↓0.06/↓0.0008↓0.07/↓0.0009
w/o FM31421↓0.02/↓0.0004↓0.03/↓0.0003
w/o MC30821↓0.02/↑0.0001↓0.04/↓0.0006
w/o Down31421↓0.01/↓0.0002↓0.03/↓0.0004
AdaptiveMaxPool → AdaptiveAvgPool31421↓0.10/↓0.0014↓0.02/↓0.0002
AdaptiveMaxPool → Nearest interpolate31421↓0.07/↓0.0008↓0.12/↓0.0015
w/o VM31421↓0.06/↓0.0009↓0.06/↓0.0007
VM → Self-attention49838↑0.05/↑0.0004↑0.07/↑0.0005
Lossw/o FFTLoss31421↓0.04/↓0.0005↓0.01/↓0.0002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Z.; Zhang, G.; Tian, J.; Qi, R. Multi-Scale Adaptive Modulation Network for Efficient Image Super-Resolution. Electronics 2025, 14, 4404. https://doi.org/10.3390/electronics14224404

AMA Style

Liu Z, Zhang G, Tian J, Qi R. Multi-Scale Adaptive Modulation Network for Efficient Image Super-Resolution. Electronics. 2025; 14(22):4404. https://doi.org/10.3390/electronics14224404

Chicago/Turabian Style

Liu, Zepeng, Guodong Zhang, Jiya Tian, and Ruimin Qi. 2025. "Multi-Scale Adaptive Modulation Network for Efficient Image Super-Resolution" Electronics 14, no. 22: 4404. https://doi.org/10.3390/electronics14224404

APA Style

Liu, Z., Zhang, G., Tian, J., & Qi, R. (2025). Multi-Scale Adaptive Modulation Network for Efficient Image Super-Resolution. Electronics, 14(22), 4404. https://doi.org/10.3390/electronics14224404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop