REMA: A Rich Elastic Mixed Attention Module for Single Image Super-Resolution

Detail preservation is a major challenge for single image super-resolution (SISR). Many deep learning-based SISR methods focus on lightweight network design, but these may fall short in real-world scenarios where performance is prioritized over network size. To address these problems, we propose a novel plug-and-play attention module, rich elastic mixed attention (REMA), for SISR. REMA comprises the rich spatial attention module (RSAM) and the rich channel attention module (RCAM), both built on Rich Structure. Based on the results of our research on the module’s structure, size, performance, and compatibility, Rich Structure is proposed to enhance REMA’s adaptability to varying input complexities and task requirements. RSAM learns the mutual dependencies of multiple LR-HR pairs and multi-scale features, while RCAM accentuates key features through interactive learning, effectively addressing detail loss. Extensive experiments demonstrate that REMA significantly improves performance and compatibility in SR networks compared to other attention modules. The REMA-based SR network (REMA-SRNet) outperforms comparative algorithms in both visual effects and objective evaluation quality. Additionally, we find that module compatibility correlates with cardinality and in-branch feature bandwidth, and that networks with high effective parameter counts exhibit enhanced robustness across various datasets and scale factors in SISR.


Introduction
Single image super-resolution (SISR) aims to rebuild a high-resolution (HR) image based on its low-resolution (LR) counterpart.It is widely used in digital multimedia, facial recognition, remote sensing image restoration, medical image processing, and other domains [1], and many SISR algorithms have been proposed, including interpolation, reconstruction, algebraic characteristics, and learning-based methods [2,3].In recent years, there have been remarkable advancements in deep learning-based SISR algorithms.However, one of the major challenges of deep learning-based algorithms is high-frequency detail preservation.Numerous studies have proposed diverse algorithms to address this challenge, including residual learning [4,5], recursive structures [6][7][8], dense connections [9][10][11], and multi-path learning [12,13].In recent times, attention-based algorithms have gained prominence, notably after the popularity of Transformer-based algorithms.In fact, there have already been plenty of studies proposing attention-based SISR methods [14][15][16][17] to restore details.Most studies prefer to design a specific SISR network utilizing attention rather than a plug-and-play attention module to improve the reconstruction quality, resulting in a lack of flexibility in methods.And, only a few researchers have proposed flexible attention modules for SR tasks [18][19][20], except for directly plugging classic attention modules into SR networks [21,22].
In fact, many researchers have solely focused on proposing size-oriented attention modules to enhance performance without increasing or even reducing model complexity.However, in real-world scenarios, there is a significant number of tasks that prioritize performance over size, rather than solely emphasizing low-complexity modules with limited performance improvement.Therefore, to address the requirements of various tasks effectively, a flexible module should encompass both size-oriented and performanceoriented characteristics, which are aspects that are rarely discussed.Moreover, according to our experiment results, some size-oriented modules may function effectively within one network; however, their compatibility with other networks may not be guaranteed.Indeed, this raises a more general question in deep learning: why does a plug-and-play module work in one network but not in another?And what are the factors influencing the performance of a plug-and-play module?These issues are not well-studied.
To address the challenges mentioned above, we identify the influential factors affecting the performance of a plug-and-play module and propose the rich elastic attention module (REMA), which is a plug-and-play attention module for SISR.For the flexibility of the module, we propose Rich Structure, which allows seamless switching between size-oriented and performance-oriented modes to accommodate various requirements and ensure compatibility with different networks.And Rich Structure is the basic structure of REMA.
From the attention module's perspective, it is essential to identify the key features affecting SR quality.Thus, we divide SISR into two steps: (1) upsampling LR images to the target size; and (2) minimizing the difference between the resized image and the groundtruth image, succinctly referred to as 'upscaling' and 'denoising'.An effective attention module should highlight key features throughout this process.Building upon the structure and inspiration from CBAM [23], REMA enhances key feature representation in these steps from spatial and channel aspects by enriching the in-module feature pass-through.Using the proposed Rich Structure, REMA can seamlessly switch between size-oriented and performance-oriented modes, ensuring flexibility for different requirements by controlling the bandwidth of in-module features pass-through.
To evaluate the effectiveness of REMA, we integrate it into our proposed modified EDSR [4] and name the resulting model REMA-SRNet.Extensive experiments are conducted on commonly used SR benchmarks.We compare REMA with other comparative algorithms and plug-and-play attention modules.The results demonstrate the effectiveness of Rich Structure, REMA, and REMA-SRNet.
In summary, the main contributions of this paper are as follows: • We identify the key factors affecting the performance of a plug-and-play module and propose Rich Structure, enabling seamless switching between size-oriented and performance-oriented modes for a plug-and-play module to satisfy the diverse needs of different tasks.

•
We propose a SISR attention module, based on Rich Structure, called REMA, consisting of RSAM and RCAM.RSAM employs a creative method to enhance performance through learning LR-HR mapping mode and multi-scale feature fusion.RCAM enhances the overall performance by learning and reducing noise caused by upsample operations and dimension-resolution changes led by convolution operations, using interactive learning.REMA can be easily integrated into networks with various architectures and significantly improve detail reconstruction accuracy at different scale factors.

•
Extensive experiments demonstrate that REMA can carry a simple ResNet backbone SR network to the state of the art while balancing performance and model size.Moreover, the impact of the number of parameters on a module's effectiveness and the overall networks' robustness across different datasets and scale factors is comprehensively discussed in the experiments.
The remainder of this paper is organized as follows: Section 2 provides a brief overview of related work on deep learning-based SISR networks, attention modules, and attentionbased SR models.In Section 3, we detail our proposed REMA, including problem analysis, overall structural design, and module architecture.Section 4 validates the effectiveness of our method, compares its performance with existing alternatives, and highlights its significant advantages.Finally, Section 5 summarizes the study and outlines directions for future work.

Deep Learning-Based SR Methods
SRCNN is the first CNN-based end-to-end SISR network [24].It interpolates the input image to the target size and employs three convolution layers for LR-HR non-linear mapping learning.SRCNN preserves more details than traditional methods, leading to its widespread adoption.Subsequently, CNN-based SISR methods have gained popularity. Examples include ESPCN [25] and FSRCNN [26], which directly take LR images as inputs directly to reduce complexity and increase network speed.ESPCN uses sub-pixel convolutional layers as reconstruction layers, while FSRCNN employs deconvolution layers for HR reconstruction.
To enhance performance, many researchers have integrated techniques such as residual learning, dense connections, recursive structures, and multi-scale or multi-level fusion into their networks.For instance, Kim et al. proposed VDSR [5], which makes the network deeper through residual learning and gradient clipping to improve reconstruction quality.EDSR [4] employs more residual blocks without batch normalization layers to deepen the network and utilizes pixel shuffle to optimize reconstruction performance.Methods like DRRN [7] and DRCN [6] introduce recursive structures to share parameters among layers and deepen the network without significantly increasing the model size.Others, such as RCAN [27], implement a cascading mechanism on a residual network to reuse hierarchical features and balance the number of parameters and accuracy.
Additionally, MSRN [28] creates two sub-branches and uses convolutions of different sizes in a residual block, fusing features interactively to obtain multi-scale features.The multi-scale dense convolutional network (MDCN) [9] densely connects each layer in multi-scale residual blocks to fully utilize multi-scale features within the block.Moreover, ESRGCNN [29] adapts group convolutional residual blocks for multi-level feature fusion and computational cost reduction.UNetSR [30] directly realizes shallow-deep feature fusion via skip connections, akin to U-Net architecture.
According to these studies, dense connections, recursive learning, multi-scale or multilevel feature fusion, and other techniques share a common goal.They aim to efficiently create and learn features at different scales within the backbone structure, a critical aspect of improving CNN-based SISR algorithms.

Attention and Attention-Based SR Models
Attention is a method used to recalibrate the weights of input features in deep learning, aiding models in focusing on key features.In fact, attention-based modules find wide application in various computer vision tasks.The squeeze-and-excitation (SE) block [31] was introduced to adjust informative features within channels.Woo et al. [23] proposed the convolutional block attention module (CBAM), incorporating both channel and spatial attention to adjust feature weights.Coordinate attention (CA) [32] embeds positional information into channel attention, facilitating the capture of long-range dependencies while preserving precise positional information.
Attention-based methods are also prevalent in SISR tasks.RCAN [27] implements a residual-in-residual (RIR) structure with channel attention, enhancing performance by fusing high-and low-frequency features via skip connections.DRLN [10] combines densely connected layers with residual blocks and incorporates a Laplacian pyramid attention mechanism to enhance image quality.Multi-scale feature fusion block (MSFFB) utilized in a multi-scale channel and spatial attention module (CSAM) in MCSN [33] facilitates multi-scale feature representation learning, enhancing the feature selection ability of the channel attention module.PAN [18] employs a pixel attention module in the backbone and upscale layers, generating a 3D attention map at the pixel level to improve performance with fewer parameters.PRRN [34] incorporates a progressive representation recalibration block to extract meaningful features by utilizing pixel and channel information and employing a shallow channel attention mechanism for efficient channel importance learning.RNAN [35] proposes residual non-local attention to obtaining non-local hybrid attention, further enhancing performance by adap-tively adjusting the interdependence between feature channels.Dynamic attention, as used in attention to network (A2N) [36], comprises non-attention component branches and composite attention branches to dynamically suppress unnecessary attention adjustments.The non-local spatial attention network (NLSN) [20] optimizes the computational cost of non-local attention via sparse attention.SwinIR [16] and Swin2SR [17] construct networks based on the Vision Transformer, achieving superior performance.
Few studies focus on plug-and-play attention modules for SISR tasks.Wang et al. [19] proposed the lightweight attention module BAM to suppress large-scale feature edge noise while retaining high-frequency features, which is the most relevant research to our topic.BAM includes the adaptive context attention module (ACAM) for noise reduction and the multi-scale spatial attention module (MSAM) for preserving high-frequency details.

Motivation and the Overall Framework
In our proposed module, the objective is to cater to the requirements of both performanceprioritized and size-prioritized tasks.Therefore, the initial focus is on maximizing performance to meet the demands of performance-prioritized tasks.Subsequently, efforts are directed toward controlling the module size to align with the needs of size-prioritized tasks.Consequently, all parameter-friendly designs are not considered during the initial stage of the module design process.This concept permeates throughout the entire module design, distinguishing our approach from others that opt for lightweight structures directly in their methods.However, this does not mean module size is not important at all for us.Indeed, this is a problem with parameter efficiency.A parameter-efficient module should not only use fewer parameters to exchange limited performance improvement but also boost the performance with more parameters and reach parameter efficiency globally.And 'Rich Structure' is proposed for this purpose.Table 1 illustrates the implications of nouns, abbreviations, and symbols used in the following text.

Module with Rich Structure
For a module, the flexibility involves more than just being plug-and-play, it also involves robustness across different datasets and compatibility to networks with varying characteristics.Identifying influential factors related to these aspects is crucial.Our experiments reveal that key factors affecting the plug-and-play module performance include the overall shape (cardinality, channel bandwidth, and depth) and task-specific effective algorithms.Hence, we propose REMA based on these considerations.
Current plug-and-play attention modules can be categorized into two types based on cardinality (the number of branches with feature transformation): single-branch modules like CBAM, SE, and PA, and multi-branch modules like CA and BAM.However, our experiments show that single-branch modules, which we define as having a plain structure, exhibit less performance improvement than most multi-branch modules when facing input features with higher complexity.Thus, our method is designed as a multi-branch structure to ensure compatibility.
Attention modules with multiple branches, such as Inception-like [37] and ResNeXt [38], or Res2Net-like blocks [39], may encounter challenges related to size-oriented designs, leading to reduced robustness and overall performance across various scale factors in the SISR task.These modules adopt a similar approach to parameter control.For instance, prevalent Inception-like modules split the input feature maps along the channel dimension, transform the features, and then concatenate them for fusion.Likewise, ResNeXt and Res2Net employ bottleneck or grouped convolution to split, transform, and aggregate or concatenate features in the final stage.They all follow a 'split-transform-aggregate or concatenate' structure to balance performance and module size, utilizing the bottleneck structure to split input features.Additionally, single-branch attention modules utilize this structure to adjust their size.Figures 1 and 2 illustrate how these methods split features or control module sizes using the dimension reduction ratio (r).In other words, the 'bottleneck' structure can become a performance bottleneck under certain conditions.However, the issue does not lie solely with the bottleneck structure.In fact, the real concern that deserves more attention is why the focus remains solely on the reduction in dimensions, or in other words, why finding a ratio to minimize the model size while maintaining performance is the predominant research direction.What would occur if a similar bottleneck structure were employed but with increased dimensions, i.e., widening the bandwidth of channels for feature pass-through, rather than reducing it?Only a few studies have addressed this question, such as [40,41]; the authors approached the topic from the perspective of the entire backbone, comparing the widened residual and Inception block with a deeper backbone, demonstrating the effectiveness of widening the bandwidth of channels.Our experiments also prove this from the module perspective.In other words, switching between size-oriented and performance-oriented modules could be unified within the same framework.
Therefore, Rich Structure is proposed as a multi-branch structure with a bi-directionally adjustable channel bandwidth of features in each branch (Figure 2).Specifically, in our proposed method, instead of using 'split-transform-concatenate/aggregate structure', we directly copy or rescale the inputs to different scales, transform features in each branch, and then aggregate them together.In other words, the structure is 'copy/rescale-transformaggregate'.Therefore, the overall width of the features in our module will be much larger and appear fatter, thus denoted as the Rich Structure.On the other hand, dimension reduc-tion (C/r, r ∈ [1, +∞)) is replaced with the proposed elastic adjuster (C × R, R ∈ (0, +∞)).When R ∈ (0, 1), the module functions akin to a 'split-transform-concatenate/aggregate' structure to fulfill the requirements of size-prioritized tasks.Conversely, when R ∈ [1, +∞), the module utilizes additional parameters to enhance performance.Thus, with the help of Rich Structure, REMA could seamlessly switch between size-oriented and performanceoriented modes to ensure flexibility to different requirements.

Rich Elastic Mixed Attention (REMA)
As mentioned above, the performance-related factors include the shape of the module and task-specific effective algorithms.For the former, we design Rich Structure to ensure compatibility with inputs of varying complexity and flexibility for different tasks.However, it is far less important than the latter.Thus, RSAM and RCAM are designed based on the characteristics of SISR, and Rich Structure amplifies their effectiveness.RSAM and RCAM function like miniature SR networks in REMA.
The goal of deep learning-based SISR tasks is to minimize the difference between the reconstructed image and the real HR image, which can be expressed by the following formula [42]: where θ F denotes the parameters of the SR model F. L devotes the loss between the reconstructed image I SR and the ground-truth HR image I y , and θ F denotes the model parameter that minimizes L. Φ θ is the regularization term, and λ serves as the trade-off parameter employed to adjust the proportion of the regularization term.In other words, the purpose of deep learning-based SISR models is to find the θ F to make I SR as close to I y as possible.
From the module perspective, the key is to identify features that deserve more attention during the process mentioned above.To simplify the problem, we decompose the HR reconstruction process into two steps: upscaling the LR image to the target size and eliminating the difference in details between the upscaled image and the real HR image.The process can be expressed in the following formula: where I LR refers to the low-resolution image, M up is the LR-HR upscale mapping mode.And D HR denotes the difference between the upscaled LR image I LR ⊗ M up and I y .
Obviously, the key to high-quality HR image reconstruction lies in the accurate estimation of M up and D HR .Therefore, inspired by CBAM, which enhances feature representation from both spatial and channel aspects, we propose a rich spatial attention module (RSAM) and a rich channel attention module (RCAM) to improve SISR network performance.Unlike CBAM, we eschew lightweight design and instead increase cardinality, the in-branch channel dimensions, and depth.Specifically for SISR tasks, the inadequacy of CBAM and other lightweight attention modules results in a lack of sufficient space for feature maps with various resolutions for interactive learning, which is crucial for M up and D HR estimation.Since learning LR-HR mapping involves avoiding details missing due to resolution changes, there should be at least one pair of feature maps with different resolutions.Therefore, a multi-branch structure is employed in both RSAM and RCAM to enrich the in-module features passed through to aid SISR networks in learning M up and D HR .On the other hand, a multi-branch structure also ensures better multi-scale and multi-level feature fusion for enhancing long-range dependency learning [43], which has already been proven effective in other studies.Thus, we combine multi-scale fusion, LR-HR interactive learning, and attention mechanisms to propose REMA.To verify the effectiveness of REMA for SISR tasks, we apply REMA to a simple ResBlock-based backbone SISR network named REMA-SRNet and compare it with other methods.We apply RSAM and RCAM in parallel at the ResBlock to enhance the backbone performance.Additionally, we fuse features from the LR image and integrate REMA into the reconstruction block to improve performance under high-scale factors.The detailed structures of REMA and REMA-SRNet are illustrated in Figure 3.The reconstruction layers utilize bilinear upsampling followed by a 3 × 3 convolution and Leaky ReLU layers.REMA is applied at 4× and 8× upscaling, with a long skip connection from the bilinear-upscaled LR input, followed by a 1 × 1 convolution for dimension alignment.SF denotes the scale factor.For 2×, 4×, and 8× reconstruction, the number of REMA ResBlocks is 16 and the number of reconstruction blocks is 1, 2, and 3, respectively.

Rich Spatial Attention Module (RSAM)
RSAM aims to enhance long-range feature extraction and non-linear LR-HR mapping mode M up learning through dynamic multi-scale feature fusion with spatial attention.The main difference between RSAM and other widely used multi-scale feature fusion methods lies in the construction of the feature pyramid.As shown in Figure 4, in contrast to methods that utilize convolutions with different kernel sizes [44], or lightweight convolutions like dilated or factorized convolution [45] to learn and fuse features from the same feature maps, RSAM constructs the feature pyramid from rescaled input feature maps based on the scale factor.Thus, regardless of how the scale factor changes, RSAM could learn M up correctly.Within this module, RSAM generates three LR-HR pairs and scans them with receptive fields of the same size to fuse features.Obviously, the advantage of our method is that such a design can obtain multi-scale features as well as preserve the complete structural information of LR-HR mapping, which is key to SISR tasks.
Specifically, RSAM dynamically upsamples and downsamples the input according to the scale factor.Following this, two sub-branches are created to accommodate each additional scale of the input.The rescaled feature maps in all branches are then scanned by a 3 × 3 convolution to acquire multi-scale features.Subsequently, RSAM generates a total of three sets of LR-HR mapping information.Assuming the scale factor is 2×, the generated mapping pairs are 2×, 2×, and 4×, as depicted in Figure 4. Finally, attention maps for three scales are generated along spatial dimensions, and features from each branch are adjusted and enhanced for the LR-HR mapping mode.Further details are provided in Figure 5.And the entire process is formulated as follows:  (3) The input feature map is F ∈ R C×H×W .Then, RSAM resizes the input according to the scale factor.And the upsampled and downsampled Fs are represented by F up s f ∈ R C×(H×s f )×(W×s f ) and F dn s f ∈ R C×(H/s f )×(W/s f ) , respectively, obtained through upsampling µ and downsampling η via bilinear interpolation, where s f denotes the target scale factor of the sampling operation (e.g., 2×, 4×, or 8× in our experiments).After a 3 × 3 convolution layer and ReLU activation cr(•), each pathway employs adaptive average-and max-pooling operations with scale adjustment, followed by concatenation along the channel axis with scale recovering (AdpMixedPool µ (•), AdpMixedPool µ (•) and AdpMixedPool η (•)).Subsequently, a 1 × 1 convolution layer c(•) and a Sigmoid operation σ generate 2D spatial attention maps M main s f ∈ R H×W , M up s f ∈ R H×W , and M dn s f ∈ R H×W for each pathway.Element-wise multiplication ⊗ is applied, and the output of each branch is fused via element-wise addition ⊕, resulting in the refined output F RSA s f of the input F, after recovering the dimension by a 1 × 1 convolution c(•).The adjustment in the bandwidth of the channel is finished by the first and last convolution layers.R denotes the ratio of the elastic adjuster.The output channel of the first convolution operation is C × R, and the last convolution layer restores the channel to that of the input.

Rich Channel Attention Module (RCAM)
After completing the learning process of upscaling, during the denoising stage, RCAM focuses on pixels causing differences to their ground-truth images after rescaling to the same size, aiming to effectively capture such features to minimize D HR and highlight these features along the channel dimensions during channel changes.Similar to RSAM, RCAM creates a sub-branch for downscaling the input.Additionally, in this sub-branch, the number of channels is also adjusted alongside the scale, as differences may arise from both rescale and convolution operations.This sub-branch establishes a middle level between layers for learning multi-level features interactively.Moreover, from a super-resolution perspective, this sub-branch offers an intermediate layer for progressive sampling, enhancing reconstruction quality under high-scale factors, a capability not achieved by other channel-related attention modules (e.g., CAM and SE) (Figure 6).Further details are provided in Figure 7.And the entire RCAM process is formulated as follows: F Sub = ηF (10) In our experiment, RCAM resizes the input feature F ∈ R C×H×W and utilizes a Convolution ReLU (CR) layer to create F Sub ∈ R C/2×H/2×W/2 for the sub-branch.Subsequently, a CR layer (cr(•)) is employed for feature extraction, adjusting the scale and channel number to match the feature maps of the main pathway.Simultaneously, F Main undergoes filtering by three CR layers to retain features at the original resolution.The intermediate feature maps of each pathway are denoted as f Main cr3 and f Sub cr3 respectively.Following this, features that exhibit significant differences (or noise) f D when resolution changes are obtained via an element-wise subtraction operation ⊖.Subsequently, spatial dimensions are compressed to 1 × 1 using adaptive average pooling AvgPool 1×1 (•), followed by FC-ReLU-FC layers and the Sigmoid function σ to generate attention maps M RCA of f D , resulting in F FCA as the adjusted output of the input F. Like RSAM, the adjustment in the bandwidth of the channel is finished by the first and last convolution layers.R denotes the ratio of the elastic adjuster.The output channel of the first convolution operation is C × R, and the last convolution layer restores the channel to that of the input.

REMA-Based Backbone
As discussed, efficiently extracting features at various scales in the backbone is crucial for CNN-based SISR algorithms.In REMA-SRNet, REMA is integrated into the residual block (REMA ResBlock) in the backbone.As depicted in Figure 8, during the feature extraction process, input features in each residual block layer pass through and iteratively generate LR-HR image pairs within the layer.Compared with connection-based algorithms that achieve features from various scales through connection transfer, the REMA-based backbone provides richer features at diverse resolutions.

Experiments 4.1. Implementation Details and Datasets
To assess the effectiveness of Rich Structure, REMA, and REMA-SRNet, we employ images from [46] for training and validation, following DIV2K's default split.Evaluation metrics include the peak signal-to-noise ratio (PSNR, dB) and structural similarity (SSIM), computed in the RGB space, where higher values indicate superior reconstruction.The best models are selected based on the highest PSNR + SSIM on the validation set of DIV2K and evaluated on five commonly used datasets (BSDS100 [47], Set14 [48], Set5 [49], Manga109 [50], and Urban100 [51]), and an additional three datasets (Historical [52], PIRM [53], and General100 [26]) for comprehensive study, under upscaling factors of 2×, 4×, and 8×, respectively.HR images are center-cropped to 256 × 256 patches, downscaled via bicubic interpolation to generate LR image pairs for training and testing, without any data augmentation.Optimization employs Adam with an initial learning rate of 0.0001, halved every 50 epochs, β 1 set to 0.9, β 2 set to 0.999, and ϵ set to 10 −6 .The batch size is set to 1, and training lasts 300 epochs, using PyTorch 2.0.0 on a desktop with an Intel I5-8600 CPU, 64GB RAM, and NVIDIA GTX 3090 GPU.The training loss function is L1 loss.
where P represents the calculated area, and p denotes the pixel's position within area P.
The pixel values at position p in both the prediction area x(p) and the ground truth map GT area y(p) are taken into account.

Evaluation Metrics
We evaluate SR images using two widely used metrics: the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM).PSNR serves as an objective metric to assess image quality and measure the degree of difference between an original image and a compressed or distorted version.The PSNR calculation relies on mean square error (MSE), quantifying the squared differences between corresponding pixels in the original and reconstructed images.The formula for PSNR is as follows: where W and H are the width and height of the image, (i, j) represent pixel positions, and X and Y denote the super-resolved image and the ground-truth image, respectively.
Here, µ x , µ y ,σ x , σ y , and σ xy denote the mean, standard deviation, and covariance of pixels at position p in the prediction map and the true value map.Constants C 1 and C 2 are included to prevent division by zero.The SSIM value falls within the range of (0, 1), with values closer to 1 indicating a superior HR reconstruction effect.

Ablation Studies
In this section, ablation studies are conducted to verify the effectiveness of each part of REMA.The experiments span networks with various settings, scale factors, and integration positions, as well as comparisons with other attention modules in REMA-SRNet and other SISR networks.Meanwhile, the effectiveness of Rich Structure is verified by comparing it with REMA using Inception-and ResNeXt-like structures.Furthermore, the impact of parameter count on performance and robustness is discussed based on the experimental results.
The baseline model in our experiment is the proposed modified EDSR (replacing REMA ResBlock with residual blocks in REMA-SRNet).Specifically, the pixel shuffle layer is replaced by bilinear upsampling followed by 3 × 3 convolutions and Leaky ReLU layers, connected with the bilinear-upscaled LR input.To validate the proposed methods, we employ two sets of network configurations: default (64-16-64) as REMA-SRNet and alternative (40-16-40) as REMA-SRNet-M.The adjuster ratio R is set to 1 by default.For 2×, 4×, and 8× reconstruction, the number of reconstruction layers is 1, 2, and 3, respectively.
In all experiments, robustness is evaluated across all eight datasets.The Historical dataset specifically assesses the module's capability in handling out-of-distribution (OOD) samples.Additionally, two sets of network configurations mentioned above (#C_In is 40 or 64) are employed to test module compatibility with varying complexities of input features.

Study of REMA in the Backbone
Figure 3 illustrates the utilization of REMA within a residual block of the backbone networks.To analyze the effectiveness of REMA, except the baseline, five models were constructed: RSAM, RCAM, RSAM-RCAM, RCAM-RSAM, and REMA, and they represent the model with RSAM and RCAM, and employ them together in parallel, respectively.The results of all the above ablation experiments are shown in Table 2.The results indicate that employing only RSAM in the backbone enhances PSNR and SSIM across most datasets, except for the Set14 and Historical datasets when the input tensor has 40 channels.However, RCAM also underperforms in the Historical dataset, attributed to significant differences between the Historical dataset images and the distribution of training datasets.Configuring them in parallel (REMA) boosts performance across most datasets.Moreover, with a 64-channel input tensor, all models show significant performance improvements.Notably, using RSAM and RCAM separately substantially mitigates the performance reduction issue in the Historical dataset.Consequently, the backbone of REMA demonstrates performance improvements in the Historical dataset.Overall, these results affirm the effectiveness of our method.

Study of Rich Structure
The study of Rich Structure, along with REMA throughout subsequent experiments, is examined.To verify the effectiveness of Rich Structure and REMA, we initially compared REMA with other attention modules.This allows us to identify the key factors influencing a plug-and-play module and demonstrate the superiority of Rich Structure and REMA.Additionally, we designed ResNeXt and Inception versions of REMA to highlight the advantages of Rich Structure in terms of compatibility and flexibility compared to other popular module structures.

Comparison with Other Attention Modules
We compared the performance of REMA with other attention modules, including CBAM, SE, CA, and BAM, which were employed in the same way as REMA.Our experiment includes results for 40-and 64-channel input.To ensure a fair comparison, we set the dimension reduction to 1 (C/r, r = 1), meaning no channel compression is applied.The results are presented in Table 3.The results indicate that for the 40-channel input, there is no significant difference in the performance of REMA with other attention modules, except CBAM.However, for the 64-channel input, REMA outperforms other attention modules.Furthermore, comparing the overall improvement when changing the number of channels from 40 to 64, BAM and REMA show much higher performance than other attention modules in the experiment, as discussed in the next section.For the Historical dataset, except for REMA, there is a reduction in performance after integrating other attention modules into the residual block.

Study of Plain, Multi-Branch, and Rich Structure
To elucidate the performance increment difference with increasing input complexity, we analyze these attention modules from a global structural perspective.According to Table 3, modules with a multi-branch structure exhibit a greater performance increase with the rise in input complexity compared to plain structures, except for CA.The primary distinction among these modules lies in their cardinality: 1 for plain modules (SE and CBAM), and 2, 2, and 5 for CA, BAM, and REMA, respectively.Based on the results, cardinality is positively correlated with the overall performance of modules for a 64-channel input.Thus, cardinality is an influential factor relating to module compatibility, and higher cardinality will enhance the module's performance with the growth in input complexity.
However, cardinality is not the sole factor influencing performance.When comparing the results of CA and BAM, both with a cardinality of two, there exists a performance gap for the 64-channel input.The main difference lies in the in-branch bandwidth.In fact, CA also employs a split-transform-aggregate structure similar to Inception-like blocks.The distinction is that CA splits the features (C × H × W) along H and W rather than C, as shown in Figure 1b, while BAM and REMA directly map the complete input to branches.This implies that the in-branch features are less informative in CA compared to BAM and REMA.
Comparing BAM and REMA, both modules generate spatial and channel attention.The difference lies in our proposed algorithm, which not only enhances SR-related feature representation but also generates richer multi-scale and multi-level features compared to BAM.This is because BAM is a size-oriented module, balancing performance and module size, resulting in limited room and more constraints for algorithm design.Our proposed Rich Structure is designed to overcome this limitation.We will delve into this topic in the following section.Therefore, in-branch feature richness and task-related algorithms are other influential factors.The richness is defined by the channel bandwidth of the in-branch features and the diversity of features(multi-scale and multi-level features).

Study of the Elastic Adjuster
For further investigation, we conducted an experiment to analyze the influence of overall channel bandwidth on performance.The overall channel bandwidth of modules with plain structures, multi-branch structures, and our proposed Rich Structure differs significantly, with the plain structure being much slimmer than the others.We redesigned these modules, replacing dimension reduction with the elastic adjuster (C × R), where R is set to 3, indicating a widened channel bandwidth by 3 times to determine how bandwidth affects the performance and to verify the effectiveness of the elastic adjuster in different attention modules.The results are presented in Table 4, and there is a dedicated section for this independent experiment in REMA in the following.The results show that for the 40-channel input, the redesigned wider CBAM and SE exhibit improvements on most datasets, bringing their performances close to those of the original CA and BAM, which performed better than them previously.This underscores the significance of the in-branch feature bandwidth of the channel as a key performance-related factor, which ultimately affects the overall module's width.These results highlight how plain structures and dimension-reduction components, realized by bottleneck structures, actually limit their potential in performance, proving the effectiveness of the proposed elastic adjuster in enhancing performance when needed alongside the Rich Structure under certain conditions.However, for the 64-channel input, a reduction occurs in wider modules, except for BAM.For BAM, the redesign results in improvements for half of the datasets and reductions on others, with overall performance close to the original for the 64-channel input.This indicates a limit to increasing in-branch channel bandwidth for further performance gains.

Study of the Elastic Adjuster in REMA
To analyze the effect of in-branch channel bandwidth in REMA, experiments are conducted.Specifically, in the experiments, the elastic adjuster's ratio was varied from 0.5 to 1.5, and the performances of R ∈ [0.5, 1) and R ∈ [1, 1.5], representing the size-oriented and performance-oriented modes of REMA, were compared.The results are shown in Table 5.
The results indicate that the overall performance of size-oriented REMA is lower than the performance-oriented one for the 40-channel input, showing the same trend as the widened versions of other attention modules.However, for the 64-channel input, different from other widened attention modules, A can still benefit from the increased bandwidth for some datasets, including BSDS100, Mange109, Set14, and Urban100.Additionally, the performance gap between the lowest and highest values for the 64-channel input is not large, proving that REMA can ensure flexibility to meet different task requirements by switching the elastic adjuster.
There is still a limit to achieving more performance through parameter exchange.This limitation may stem from two aspects: input complexity and task-specific algorithms.Regarding the former, comparing the results of 40_1.5 and 64_0.6, it can be observed that they have similar numbers of parameters, yet 64_0.6 performs significantly better than 40_1.5, with the only difference being the number of input channels.This illustrates one of the reasons why models with more parameters do not always yield higher performance and why a plug-and-play module works in one network but not in another.
Concerning the latter, comparing REMA with the widened version of BAM (64_1.2),both having a multi-branch structure with the elastic adjuster and similar overall channel bandwidth (BAM: 2 × 3, REMA: 5 × 1.2), REMA outperforms BAM on all datasets.Furthermore, the results of R ∈ [0.5, 1) and R ∈ [1, 1.5] demonstrate that a more effective parameter exchange provides extra robustness on different datasets, although models with fewer parameters may perform better on certain datasets.
Table 5.The trend of performance changes with different ratios of the elastic adjuster under 4×.#C_in_R denotes the number of channels of the input and the elastic adjuster's ratio.The results of different input widths are denoted by blue and green .Deeper colors represent higher values.In order to investigate how lightweight structures affect performance further, we compare Rich Structure (copy/rescale-transform-aggregate) with other size-oriented multibranch designs.Specifically, we redesign REMA in Inception (split-transform-concatenate) and ResNeXt (split-transform-aggregate) styles.The split operation is achieved by setting the elastic adjuster to be 1/3 in RSAM and 1/2 in RCAM to maintain the overall bandwidth the same as the input feature.Additionally, the main difference between the Inception and ResNeXt versions lies in the topology of each transforming branch, whereas in ResNeXt, they are the same.Hence, we propose an extra version of it to maintain multi-scale and multi-level feature fusion as used in REMA, to verify their effectiveness.
To comprehensively discuss the parameter efficiency of size-oriented and performanceoriented structures, we also consider the scale factor for two reasons.Firstly, from the SR task perspective, a higher scale factor makes SR inference more challenging.From the network perspective, as the scale factor increases, the network becomes more prone to overfitting since we generate training data by downsampling the ground-truth image at the target scale factor rate.Consequently, the input patch becomes very small at 8× (32 × 32), potentially leading to overfitting for a module that performs well at 2× and 4×.In other words, 2×, 4×, and 8× represent three situations, ranging from low to high difficulty for every parameter that influences performance.Additionally, performance on the Historical dataset receives more attention as it represents an out-of-distribution (OOD) scenario.Therefore, we use these factors to test the module's compatibility and robustness, with the experiment results presented in Table 6.According to the results, Rich Structure outperforms other versions of REMA.Although the performances of Inception and ResNext_MS may be close to the Rich Structure version of REMA in certain datasets or certain upscale ratios, overall, the Rich Structure version demonstrates the best capability across different datasets and networks, with less likelihood of overfitting.Moreover, comparing ResNeXt_MS shows better performance than ResNeXt under 2× and 4×, and their results are comparable under 8x, highlighting the effectiveness of the multi-scale and multi-level feature fusion strategy in REMA.These findings demonstrate the higher compatibility and robustness of our method compared to other popular size-oriented multi-branch structures when applied in the backbone.Again, the results demonstrate that extra effective parameters can exchange and provide more robustness under different scale factors.

Study of REMA in the Reconstruction Layer
Additionally, given the application of REMA in reconstruction layers at high-scale factors, experiments are conducted at scale factors of 4× and 8×. Figure 3 illustrates the implementation of REMA in the reconstruction layer, with corresponding results shown in Table 7.In summary, the significance of REMA in reconstruction blocks increases with largerscale factors.At 4×, it results in a performance decline in most datasets, leading to its exclusion from REMA-SRNet under 2× and 4×.However, at 8×, there is an improvement in most datasets when used in reconstruction for the 64-channel input.However, for the 64-channel input, the overall enhancement is less evident.Hence, REMA in reconstruction layers improves performance at high-scale ratios under specific conditions.

Study of REMA in Other SISR Network
For further investigation, we incorporate REMA into UNet-SR, a super-resolution network based on the image segmentation network U-Net.UNet-SR employs skip connections for encoder-reconstruction feature fusion, enhancing reconstitution quality.We utilize this setup to assess REMA's effectiveness in other networks and evaluate its impact on performance when integrated into skip connections.This extends the experiments beyond the backbone and reconstruction layers, as skip connections were not used in REMA-SRNet for varying depth feature fusion.Results are summarized in Table 8.The results show that the performance of REMA, when added to the skip connection, surpasses other attention modules at the same position, indicating that REMA remains effective in various SR models and positions.In fact, the number of input channels gradually expands, layer by layer, as it progresses from shallow to deep within the skip connections of UNet-SR.Thus, this also suggests that Rich Structure's advantage becomes more pronounced when handling inputs with more filters, outperforming other attention modules.
Table 9 displays the quantitative results for various scaling factors.In summary, compared to other SOTAs, REMA outperforms other methods for 2×, 4×, and 8× upscaling on benchmark datasets, showcasing the effectiveness of REMA-SRNet.Further research should address the parameter-efficiency perspective when discussing trends in results.
The results indicate that methods with large sizes do not necessarily equate to high performance.In fact, size and performance show some positive correlation at 4×.As explained earlier, this is due to complex models being prone to overfitting as the complexity of the training data decreases with the increasing scale factor.For instance, RCAN and DRLN may achieve better results on certain datasets at 2× and 4× but perform worse than lightweight models like PAN and A2N at 8× due to overfitting.Conversely, while lightweight models may excel in specific scale factors or datasets, they may be insufficient for performance-prioritized tasks or broad compatibility requirements.Thus, parameter efficiency not only achieves intermediate results with few parameters but also attains optimal results while maintaining the overall model size.Among the models tested, only REMA-SRNet and SwinIR achieve this balance.REMA-SRNet generally outperforms SwinIR while using only 60% of its parameters (Figure 9).

Visual Comparison of Different Models
We selected reconstructed images from the Urban100, BSDS100, General100, and SET14 datasets to compare reconstruction details. Figure 10 illustrates the HR effects of REMA-SRNet and other methods, highlighting smoother lines, the preservation of fine details, and improved textures in the reconstructed images.Specifically, the textures in the super-resolved images 'img_048' and 'img_092' by REMA-SRNet are more accurate, and the lines in 'monarch' and '62096' are sharper compared to other methods.

Conclusions
To address the challenge of detail preservation in SISR tasks, we propose a plug-andplay attention module called REMA.The core component, Rich Structure, is proposed based on extensive research into how different module structures impact size, compatibility, and performance.This allows REMA to seamlessly transition between being size-oriented and performance-oriented, depending on the specific requirements of the task.We separate the SR process into two steps: upsampling and denoising, with RSAM and RCAM designed to focus on the key factors in each step, respectively.Building on Rich Structure, we propose RSAM and RCAM.RSAM focuses on the mutual dependency of multiple LR and HR pairs, as well as multi-scale features, while RCAM uses interactive learning to emphasize key features, enhancing detail and noise differentiation and generating intermediate features for multi-level feature fusion.Thus, with RSAM and RCAM, REMA enhances the SISR process and the performance of deep learning-based networks by simultaneously improving longrange dependency learning.Together, these components alleviate issues of algorithm flexibility and detail preservation.
Extensive experiments validate the effectiveness of REMA, showing significant improvements in performance and compatibility compared to other attention modules.Additionally, REMA-SRNet demonstrates superiority over other SISR networks.Our investigations into module compatibility reveal a correlation between cardinality, in-branch feature bandwidth, and compatibility.Further analysis indicates that networks with high effective parameter counts exhibit enhanced robustness across various datasets and scale factors.
Future work will continue to explore factors influencing the performance and robustness of modules and aim to improve super-resolution accuracy.We plan to introduce more metrics and explore higher super-resolution ratios, such as 16×.Our goal is to develop a plug-andplay module that can automatically adjust its structure and complexity, ensuring cost efficiency and reducing the need for manual parameter tuning to meet diverse requirements.

Figure 1 .
Figure 1.Illustration of dimension reduction in size-oriented attention modules.(a) Dimension reduction in SE-like modules, and the channel attention module of BAM and CBAM and their variants.(b) Dimension reduction in CA.

Figure 3 .
Figure 3. Illustration of the proposed REMA-SRNet.The backbone of REMA-SRNet is based on residual blocks incorporating REMA (REMA ResBlock).The reconstruction layers utilize bilinear upsampling followed by a 3 × 3 convolution and Leaky ReLU layers.REMA is applied at 4× and 8× upscaling, with a long skip connection from the bilinear-upscaled LR input, followed by a 1 × 1 convolution for dimension alignment.SF denotes the scale factor.For 2×, 4×, and 8× reconstruction, the number of REMA ResBlocks is 16 and the number of reconstruction blocks is 1, 2, and 3, respectively.

Figure 4 .
Figure 4.The difference in multi-scale feature generation between RSAM and other conventional methods (assuming the scale factor is 2×).(a) RSAM (ours) learns multi-scale features and LR-HR mapping together.(b) Conventional methods (like ASPP and Inception blocks) can only obtain multi-scale features.

Figure 8 .
Figure 8.Comparison of REMA-based Backbone and Other Methods.The trend of scale changes in the backbone is denoted by green waves .

Figure 10 .
Figure 10.Subjective quality assessment for 4× upscaling on the general images from four datasets.The best results are bold and underlined.

Table 1 .
Implications of nouns, abbreviations, and symbols MAX is the maximum pixel value range, and M MSE stands for mean square error.Higher PSNR values indicate lower distortion and better image quality, typically ranging from 20 to 50.PSNR values exceeding 30 dB are generally considered indicative of good image quality.Recognizing that PSNR is a limited indicator that fails to capture human subjective perception of images, we also utilize SSIM as an evaluation index.SSIM accounts for contrast, brightness, and structural similarity.The calculation for the SSIM value at the pixel position, p, is as follows:

Table 2 .
The effect of each part of REMA in the backbone (4×).#C_In denotes the input tensor's channel count.Numerical comparisons maintain precision to 12 decimal places, with the top two results highlighted in red and blue.

Table 3 .
Performance comparison in ResBlock between REMA and other attention modules.#C_In denotes the input tensor's channel count.Numerical comparisons maintain precision to 12 decimal places, with the top two results highlighted in red and blue.

Table 4 .
Performance comparison between widened attention modules #C_In denotes the input tensor's channel count._wide denotes the modules with widened channel bandwidth (×3) by the elastic adjuster, and R denotes the ratio of the elastic adjuster.The numerical comparisons are accurate to 12 decimal places, with the best result highlighted in red.

Table 6 .
Performance comparison between the REMA (Rich Structure), ResNeXt, and Inception versions of REMA.ResNeXt_MS represents the ResNeXt version of REMA with multi-scale and multi-level feature fusion.SF denotes the scale factor.The numerical comparisons are accurate to 12 decimal places.The best two results are highlighted in red and blue.

Table 7 .
REMA in the reconstruction layer.SF denotes the scale factor.#C_in denotes the input tensor's channel count.RB w/ REMA denotes the reconstruction block with REMA.And RB w/o REMA denotes the reconstruction block without REMA.The numerical comparisons are accurate to 12 decimal places.The best result is highlighted in red.

Table 8 .
Comparison with other attention modules in UNet-SR under 4×.The numerical comparisons are accurate to 12 decimal places.The best two results are highlighted in red and blue.

Table 9 .
Performance comparison between REMA-SRNet and other comparative methods.HR images are center-cropped and downscaled via bicubic interpolation to generate LR image pairs for training and testing, without any data augmentation.PSNR and SSIM are computed in the RGB space.#P denotes the number of parameters(m).SF denotes the scale factor.The numerical comparisons are accurate to 12 decimal places.The best two results are highlighted in red and blue.