Spatial and Channel Aggregation Network for Lightweight Image Super-Resolution

Advanced deep learning-based Single Image Super-Resolution (SISR) techniques are designed to restore high-frequency image details and enhance imaging resolution through the use of rapid and lightweight network architectures. Existing SISR methodologies face the challenge of striking a balance between performance and computational costs, which hinders the practical application of SISR methods. In response to this challenge, the present study introduces a lightweight network known as the Spatial and Channel Aggregation Network (SCAN), designed to excel in image super-resolution (SR) tasks. SCAN is the first SISR method to employ large-kernel convolutions combined with feature reduction operations. This design enables the network to focus more on challenging intermediate-level information extraction, leading to improved performance and efficiency of the network. Additionally, an innovative 9 × 9 large kernel convolution was introduced to further expand the receptive field. The proposed SCAN method outperforms state-of-the-art lightweight SISR methods on benchmark datasets with a 0.13 dB improvement in peak signal-to-noise ratio (PSNR) and a 0.0013 increase in structural similarity (SSIM). Moreover, on remote sensing datasets, SCAN achieves a 0.4 dB improvement in PSNR and a 0.0033 increase in SSIM.


Introduction
Single Image Super-Resolution (SISR) refers to the technique of restoring Low-Resolution (LR) images to High-Resolution (HR) clear images.SISR not only improves the perceptual quality of images but also helps to enhance the performance of other computer vision tasks such as object detection and image denoising [1][2][3].As a result, it has attracted wide attention from researchers.
Edge computing devices have made remarkable strides [18,19], enabling the deployment of super-resolution algorithms.However, the existing algorithms, due to their high computational complexity, struggle to strike a balance between performance and speed on edge devices.Therefore, there is a pressing need to develop SISR methods that are lightweight and efficient, ensuring their viability for resource-constrained devices.
Accordingly, there have been many outstanding works on lightweight SISR networks [20][21][22][23][24][25][26][27], most of which employ more compact network architectures and utilize ingenious lightweight strategies.These lightweight strategies include the use of group convolutions [28], depth-wise separable convolutions [29], dilated convolutions [30], and cross convolution [31] to replace regular convolutions.In addition, there are also lightweight strategies such as neural architecture search [32,33], structural reparameterization [34], efficient attention mechanisms [35] and so on.Despite the increasing availability of lightweight lightweight strategies such as neural architecture search [32,33], structural reparameterization [34], efficient attention mechanisms [35] and so on.Despite the increasing availability of lightweight SISR networks, their performance is often severely com-promised, making it difficult for them to meet the demands of complex practical applications.There remains room for improvement in the field of lightweight SISR.
Recent research suggests that the remarkable performance of ViT is primarily attributed to its macro architecture [36,37].By utilizing advanced training configurations and adopting ViT-inspired architectural enhancements, CNN can achieve performance on par with, or even surpass, that of ViT, especially when employing large kernel convolutions [38,39].Following that, Yu et al. [40] present MetaFormer as a general architecture abstracted from transformers, as shown in Figure 1, MetaFormer architecture consists of normalization layers, Spatial Mixture (SM) modules, and Channel Mixture (CM) modules.Further investigation by Yu et al. [41] reveals that a pure CNN network with the Meta-Former architecture is more efficient compared to the ViT-based network in image classification task.The Multi-scale Attention Network (MAN) [42] is an exceptional work of SISR that employs the MetaFormer architecture and large kernel convolutions.MAN employ large convolutional kernels in their spatial and channel mixture modules to extract information, resulting in more efficient performance compared to ViT-based SR methods.Furthermore, Deng et al. [43] found that deep neural networks have a tendency to encode interactions that are either too complex or too simple, rather than interactions with a moderate level of complexity.MAN also suffer from this drawback.A recent work focusing on the Multi-Order Gated Aggregation Network (MogaNet) [44] proposed the use of feature reduction operations, which compel the network to focus more on challenging intermediate-level information, and achieved excellent results in image classification tasks.
Inspired by the Multi-Order Gated Aggregation Network (MogaNet) [44], we introduced the strategy of large-kernel convolutions coupled with feature reduction in the field of super-resolution for the first time and, based on this design, developed the Triple-Scale Spatial Aggregation Attention Module to aggregate multi-scale information.Building upon the MetaFormer architecture, we further proposed a Spatial and Channel Aggregation Block (SCAB) to aggregate multi-order spatial and channel information.Furthermore, an ingenious introduction of a 9 × 9 large kernel convolutional layer is made at the end of the SCAB module to further expand the receptive field.
As shown in Figure 2, the proposed SCAN can aggregate more contextual information compared to the light versions of MAN [42] and Image Restoration Using Swin Transformer (SwinIR) [45].
The main contributions of this work can be summarized as follows: (1) The Triple-Scale Spatial Aggregation (TSSA) attention module was innovatively introduced for the first time, enabling the aggregation of triple-scale spatial information.(2) The Spatial and Channel Aggregation Block (SCAB) is innovatively introduced for the first time, capable of aggregating both multi-scale spatial and channel information.Furthermore, Deng et al. [43] found that deep neural networks have a tendency to encode interactions that are either too complex or too simple, rather than interactions with a moderate level of complexity.MAN also suffer from this drawback.A recent work focusing on the Multi-Order Gated Aggregation Network (MogaNet) [44] proposed the use of feature reduction operations, which compel the network to focus more on challenging intermediate-level information, and achieved excellent results in image classification tasks.
Inspired by the Multi-Order Gated Aggregation Network (MogaNet) [44], we introduced the strategy of large-kernel convolutions coupled with feature reduction in the field of super-resolution for the first time and, based on this design, developed the Triple-Scale Spatial Aggregation Attention Module to aggregate multi-scale information.Building upon the MetaFormer architecture, we further proposed a Spatial and Channel Aggregation Block (SCAB) to aggregate multi-order spatial and channel information.Furthermore, an ingenious introduction of a 9 × 9 large kernel convolutional layer is made at the end of the SCAB module to further expand the receptive field.
As shown in Figure 2, the proposed SCAN can aggregate more contextual information compared to the light versions of MAN [42] and Image Restoration Using Swin Transformer (SwinIR) [45].
The main contributions of this work can be summarized as follows: (1) The Triple-Scale Spatial Aggregation (TSSA) attention module was innovatively introduced for the first time, enabling the aggregation of triple-scale spatial information.(2) The Spatial and Channel Aggregation Block (SCAB) is innovatively introduced for the first time, capable of aggregating both multi-scale spatial and channel information.(3) The Spatial and Channel Aggregation Network (SCAN), a lightweight and efficient pure CNN-based SISR network model that combines the advantages of both CNN and Transformer is proposed.(4) Quantitative and qualitative evaluations are conducted on benchmark datasets and remote sensing datasets to investigate the proposed SCAN.As shown in Figure 3, the proposed SCAN achieves a good trade-off between model performance and complexity.

Related Work
Classical Deep Learning-based SISR models.With the rapid development of deep learning techniques, researchers have been actively exploring and studying deep learning-based SISR methods.Compared to traditional approaches, deep learning-based methods can extract more expressive image features from the dataset and adaptively learn the mapping relationship between low-resolution and high-resolution images.Consequently, they have achieved remarkable breakthroughs in this field.Dong et al. [4] first introduced the application of deep learning in the field of SISR by proposing the pioneering CNN-based SISR network model called Super-Resolution CNN (SRCNN), which utilizes three convolutional layers to achieve an end-to-end mapping between low-resolution and high-resolution images, resulting in superior performance compared to traditional methods.Since then, numerous out-standing deep learning-based SISR methods have emerged.Kim et al. [5] introduced Residual Network (ResNet) [12] into the field of SISR inspired by the deep convolutional neural network VGG-net, and proposed the Very Deep SR (VDSR) with 20 weight layers.Kim et al. [48] introduced Recurrent Neural Network (RNN) into SISR for the first time and proposed the Deep Recursive Convolutional Network (DRCN) with up to 16 recursive layers, combining residual learning to control parameter count and address the overfitting issues associated with increasing network depth.Tong et al. [8] pioneered the application of Dense Convolutional Network (DenseNet) [13] in SR and proposed the SR DenseNet network (SRDenseNet) model, which effectively fuses low and high-level features through dense skip connections and enhances the reconstruction of image details using deconvolutional layers.Lim et al. [6] introduced the Enhanced Deep SR (EDSR) network, which removes Batch Normalization (BN) layers from the SRResNet [49] network and incorporates techniques like residual scaling.Their work demonstrated a remarkable enhancement in the quality of image reconstruction.Addressing the oversight of high-frequency information recovery in existing deep learningbased SR methods, Wu et al. [14] introduced a new convolutional block, known as the Spatial-Frequency Hybrid Convolution Block (SFBlock).This block is engineered to extract features from both the spatial and frequency domains.It enhances the capturing of highfrequency details, while simultaneously preserving low-frequency information.
The attention mechanism [50], by assigning different weights to image features based on their importance, enables the network to prioritize crucial information with higher weights.As a result, it is widely employed in various visual tasks to enhance performance and focus on important information.Zhang et al. [11] introduced the attention mechanism into SR for the first time and proposed the Residual Channel Attention Network (RCAN), which utilizes channel attention to adaptively recalibrate the features of each channel based on their interdependencies.This approach enhances the network's representation capability by learning more informative channel features.Subsequently, Dai et al. [16] pointed out that the channel attention mechanism introduced in RCAN only utilizes the first-order statistics of features through global average pooling, which neglects higher-order statistics and hinders the network's discriminative ability.They proposed a second-order attention mechanism and developed the Second-order Attention Network (SAN), which achieved better performance than RCAN.Wu et al. [17] introduced a novel Feedback Pyramid Attention Network (FPAN) for SISR.By leveraging the unique feedback connection structure proposed in their work, the FPAN is capable of enhancing the representative capacity of the network while facilitating information flow within it.Furthermore, it exhibits a proficiency in capturing long-range spatial context information across multiple scales.
Recently, the transformer [39] has made significant breakthroughs in the field of computer vision.Inspired by this, Chen et al. [51] proposed a pre-trained network model called Image Processing Transformer (IPT) for handling various low-level computer vision tasks such as SR, denoising, and rain removal.The IPT model proves to be effective in performing the desired tasks, outperforming most existing methods across different tasks.Liang et al. [45]  spatial relationships in images and learns the mapping between low-resolution and highresolution images, resulting in superior image SR performance.
Guo et al. [39] proposed the Large Kernel Attention (LKA) module and developed the Visual Attention Network (VAN), which outperforms state-of-the-art VIT-based networks in multiple visual tasks.Inspired by LKA, Li et al. [42] proposed the Multi-Scale Large Kernel Attention (MLKA) and developed the Multi-Scale Large Kernel Attention Net-work (MAN) for SISR.MAN outperforms SwinIR with improved performance and reduced computational cost, thanks to the integration of MLKA.
Although these classic SISR methods have achieved impressive performance, they often come with a significant computational cost, making it challenging to apply them in resource-constrained scenarios.
Lightweight SISR models.To facilitate the practical application of SISR methods, researchers have also proposed numerous lightweight SISR methods.Dong et al. [20] introduced Fast SRCNN (FSRCNN), which replaces the pre-upsampling model framework in SRCNN with a post-upsampling model framework.By utilizing a deconvolution layer at the end of the network for upsampling, FSRCNN effectively addresses the high computational complexity of SRCNN.Shi et al. [52] proposed Efficient Sub-Pixel CNN (ESPCN), which, like FSRCNN, adopts a post-upsampling model framework.However, ESPCN employs sub-pixel convolutional layers for image upsampling, resulting in superior reconstruction performance compared to the FSRCNN network model.
Ahn et al. [21] introduced the Cascading Residual Network (CARN) which enhances network efficiency through the utilization of group convolutions and parameter sharing blocks.Hui et al. [22] also employed the strategy of group convolutions and introduced the Information Distillation block to construct the Information Distillation Network (IDN), which further improves the quality of image reconstruction while enhancing the speed.Building upon IDN, Hui et al. [23] improved the information distillation blocks and designed the Information Multi-Distillation Block (IMDB) for constructing a lightweight Information Multi-Distillation Network (IMDN), which achieves higher efficiency performance compared to IDN.Li et al. [53] proposed the linearly assembled pixel-adaptive regression network (LAPAR), which transforms the problem of learning the LR-to-HR image mapping into a linear regression task based on multiple predefined filter dictionaries.This approach achieves good performance while maintaining speed.Luo et al. [54] introduced the lattice block network (LatticeNet), which utilizes multiple cascaded lattice blocks based on a lattice filter bank, as well as backward feature fusion strategy.Inspired by edge detection methodologies, Liu et al. [31] developed a novel cross convolution approach that enables more effective exploration of the structural information of the features.
Song et al. [26] proposed an efficient residual dense block search algorithm for image SR.It employs a genetic algorithm to search for efficient SR network structures.Huang et al. [32] introduced a lightweight image SR method with a fully differentiable neural architecture search (DLSR), which incorporates a fully differentiable neural architecture search.A key innovation of their work lies in the creation of a cell-level and networklevel search space, which enables the discovery of optimal lightweight models.Zhang et al. [25] introduced the reparameterization concept to SISR and proposed edge-oriented convolution block for real-time SR (ECBSR).During training, they utilized a multi-branch module to enhance the model's performance.During inference, they transformed the multi-branch module into a single-branch structure to improve runtime speed while maintaining performance.Zhu et al. [35] introduced a light-weight SISR network known as Expectation-Maximization Attention SR (EMASRN).The distinctive aspect of their work is the incorporation of a novel high-resolution expectation-maximization attention mechanism, which effectively captures long-range dependencies in high-resolution feature maps.Although the aforementioned methods achieve high processing speeds, they often struggle to achieve satisfactory image reconstruction quality.In recent lightweight SISR research, there has been a growing trend to combine ViT and LKA to achieve improved reconstruction quality.
Lu et al. [55] proposed the Efficient SR Transformer (ESRT), a hybrid network composed of a CNN backbone and a transformer backbone to address the significant computational cost and high GPU memory consumption of transformers.Sun et al. [56] introduced ShuffleMixer, which employs deep convolutions with large kernel sizes to aggregate spatial information from large regions.Sun et al. [57] introduced the Spatially Adaptive Feature Modulation Network (SAFMN), a CNN SR network based on the ViT architecture, which achieved a balance between performance and model complexity.Wu et al. [58] proposed TCSR, which introduces the neighborhood attention module to achieve more efficient performance than LKA.
While these lightweight methods have achieved a good balance between model complexity and performance, there is still room for improvement.formance.Zhu et al. [35] introduced a light-weight SISR network known as Expectation-Maximization Attention SR (EMASRN).The distinctive aspect of their work is the incorporation of a novel high-resolution expectation-maximization attention mechanism, which effectively captures long-range dependencies in high-resolution feature maps.

Proposed Method
Although the aforementioned methods achieve high processing speeds, they often struggle to achieve satisfactory image reconstruction quality.In recent lightweight SISR research, there has been a growing trend to combine ViT and LKA to achieve improved reconstruction quality.
Lu et al. [55] proposed the Efficient SR Transformer (ESRT), a hybrid network composed of a CNN backbone and a transformer backbone to address the significant computational cost and high GPU memory consumption of transformers.Sun et al. [56] introduced ShuffleMixer, which employs deep convolutions with large kernel sizes to aggregate spatial information from large regions.Sun et al. [57] introduced the Spatially Adaptive Feature Modulation Network (SAFMN), a CNN SR network based on the ViT architecture, which achieved a balance between performance and model complexity.Wu et al. [58] proposed TCSR, which introduces the neighborhood attention module to achieve more efficient performance than LKA.
While these lightweight methods have achieved a good balance between model complexity and performance, there is still room for improvement.

Network Architecture
As illustrated in Figure 4

Shallow feature extraction module (SF):
given an input Low-Resolution (LR) image I LR ∈ R 3×H×W , where H and W are the height and width of the LR image.The shallow feature extraction module was applied, denoted by SF (•), which consists of only a single 3 × 3 convolution, to extract the shallow feature F p ∈R C×H×W .The process is expressed as follows: Deep feature extraction module (DF): then the shallow features are sent to the deep feature extraction module to extract deeper and more abstract high-level features.The process is expressed as follows: Where F r denotes the deep feature maps.DF (•) denotes the deep feature extraction module, which consists of multiple cascaded SCAGs and a single 9 × 9 depth-wise-dilated convolutional layer with dilation ratios d = 4.More specifically, intermediate features F 1 , F 2 , . . .F n are extracted step by step, as shown in the following formula: where SCAG i (•) denotes the ith SCAG and DWD Conv 9×9,d=4 (•) denotes 9 × 9 depth-wisedilated convolutional layer with dilation ratios d = 4. n denotes the number of SCAG.
Image reconstruction module: F r and F p are sent to the image reconstruction module to complete the super resolution reconstruction of the image.The process can be described as below: where RC denotes the up-sampling module, which consists of a pixel shuffle operation and a single 3 × 3 convolution.Bicubic(•) denotes bicubic interpolation up-sampling operation.Incorporating interpolation at this juncture serves to enhance network performance and expedite network convergence.

Spatial and Channel Aggregation Groups (SCAG)
As discussed in the first section, neural networks utilizing Metaformer [40] architectures similar to ViT have emerged with tremendous potential.Flowing to Metaformer, we propose a spatial and channel aggregation module which used a Metaformer-style design.
As shown in Figure 5, SCAG consists of multiple cascaded spatial aggregation and channel aggregation blocks (SCAB) and a single 9 × 9 depth-wise-dilated convolutional layer with dilation ratios d = 4. ) where  i SCAG (⋅) denotes the th SCAG and where RC  denotes the up-sampling module, which consists of a pixel shuffle operation and a single 3 × 3 convolution.Bicubic (⋅) denotes bicubic interpolation up-sampling operation.Incorporating interpolation at this juncture serves to enhance network performance and expedite network convergence.

Spatial and Channel Aggregation Groups (SCAG)
As discussed in the first section, neural networks utilizing Metaformer [40] architectures similar to ViT have emerged with tremendous potential.Flowing to Metaformer, we propose a spatial and channel aggregation module which used a Metaformer-style design.
As shown in Figure 5, SCAG consists of multiple cascaded spatial aggregation and channel aggregation blocks (SCAB) and a single 9 × 9 depth-wise-dilated convolutional layer with dilation ratios d = 4.Given the input feature X, the whole process of SCAB is formulated as: Given the input feature X, the whole process of SCAB is formulated as: where LN (•) are layer normalization and is employed to enhance network training stability, accelerate convergence speed, and improve model generalization performance.TSGA (•) and CA (•) denote the triple-scale gated aggregation attention (TSGA) module and the channel aggregation (CA) module, which will be introduced in the next section.

Triple-Scale Spatial Aggregation (TSSA) Attention Module
As mentioned in the first section, methods based on large kernel convolutions have gradually surpassed ViT-based methods in some computer vision domains.However, there still exists a bottleneck in representation with the following current methods: deep neural networks have a tendency to encode interactions that are either too complex or too simple, rather than interactions with a moderate level of complexity.
In MogaNet [44], the authors innovatively introduced feature decomposition operations to encourage the network to pay more attention to intermediate-level information that is often overlooked by deep neural networks.Inspired by MogaNet [44] and CNN methods that employ large kernel convolutions [39,42], we propose triple-scale spatial Sensors 2023, 23, 8213 8 of 21 aggregation attention module (TSSA) to aggregate triple-scale context information.As shown in Figure 6, the whole process of TSSA is formulated as: where FD (•) indicates a Feature Decomposition (FD) module used to eliminate redundant feature interactions.TSGA (•) is a Triple-Scale Gated Aggregation (TSGA) module used to aggregate triple-scale contextual information.
there still exists a bottleneck in representation with the following current methods: deep neural networks have a tendency to encode interactions that are either too complex or too simple, rather than interactions with a moderate level of complexity.
In MogaNet [44], the authors innovatively introduced feature decomposition operations to encourage the network to pay more attention to intermediate-level information that is often overlooked by deep neural networks.Inspired by MogaNet [44] and CNN methods that employ large kernel convolutions [39,42] , we propose triple-scale spatial aggregation attention module (TSSA) to aggregate triple-scale context information.As shown in Figure 6, the whole process of TSSA is formulated as: where FD (•) indicates a Feature Decomposition (FD) module used to eliminate redundant feature interactions.TSGA (•) is a Triple-Scale Gated Aggregation (TSGA) module used to aggregate triple-scale contextual information.The Feature Decomposition (FD) module can be formulated as: The Feature Decomposition (FD) module can be formulated as: where Conv 1×1 (•) and GAP(•) are 1 × 1 convolutional layer and global average pooling layer which can extract common local texture and complex global shape, separately.Y − GAP(Y) can increase the impact of mid-level information.γ S ∈ R C×1 denotes a scaling factor initialized as zeros, which can increases spatial feature diversities by re-weighting.GELU indicates gaussian error linear unit [59], a high-performing neural network activation function, used for channel information gathering and redistribution.
Triple-Scale Gated Aggregation (TSGA) module can be formulated as: where DWConv 5×5 (•) are 5 × 5 depth-wise convolutional layer.Then, by Split( To extract triple scale feature, Y l , Y m , Y h are sent to the 5 × 5 depth-wise-dilated convolutional layer with a dilation ratio of d = 2, 7 × 7, the depth-wise-dilated convolutional layer with a dilation ratio of d = 3, 9 × 9, and the depth-wise-dilated convolutional layer with a dilation ratio of d = 4, separately.Then, the output Y l , Y m , Y h are concatenated to form triple-scale contexts by Concat(•).SiLU are sigmoid-weighted linear units [60], which possesses both the gating effects of Sigmoid and the stable training properties.Conv 1×1 is 1 × 1 convolutional layer.⊗ is element-wise multiplication.Finally, spatial gates are leveraged to learn more local information.

Channel Aggregation (CA) Module
Metaformer-style architectures, as illustrated in first section, perform Channel Mixing (CM) usually by 2-layer channel MLP or MLP equipped with 3 × 3 depth-wise convolution [61][62][63].The suboptimal efficiency of traditional MLPs can be attributed to redundant channels.To address this issue, a lightweight channel aggregation module was introduced, which was inspired by MogaNet.As illustrated in Figure 7, the Channel Aggregation (CA) module is formulated as: where Conv 1×1 (•) indicates a 1 × 1 convolutional layer, DWConv 3×3 (•) indicates 3 × 3 depth-wise convolutional layer, GELU(•) indicates gaussian error linear unit [62], a highperforming neural network activation function.MSFR(•) indicates multi-scale feature reallocation module, which is realized through a channel-reducing projection and gaussian error linear unit [62] to aggregate and redistribute channel-specific information.MSFR (•) indicates multi-scale feature reallocation module, which is realized through a channel-reducing projection and gaussian error linear unit [62] to aggregate and redistribute channel-specific information.
The emphasis on mid-level features is enhanced by performing the () r X GELU XW − The Multi-Scale Feature Reallocation (MSFR) module is formulated as: Sensors W r indicates the channel-reducing projection, which is formulated as: The emphasis on mid-level features is enhanced by performing the X − GELU(XW r ) operation, which subtracts the global channel information.This operation effectively reduces the influence of global channel statistics and enables a stronger focus on mid-level information.γc indicates the channel-wise scaling factor initialized as zeros, which increases channel feature diversities.

Experiments 4.1. Experimental Setup
Datasets and Evaluation Metrics.The training images consist of 2650 images from-Flickr2K [6] and 800 images from DIV2K [64].We evaluated our models on the following widely used benchmark datasets: Set5 [47], Set14 [65], BSD100 [66], Urban100 [67], and Manga109 [68].The commonly used data augmentation methods are applied in the training dataset.Specifically, we used a random combination of random rotations of 0 • , 90 • , 180 • , 270 • and horizontal flipping for data augmentation.The average Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity (SSIM) on the luminance (Y) channel are used as the evaluation metrics.
Implementation Details.For a more comprehensive evaluation of the proposed methods, two different versions of SCAN were trained to resolve the lightweight SR tasks under different complexity.1/6 SCAGs and 5/4 SCABs were stacked, and the channel width was set to 48/60 in the corresponding tiny/light SCAN.
Training Details.The model was trained using the Adam optimizer [69] with β1 = 0.9 and β2 = 0.99.The learning rate was initialized as 5×10 −4 and scheduled by cosine annealing learning during the whole 1×10 6 training iterations.For the ablation studies, we trained all models in 4×10 5 iterations.The weight of the exponential moving average (EMA) [70] was set to 0.999.Only the L1 loss was used to optimize the model.The patch size/batch size was set to 192 × 192/64 and 192 × 192/32 in the corresponding tiny/light SCAN.

Comparison with SCAN-Tiny SR Method
Quantitative comparisons.To evaluate the performance of the proposed SCAN-tiny, a comparison was made with state-of-the-art tiny SR methods with a parameter count of around 200 k, including Bicubic, SRCNN [4], FSRCNN [20], ShuffleMixer-tiny [56], ECBSR [25], LAPAR-B [53], MAN-tiny [42], and SAFMN [57].Table 1 shows the quantitative comparison on benchmark datasets for the upscale factors of ×2, ×3 and ×4.The number of parameters (Params) and Floating-point Operations (FLOPs) are also provided, calculated on the 1280 × 720 output.Benefiting from its simple and efficient architecture, the proposed SCAN-tiny achieved comparable performance with fewer parameters and memory consumption.
As shown in Table 1, the proposed SCAN surpassed all methods with parameter counts less than 200 k.Specifically, compared to MAN-tiny ×4, the proposed SCANtiny ×4 achieves average 0.1 dB PSNR gain and 0.00216 SSIM gain on five benchmark datasets, with parameter and computational complexity almost equivalent to MAN-tiny ×4.Similarly, compared to ShuffleMixer-tiny ×4, the proposed SCAN-tiny ×4 achieved on average a 0.244 dB PSNR gain and a 0.00634 SSIM gain on the five benchmark datasets, with only slightly higher parameter and computational complexity than ShuffleMixer-tiny.In addition to networks with fewer than 200 k parameters, comparisons were also made with many networks having more than 200 k parameters.Specifically, the proposed SCAN-tiny ×4 uses 53% parameter and 18% computational complexity of LAPAR-B ×4, achieves average 0.168 dB PSNR gain and 0.00368 SSIM gain on five benchmark datasets.Moreover, SAFMN, which ranks in the Top 3 for model complexity in the NTIRE2023 ESR Challenge, is also listed.The proposed SCAN-tiny ×4 demonstrates comparable performance to SAFMN ×4, while utilizing only 69% of the parameters and 68% of the computational complexity.
These results validate that the proposed SCAN achieves superior performance with fewer parameters, significantly enhancing computational efficiency.
Qualitative comparisons.In addition to the quantitative evaluations, a visual comparison is presented between the SCAN-tiny and six state-of-the-art tiny SR methods, including Bicubic, SRCNN [4], FSRCNN [20], ECBSR [25], LAPAR-B [53], and MAN-tiny [42]. Figure 8 presents a visual comparison of state-of-the-art methods on the Urban100 (×4) and Set14 (×4) datasets for upscale factors of ×4.The image within the red box is cropped and zoomed in on.
For img096 from Urban100, none of the six methods compared to the proposed SCAN were able to recover realistically sharp edges, and the resulting images exhibit blurriness and artifacts.The proposed SCAN reconstructed images closely resemble the HR images.Similarly, for barbara from Set14, only the proposed SCAN is capable of restoring authentic and clear images of books.These visual results demonstrate the information extraction capabilities of the proposed SCAN.
Thanks to its straightforward and efficient architecture, the proposed SCAN model achieves comparable performance while utilizing fewer parameters and consuming less memory.As shown in Table 2, the proposed approach surpasses all methods with a parameter count of around 1000 k.
Specifically, compared with several state-of-the-art transformer-based methods such as ESRT and SwinIR-light.The proposed SCAN ×4 uses 77% floating-point operations of ESRT ×4, while achieving a 0.36 dB PSNR gain and 0.00624 SSIM gain on five benchmark datasets.Moreover, compared to SwinIR-light ×4, the proposed SCAN ×4 achieves an average of 0.204 dB PSNR gain and 0.00324 SSIM gain on five benchmark datasets, with parameters and computational complexity almost equivalent to SwinIR-light ×4.
In addition to transformer-based methods, many CNN-based methods were also compared.Specifically, the proposed SCAN ×4 uses 88% parameter and 56% computational complexity of TCSR-L ×4, achieves an average of 0.096 dB PSNR gain and 0.00172 SSIM gain on the five benchmark datasets.Moreover, compared to MAN-light ×4, the proposed SCAN ×4 achieves an average of 0.085 dB PSNR gain and 0.00016 SSIM gain on the five benchmark datasets, with only slightly higher parameters and computational complexity than MAN-light ×4.
Compared to CNN-based and transformer-based methods, the proposed SCAN achieves superior performance with lower computational complexity, demonstrating the superiority of the SCAN approach.
Qualitative comparisons.In addition to the quantitative evaluations, a visual com parison of the proposed SCAN and six state-of-the-art light SR methods is presented, including Bicubic, LAPAR-A [53], IMDN [23], ESRT [55], SwinIR-light [45] and MANlight [42]. Figure 9   Qualitative comparisons.In addition to the quantitative evaluations, a visual parison of the proposed SCAN and six state-of-the-art light SR methods is presente cluding Bicubic, LAPAR-A [53], IMDN [23], ESRT [55], SwinIR-light [45] and MAN [42]. Figure 9 presents a visual comparison of state-of-the-art methods on the Urba (×4) and Set14 (×4) datasets for up-scale factors of ×4.The image within the red b cropped and zoomed in on.
For img092 from Urban100, the proposed SCAN method can restore images tha nearly indistinguishable from the HR, while other methods lack this capability, wh other methods fail to achieve this level of quality, resulting in unacceptable recons tions.Similarly, For BokuHaSitatakaKun from Manga109, compared to the other met evaluated, none of the models were able to reliably recover a clear and accurate repr tation of the complex letter M.

Remote Sensing Image Super-Resolution
In comparison to conventional images, remote sensing images exhibit num small targets and complex backgrounds, which place higher demands on image s resolution algorithms.To validate the effectiveness of the proposed SCAN-light, ex  For img092 from Urban100, the proposed SCAN method can restore images that are nearly indistinguishable from the HR, while other methods lack this capability, whereas other methods fail to achieve this level of quality, resulting in unacceptable reconstructions.Similarly, For BokuHaSitatakaKun from Manga109, compared to the other methods evaluated, none of the models were able to reliably recover a clear and accurate representation of the complex letter M.

Remote Sensing Image Super-Resolution
In comparison to conventional images, remote sensing images exhibit numerous small targets and complex backgrounds, which place higher demands on image super-resolution algorithms.To validate the effectiveness of the proposed SCAN-light, experiments were conducted on publicly available remote sensing datasets, DIOR [71] and DOTA [72].
Two hundred images were randomly selected from the DIOR dataset, and another two hundred images were chosen from the DOTA dataset for conducting the transfer of learning on the proposed model.Sixty images were extracted from the test datasets of both the DIOR and DOTA datasets to evaluate the performance of the proposed SCAN-light.
Quantitative comparisons.In order to assess the performance of the proposed SCAN on remote sensing datasets, a comparison is conducted with state-of-the-art lightweight SR methods, including Bicubic, RFDN [24], LAPAR-A [53], IMDN [23], SwinIR-light [45], and MAN-light [42].Table 3 shows the quantitative comparison on benchmark datasets for upscale factors of ×2, ×3 and ×4.The number of parameters (Params) and floating-point operations (FLOPs) are provided, calculated on the 1280 × 720 output.Thanks to its efficient architecture, the proposed SCAN model achieves comparable performance while consuming less memory.As shown in Table 3, the proposed approach surpasses all methods in PSNR and SSIM.
Specifically, compared with MAN-light.The proposed SCAN ×4 employs parameters and floating-point operations similar to MAN-light model, achieves an average PSNR improvement of 0.4 dB and an SSIM improvement of 0.0013 on remote sensing datasets.
Qualitative comparisons.In addition to the quantitative evaluations, a visual comparison of the proposed SCAN and five state-of-the-art light SR methods are presented, RFDN [24], LAPAR-A [53], IMDN [23], SwinIR-light [45] and MAN-light [42]. Figure 10 presents a visual comparison of state-of-the-art methods on the DIOR and DOTA datasets for up-scale factors of ×4.The images within the red box are cropped and zoomed in on.
For img05916 from DIOR, the proposed SCAN demonstrates the ability to restore clear road surfaces and car contours, a feat that other methods struggle to achieve due to the small size of these regions.Similarly, for imgP0047 from DOTA, the proposed SCAN excels in restoring accurate car contours, while other methods often produce indistinct contours that are difficult to recognize as cars.Qualitative comparisons.In addition to the quantitative evaluations, a visual comparison of the proposed SCAN and five state-of-the-art light SR methods are presented, RFDN [24], LAPAR-A [53], IMDN [23], SwinIR-light [45] and MAN-light [42]. Figure 10 presents a visual comparison of state-of-the-art methods on the DIOR and DOTA datasets for up-scale factors of ×4.The images within the red box are cropped and zoomed in on.For img05916 from DIOR, the proposed SCAN demonstrates the ability to restore clear road surfaces and car contours, a feat that other methods struggle to achieve due to the small size of these regions.Similarly, for imgP0047 from DOTA, the proposed SCAN

Ablation Studies
In this section, we conduct ablation studies on some of the designs involved in our SCAN.
Study on TSSA.As previously discussed in Section 3, the FD module and TSGA module are utilized to aggregate triple-scale context information.To substantiate the effectiveness of the FD and TSGA modules, experiments were conducted by selectively removing either of the two from the SCAN-tiny model, and subsequently observing the resultant impact on model performance.As shown in Table 4, it is evident that employing any of them lead to an improvement in performance.
For further demonstration, this is depicted in Figure 11, with the feature maps at different stages within the TSSA module of the model being visualized.Notably, following the collaborative deployment of the TSGA and FD modules, a substantial augmentation in feature richness is observed.
Sensors 2023, 23, x FOR PEER REVIEW 16 excels in restoring accurate car contours, while other methods often produce indi contours that are difficult to recognize as cars.

Ablation Studies
In this section, we conduct ablation studies on some of the designs involved i SCAN.
Study on TSSA.As previously discussed in Section 3, the FD module and T module are utilized to aggregate triple-scale context information.To substantiate th fectiveness of the FD and TSGA modules, experiments were conducted by selective moving either of the two from the SCAN-tiny model, and subsequently observin resultant impact on model performance.As shown in Table 4, it is evident that emplo any of them lead to an improvement in performance.
For further demonstration, this is depicted in Figure 11, with the feature maps a ferent stages within the TSSA module of the model being visualized.Notably, follo the collaborative deployment of the TSGA and FD modules, a substantial augment in feature richness is observed.

Study on TSGA.
As previously discussed in Section 3, regarding the extraction of multi-scale information, the input is partitioned into three equal portions, and each of these portions is subsequently fed into three depth-wise-dilated convolutional layers of varying scales.To validate the efficacy of this approach, the performance of the triple-scale approach was compared with single-scale approaches of different sizes.The comparison results are presented in Table 5, revealing that employing a triple-scale approach facilitates the attainment of a balance between performance and computational costs.Study on Activation Functions.As previously discussed in Section 3, a sigmoidweighted linear unit (SiLU) [60] is utilized in the gating branch.In order to authenticate the effectiveness of SiLUs, experiments were conducted in which the SiLU was replaced with a ReLU [73], PReLU [74], and GELU [59] in the proposed SCAN-tiny model, subsequently comparing their respective performances.The comparison results are shown in Table 6, revealing that leveraging SiLU enables the attainment of superior performance with minimal computational costs.The Study on CA.As previously discussed in Section 3, the introduced CA module consists of a 2-layer channel MLP, a 3 × 3 depth-wise convolutional layer, and a multiscale feature reallocation module.In Table 7, the results of deploying a 3 × 3 depth-wise convolutional layer and an MSFR module on the proposed tiny SCAN are presented.When comparing the base models without a 3 × 3 depth-wise convolutional layer and multi-scale feature reallocation modules, it is evident that the employment of a 3 × 3 depthwise convolutional layer can enhance performance.Furthermore, it can also be observed that employing MSFR without the use of large convolutional kernels negatively impacts performance.This is attributed to the pooling operation along the channel dimension, which results in the loss of essential information required for effective processing.For further clarification, this is demonstrated in Figure 12, where feature maps from various stages within the CA module of the model were visualized, revealing a significant increase in feature richness after the collaborative utilization of the DWConv and MSRA modules.

Conclusions
In this paper, a Spatial and Channel Aggregation Network (SCAN) for lightw SISR was proposed.SCAN incorporates the Triple-Scale Spatial Aggregation Atte module (TSSA) to aggregate spatial information at multiple scales.Additionally Channel Aggregation (CA) module is used to aggregate channel information.Featu duction operations are applied in both the TSSA and CA modules to encourage th In Table 8, a comparison was made between the base models without a tail in the spatial and channel aggregation groups (SCAG) and the Deep Feature (DF) extraction module.It was observed that the inclusion of DWDConv (9 × 9, d = 4) as the tail resulted in a notable improvement in performance.

Conclusions
In this paper, a Spatial and Channel Aggregation Network (SCAN) for lightweight SISR was proposed.SCAN incorporates the Triple-Scale Spatial Aggregation Attention module (TSSA) to aggregate spatial information at multiple scales.Additionally, the Channel Aggregation (CA) module is used to aggregate channel information.Feature reduction operations are applied in both the TSSA and CA modules to encourage the network to focus on the mid-level information which is challenging for deep neural networks to aggregate.The core concept of SCAN is the utilization of large-kernel convolutions and feature reduction strategies to extract intermediate features that are challenging to capture in both spatial and channel dimensions.Moreover, an innovative approach is introduced by using 9 × 9 large convolutional kernels at the end of the attention modules for the first time, aiming to further enlarge the receptive field.As a result, the proposed SCAN achieves highly efficient SR performance on both public benchmark datasets and remote sensing datasets.Extensive experiments demonstrate that, compared to state-ofthe-art lightweight SISR methods with similar parameters and FLOPs, the proposed SCAN provides a significant improvement of 0.13 dB in PSNR metric on benchmark dataset and an even more impressive enhancement of 0.4 dB in PSNR on remote sensing datasets.The proposed SCAN provides a 0.13 dB improvement in peak signal-to-noise ratio (PSNR) and a 0.0013 increase in structural similarity (SSIM).Moreover, on remote sensing datasets, SCAN achieves a 0.4 dB improvement in PSNR and a 0.0033 increase in SSIM.

( 3 )of 21 ( 4 ) 22 ( 3 )
The Spatial and Channel Aggregation Network (SCAN), a lightweight and efficient pure CNN-based SISR network model that combines the advantages of both CNN and Transformer is proposed.Sensors 2023, 23, 8213 3 Quantitative and qualitative evaluations are conducted on benchmark datasets and remote sensing datasets to investigate the proposed SCAN.As shown in Figure 3, the proposed SCAN achieves a good trade-off between model performance and complexity.Sensors 2023, 23, x FOR PEER REVIEW 3 of The Spatial and Channel Aggregation Network (SCAN), a lightweight and efficient pure CNN-based SISR network model that combines the advantages of both CNN and Transformer is proposed.(4) Quantitative and qualitative evaluations are conducted on benchmark datasets and remote sensing datasets to investigate the proposed SCAN.As shown in Figure 3, the proposed SCAN achieves a good trade-off between model performance and complexity.

Figure 2 .
Figure 2. Comparison of local attribution maps (LAMs) [46] between the proposed SCAN and other efficient lightweight SR models.The LAMs demonstrate the significance of every pixel in the input LR image with respect to the patch marked with a red box's SR.Additionally, the contribution area is shown in the third row.It is evident that the proposed SCAN can aggregate more information.

Figure 2 .
Figure 2. Comparison of local attribution maps (LAMs) [46] between the proposed SCAN and other efficient lightweight SR models.The LAMs demonstrate the significance of every pixel in the input LR image with respect to the patch marked with a red box's SR.Additionally, the contribution area is shown in the third row.It is evident that the proposed SCAN can aggregate more information.

Figure 2 .
Figure 2. Comparison of local attribution maps (LAMs)[46] between the proposed SCAN and other efficient lightweight SR models.The LAMs demonstrate the significance of every pixel in the input LR image with respect to the patch marked with a red box's SR.Additionally, the contribution area is shown in the third row.It is evident that the proposed SCAN can aggregate more information.

Figure 3 .
Figure 3. Model performance and complexity comparison between the proposed SCAN model and other lightweight SISR methods on Set5 [47] for ×4 SR.The circle sizes indicate the Floating-Point Operations (FLOPs).The proposed SCAN achieves a good trade-off between model performance and complexity.
3.1.Network Architecture As illustrated in Figure 4, the proposed SCAN consists of the following three components: the Shallow Feature (SF) extraction module, the Deep Feature (DF) extraction module based on cascaded Spatial and Channel Aggregation Groups (SCAG), and the high-quality image reconstruction module.
, the proposed SCAN consists of the following three components: the Shallow Feature (SF) extraction module, the Deep Feature (DF) extraction module based on cascaded Spatial and Channel Aggregation Groups (SCAG), and the highquality image reconstruction module.

Figure 4 .
Figure 4. Overview of our spatial and channel aggregation network (SCAN).
denotes 9 × 9 depth-wisedilated convolutional layer with dilation ratios d = 4.  denotes the number of SCAG.Image reconstruction module: r F and p F are sent to the image reconstruction mod- ule to complete the super resolution reconstruction of the image.The process can be described as below:

Figure 5 .
Figure 5. Structure of spatial and channel aggregation groups (SCAG).Spatial aggregation and channel aggregation block (SCAB): SCAB consists of the following three components: the triple-scale spatial aggregation attention (TSSA) module, the channel aggregation (CA) module, and the layer normalization layer.Given the input feature X, the whole process of SCAB is formulated as:

Figure 5 .
Figure 5. Structure of spatial and channel aggregation groups (SCAG).Spatial aggregation and channel aggregation block (SCAB):SCAB consists of the following three components: the triple-scale spatial aggregation attention (TSSA) module, the channel aggregation (CA) module, and the layer normalization layer.Given the input feature X, the whole process of SCAB is formulated as:

Figure 7 .
Figure 7. Structure of the Channel Aggregation (CA) module.

Figure 8 .
Figure 8. Visual comparison of state-of-the-art methods in some challenging cases (×4SR).

Figure 8 .
Figure 8. Visual comparison of state-of-the-art methods in some challenging cases (×4SR).
presents a visual comparison of state-of-the-art methods on the Urban100 Sensors 2023, 23, 8213 13 of 21 (×4) and Set14 (×4) datasets for up-scale factors of ×4.The image within the red box is cropped and zoomed in on.

Figure 9 .
Figure 9. Visual comparison of state-of-the-art methods in some challenging cases(×4SR).

Figure 9 .
Figure 9. Visual comparison of state-of-the-art methods in some challenging cases(×4SR).

Figure 10 .
Figure 10.Visual comparison of state-of-the-art methods in some challenging remote sensing cases(×4SR).

Figure 10 .
Figure 10.Visual comparison of state-of-the-art methods in some challenging remote sensing cases(×4SR).

Figure 11 .
Figure 11.Feature map at the stage of TSSA.

Sensors 2023 , 18 Figure 12 .
Figure 12.Feature map at the stage of CA.Study on tail in spatial and channel aggregation groups (SCAG) and deep fe (DF) extraction module.As an innovation, a 9 × 9 depth-wise-dilated convolutional with a dilation ratio of d = 4 was employed in the tail of the Deep Feature (DF) extra module and the spatial and channel aggregation groups (SCAG) module for the firstIn Table8, a comparison was made between the base models without a tail i spatial and channel aggregation groups (SCAG) and the Deep Feature (DF) extra module.It was observed that the inclusion of DWDConv (9 × 9, d = 4) as the tail res in a notable improvement in performance.

Figure 12 .
Figure 12.Feature map at the stage of CA.Study on tail in spatial and channel aggregation groups (SCAG) and deep feature (DF) extraction module.As an innovation, a 9 × 9 depth-wise-dilated convolutional layer with a dilation ratio of d = 4 was employed in the tail of the Deep Feature (DF) extraction module and the spatial and channel aggregation groups (SCAG) module for the first time.In Table8, a comparison was made between the base models without a tail in the spatial and channel aggregation groups (SCAG) and the Deep Feature (DF) extraction module.It was observed that the inclusion of DWDConv (9 × 9, d = 4) as the tail resulted in a notable improvement in performance.
introduced Image Restoration Using Swin Transformer (SwinIR), a highquality image SR model based on the Swin Transformer [30] architecture.By leveraging the window-based attention mechanism of Swin Transformers, SwinIR effectively handles

Table 1 .
Quantitative comparison with state-of-the-art methods for image SR on benchmark datasets.'Multi-Adds' is calculated with a 1280 × 720 GT image.The best and second-best performances are in red and blue, respectively.

Table 2 .
Quantitative comparison with state-of-the-art methods for image SR on benchmark datasets.'FLOPs' is calculated with a 1280 × 720 GT image.The best and second-best performances are in red and blue, respectively.

Table 3 .
Quantitative comparison with state-of-the-art methods for light image SR on remote sensing datasets.'FLOPs' is calculated with a 1280 × 720 GT image.The best and second-best performances are in red and blue, respectively.

Table 4 .
Ablation studies on components of TSSA.The impact of FD and TSGA modules are s upon SCAN-tiny on the ×4 SR task.'FLOPs' is calculated with a 1280 × 720 GT image.Th metrics are highlighted in bold for emphasis.

Table 4 .
Ablation studies on components of TSSA.The impact of FD and TSGA modules are shown upon SCAN-tiny on the ×4 SR task.'FLOPs' is calculated with a 1280 × 720 GT image.The best metrics are highlighted in bold for emphasis.

Table 5 .
Ablation studies on design of TSGA module.The impact of multi-scale approach, and single-scale approach are shown upon SCAN-tiny on ×4 SR task.'FLOPs' is calculated with a 1280 × 720 GT image.The best metrics is highlighted in bold for emphasis.

Table 6 .
Ablation studies on different type of activation function.The impact of SiLU, ReLU, PReLU and GELU are shown upon SCAN-tiny on ×4 SR task.'FLOPs' is calculated with a 1280 × 720 GT image.The best metrics are highlighted in bold for emphasis.

Table 7 .
Ablation studies on components of CA.The impact of DWConv 3×3 and MSFR mechanism, are shown upon SCAN-tiny on ×4 SR task.'FLOPs' is calculated with a 1280 × 720 GT image.The best metrics are highlighted in bold for emphasis.

Table 8 .
Ablation studies on different type of tail.The impact of with no tail, DWDConv = are shown upon SCAN-tiny on ×4 SR task.'FLO calculated with a 1280 × 720 GT image.The best metrics are highlighted in bold for emphasis

Table 8 .
Ablation studies on different type of tail.The impact of with no tail, DWDConv 5×5,d=2 , DWDConv 7×7,d=3 and DWDConv 9×9,d=4 are shown upon SCAN-tiny on ×4 SR task.'FLOPs' is calculated with a 1280 × 720 GT image.The best metrics are highlighted in bold for emphasis.