CSINet: A Cross-Scale Interaction Network for Lightweight Image Super-Resolution

In recent years, advancements in deep Convolutional Neural Networks (CNNs) have brought about a paradigm shift in the realm of image super-resolution (SR). While augmenting the depth and breadth of CNNs can indeed enhance network performance, it often comes at the expense of heightened computational demands and greater memory usage, which can restrict practical deployment. To mitigate this challenge, we have incorporated a technique called factorized convolution and introduced the efficient Cross-Scale Interaction Block (CSIB). CSIB employs a dual-branch structure, with one branch extracting local features and the other capturing global features. Interaction operations take place in the middle of this dual-branch structure, facilitating the integration of cross-scale contextual information. To further refine the aggregated contextual information, we designed an Efficient Large Kernel Attention (ELKA) using large convolutional kernels and a gating mechanism. By stacking CSIBs, we have created a lightweight cross-scale interaction network for image super-resolution named “CSINet”. This innovative approach significantly reduces computational costs while maintaining performance, providing an efficient solution for practical applications. The experimental results convincingly demonstrate that our CSINet surpasses the majority of the state-of-the-art lightweight super-resolution techniques used on widely recognized benchmark datasets. Moreover, our smaller model, CSINet-S, shows an excellent performance record on lightweight super-resolution benchmarks with extremely low parameters and Multi-Adds (e.g., 33.82 dB@Set14 × 2 with only 248 K parameters).


Introduction
Single image super resolution (SR) is a low-level task in the field of computer vision that aims to reconstruct a high resolution from a corresponding low-resolution image (LR), which is widely used in many applications, such as mobile devices, surveillance systems, autonomous driving, medical imaging, etc.However, SR is an ill-posed problem, since an identical LR image may be degenerated from different HR images.Therefore, SR is still a challenging task in terms of how to efficiently visually reconstruct HR images from degraded LR images.
To address this issue, Dong et al. [1] proposed SRCNN, marking the first application of deep learning methods in the field of single image super-resolution (SR).Achieving significantly superior results compared to traditional methods, SRCNN utilizes a threelayer convolutional neural network.To mitigate the computational demands of SRCNN, Kim et al. [2] introduced the VDSR model, incorporating residual learning to deepen the network to 20 layers and achieve rapid convergence.Lim et al. [3] presented the EDSR model, simplifying the network structure by removing batch normalization layers (BN), 1.We adopted a factorized convolution approach to design a Cross-Scale Interaction Block (CSIB).CSIBs employ a dual-branch structure to extract both local fine-grained features and global coarse-grained features.Furthermore, we utilize interaction operations at the end of the dual-branch structure, facilitating the integration of cross-scale contextual information; 2. We designed an Efficient Large Convolutional Kernel Attention (ELKA) with limited additional computation for refining and extracting features.Ablation studies validated the effectiveness of this attention module; 3. Comprehensive experiments on benchmark datasets show that our CSINet outperforms most state-of-the art lightweight SR methods.

Related Work 2.1. Lightweight Image SR
To improve the network speed while maintaining superior reconstruction results, several lightweight image super-resolution networks have been introduced [1,7,[11][12][13][14].These networks can be broadly categorized into three groups: network structure design, knowledge distillation, and pruning.In the network structure design methods, FSRCNN [1] is the first lightweight super-resolution model.It performs upsampling at the end of the network, significantly improving the processing speed, but the performance of image reconstruction still needs improvement.CARN [14] designs a cascaded residual module based on grouped convolution and adopts a mechanism of local and global cascading to fuse multi-layer features, thereby accelerating the model's running speed.PAN [8] designs self-calibrated blocks with pixel attention and upsampling blocks, achieving competitive performance with only 272K parameters.
In knowledge distillation methods, IDN [15] uses 1 × 1 and 3 × 3 convolutions to construct an information distillation module, distilling the current feature map through channel separation, achieving real-time performance while maintaining reconstruction accuracy.Based on IDN, IMDN [11] introduces a multi-information distillation module that extracts a part of useful features each time and integrates the remaining features into the distillation step of the next stage.After completion, the features extracted in each step are connected together.Subsequently, RFDN [13] combines feature distillation connections and shallow residual blocks to construct a residual feature distillation block, achieving a better performance than IMDN with fewer parameters.
In pruning methods, SCCVLAB [16] uses a fine-grained channel pruning strategy to address image super-resolution problems, achieving satisfactory results.SMSR [7] prunes redundant computations by learning spatial and channel masks, achieving a better performance with an improved inference efficiency.
Although the aforementioned methods are lightweight and efficient, the quality of SR reconstruction still requires significant improvement.

Attention Mechanism of Image SR
Researchers in the field of image super-resolution have adopted the attention mechanism, which was initially developed for natural language processing tasks [17,18], and has proven effective in image super-resolution.
Hu et al. [18] proposed using channel attention (CA), which assigns a weight to each feature channel based on its significance and improves the feature representation by amplifying the features with high weights and suppressing those with low weights.Hui et al. [15] enhanced the channel attention mechanism with contrast-aware channel attention (CCA), which assigns channel weights according to the sum of the standard deviation and the mean.Wang et al. [19] introduced efficient channel attention (ECA), which uses 1D convolution to efficiently capture dependencies across channels, to make the attention mechanism lighter.These attention mechanisms exhibit state-of-the-art performance in SR tasks [4,8,15].Some studies have introduced spatial attention to enrich the feature map.Wang et al. [20] proposed an additional attention mechanism, non-local attention, which captures global context information by computing pixel-to-pixel dependencies.Nevertheless, this mechanism incurs a substantial computational overhead.To address this issue, Liu et al. [13] proposed enhanced spatial attention (ESA), which reduces the channel dimensions by employing a 1 × 1 convolutional layer followed by a stride convolution to expand the receptive field.The max pooling operation with a large window and stride then focuses on the feature's crucial spatial information.EFDN [10] and BSRN [6] also demonstrate superior performance with ESA.
Guo et al. [21] proposed a novel linear attention mechanism named Large Kernel Attention (LKA) that utilizes the large receptive field of large convolutional kernels to achieve the effects of adaptability and long-range correlations similar to self-attention.The LKA attention mechanism has demonstrated excellent performance in various computer vision tasks [22,23].However, the use of large convolutional kernels in LKA can introduce a significant computational burden.To address this, we decompose the large convolutional kernels in LKA into smaller ones, achieving results comparable to LKA while significantly reducing the computational requirements.

Factorized Convolution
Factorized convolution has emerged as a promising technique in efficient neural network design.It involves breaking down a standard convolution operation into multiple smaller convolution operations, typically aimed at reducing the computational complexity and model parameters.This technique has found widespread application in various computer vision tasks, including image classification, object detection, and semantic segmentation.
One common form of factorized convolution is depth-wise Separable Convolution, where a standard convolution layer is decomposed into two independent operations: depth-wise convolution and point-wise convolution.Depth-wise convolution independently filters each input channel spatially, while point-wise convolution combines the filtered outputs from each channel.This factorization significantly reduces the number of parameters, resulting in more efficient models.
Recent research has demonstrated the immense potential of factorized convolution in enhancing the efficiency of neural networks.For instance, MobileNet [24] introduced depth-wise Separable Convolution, creating lightweight models suitable for mobile devices.ERFNet [25] factorized 3 × 3 convolutions into 3 × 1 and 1 × 3 convolutions, achieving substantial performance improvements in semantic segmentation tasks.Subsequent studies like DABNet [26], LEDNet [27], and MSCFNet [28] have further improved upon this technique and successfully applied it to their respective tasks, further emphasizing the importance of factorized convolution in efficient network design.
While factorized convolution has been successful in tasks such as image classification and object detection, its application in enhancing the efficiency of super-resolution neural networks has been relatively limited.Despite the successes observed in image classification and object detection tasks, the untapped potential of factorized convolution in improving the efficiency of super-resolution neural networks remains largely unexplored.
To address this gap, we propose an innovative approach in this work, applying factorized convolution to super-resolution networks.Our method fully leverages the advantages of factorized convolution to create highly efficient and lightweight architectures capable of delivering high-quality image super-resolution results.

Network Structure
Our CSINet intends to reconstruct HR images by leveraging the work of RFDN [13] and blueprint separable residual network (BSRN) [6]. Figure 2 depicts the architecture of CSINet, which comprises four modules: shallow feature extraction, multiple stacked feature aggregation residual group, dense feature fusion (DFF), and image reconstruction.The goal of the shallow feature extraction module is to extract low-level image features.The multiple stacked feature aggregation residual group is intended to aggregate and refine features from multiple scales.The dense feature fusion (DFF) module combines features from multiple scales, utilizing the attention mechanism to highlight important features and suppress irrelevant ones.The image reconstruction module then reconstructs the HR image based on the fused features.
Shallow Feature Extraction.Given a low-quality input image I LR ∈ R H×W×C , the shallow feature F 0 is extracted by a 3 × 3 convolutional layer.This process can be expressed as where f cn×m_sk denotes an n × m convolutional layer with stride k for the shallow feature extraction; this convolution layer provides straightforward mapping from the input image space to a higher-dimensional feature space.Multiple Stacked Feature Aggregation Residual Group.To extract deep features, we use a non-linear mapping module that consists of several stacked feature aggregation residual groups (FARGs).The output of the i-th FARG F i can be expressed as follows: where FARG i (•) is the function of i-th FARG and the corresponding output denoted by F i .More details of FARG unit will be given in Section 3.3.Dense Feature Fusion (DFF).To combine hierarchical features from all layers, the outputs of these FARGs are concatenated and sent to a DFF module consisting of a 1 × 1 convolution, a GELU, and a 3 × 3 convolution.The feature is then refined using an ESA attention module.This procedure can be described as and where I SR denotes the super-resolution result of the network, f up, ps indicates the Pixel Shuffle operation.Loss Function.We utilize the L1 loss to optimize the parameters of our CSINet model as where I SR is the super-resolution result of the network, and I HR denotes the corresponding high-resolution image.

Efficient Large Kernel Attention (ELKA)
Guo et al. [21] introduced an innovative linear attention mechanism known as Large Kernel Attention (LKA), which leverages the expansive receptive field provided by large convolution kernels to attain adaptability and long-range correlation effects, akin to selfattention mechanisms.LKA has demonstrated remarkable efficacy, particularly in SR tasks [22,23].Nonetheless, the utilization of large convolution kernels in LKA imposes a substantial computational burden.
To address this issue, we adopted two pivotal strategies.First, we decomposed the 2D convolution kernel in the deep convolution layer of LKA into a sequence of cascaded horizontal and vertical 1D convolution kernels.Specifically, a K × K spatial convolution was deconstructed into a K × 1 depth-wise convolution and a 1 × K depth-wise convolution.This decomposition effectively curtails the quadratic increase in the number of parameters in LKA as the convolution kernel size grows, all the while preserving performance quality.
Secondly, we introduced a 1 × 1 convolution layer both preceding and following the depth convolution operation, facilitating information interaction across channels.This groundbreaking module is denoted as ELKA.The overall architecture of ELKA is visually depicted in Figure 3.The ELKA module consists of three parts: (1) Spatial local convolution: including two cascaded depth-wise convolution operations with convolution kernel sizes of 7 × 1 (DW-Conv7 × 1) and 1 × 7 (DW-Conv1 × 7).( 2) Spatial global convolution: Contains two cascaded deep-wise dilated convolution operations with convolution kernel sizes of 9 × 1 (DW-D-Conv9 × 1) and 1 × 9 (DW-D-Conv1 × 9).( 3) Two channel convolutions: These two channel convolutions are applied at the beginning and end of the module.The expression of ELKA operation is as follows: )) where F ELKA out denotes the output of the ELKA module; f GELU (•) is the gelu activation function;f dwn×m indicates the n × m depth-wise convolution operation, f dwn×m_rd is n × m depth-wise dilated convolution operation with dilated rate d, ⊙ indicates hadamard product.
Compared to the standard LKA design, ELKA can achieve comparable performance while exhibiting a lower computational complexity and memory.

Enhanced Spatial Attention
Enhanced Spatial Attention (ESA) is a lightweight and effective spatial attention mechanism [13], as shown in Figure 4. To reduce the computational cost, the ESA module first reduces the number of channels using 1 × 1 convolution.For enlarging the receptive field further, ESA first halves the size of the feature map using a 1 × 1 convolution with stride 2, and then adds a 7 × 7 max pooling layer with stride 3 to reduce the spatial dimension.Following a series of 3 × 3 convolutions to determine the interdependence of the feature map's spatial dimensions, a bilinear interpolation is then used to restore the feature map to its original size and then concatenate the features obtained from the previous feature map.Then, a 1 × 1 convolutional layer is utilized to restore the number of channels of the feature map to its initial value.The attention mask is then generated by the sigmoid activation function and multiplied by the input features to produce an output feature map with long-distance dependence.Given the input characteristics F ESA in of ESA, the preceding operations can be described as follows: )) and where F ESA out denotes the output of the ESA module; f mn×m_sk represents the n × m max pooling layer with stride k; f up, bi is the bilinear interpolation operation; f upsample is the upsampling operation; f sigmoid (•) is the sigmoid activation function; and ⊗ indicates the element-wise product.

Feature Aggregation Residual Group (FARG)
The feature distillation technique introduced in RFDN [13] has proven effective in reducing the number of parameters while improving performance.Nevertheless, recent studies [5] have indicated that eliminating the feature distillation branch can lead to a reduction in the runtime and computational cost.Motivated by the findings of [29], we have developed the Feature Aggregation Residual Group (FARG) architecture, which is depicted in Figure 2a.
FARG has been designed to be an efficient network module.It comprises two Cross-Scale Interaction Blocks (CSIB), a 3 × 3 depth-wise convolution layer, and employs the GELU activation function.To begin, FARG processes input features through a pair of CSIBs, a step critical for obtaining deep and robust feature representations.These channel feature enhancement blocks are instrumental in extracting and enriching information from the input features.Next, the channel features undergo convolution through 3 × 3 convolutional layers, further enhancing the feature representation.This step plays a crucial role in capturing spatial relationships and structural information between the features.Subsequently, the GELU activation function is applied for nonlinear transformation, introducing more complex nonlinear characteristics.This is valuable for enabling the model to better comprehend the intricacies of the data and extract abstract features.Finally, residual operations combine identity mapping with the output features, ensuring that the acquired features effectively integrate with the original input features to better preserve valuable information.This architectural design enhances the network module's ability to learn and represent complex data features with greater effectiveness.The procedure of FARG can be expressed as where F FARG in and F FARG out are the input and output of the FARG, respectively; CSIB stands for the Cross-Scale Interaction Block, which will be introduced later; ⃝ i CSIB i is a CSIB group as a sequence of CSIB blocks; here, two CSIBs are applied for our network; and f GELU (•) is the Gaussian error linear unit activation function.

Cross-Scale Interaction Block (CSIB)
To create an efficient architecture, we propose the efficient cross-scale interaction block (CSIB), inspired by the work of Romera [25], Wang [27], Li [26], and Gao [28].The primary focus of the CSIB's design is on cross-scale information interaction, taking into consideration the limitations of existing methods in terms of the feature representation capability and efficiency.The CSIB incorporates factorized depth-wise dilated convolutions and residual connections for efficient representation learning, as shown in Figure 5c.In contrast to the single-branch structure of the Non-bottleneck-1D module proposed by Romera [25] and the dual-branch structure of the SS-nbt module proposed by Wang [27], CSIB utilizes an effective cross-scale interaction technique to integrate cross-scale contextual information.This architecture is intended to strike a balance between accuracy and parameters, allowing for improved feature representation and enhanced computational efficiency.Firstly, CSIB employs a 1 × 1 convolution layer to decrease the number of parameters and expedite the training process.Following [27,28], we employ a dual-branch structure to simultaneously extract local and multi-scale contextual information.Unlike SS-nbt [27], we replace the factorization convolution with depth-wise factorization convolution to further reduce the parameters in the first branch, which can extract local information.The second branch applies factorization convolution to the depth-wise dilated convolution in order to enlarge the receptive field, thereby capturing global context information.According to previous studies [26,28], dilated convolution may result in gridding artifacts; therefore, we employ depth-wise dilated convolutions with varying dilation rates in various CSIBs.To integrate the cross-scale contextual information of different branches, we perform an element-wise sum of the feature maps extracted by the 5 × 1 convolutions in the two branches and feed them to a subsequent 1 × 5 convolution in each branch.In this manner, the extracted feature maps from the two branches can interact.
Considering that concatenation operations are more effective than addition operations, we use concatenation to merge the convolution outputs of two branches.Due to the fact that the receptive fields of the two branches are different sizes, 1 × 1 convolution is used to promote the fusion of the contextual information extracted by the two branches, strengthen information interaction, and improve feature representation.Following this is an ELKA module for extracting distinguishing characteristics.Finally, the shortcut connection is utilized to preserve the previous functionality, and then we join this into the subsequent CSIB.The operations of CSIB can be expressed as Following previous research [5,7,[10][11][12][13], we train our models using the recently popularized dataset DIV2K [30] with 800 high-quality images.Five standard benchmark datasets are used to evaluate our models: Set5 [31], Set14 [32], BSD100 [33], Urban100 [34], and Manga109 [35].To objectively assess the performance of our model, we convert the image to the YCbCr color space and compute the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics on the luminance channel.
PSNR stands for Peak Signal-to-Noise Ratio and is a measure of image quality that compares the original image to the compressed or distorted image.It is defined as where MAX I is the maximum pixel value of the image, and MSE is the Mean Squared Error between the original and compressed/distorted images.Higher values of the PSNR indicate a better image quality.SSIM stands for Structural Similarity Index and is a metric that compares the structural similarity of two images, taking into account the luminance, contrast, and structure.It is defined as: where x and y are the two images being compared, µ x and µ y are their respective means, σ 2 x and σ 2 y are their respective variances, and σ xy is their covariance.C 1 and C 2 are constants used to avoid instability when the means are close to zero.The SSIM value ranges between −1 and 1, where a value of 1 indicates perfect similarity.

Training Details
During the training phase, LR training images are generated by downsampling HR images with scaling factors (×2, ×3, and ×4) using bicubic interpolation in MATLAB R2017a.We apply random horizontal or vertical flips and 90 • rotations to the training set.In each mini-batch, inputs consisting of 48 × 48 LR color patches are selected.The Adan optimizer will be used to train our model with the parameters β 1 = 0.98, β 2 = 0.92, β 3 = 0.99, and with an initial learning rate of 1 × 10 −3 .In the training stage, we use L1 to train our network for 1 × 10 6 iterations, and reduce the learning rate by half at 6 × 10 5 and 8 × 10 5 iterations.Subsequently, in the fine-tuning stage, we switch to the L2 to fine-tune our network with a learning rate of 2 × 10 −5 , and a total of 1 × 10 5 iterations.
We replaced the 3 × 3 convolution in the FARG model with a 1 × 1 convolution, creating a smaller CSINet called CSINet-S.We trained CSINet-S using the DIV2K and Flickr2K datasets.During the training process, the input patch size is set to 64 × 64 and the mini-batch is set to 64.The Adan optimizer will be used to train our model with the parameters β 1 = 0.98, β 2 = 0.92, β 3 = 0.99, and with an initial learning rate of 1 × 10 −3 .We use L1 to train our network for 1 × 10 6 iterations, and reduce the learning rate by half at 6 × 10 5 and 8 × 10 5 iterations.Subsequently, in the fine-tuning stage, we switch to the L2 to fine-tune our network with a learning rate of 2 × 10 −5 , and a total of 1 × 10 5 iterations.
The proposed networks are implemented using the PyTorch framework and trained on a single NVIDIA 3090 GPU.

Ablation Study 4.2.1. Effectiveness of Dilation Rate
In deep learning-based image super-resolution methods, the receptive field size of the network is an important factor that affects the ability of the network to capture spatial information from the input image.The dilation rate is a common way to adjust the receptive field size of a CNN.A larger dilation rate means a larger receptive field, which can capture more global contextual information, while a smaller dilation rate means a smaller receptive field, which can capture more local details.
As shown in Table 1, we conducted extensive experiments to investigate the effects of different dilation rates on the image super-resolution performance.Specifically, we adopted the concept from [26,28] and tested seven different dilation configurations.Our experimental results demonstrate that the choice of the dilation rate has a significant impact on the quality of the super-resolved images.
Among the tested dilation configurations, we found that setting the dilation rates to (1,3,3,5) consistently produces superior results across multiple benchmark datasets.These results are in line with previous studies that have also shown the effectiveness of large dilation rates in image super-resolution tasks.The CSIB is intended to enhance the model's reconstruction performance by effectively fusing multi-scale features from different branches.This is achieved through its parallel branching and cross-fusion structures.To evaluate the effectiveness of the CSIB, two similar modules were designed for comparative analysis (Figure 6).
The Multi-Branch Feature Fusion Block (MFFB), splits the input features into two branches using channel splitting and halving operations.The multi-scale contextual information is then extracted from these two branches using depth-wise factorized convolution.The Cascade Dilated Fusion Block (CDFB), employs a cascade structure of three 3 × 3 depth-wise dilated convolutions instead of the two-branch structure used in MFFB.Both of these modules were integrated into corresponding SR networks, named a Multi-Branch Feature Fusion Network (MFFNet) and Cascaded Dilated Fusion Network (CDFNet), respectively.Extensive experiments were conducted to evaluate the performance of these three SR networks, with the results shown in Table 2.In terms of the reconstruction accuracy, the outcomes of these experiments clearly indicate that the CSIB is superior to both MFFB and CDFB.The CSIB achieved greater PSNR and SSIM values, indicating that the reconstructed images were more accurate.In addition, it did so with fewer parameters and at a lower computational cost, proving the efficacy of the interactive fusion structure in the SR reconstruction procedure.Compared to CDFB, CSIB not only requires fewer parameters, but also demonstrates a significant performance advantage in terms of reconstruction.This demonstrates the CSIB's ability to not only have a large receptive field, but also effectively combine complementary information from multiple scales to improve the model's representational capabilities.
The visual analysis of CDFB, MFFB, and CSIB is presented in Figure 7.As depicted in Figure 7a, CDFB shows promising results in recovering a portion of the butterfly's streak profile, albeit with some blurring.In contrast, Figure 7b presents MFFB which stands out due to its ability to extract more details of the stripes.This enhanced performance is attributed to its effective use of multi-scale feature extraction modules, which facilitates the recovery of intricate details with remarkable precision.Furthermore, the proposed CSIB, shown in Figure 7c, also utilizes multi-scale feature extraction modules, leading to a superior restoration performance when compared to the aforementioned models.CSIB excels in reconstructing high-frequency details and edge information with exceptional clarity, as evidenced in the results.The findings highlight the proficiency of CSIB in structural texture restoration and demonstrate the immense potential of deep learning models in image processing applications.

Effectiveness of Factorized Convolution
To validate the effectiveness of factorized convolution, we replaced it with regular convolution in CSIB, denoted as "w/RC".
Upon reviewing the results in Table 3, it is clear that the inclusion of factorized convolution leads to a reduction of 14 K parameters and a decrease of 0.8 G FLOPs compared to regular convolution.Simultaneously, PSNR and SSIM exhibit improvements across all benchmark datasets.Furthermore, the inference time decreased by 1.24 ms.These findings indicate that the introduction of factorized convolution not only enhances the model's lightweight characteristics but also contributes to significant performance improvements.
Table 3.The results pertaining to the inclusion of factorized convolution in CSIB are presented."w/RC" denotes the scenario where factorized convolution is replaced with regular convolution, while "w/FC" signifies our model using factorized convolution.The inference time is calculated on Set5 with a scaling factor of ×4.The experiments were executed using an NVIDIA 3090 GPU.The best results are color red.The ablation studies on the two attention modules-ELKA, and ESA-are presented in Table 4.The results indicate that ELKA is a highly effective module.We observed a significant decrease in the network performance when ELKA was removed, with a decrease of approximately 0.2 dB in the Set5 and Set14 datasets, and a decrease of over 0.4 dB in the Urban100 and Manga109 datasets.Furthermore, ESA has a positive impact on the model's performance, as evidenced by a substantial decmidrule in the performance when ESA is removed.These findings demonstrate that combining ELKA and ESA can effectively increase the model's capacity.It is noteworthy that ELKA provides a more computationally efficient way to incorporate global information, while ESA modules can enhance the local feature representation.Thus, the combination of these attention modules offers a well-balanced and effective solution to improve the model's performance.

Method
To further observe the benefits produced by our ELKA module, we visualize the feature maps before and after ELKA for different FARGs, as shown in Figure 8.It can be observed that the ELKA module enhances high-frequency information, making the edges and structural textures in the output features clearer.

Comparison with the SOTA SR Methods
To verify the effectiveness of the proposed model, we compare our CSINet model with 14 lightweight state-of-the-art SISR methods, including SRCNN [1], VDSR [2], CARN [14], IDN [15], MAFFSRN [36], SMMR [7], IMDN [11], PAN [8], LAPAR-A [12], RFDN [13], Cross-SRN [37], FDIWN [38], RLFN [5], and BSRN [6].The results of the comparisons are presented in Table 5.To assess the model's size, we used two metrics: the number of parameters and the number of operations (Multi-Adds), calculated on a high-resolution image of 1280 × 720.Our method achieved outstanding results on all the datasets with various scaling factors, outperforming most of the other state-of-the-art networks in both the PSNR and SSIM measurements.Despite having fewer parameters and Multi-Adds, our CSINet outperformed techniques such as LAPAR-A, RFDN, Cross-SRN, and even the RLFN, which was awarded second place in the sub-track2 (Overall Performance Track) of the NTIRE 2022 efficient super-resolution challenge.These results illustrate the effective balance between image quality and computational efficiency that our method achieves.We have incorporated the Non-Reference Image Quality Evaluator (NIQE) into our evaluation metrics to provide a more comprehensive analysis of the performance of our model compared to other lightweight models, including VDSR, CARN, IMDN, PAN, EFDN, and RLFN, as shown in Table 6.In the comparison, we computed the NIQE scores for the outputs of our model and the aforementioned lightweight models.The NIQE score measures the naturalness of an image, with lower scores indicating a better image quality.Our model achieved comparable or slightly lower NIQE scores compared to these models, indicating that our model produces images with similar or slightly better naturalness.These results suggest that our model not only performs competitively in terms of traditional evaluation metrics such as PSNR and SSIM but also maintains or enhances the perceptual quality of the super-resolved images according to the NIQE score.It demonstrates the effectiveness of our lightweight model in preserving image quality while reducing the computational complexity.For Set14, we compared the models' ability to reconstruct the "baboon" and "monarch" images.Our findings suggest that while the SRCNN [1] and VDSR [2] models recovered most of the stripe contours, their reconstructions still exhibited blurriness.In contrast, our proposed model, CSINet, was able to reconstruct high-frequency details with greater clarity.
For the "monarch" image, CSINet was also superior in reproducing the butterfly antennae with greater clarity.On the BSD100 dataset, we evaluated the performance of the models on the "108005" and "148026" images.Our results indicate that Bicubic failed to reproduce the basic texture features when reconstructing the details of the stripes on the tiger.While other models, such as CARN [14], IMDN [11], PAN [8], and EFDN [10], could recover more stripe details, their reconstructed images still exhibited some blurriness.In contrast, CSINet was able to reconstruct high-frequency details with greater clarity, outperforming all the other models.For the "148026" image, CSINet also produced reconstructed images with clear texture and rich details, which were closer to the real images than the other models.
Finally, on the Urban100 dataset, we evaluated the models' ability to restore the "img_092" image.Our results suggest that most of the models, except for Bicubic, could restore the horizontal stripes of the building facade but still exhibited some blurriness.In contrast, the reconstructed images from CSINet had clear texture and rich details, approaching perfection.Similarly, for the "img_062" image in the Urban100 test set, the reconstructed images using Bicubic, SRCNN [1], and VDSR [2] were severely distorted and blurry.While the reconstructed results using CARN [14], IMDN [11], PAN [8], EFDN [10], and E-RFDN [13] were slightly clearer, the glass window grids were distorted and deformed.In contrast, the reconstructed images using CSINet proposed in this study had clear texture and rich details, which were closer to the real images.
Overall, our subjective visual effect comparisons demonstrate that CSINet outperforms other state-of-the-art super-resolution models, providing high-frequency details that are clearer and closer to the real images.

Complexity Analysis
The runtime of a network is a crucial metric, even for lightweight SR algorithms.We conducted comparative experiments on the Set5 dataset (×4) to assess the reconstruction speeds of mainstream networks.The experiments were run on an NVIDIA 3090 GPU with 24 GB RAM.The test images had a spatial resolution of 64 × 64 pixels.After 10 repeated runs, the average inference times were obtained and are presented in Figure 12.It can be observed that our CSINet not only achieves the fastest reconstruction speed but also delivers the best reconstruction quality, demonstrating the significant advantages of our lightweight CSINet.To further validate the lightweight nature of CSINet, we deployed it on the NVIDIA Jetson Xavier NX Developer Kit, known as one of the world's smallest AI supercomputers for embedded MEC systems.We conducted experiments on real-world photos to evaluate the effectiveness of CSINet in the embedded MEC system.In these scenarios, ground-truth images and downsampling kernels were unavailable.As depicted in Figure 13, our method successfully reconstructs sharper and more accurate images compared to state-of-the-art approaches.This indicates that our lightweight model excels in achieving exceptional super-resolution performance, making it highly suitable for deployment in embedded MEC systems.

Discussions
The effectiveness of the proposed Cross-Scale Interaction Block (CSIB) is a key highlight of our study.CSIB stands out as a crucial component in enhancing the overall performance of CSINet.
Firstly, CSIB is meticulously designed for super-resolution (SISR), integrating crossscale contextual information using depth-wise convolution and dilated convolution.This design choice proves effective in capturing and leveraging contextual details across different scales, contributing to improved image reconstruction.
Secondly, the incorporation of Efficient Lightweight Kernel Aggregation (ELKA) within CSIB further enhances the model's representational capacity.ELKA plays a pivotal role in aggregating relevant features efficiently, contributing to the model's ability to capture intricate details and patterns.
The experimental results underscore the effectiveness of CSIB.In comparison to scenarios where regular convolution is used, the inclusion of factorized convolution within CSIB leads to significant reductions in parameters and FLOPs while simultaneously improving PSNR, SSIM, and reducing the inference time.This indicates that CSIB not only reduces the model complexity but also positively impacts the image quality and computational efficiency.
In visual comparisons with state-of-the-art methods, CSINet equipped with CSIB excels in reconstructing high-frequency details with exceptional clarity.This suggests that the designed cross-scale interaction mechanism within CSIB plays a pivotal role in capturing and utilizing contextual information effectively, resulting in superior image reconstruction.
CSIB emerges as a crucial element contributing to the effectiveness of CSINet.Its innovative design and integration within the network significantly improve image quality, demonstrating the efficacy of the proposed cross-scale interaction strategy in the context of lightweight super-resolution.

Conclusions
In this paper, we introduce the Cross-Scale Interaction Network (CSINet), a novel architecture designed for lightweight image super-resolution (SISR).Specifically, we present a lightweight Cross-Scale Interaction Block (CSIB) tailored for SISR.This block is carefully crafted to integrate cross-scale contextual information using depth-wise convolution and dilated convolution, leading to an effective reduction in the model complexity.Additionally, the integration of Efficient Lightweight Kernel Aggregation (ELKA) enhances the model's representational capacity.The proposed network is characterized by its lightweight nature, with only 366K parameters.Extensive experiments conducted on benchmark datasets validate that CSINet outperforms the majority of state-of-the-art lightweight SR methods.Remarkably, it achieves superior results with fewer parameters and Multi-Adds, underscoring its efficiency and effectiveness.
In future work, to enhance the applicability of CSINet in real-time scenarios, optimizing model parameters and the inference time will become crucial for achieving a more lightweight model.This optimization will remain a central focus in our ongoing research, with the goal of ensuring the seamless integration of CSINet into real-time application environments.

Figure 1 .
Figure 1.Trade-off between performance and model complexity of other state-of-the-art lightweight models on the BSD100 dataset for ×4 SR.The CSINet achieves higher PSNR with fewer parameters.

Figure 2 .
Figure 2.An overview of our CSINet network.(a) The architecture of CSINet network, (b) the details of the feature aggregation residual group (FARG).

Figure 7 .
Figure 7. Visualized feature maps processed by different convolution designs.(a) Input feature.(b) Feature processed by the CDFB.(c) Feature processed by the MFFB.(d) Output feature of CSIB.

Figure 8 .
Figure 8. Visualized feature maps of the four FARGs.(a) Visualization feature maps of the four FARGs before ELKA.(b) Visualization feature maps of the four FARGs after ELKA.The values are calculated by averaging the feature maps and normalized in range 0 to 1.

Figure 9 .
Figure 9. Visual comparison of the Set 14 dataset for ×4 SR.

Figure 13 .
Figure 13.Comparison of super-resolution results on real-world photos; CSINet outperforms stateof-the-art methods with embedded MEC system.
where f dwn×m indicates the n × m depth-wise convolution operation, f dwn×m_rd is n × m depth-wise dilated convolution operation with dilated rate d, ELKA indicates efficient large kernel attention module, Concat(F CSIB 3,L , F CSIB 3,R ) is the concatenation of features generated by

Table 1 .
Investigation of different dilate rates.'R' denotes the dilation rates of each depth-wise convolution.These results were recorded after 1 × 10 6 iterations without pre-training and fine-tuning.The best are color red.

Table 2 .
Quantitative comparison of three distinct approaches to ×4 SR: MFFNet, CDFNet, and the proposed CSINet.These results were recorded after 1 × 10 6 iterations without pre-training and fine-tuning.The best results are color red.

Table 4 .
Comparison of the number of parameters, Multi-Adds, and mean values of PSNR obtained without ELKA and without ESA and our CSINet on five datasets for ×4 SR.These results were recorded after 1 × 10 6 iterations without pre-training and fine-tuning.The best results are color red.

Table 5 .
Quantitative comparisons of state-of-the art SR algorithm on five datasets.The best and the second best results are color red and blue, respectively."Multi-Adds" are computed with a 720p HR image.

Table 6 .
The average NIQE for ×4 SR.Red indicates the best performance.The best results are color red.The visual comparisons of our method with several state-of-the-art methods are presented in Figures9-11.The results demonstrate the superiority of our method in terms of the image quality.