Multi-Scale Cross-Attention Fusion Network Based on Image Super-Resolution

work


Introduction
Image super-resolution (SR) is a fundamental task in computer vision, the primary goal of which is to reconstruct a low-resolution image (LR) into a high-resolution photo (HR).Image super-resolution (ISR) reconstruction is an ill-posed problem because multiple HR images may degrade into the same LR image, and details may be lost in the degradation process.Image super-resolution has been widely studied and applied to medical images, remote sensing images, video surveillance, and other fields needing high-frequency information.In recent years, as deep learning technology has made significant progress in computer vision, this technology has been applied to more tasks.Compared with the image super-resolution methods based on interpolation [1], reconstruction [2], and learning [3,4], the use of deep learning methods can reconstruct high-frequency information more effectively.
SRCNN [5] first applied a Convolutional Neural Network (CNN) to the field of image super-resolution and solved the problem through the mapping function from LR input to HR output.Since then, deep CNN-based methods have been widely used in ISR.Following SRCNN, methods such as FSRCNN [6], ESPCN [7], VDSR [8], EDSR [9], LapSRN [10], and DRRN [11] provide a wider sensory field by deepening the network structure and introducing a residual learning mechanism to alleviate the gradient vanishing problem that increases with network deepening.
Recently, CNN-based methods, such as MSRN [12], MSFRNE [13], and MSAR [14], have demonstrated the ability to further enhance network performance by making full use of multi-scale extracted feature information to increase image texture details.However, despite advances in these methods, more work still needs to be done on the effective fusion of multi-scale features and deep utilization of attention mechanisms.In particular, how to fully use different multi-scale information and enhance the ability of feature information expression while maintaining network efficiency.Therefore, to solve these problems, this paper proposes a multi-scale cross-attention fusion network (MCFN) for the image super-resolution task.The main contributions of this paper are as follows: (1) A multi-scale cross-attention fusion network (MCFN) is proposed to achieve total extraction and compelling fusion of feature information at different scales and promote high-quality image reconstruction.(2) A multi-scale Trans-attention module (MTM) is proposed to efficiently extract and fuse multi-scale feature information.MTM utilizes a pyramid multi-scale module (PMM) to extract feature information of various scales, which is then input into a Cross Attention Fusion module (CFM) in a cross-module manner.This approach incorporates a cross-connect strategy that combines channel and spatial attention mechanisms to fuse the multi-scale feature information effectively and capture the correlation dependence between them.(3) An improved integrated Attention Enhancement module (IAEM) is proposed to extract more feature information from the middle layer through a dense connection strategy.The module learns the correlation between the middle layers and integrates the feature information of each module effectively.(4) The objective metrics and subjective vision of public datasets show that our method is competitive compared with existing methods.At the same time, we prove the proposed method's effectiveness through many ablation and experimental studies.
This paper is organized as follows: Section 2 will introduce the relevant studies.Section 3 will elaborate on our proposed method and structure.Section 4 will show the experimental results of the method on a public benchmark dataset.The last section will summarize the main conclusions of the paper.

Related Works 2.1. Deep CNN-Based Image Super-Resolution
Methods based on deep learning have recently been widely used in image superresolution [15] and have achieved significant advantages over traditional methods.Dong et al. proposed SRCNN [5], the first article to apply a convolutional neural network to the field of image super-resolution.They used a three-layer convolutional neural network to establish an end-to-end mapping SR method between LR images and their corresponding HR images.Kim et al. proposed the VDSR [8] algorithm, which used a deep convolutional neural network and added residual learning to improve the SRCNN network.At the same time, the DRCN [16] algorithm was proposed, which is the first method to introduce recursive learning to realize parameter sharing in SR.Although the initial application of the CNN method can improve the performance of traditional methods, it will increase the computational cost and produce artifacts.Therefore, Dong et al. proposed the FS-RCNN [6] approach to improve computational efficiency by introducing deconvolution in up-sampling.The ESPCN [7] algorithm was suggested by Shi et al., which presents a sub-pixel convolutional layer to upsample the final LR features as HR output to improve the computational performance to achieve a complete end-to-end mapping.Due to the effectiveness of the sub-pixel convolutional layer, the EDSR [9] algorithm also directly uses it for upsampling and removes the BN layer at the same time to increase the amount of network calculation, reduce the model parameters, and improve image performance.Lai et al. proposed the LapSRN [10] algorithm to reduce the amount of network calculation by using a cascade structure to gradually enlarge image reconstruction.[18] algorithm, which uses hierarchical thick blocks to reconstruct the image to reduce the amount of calculation brought by the dense residual method.These methods show that deep, residual, and dense connections can improve the network's performance.There are other ways to improve network performance.

Multi-Scale Feature Extraction Based on Image Super-Resolution
Multi-scale feature extraction is widely used in object detection [19] and semantic segmentation tasks [20].Multi-scale feature extraction can fully use information features at different depths to improve accuracy.The classical scheme for multi-scale feature extraction is the Inception [21] [13] algorithm, which uses a multiscale extraction module and adds multiple paths for fusion to improve image reconstruction quality.Although these methods are optimized at the network and training levels to enhance the performance of image reconstruction, there is still room for improvement in the extraction and fusion of feature information at different scales.

Attention Mechanism Based on Image Super-Resolution
Attention usually means that the human visual system adaptively focuses on salient areas in visual information.Therefore, the attention mechanism can help the network focus on essential details.A non-local neural network for image classification tasks [23] was first proposed by Wang et al.After that, Hu et al. designed a Squeeze and Excitation Network (SENet) [24] to improve image classification performance by introducing a channel attention mechanism.Attention-based networks have also been increasingly applied in image superresolution (ISR) tasks.Inspired by the SENet network [25], Zhang et al. referred to the channel attention mechanism in SR [26] to improve image quality.The SAN [27] algorithm recently used a second-order channel attention mechanism to refine features adaptively.In the AIDN [28] algorithm, information recognition ability is enhanced using a refined attention mechanism to improve network performance.In the MSAR [14] algorithm, the multi-scale attention residual module of feature refinement is used to refine the edge of parts at each scale to improve performance.Therefore, using a multi-scale attention mechanism for feature correlation learning can achieve a more comprehensive and in-depth improvement in performance.We propose a multi-scale cross-attention fusion network (MCFN) to extract and effectively fuse image feature information fully.

Methods
The ISR aims to reconstruct a high-resolution image I HR ∈ R C×rH×rW on top of a low-resolution image I LR ∈ R C×H×W .The height and width of the image are denoted as W and H, C is the number of channels in the color space, and r is the scale factor.LR images are usually obtained by down-sampling the HR image.
Firstly, this section shows the overall framework of the multi-scale cross-attention fusion network (MCFN).We will then detail each core component, including the pyramid multi-scale module (PMM) in the multi-scale trans-attention module (MTM), the crossattention fusion module (CFM), and the optimized, integrated attention enhancement module (IAEM).In addition, we will provide an in-depth analysis and justification of the overall architecture strategy of the network.

Network Framework
We proposed a multi-scale cross-attention fusion network architecture, as shown in Figure 1, which consists of a shallow feature extraction module (SFM), a deep feature extraction module (DFM), and a feature reconstruction module (FRM).
attention fusion module (CFM), and the optimized, integrated attention enhancement module (IAEM).In addition, we will provide an in-depth analysis and justification of the overall architecture strategy of the network.

Network Framework
We proposed a multi-scale cross-attention fusion network architecture, as shown in Figure 1, which consists of a shallow feature extraction module (SFM), a deep feature extraction module (DFM), and a feature reconstruction module (FRM)., given an input LR image At the same time, 0 F is also the input to the Deep Feature Extraction Module (DFM).
Inside the DFM, the 0 F is used as the input of the M multi-scale trans-attention modules and an optimized, Integrated Attention Enhancement Module (IEAM) in order to extract and fuse image feature information.The function of this process is called DFM f .In addition, global skip and dense connections are introduced to make the central part of the network focus on high-frequency information, which can be formally expressed as follows: i MTM f denotes the mapping of the i-th multi-scale trans-attentive module, and ⋅ [] de- notes concatenation.i F denotes the output of the i-th MTM, and its input is a concatena- tion of the outputs of the previous i − 1 MTM modules.Concat denotes the connectivity operator, and IAEM f denotes the mapping in which the module learns feature information from the outputs of the M MTMs, enhancing the feature information for highfrequency information.The IAEM module is designed to enhance the feature layers that are highly informative in their contribution and suppress the feature layers that contain redundant information.Finally, the feature reconstruction module generates a high-resolution image according to the feature information R F , which is upsampled to the required size by sub-pixel convolution: First, the SFM extracts shallow feature information F 0 ∈ R C×H×W , including edges and corners, through a single 3 × 3 convolution function f s f (•), given an input LR image At the same time, F 0 is also the input to the Deep Feature Extraction Module (DFM).Inside the DFM, the F 0 is used as the input of the M multi-scale trans-attention modules and an optimized, Integrated Attention Enhancement Module (IEAM) in order to extract and fuse image feature information.The function of this process is called f DFM .In addition, global skip and dense connections are introduced to make the central part of the network focus on high-frequency information, which can be formally expressed as follows: f i MTM denotes the mapping of the i-th multi-scale trans-attentive module, and [•] denotes concatenation.F i denotes the output of the i-th MTM, and its input is a concatenation of the outputs of the previous i − 1 MTM modules.Concat denotes the connectivity operator, and f I AEM denotes the mapping in which the module learns feature information from the outputs of the M MTMs, enhancing the feature information for high-frequency information.The IAEM module is designed to enhance the feature layers that are highly informative in their contribution and suppress the feature layers that contain redundant information.Finally, the feature reconstruction module generates a high-resolution image I SR ∈ R C×rH×rW according to the feature information F R , which is upsampled to the required size by sub-pixel convolution: where f PixelShu f f le denotes sub-pixel convolution, which aggregates low-resolution feature information to reconstruct the image.Currently, loss functions such as L 1 , L 2 , perceptual loss, and adversarial loss are commonly used to train SR models.In this paper, we choose loss L 1 to reduce computational complexity.In a given training set, I i LR , I i SR N i=1 , N images, and corresponding images, L 1 loss is defined as: Appl.Sci.2024, 14, 2634 5 of 17 where f MCFN and Θ denote the proposed functional mapping and its learning parameters, respectively.The configuration of each module will be shown in detail next.

Multi-Scale Trans-Attention Module
The multi-scale trans-attention module (MTM) is the core of this method, where the extraction and fusion of multi-scale deep feature information are mainly carried out.Figure 2 shows the pyramid multi-scale module (PMM) and the cross-attention fusion module (CFM).

ages, 1
L loss is defined as: where MCFN f and Θ denote the proposed functional mapping and its learning parameters, respectively.The configuration of each module will be shown in detail next.

Multi-Scale Trans-Attention Module
The multi-scale trans-attention module (MTM) is the core of this method, where the extraction and fusion of multi-scale deep feature information are mainly carried out.Figure 2 shows the pyramid multi-scale module (PMM) and the cross-attention fusion module (CFM).We constructed a pyramid multi-scale module (PMM) to fully extract feature information and a cross-attention fusion module (CFM) for feature information fusion.We adopt the global residual to minimize loss in the feature information extraction process.The pyramid multi-scale module (PMM) we designed extracts features, such as detail texture and contour area, to extract feature information comprehensively.Then, the heads We constructed a pyramid multi-scale module (PMM) to fully extract feature information and a cross-attention fusion module (CFM) for feature information fusion.We adopt the global residual to minimize loss in the feature information extraction process.The pyramid multi-scale module (PMM) we designed extracts features, such as detail texture and contour area, to extract feature information comprehensively.Then, the heads and tails of multiple modules are fed into the cross-attention fusion module as cross-module outputs for related learning.The specific process is as follows: where F 1 PMM and F N PMM denote the outputs of the 1st and Nth pyramid multi-scale modules, and f IFM denotes the mapping of the cross-attention fusion module.

Pyramid Multi-Scale Module
The multi-scale CNN can provide more informative features and help generate highquality super-resolution images.In order to extract the informative part of all scales more comprehensively, we designed a pyramid multi-scale module for feature lifting, as shown in Figure 2.
In feature extraction, the shallower convolutional layers contain more global information, so extracting more than one detailed texture information feature is crucial.Inspired by DEEP Lab V3 [29] and Mobile Net V2 [30], the ASPP module is improved to extract detailed texture feature information.ASPP uses multiple cavity convolutions with different expansion rates to extract sensory fields of different sizes and then uses standard convolutions to achieve multi-scale feature information fusion.In order to improve the efficiency and performance of ASPP as well as reduce its computational overhead, this paper improves ASPP.It proposes the pyramid multi-scale module to extract the feature information at different scales more effectively.A comparison is shown in Figure 3.
ules, and IFM f denotes the mapping of the cross-attention fusion module.

Pyramid Multi-Scale Module
The multi-scale CNN can provide more informative features and help generate highquality super-resolution images.In order to extract the informative part of all scales more comprehensively, we designed a pyramid multi-scale module for feature lifting, as shown in Figure 2.
In feature extraction, the shallower convolutional layers contain more global information, so extracting more than one detailed texture information feature is crucial.Inspired by DEEP Lab V3 [29] and Mobile Net V2 [30], the ASPP module is improved to extract detailed texture feature information.ASPP uses multiple cavity convolutions with different expansion rates to extract sensory fields of different sizes and then uses standard convolutions to achieve multi-scale feature information fusion.In order to improve the efficiency and performance of ASPP as well as reduce its computational overhead, this paper improves ASPP.It proposes the pyramid multi-scale module to extract the feature information at different scales more effectively.A comparison is shown in Figure 3.We replaced the dilated convolution module with depthwise separable and point convolutions to improve computational efficiency.Experiments [30] show that the number of channels has an essential impact on the overall performance.The comparison shows that the performance of the dilated dimension is better than that of the compressed dimension.This paper uses point convolution to expand and restore the dimension and control the number of channels.Point convolution can effectively promote the information exchange between different channels, and depthwise separable convolution can also focus on extracting multi-scale feature information on each channel independently.The leaky Relu function, which has smaller parameters and better feature extraction ability than Relu6, is selected in this paper.The representation process is as follows: , We replaced the dilated convolution module with depthwise separable and point convolutions to improve computational efficiency.Experiments [30] show that the number of channels has an essential impact on the overall performance.The comparison shows that the performance of the dilated dimension is better than that of the compressed dimension.This paper uses point convolution to expand and restore the dimension and control the number of channels.Point convolution can effectively promote the information exchange between different channels, and depthwise separable convolution can also focus on extracting multi-scale feature information on each channel independently.The leaky Relu function, which has smaller parameters and better feature extraction ability than Relu6, is selected in this paper.The representation process is as follows: where F i−1 MTM denotes the output of the j-1st MTM, f exp and 1×1conv denotes the convolution function of the expanded dimension, F PMM pwconv denotes the output after the expanded dimension, f dwconv,rate=n denotes the Depthwise Convolution function with expansion rate n, F dwconv denotes the output after the expanded rate, f regain 1×1conv denotes the convolution function of the recovered dimension, f lrelu denotes the Leaky Relu function, and F pwconv,rate=n denotes the output after the recovered dimension.Thus, the process concludes with the introduction of global residual connectivity in this paper in order to increase the stability of the module.Formally, the process is described as follows: where f j PMM denotes the mapping of the PMM, F conv denotes the feature mapping obtained after convolutional layer processing, and F global denotes the feature information after pooling.Compared with the previous improvement, the module's parameters and computational overhead are reduced, and more detailed texture information features can be extracted.

Cross-Attention Fusion Module
The CNN convolution module is usually used to extract features and perform simple feature fusion.In order to comprehensively fuse information features, this paper proposes a cross-attention fusion module (CFM) to learn the correlation of feature information and fuse them.As shown in Figure 2, the PAM and CAM [31] modules are imported.
In the feature extraction process, the deeper layers can extract more advanced feature information, such as shape feature information, and reduce the deformation during image reconstruction.However, there will be a loss of feature information.Considering such a problem, we designed the cross-module input method, which focuses on shallow and deep feature information to complement the feature information.We designed a cross-attention fusion module (CFM), containing a channel attention module (CAM) [31] and a position attention module (PAM) [31].Shallow feature information contains more comprehensive and rich spatial location information.After extracting feature information through the location attention module, spatial location features are weighted and selectively aggregated for each location.Deeply extracted feature information often contains rich semantic context, so the information is cross-processed after the output of the location attention module is combined with the deeply extracted feature information.Then, through the channel attention module, the correlation feature information between all channel mappings is learned to achieve the purpose of selectively emphasizing the interdependence.This information is then multiplied with the input features to refine the feature boundaries and finally cross-fertilized with spatial location feature information and semantic feature information.Formally, the process is described as: where f CAM and f PAM denote spatial attention and position attention function mapping, F CFM denotes cross-attention fusion function mapping, and ⊗ denotes element-wise multiplication.The module we designed adopts the strategy of cross-module learning and cross-learning to fuse the correlation of spatial location and semantic context of feature information, making the learning process more comprehensive and detailed.

Integrating the Attention Enhancement Module
Currently, most SR networks usually use standard convolutional connections and ultimately perform deep feature extraction.Adding an extra module enhances the feature learning capability and thus improves the network's overall performance.Therefore, we designed the Integrated Attention Enhancement Module (IAEM) according to this assumption.We continue with the attention mechanism, inspired by DANet [31], and optimize it.We treat the mapping of each deep feature extraction module as a specific response; different module responses correlate.The interdependence between module mappings is used to enhance the interdependent feature information mapping and the feature representation ability of modules, as shown in Figure 4.
designed the Integrated Attention Enhancement Module (IAEM) according to this assumption.We continue with the attention mechanism, inspired by DANet [31], and optimize it.We treat the mapping of each deep feature extraction module as a specific response; different module responses correlate.The interdependence between module mappings is used to enhance the interdependent feature information mapping and the feature representation ability of modules, as shown in Figure 4. Different from the above CAM module, the input is the deep feature group output from N multi-scale cross-attention modules, and the dimension is Through the change of dimension, the weight of feature information is relearned to strengthen the attention of high-frequency information.Firstly, in this paper, the feature group is convolved by 3D convolution to strengthen the representation of local context features.Then, the sigmoid function is used to extract the feature information of the deep feature group and generate the corresponding attention map.
According to the dimensions of the feature groups, we chose a 3D convolution with kernel size three and step size 1 to generate the attention maps of the three feature groups.Then, in this paper, we multiply it element-by-element with the original input depth-extracted feature layer and multiply it by the scale parameter C to generate the attention map B. Formally, the process is described as: where 3dconv f represents the 3D convolution function, σ represents the softmax function, ⋅ represents element-wise multiplication, and μ learns weights starting from ini- tialization 0. Secondly, this paper reshapes these deep extracted feature groups IFGs into a twodimensional matrix of N × HWC.After that, the reshaped feature group is matrix multiplied with its transpose, and then, softmax is applied to obtain the attention map Different from the above CAM module, the input is the deep feature group F IFG ∈ R N×H×W×C output from N multi-scale cross-attention modules, and the dimension is N × H × W × C. Through the change of dimension, the weight of feature information is re-learned to strengthen the attention of high-frequency information.Firstly, in this paper, the feature group is convolved by 3D convolution to strengthen the representation of local context features.Then, the sigmoid function is used to extract the feature information of the deep feature group and generate the corresponding attention map.
According to the dimensions of the feature groups, we chose a 3D convolution with kernel size three and step size 1 to generate the attention maps of the three feature groups.Then, in this paper, we multiply it element-by-element with the original input depthextracted feature layer and multiply it by the scale parameter C to generate the attention map B. Formally, the process is described as: where f 3dconv represents the 3D convolution function, σ represents the softmax function, • represents element-wise multiplication, and µ learns weights starting from initialization 0. Secondly, this paper reshapes these deep extracted feature groups IFGs into a twodimensional matrix of N × HWC.After that, the reshaped feature group is matrix multiplied with its transpose, and then, softmax is applied to obtain the attention map S ∈ R N×N that strengthens the correlation between modules.Formally, the process is described as follows: where s ji represents the influence between the i-th module and the j-th module, the attention map of the depth extraction feature layer is obtained by multiplying the reshaped depth extraction feature set with the original feature set matrix and then multiplying the result with the scale parameter λ.Finally, the two attention maps are summed element-wise to obtain the output F I AEM ∈ H × W × NC.Formally, the process is described as follows: where λ learns the weights from the initialization of 0, and the final feature of each module represents a weighted sum of all the parts of the module that are related to the original quality and models the long-range semantic dependencies of the entire feature graph.Thus, integrating the attention enhancement modules by learning the interdependencies between the modules is a way to enhance and optimize the overall network's performance effectively.

Datasets and Metrics
In this paper, DIV2K [32] is used as the training set of the model, and the DIV2K dataset contains 800 training images, 100 validation images, and 100 test images.Five standard test sets: Set5 [33], Set14 [34], B100 [35], Urban100 [36], and Manga109 [37] are used.According to the current work, all training and testing are performed based on the luminance channel of the YCbCr color space, and only the Y-channel is processed.This paper uses bicubic down-sampling (BI) to obtain the low-resolution image (LR).The commonly used evaluation metrics PSNR and SSIM are selected for quantitative comparison with other SR methods.Visualization results are also provided for a more intuitive comparison with other methods.

Implementation Details
In this paper, the LR image is randomly cropped into blocks of size 48 × 48 as training input, and the corresponding patch size of the HR image is 48r × 48r, where r is the scale factor.The minibatch is set to 16, and data enhancement such as horizontal flipping and random rotation of 90 • are performed on the training set.This paper sets the number of MTMs M = 5 and the number of PMMs N = 7 for hyper-parameter settings.The model in this paper is trained using the ADAM optimizer [3][4][5][6] with β1 = 0.9, β2 = 0.999, and ε = 10 −8 , L1 loss function, the number of channels (number of filters) C = 64, and sets the learning rate to 10 −4 every 200 backpropagation iterations to reduce the learning rate to 0.5 per 100 iterations.Backpropagation iterations were reduced by half.In increasing the image resolution to 3× and 4× for model training, we adopt the trained 2× image upsampling model as a pre-trained model to further train the ×3 and ×4 models.This approach captures the underlying upsampling mechanism and features by learning with a small (×2) upsampling time.When this pre-trained model is trained on the task of upsampling to higher magnifications (×3 and ×4), it can learn the complex details required for the task more efficiently, accelerating training time and improving model performance at higher resolutions.This paper uses the PyTorch framework and NVIDIA GeForce RTX 3090 GPU for training and testing.

Comparison with State-of-the-Art Methods
In this section, we compare the performance of the MCFN network in detail with several state-of-the-art network models.The comparison covers the following network models: double cubic interpolation, A+ [38], SRCNN [5], VDSR [8], EDSR-baseline [9], Lap-SRN [10], CARN [39], IDN [40], MSRN [12], MSFRN [13], MIPN [41], MSCIF [42], and MSAR [14].Through quantitative analysis and subjective visual evaluation methods, we aim to objectively assess the performance metrics of each model in order to comprehensively demonstrate the performance of the MCFN network in various aspects.This study performed detailed comparisons on different scaling factors, i.e., c2, ×3, ×4.The specific comparison results are shown in Table 1.It can be observed in these results that the MCFN network shows a significant advantage in most of the performance metrics compared to the recently proposed methods.In particular, compared to the more extensive network MSRN proposed by ECCV, the MCFN network shows higher PSNR and SSIM values by 0.21dB and 0.0041, respectively, on the Set14 test set with a scaling factor of 2. On Set5, with a scaling factor of 3, compared to the MIPN, the MCFN also improves its PSNR and SSIM values by 0.1 dB and 0.0009.As the scaling factor increases, the low-resolution image loses more high-frequency information, limiting the high-quality reconstruction of super-resolution images.In the Urban100 dataset, which is rich in detailed information, MCFN outperforms the following highest method, MSAR, by 0.387 dB and 0.0105 in PSNR and SSIM metrics, respectively, when the scaling factor is four.In summary, our exhibits recognizable performance, which initially proves the validity of the network that we designed.
In order to present a more comprehensive picture of the performance of our model, we selected several representative detail parts from different super-resolution images.We reconstructed the images with ×2, ×3, and ×4 for these detail parts to show and compare these key details more obviously.As shown in Figures 5-8, the selected details were marked with rectangular boxes and enlarged three times to show and contrast these key details more obviously.It can be observed in these results that the MCFN network shows a significant advantage in most of the performance metrics compared to the recently proposed methods.In particular, compared to the more extensive network MSRN proposed by ECCV, the MCFN network shows higher PSNR and SSIM values by 0.21dB and 0.0041, respectively, on the Set14 test set with a scaling factor of 2. On Set5, with a scaling factor of 3, compared to the MIPN, the MCFN also improves its PSNR and SSIM values by 0.1 dB and 0.0009.As the scaling factor increases, the low-resolution image loses more high-frequency information, limiting the high-quality reconstruction of super-resolution images.In the Ur-ban100 dataset, which is rich in detailed information, MCFN outperforms the following highest method, MSAR, by 0.387 dB and 0.0105 in PSNR and SSIM metrics, respectively, when the scaling factor is four.In summary, our network exhibits recognizable performance, which initially proves the validity of the network that we designed.
In order to present a more comprehensive picture of the performance of our model, we selected several representative detail parts from different super-resolution images.We reconstructed the images with ×2, ×3, and ×4 for these detail parts to show and compare these key details more obviously.As shown in Figures 5-8, the selected details were marked with rectangular boxes and enlarged three times to show and contrast these key details more obviously.It can be observed in these results that the MCFN network shows a significant advantage in most of the performance metrics compared to the recently proposed methods.In particular, compared to the more extensive network MSRN proposed by ECCV, the MCFN network shows higher PSNR and SSIM values by 0.21dB and 0.0041, respectively, on the Set14 test set with a scaling factor of 2. On Set5, with a scaling factor of 3, compared to the MIPN, the MCFN also improves its PSNR and SSIM values by 0.1 dB and 0.0009.As the scaling factor increases, the low-resolution image loses more high-frequency information, limiting the high-quality reconstruction of super-resolution images.In the Ur-ban100 dataset, which is rich in detailed information, MCFN outperforms the following highest method, MSAR, by 0.387 dB and 0.0105 in PSNR and SSIM metrics, respectively, when the scaling factor is four.In summary, our network exhibits recognizable performance, which initially proves the validity of the network that we designed.
In order to present a more comprehensive picture of the performance of our model, we selected several representative detail parts from different super-resolution images.We reconstructed the images with ×2, ×3, and ×4 for these detail parts to show and compare these key details more obviously.As shown in Figures 5-8, the selected details were marked with rectangular boxes and enlarged three times to show and contrast these key details more obviously.It can be observed in these results that the MCFN network shows a significant advantage in most of the performance metrics compared to the recently proposed methods.In particular, compared to the more extensive network MSRN proposed by ECCV, the MCFN network shows higher PSNR and SSIM values by 0.21dB and 0.0041, respectively, on the Set14 test set with a scaling factor of 2. On Set5, with a scaling factor of 3, compared to the MIPN, the MCFN also improves its PSNR and SSIM values by 0.1 dB and 0.0009.As the scaling factor increases, the low-resolution image loses more high-frequency information, limiting the high-quality reconstruction of super-resolution images.In the Ur-ban100 dataset, which is rich in detailed information, MCFN outperforms the following highest method, MSAR, by 0.387 dB and 0.0105 in PSNR and SSIM metrics, respectively, when the scaling factor is four.In summary, our network exhibits recognizable performance, which initially proves the validity of the network that we designed.
In order to present a more comprehensive picture of the performance of our model, we selected several representative detail parts from different super-resolution images.We reconstructed the images with ×2, ×3, and ×4 for these detail parts to show and compare these key details more obviously.As shown in Figures 5-8, the selected details were marked with rectangular boxes and enlarged three times to show and contrast these key details more obviously.In Figure 5, a significant difference in the clarity of the letters reconstructed by the different algorithms can be observed when the magnification factor is two times.For example, the letters reconstructed by SRCNN and CARN could be more precise and quieter.Although IDN, MSRN, and MIPN methods have improved the clarity, some details of the letter shape still need to be recovered.In contrast, the letters reconstructed by MCFN are more transparent and less noisy.
In Figure 6, a building at sunset at 3× magnification shows that MCFN performs better in preserving the edge texture and reducing the artifacts.Figure 7 shows a car roof image at 4× magnification.MCFN demonstrates less distortion and effectively reduces ringing effects, with richer information on the edge contours.
In addition, in Figure 8, the selected sign text in the scene is displayed under a magnification factor of 4, and our method improves the edge clarity while also improving the brightness to obtain a better visual effect.In general, our network performs well in objective indicators and shows significant advantages in subjective visual effects.

Study of Ablation of Network Structures
In this part of the study, we demonstrate the effectiveness of each module in the proposed MCFN and their contribution to the network performance.We design a series of ablation experiments, as shown in Table 2.We evaluate their contribution to network performance by adding or replacing critical modules in the network.Firstly, we construct a base network consisting of a series of PMMs, called the PMMs network.The base network adopts a multi-scale mechanism of depth-separable convolution and pointwise convolution, improving computational efficiency while ensuring adequate feature information extraction at different scales.Then, IAEM was added to our study to evaluate the network performance of PMMs, denoted as MTMs_PMMs + IAEM.Subsequently, CFMs were added to the PMMs to assess the effect of the addition on the network's performance, denoted as MTMs (PMMs + CFM).It is worth noting that we did not perform ablation experiments on the combination of PMMs and CFM alone.Instead, we chose to perform ablation experiments on MTMs (a combination of PMMs and CFMs) together with IAEM, aiming to assess the impact of CFM on performance in the presence of IAEM.Therefore, we used the strategy of replacing CFMs with PAMs and CAMs, denoted as MTMs _PAM + IAEM and MTMs_CAM + IAEM, and similarly, in order to assess the performance of IAEMs, we replaced IAEMs with CAMs in the MCFN structure, denoted as MTMs + CAM.Although this design scheme for ablation experiments may be different from traditional ablation methods, it provides us with an effective way to assess the interactions of the individual modules.In addition, this design approach aligns more with our experimental resource realities, allowing us to perform the most effective performance evaluation under limited conditions.We select most of the modeling methods, PSNR, and SSIM values on Set5, Set14, and B100 test sets for 200 cycles of comparison to ensure the necessity and In Figure 5, a significant difference in the clarity of the letters reconstructed by the different algorithms can be observed when the magnification factor is two times.For example, the letters reconstructed by SRCNN and CARN could be more precise and quieter.Although IDN, MSRN, and MIPN methods have improved the clarity, some details of the letter shape still need to be recovered.In contrast, the letters reconstructed by MCFN are more transparent and less noisy.
In Figure 6, a building at sunset at 3× magnification shows that MCFN performs better in preserving the edge texture and reducing the artifacts.Figure 7 shows a car roof image at 4× magnification.MCFN demonstrates less distortion and effectively reduces ringing effects, with richer information on the edge contours.
In addition, in Figure 8, the selected sign text in the scene is displayed under a magnification factor of 4, and our method improves the edge clarity while also improving the brightness to obtain a better visual effect.In general, our network performs well in objective indicators and shows significant advantages in subjective visual effects.

Study of Ablation of Network Structures
In this part of the study, we demonstrate the effectiveness of each module in the proposed MCFN and their contribution to the network performance.We design a series of ablation experiments, as shown in Table 2.We evaluate their contribution to network performance by adding or replacing critical modules in the network.Firstly, we construct a base network consisting of a series of PMMs, called the PMMs network.The base network adopts a multi-scale mechanism of depth-separable convolution and pointwise convolution, improving computational efficiency while ensuring adequate feature information extraction at different scales.Then, IAEM was added to our study to evaluate the network performance of PMMs, denoted as MTMs_PMMs + IAEM.Subsequently, CFMs were added to the PMMs to assess the effect of the addition on the network's performance, denoted as MTMs (PMMs + CFM).It is worth noting that we did not perform ablation experiments on the combination of PMMs and CFM alone.Instead, we chose to perform ablation experiments on MTMs (a combination of PMMs and CFMs) together with IAEM, aiming to assess the impact of CFM on performance in the presence of IAEM.Therefore, we used the strategy of replacing CFMs with PAMs and CAMs, denoted as MTMs _PAM + IAEM and MTMs_CAM + IAEM, and similarly, in order to assess the performance of IAEMs, we replaced IAEMs with CAMs in the MCFN structure, denoted as MTMs + CAM.Although this design scheme for ablation experiments may be different from traditional ablation methods, it provides us with an effective way to assess the interactions of the individual modules.In addition, this design approach aligns more with our experimental resource realities, allowing us to perform the most effective performance evaluation under limited conditions.We select most of the modeling methods, PSNR, and SSIM values on Set5, Set14, and B100 test sets for 200 cycles of comparison to ensure the necessity and validity of the experiments.In order to show the experimental results more intuitively, we plotted the experimental data of the last 50 cycles as a line graph-Figure 9.    9a demonstrate a clear trend: adding the fusion and enhancement networks to the base network significantly improves the network's performance metrics, proving the effectiveness of the individual modules and indicating that better results can be obtained.The performance improvement is pronounced in the MTMs_CAM + IAEM network.This network effectively focuses on critical feature information by learning the relevance between different channels, which demonstrates the importance of correlation learning after deep extraction of high-frequency information.In particular, in the MCFN network, we design an innovative cross-attention fusion module.This network not only effectively learns the spatial locations of shallow feature information through the crossmodule learning approach but also combines this spatial location feature information with deep feature information through the cross-connection strategy to deeply learn the relevance of the information in the channel.This hierarchical approach improves the comprehensiveness of information utilization.In CFM, by integrating spatial and channel features, we achieve a more comprehensive fusion of information, enabling the network to achieve the best results in several performance metrics.
When analyzing the performance of IAEM, we used CAM as a control group to learn the difference in performance between the two.As shown in Table 2 and Figure 9b, our network performs better in PSNR and SSIM than the control group in the above test set experimental results.The results of the above analyses demonstrate the effectiveness of our module in performing relevant learning.In contrast to accessing channel attention only at the tail, our integrated attention-enhanced network employs a dimensionality transformation technique to fuse feature information at different stages.This strategy en-   9a demonstrate a clear trend: adding the fusion and enhancement networks to the base network significantly improves the network's performance metrics, proving the effectiveness of the individual modules and indicating that better results can be obtained.The performance improvement is pronounced in the MTMs_CAM + IAEM network.This network effectively focuses on critical feature information by learning the relevance between different channels, which demonstrates the importance of correlation learning after deep extraction of high-frequency information.In particular, in the MCFN network, we design an innovative cross-attention fusion module.This network not only effectively learns the spatial locations of shallow feature information through the crossmodule learning approach but also combines this spatial location feature information with deep feature information through the cross-connection strategy to deeply learn the relevance of the information in the channel.This hierarchical approach improves the comprehensiveness of information utilization.In CFM, by integrating spatial and channel features, we achieve a more comprehensive fusion of information, enabling the network to achieve the best results in several performance metrics.
When analyzing the performance of IAEM, we used CAM as a control group to learn the difference in performance between the two.As shown in Table 2 and Figure 9b, our network performs better in PSNR and SSIM than the control group in the above test set experimental results.The results of the above analyses demonstrate the effectiveness of our module in performing relevant learning.In contrast to accessing channel attention only at the tail, our integrated attention-enhanced network employs a dimensionality transformation technique to fuse feature information at different stages.This strategy enhances the learning of feature information weights and effectively helps the network's performance during the fusion reconstruction process.

Study of Multi-Scale Trans-Module Synthesis
In this part of this study, we analyze the influence of MTM and the number of PMMs in MTM on the network performance and conduct a series of ablation experiments.As shown in Table 3, we set the number of MTMS M to 4, 5, and 6 and evaluate its impact on the number of parameters and network performance in the test set Set 5.The results show that with the increase in M, the PSNR value of the network improves, and the network performance improves, but the growth rate becomes gradually smaller.In addition, we analyze the number N of PMMs, setting them to 6, 7, and 8, respectively, and record the comparative experimental results, as shown in Table 4.The experimental results show that when N increases from 6 to 7, the PSNR value increases by 0.034.However, when N grows to 8, the increase in PSNR value is only 0.01.Therefore, to effectively balance the reconstruction quality and the number of parameters, we set the number of MTM and PMM to 5 and 7.As shown in Figure 10, this study compares the number of parameters, the number of floating-point operations (FLOPs), and the average peak Signal-to-Noise Ratio (PSNR) (Avg.PSNR) between MCFN and other advanced methods when 4× magnification (output image resolution is 1280 × 720) is performed on the Set 5 dataset.In order to provide a more intuitive comparison perspective, the relevant data are summarized in Table 5.Compared with other methods, MCFN achieves superior performance with low computational overhead.Although not optimal regarding the number of parameters, MCFN has half the number of parameters compared to MSRN.In summary, MCFN performs well in model efficiency and objective evaluation indicators.

Conclusions
This paper proposes a multi-scale cross-attention fusion network (MCFN) to improve the image quality of image super-resolution tasks.The network combines the advantages of the multi-scale and attention mechanisms, aiming to extract and fuse the feature information of the image more thoroughly.The multi-scale trans-attention module (MTM) we designed includes the pyramid multi-scale module (PMM) and the cross-attention fusion module (CFM).In the pyramid multi-scale module (PMM), to extract feature information of each scale while maintaining the operation efficiency, depth separable convolution and point convolution are introduced using a residual strategy.In the cross-attention fusion module (CFM), the image feature information extracted by cross-fusion is designed to reconstruct the high-frequency information of the image.At the same time, to effectively fuse the cascaded multiple pyramid multi-scale modules (PMMs), a cross-module learning method is designed to learn the multi-scale information extracted by different deep features.In addition, an improved integrated attention enhancement module (IAEM) is inserted in the tail, which fuses the deep parts of different stages through dense connection, enhances the learning feature weight by changing the dimension, and introduces 3D convolution to learn context features to realize the effective fusion of image feature information to improve the quality of image reconstruction more accurately.Finally, experimental results show that MCFN has a certain competitiveness in key performance indicators compared with existing leading methods on public benchmark datasets.In particular, when quadrupled upsampling was performed on the Set 5 dataset, MCFN reached a PNSR of 32.43 dB, 0.14 dB higher than MSAR.In addition, through visual contrast, MCFN has rich texture details and a high level of high-frequency information in the reconstructed images, further proving the method's effectiveness.Although MCFN has shown some competitive performance in the experiment, we also recognize its limitations.Future work plans include training with more realistic datasets to enhance the generalization and practicality of the model.In addition, it includes the introduction of subjective evaluation and other methods to evaluate image quality more comprehensively.

Figure 1 .
Figure 1.Framework of the multi-scale cross-attention fusion network.

Figure 1 .
Figure 1.Framework of the multi-scale cross-attention fusion network.

Figure 2 .
Figure 2.Architecture of the multi-scale trans-attentive module.The core consists of the PMM as a multi-scale pyramid module, which extracts feature information at different scales by incorporating depth-separable convolution to improve efficiency.In addition, CFM is the cross-attention fusion module, which fully fuses feature information by cross-learning the correlation of shallow and deep PMM output feature information.

Figure 2 .
Figure 2.Architecture of the multi-scale trans-attentive module.The core consists of the PMM as a multi-scale pyramid module, which extracts feature information at different scales by incorporating depth-separable convolution to improve efficiency.In addition, CFM is the cross-attention fusion module, which fully fuses feature information by cross-learning the correlation of shallow and deep PMM output feature information.

Figure 3 .
Figure 3. Comparative structural diagram of PMM and ASPP.

Figure 3 .
Figure 3. Comparative structural diagram of PMM and ASPP.

Figure 4 .
Figure 4. Integration of the Attention Enhancement Module architecture.

Figure 4 .
Figure 4. Integration of the Attention Enhancement Module architecture.

Figure 5 .
Figure 5. Visual comparison of our method with other methods (×2).

Figure 6 .
Figure 6.Visual comparison of our method with other methods (×3).

Figure 7 .
Figure 7. Visual comparison of our method with other methods (×4).

Figure 5 .
Figure 5. Visual comparison of our method with other methods (×2).

Figure 5 .
Figure 5. Visual comparison of our method with other methods (×2).

Figure 6 .
Figure 6.Visual comparison of our method with other methods (×3).

Figure 7 .
Figure 7. Visual comparison of our method with other methods (×4).

Figure 6 .
Figure 6.Visual comparison of our method with other methods (×3).

Figure 5 .
Figure 5. Visual comparison of our method with other methods (×2).

Figure 6 .
Figure 6.Visual comparison of our method with other methods (×3).

Figure 7 .
Figure 7. Visual comparison of our method with other methods (×4).Figure 7. Visual comparison of our method with other methods (×4).

Figure 7 .
Figure 7. Visual comparison of our method with other methods (×4).Figure 7. Visual comparison of our method with other methods (×4).

Figure 8 .
Figure 8. Visual comparison of the proposed method with other methods in terms of letters (×4).

Figure 8 .
Figure 8. Visual comparison of the proposed method with other methods in terms of letters (×4).

Figure 9 .
Figure 9. Line plots of the training process: (a) plot comparing the results of the fusion module network and (b) plot comparing the results of integrating the augmentation module and CAM.

Figure 9 .
Figure 9. Line plots of the training process: (a) plot comparing the results of the fusion module network and (b) plot comparing the results of integrating the augmentation module and CAM.

18 Figure 10 .Table 5 .
Figure 10.Visualization of PSNR, parameters, and FLOPs.PSNR values were evaluated in Set 5 with scaling factor × 4. Table 5.Comparison of performance, parameters, and FLOPs with some state-of-the-art ISR methods under a scaling factor of 4 in the Set 5 dataset.Comparison results of the number of parameters, FLOPS, and average PSNR values of the SR method on the Set 5 test set.FLOPs are calculated based on 320 × 180 input features.

Figure 10 .
Figure 10.Visualization of PSNR, parameters, and FLOPs.PSNR values were evaluated in Set 5 with scaling factor × 4.
Tai et al. proposed the MemNet [17] algorithm, which uses dense blocks for deep networks.Jiang et al. proposed the HDRN module proposed by Szegedy et al., which uses multiple convolution kernels of different sizes at the same level to extract features, obtain various receptive fields, and improve image quality.Recently, multi-scale feature extraction has also been introduced into image super-resolution.Li et al. proposed an MSRN algorithm [12] that uses multi-scale feature extraction to extract image features of different scales adaptively.He et al. proposed the MRFN [22] algorithm, which uses a multi-receptive field module to remove parts of various receptive fields and proposed a new training loss to reduce reconstruction error.Feng et al. proposed the MSRFN

Table 1 .
Comparison of PSNR and SSIM values on standard datasets.In this table, the bolded numbers indicate the optimal values in each dataset, while the slanted numbers represent the suboptimal values.

Table 2 and
Figure

Table 3 .
Analysis of the number of MTMs.

Table 4 .
Analysis of the number of PMMs in MTM.

Table 5 .
Comparison of performance, parameters, and FLOPs with some state-of-the-art ISR methods under a scaling factor of 4 in the Set 5 dataset.Comparison results of the number of parameters, FLOPS, and average PSNR values of the SR method on the Set 5 test set.FLOPs are calculated based on 320 × 180 input features.