A Visible and Synthetic Aperture Radar Image Fusion Algorithm Based on a Transformer and a Convolutional Neural Network

: For visible and Synthetic Aperture Radar (SAR) image fusion, this paper proposes a visible and SAR image fusion algorithm based on a Transformer and a Convolutional Neural Network (CNN). Firstly, in this paper, the Restormer Block is used to extract cross-modal shallow features. Then, we introduce an improved Transformer–CNN Feature Extractor (TCFE) with a two-branch residual structure. This includes a Transformer branch that introduces the Lite Transformer (LT) and DropKey for extracting global features and a CNN branch that introduces the Convolutional Block Attention Module (CBAM) for extracting local features. Finally, the fused image is output based on global features extracted by the Transformer branch and local features extracted by the CNN branch. The experiments show that the algorithm proposed in this paper can effectively achieve the extraction and fusion of global and local features of visible and SAR images, so that high-quality visible and SAR fusion images can be obtained.


Introduction
In recent years, with the continuous development of remote sensing technology, SAR imaging, as an important imaging technology, has become a popular research field.SAR is a microwave remote sensing system with the characteristics of all-weather, all-day, certain penetration, etc., and can provide SAR images with rich structural information.However, SAR images are seriously contaminated by noise, resulting in low signal-to-noise ratios, and SAR image interpretation is more difficult.In comparison, traditional visible-light sensors receive rich, high-resolution, multi-spectral information from ground objects, but visiblelight sensor imaging is easily affected by weather and other factors.On the other hand, although visible-light imaging technology covers a wide range of wavelengths from 400 to 700 nm, its principle of operation relies on the passive reception of spectral information reflected from features and, therefore, has limitations in filtering or highlighting spectral information in specific wavelength bands.In contrast, SAR is capable of actively emitting electromagnetic waves of a specific wavelength band and receiving their reflected signals, thereby accurately capturing and extracting information reflected from electromagnetic waves of a specific wavelength band, which endows SAR with greater flexibility and accuracy in specific tasks.Therefore, the organic fusion of visible-light and SAR images with complementary advantages can significantly enrich the useful information of images, which is of great significance in military reconnaissance, agricultural planning, target extraction, and other image processing work.
In image fusion methods based on deep learning, the auto-encoder is a commonly used fusion model.Its structure mainly consists of three parts: encoder, fusion decision, and decoder.The encoder is primarily used to encode the source images into low-dimensional representations in the latent space, capturing the key features of the images.The decoder reconstructs the original images by receiving the latent representations generated by the encoder.During the training process, an appropriate loss function is designed so that the decoder can reconstruct the input images as accurately as possible.After training, the encoder can encode data from different modalities into low-dimensional representations in the latent space, which are then fused according to the designed fusion method.The fused encoding is inputted into the decoder for reconstruction.Image fusion methods based on auto-encoders do not require manual design for feature extraction.They can effectively learn key information from the image and achieve fusion in an end-to-end framework, greatly simplifying the fusion process.
Among the many AE fusion frameworks, the auto-encoder approach based on CNN feature extraction and reconstruction has been proven to be one of the most effective methods.The three algorithmic processes shown in Figure 1 are currently the most commonly used methods for this approach.The processes shown in Figure 1a,b are based on a shared encoder algorithm process, while the one in Figure 1c is based on a private encoder method.However, these methods currently have some problems and shortcomings.Firstly, CNNs are convolution-based neural networks with inductive biases and translation invariance characteristics, which, while improving the efficiency of graphic feature computations, cause a loss of the receptive field, leading to weak global feature mining capabilities and difficulty in extracting global information to obtain high-quality fused images [24]; secondly, forward propagation in the fusion network may lead to the loss of some important feature information; and lastly, the method based on shared encoders in the figure cannot differentiate features from different modalities, while the method based on private encoders overlooks shared features.Unlike CNNs, the Vision Transformer (ViT) [25] model architecture, which has recently become popular in the field of computer vision, utilizes mechanisms such as self-attention, multi-head attention, and positional encoding.This enables the model to effectively capture global dependencies within the input sequence, thereby providing outstanding global feature extraction capabilities.However, network models based on ViT are relatively complex and require substantial computation to achieve better performance.
To address these issues, this paper proposes a more rational fusion network architecture to solve the shortcomings and challenges in feature extraction and fusion.The fusion algorithm framework designed in this paper is illustrated in Figure 2.  Second, addressing the potential loss of important feature information during the fusion process, this paper makes relevant improvements to the Transformer and CNN feature extraction models, enhancing the network's ability to capture important feature information.On one hand, based on the Transformer network structure, we introduce the LT [26] block to balance fusion image quality and reduce computational costs and the DropKey [27] mechanism in the network's attention layer to adaptively adjust attention weights, making the model focus on more useful information.On the other hand, based on the CNN network model, we have added the CBAM module, which enhances the network model's focus on important areas by introducing channel attention and spatial attention mechanisms, thereby reducing the loss of important information.
Third, regarding visible and SAR images, we believe that the large-scale environmental features such as background and contour of different modal data have high similarity, showing high correlation in global features, whereas for different modal textures and details, they show some differences and independence, demonstrating low correlation in local features.Therefore, we promote the feature extraction capability and effectiveness of different modal data by increasing the correlation of global features and reducing the correlation of local features in visible and SAR images.Second, addressing the potential loss of important feature information during the fusion process, this paper makes relevant improvements to the Transformer and CNN feature extraction models, enhancing the network's ability to capture important feature information.On one hand, based on the Transformer network structure, we introduce the LT [26] block to balance fusion image quality and reduce computational costs and the DropKey [27] mechanism in the network's attention layer to adaptively adjust attention weights, making the model focus on more useful information.On the other hand, based on the CNN network model, we have added the CBAM module, which enhances the network model's focus on important areas by introducing channel attention and spatial attention mechanisms, thereby reducing the loss of important information.
Third, regarding visible and SAR images, we believe that the large-scale environmental features such as background and contour of different modal data have high similarity, showing high correlation in global features, whereas for different modal textures and details, they show some differences and independence, demonstrating low correlation in local features.Therefore, we promote the feature extraction capability and effectiveness of different modal data by increasing the correlation of global features and reducing the correlation of local features in visible and SAR images.
In summary, the main contributions of this paper are as follows: ( The specific chapters and arrangements of this paper are as follows: Section 2 introduces the related work on visible and SAR image fusion methods; Section 3 describes, in detail, the visible and SAR image fusion method and the related structure used in this paper; Section 4 introduces the related experimental work and presents the experimental results and analysis; finally, this paper concludes in Section 5.

Related Work
In this section, we mainly introduce some related work on image fusion methods.

CNN
Image fusion methods based on CNN mainly leverage the powerful feature extraction capabilities of CNN networks, retaining rich detail information in the fused images.In 2017, Liu et al. [28] introduced CNN networks into the field of image fusion.They trained the network using blurred background and foreground images to obtain binarized weight maps.During the testing phase, the source images were combined with the weight maps to produce fused multifocus images.Subsequently, many researchers introduced CNN network models into traditional methods, infusing rich semantic information into the fused images.For example, Li et al. [29] used the VGG19 network to further process the detail parts obtained through multi-scale decomposition, thus preserving rich texture information in the fused images.Liu et al. [30] used a downsampling sequence of convolutional weight maps as the fusion ratio map of two-branch downsampling sequences, avoiding manually designed fusion strategies.These methods all share a common issue: they do not fully consider the different information among different modal images.

Attention Mechanism
The attention mechanism is a commonly used module in image processing that is used to focus on the important features of the image and inhibit unnecessary regional responses.In 2014, the Google Mind team used the attention mechanism in the RNN model for image classification, which resulted in its research and use by many scholars.In general, the attention mechanism can be divided into soft attention, hard attention, and the self-attention that is used in the field of Natural Language Processing (NLP).Among them, the soft attention mechanism can currently be subdivided into channel attention, spatial attention, and its combination module [31].Woo et al. [32] proposed CBAM through the channel dimension and spatial dimension in a combinatorial analysis study and confirmed that the performance of the network is enhanced by the accurate attention mechanism and the suppression of noisy information.CBAM is a combination of the channel attention module (CAM) and the spatial attention module (SAM) used for enhancing the performance of the feedforward convolutional attention module to enhance the performance of CNNs.It can be integrated into any network model of CNN architecture with negligible computational cost and is a neural network that enables end-to-end training.Currently, CBAM has been applied to a variety of common Convolutional Neural Networks for enhancing network performance, such as ResNet [33], VGG [34], DenseNet [35], etc.
The CAM mainly models the importance of features, and its structure is shown in Figure 3.Its main process consists in using both the maximum pooling and mean pooling algorithms, then going through several MLP layers to obtain the transformed results, and finally applying them to the two channels separately to obtain the channel's attention results using the sigmoid function.
through the channel dimension and spatial dimension in a combinatorial analysis stud and confirmed that the performance of the network is enhanced by the accurate attentio mechanism and the suppression of noisy information.CBAM is a combination of the cha nel attention module (CAM) and the spatial attention module (SAM) used for enhancin the performance of the feedforward convolutional attention module to enhance the pe formance of CNNs.It can be integrated into any network model of CNN architecture wi negligible computational cost and is a neural network that enables end-to-end trainin Currently, CBAM has been applied to a variety of common Convolutional Neural Ne works for enhancing network performance, such as ResNet [33], VGG [34], DenseNet [35 etc.
The CAM mainly models the importance of features, and its structure is shown Figure 3.Its main process consists in using both the maximum pooling and mean poolin algorithms, then going through several MLP layers to obtain the transformed results, an finally applying them to the two channels separately to obtain the channel's attention r sults using the sigmoid function.The SAM models the importance of spatial locations, and its structure is shown Figure 4. Its main process consists in first downscaling the channel itself to obtain th maximum pooling and mean pooling results, respectively, and then stitching them into feature map, which is then learned using a convolutional layer.

Transformer and Its Variants
Transformer is a classic NLP model proposed by Vaswani et al. [36] in 2017 that reli entirely on self-attention to compute its inputs and outputs.The Vision Transformer (Vi was introduced by Dosovitskiy [25] for computer vision applications.Compared to CNN ViT and its variants have achieved many advanced results in image processing.For exam ple, Wang et al. [37] proposed PVT, which integrates Transformer into CNN and trains o dense partitions of images to produce high-resolution outputs, overcoming the drawbac of Transformer for dense prediction tasks.Wu et al. [38] proposed an efficient mobile NL architecture, LT, which features long-and short-range attention to significantly redu computational costs.The SAM models the importance of spatial locations, and its structure is shown in Figure 4. Its main process consists in first downscaling the channel itself to obtain the maximum pooling and mean pooling results, respectively, and then stitching them into a feature map, which is then learned using a convolutional layer. through the channel dimension and spatial dimension in a combinatorial analysis study and confirmed that the performance of the network is enhanced by the accurate attention mechanism and the suppression of noisy information.CBAM is a combination of the channel attention module (CAM) and the spatial attention module (SAM) used for enhancing the performance of the feedforward convolutional attention module to enhance the performance of CNNs.It can be integrated into any network model of CNN architecture with negligible computational cost and is a neural network that enables end-to-end training.Currently, CBAM has been applied to a variety of common Convolutional Neural Networks for enhancing network performance, such as ResNet [33], VGG [34], DenseNet [35], etc.
The CAM mainly models the importance of features, and its structure is shown in Figure 3.Its main process consists in using both the maximum pooling and mean pooling algorithms, then going through several MLP layers to obtain the transformed results, and finally applying them to the two channels separately to obtain the channel's attention results using the sigmoid function.The SAM models the importance of spatial locations, and its structure is shown in Figure 4. Its main process consists in first downscaling the channel itself to obtain the maximum pooling and mean pooling results, respectively, and then stitching them into a feature map, which is then learned using a convolutional layer.

Transformer and Its Variants
Transformer is a classic NLP model proposed by Vaswani et al. [36] in 2017 that relies entirely on self-attention to compute its inputs and outputs.The Vision Transformer (ViT) was introduced by Dosovitskiy [25] for computer vision applications.Compared to CNN, ViT and its variants have achieved many advanced results in image processing.For example, Wang et al. [37] proposed PVT, which integrates Transformer into CNN and trains on dense partitions of images to produce high-resolution outputs, overcoming the drawbacks of Transformer for dense prediction tasks.Wu et al. [38] proposed an efficient mobile NLP architecture, LT, which features long-and short-range attention to significantly reduce computational costs.Zamir

Transformer and Its Variants
Transformer is a classic NLP model proposed by Vaswani et al. [36] in 2017 that relies entirely on self-attention to compute its inputs and outputs.The Vision Transformer (ViT) was introduced by Dosovitskiy [25] for computer vision applications.Compared to CNN, ViT and its variants have achieved many advanced results in image processing.For example, Wang et al. [37] proposed PVT, which integrates Transformer into CNN and trains on dense partitions of images to produce high-resolution outputs, overcoming the drawbacks of Transformer for dense prediction tasks.Wu et al. [26] proposed an efficient mobile NLP architecture, LT, which features long-and short-range attention to significantly reduce computational costs.Zamir  LT is a novel lightweight Transformer network with two enhanced self-attention mechanisms to improve the performance of edge deployment.For low-level features, Convolutional Self-Attention (CSA) is introduced.Unlike previous approaches that fused convolution and self-attention, CSA introduces local self-attention into the convolution within a kernel of size 3 × 3 to enrich the low-level features in the first stage of LT.For highlevel features, Recursive Atrous Self-Attention (RASA) is proposed to compute similarity mappings using multi-scale contexts, and a recursive mechanism is employed to increase the representational power of additional marginal parameter costs.
In the image recovery task, although the existing Transformer model can overcome the problems of the limited sensory field of CNN and its non-adaptability to the input content, its computational complexity grows quadratically with the spatial resolution, and thus it cannot be applied in the recovery task of high-resolution images.In contrast, Restormer, as an efficient Transformer network for image restoration, is applicable to the task of restoring and reconstructing large images by introducing an MDTA module and a new GDFN that models global connectivity.
The main improvements to ViT mainly focus on two aspects: on the one hand, enhancing or replacing the original network's ReLU structures due to insufficient non-linearity, as exemplified in LT; on the other hand, introducing the DropOut mechanism during the training process of Transformer networks to prevent overfitting, thereby helping the model extract more useful feature information.This paper introduces the DropKey mechanism based on the LT network to improve the network.

Regularization Method
In machine learning, when the model is continuously optimized, image blocks with a larger share of attention in the current iteration will tend to be assigned larger attention weights during the next iteration, thus predisposing them to overfitting problems.In order to solve such problems, many machine learning algorithms use related strategies to reduce the test error, which are collectively known as regularization.Currently, the main strategies used in deep learning are Parameter Norm Penalties, Early Stopping, DropOut, etc.In 2012, Alex proposed DropOut, which is based on the principle of improving the performance of neural networks by preventing the co-action of feature detectors to alleviate the neural network overfitting problem.And in that year's image recognition competition, Alex et al. used the DropOut algorithm in the AlexNet network to prevent the overfitting problem and eventually won the competition.As shown in Figure 5b, DropOut involves randomly discarding the attention weights after Softmax normalization, but this breaks the probability distribution of the attention weights and fails to penalize the weight peaks, resulting in the model still overfitting to locally specific information.In this paper, we use a novel regularization method, DropKey [27], shown in Figure 5c, which implicitly assigns an adaptive operator to each attention block to constrain the attention distribution by randomly dropping some of the key vectors (thus making it smoother) and also encouraging the model to pay more attention to the useful information of the other image blocks, which can help to capture globally robust features.
Electronics 2024, 13, x FOR PEER REVIEW 7 result of the above inner product is normalized using Softmax; finally, the Softmax m is obtained and then multiplied with the V matrix to obtain the final output.

Framework and Methodology
In this section, we introduce the method and framework we proposed; we hav designed the corresponding loss functions for this method.The algorithm framewo this paper is shown in Figure 6.Below, we will introduce it from four aspects: enc Q, K, and V, shown in Figure 5, are the three key components inside the self-attention mechanism in the Transformer network, denoted as query vectors, key vectors, and value vectors, respectively, which are all obtained from the input matrices by linear transformation, as shown in Figure 5a.In the self-attention mechanism, a weight distribution is obtained by calculating the similarity between the query vector and all the key vectors based on the query vector, which is used to weight and sum the associated value vectors.Firstly, the inner product (MatMul) of matrices Q and the vectors of each row of K is calculated, and in order to prevent the inner product from being too large, it is divided by the square root of dk (Scale), where dk is the dimension of the K matrix; secondly, the result of the above inner product is normalized using Softmax; finally, the Softmax matrix is obtained and then multiplied with the V matrix to obtain the final output.

Framework and Methodology
In this section, we introduce the method and framework we proposed; we have also designed the corresponding loss functions for this method.The algorithm framework of this paper is shown in Figure 6.Below, we will introduce it from four aspects: encoder, fusion strategy, decoder, and loss function.

Framework and Methodology
In this section, we introduce the method and framework we proposed; we have also designed the corresponding loss functions for this method.The algorithm framework of this paper is shown in Figure 6.Below, we will introduce it from four aspects: encoder, fusion strategy, decoder, and loss function.Global Feature Extraction.Based on the shallow features extracted by the Restormer Block, we use the LT model to extract the global features of the input images.The LT model, by adopting long-and short-range attention, focuses more on the global information of images and reduces model parameters through a flattened feedforward network structure, significantly reducing computational costs while maintaining the same performance.At the same time, we introduce the DropKey mechanism in the attention layer, randomly dropping some key values to reduce the model's over-reliance on certain neurons, helping to capture more robust global features.
Local Feature Extraction.Local feature extraction aims to extract detailed features such as texture information and corner features from image data.The CNN feature extraction network is currently one of the most effective methods for extracting image detail features.
To capture more detailed feature information and reduce the loss of important information during the fusion process, we introduced the CBAM module at the front end of the CNN feature extraction network.This module adaptively adjusts the importance of different channel pieces of information and assesses the relevance of different spatial positions to enhance the network's focus on important areas.

Fusion Strategy
First, a fusion layer is constructed, whose main structure is similar to the feature extraction structure of the encoder.Therefore, we similarly adopt a Transformer network with the LT module and DropKey mechanism, as well as a CNN network with the CBAM module as the fusion strategy.For the first training stage, our approach is to first fuse and concatenate the global features extracted from the visible and SAR images, then send these concatenated features along with the local features of the visible and SAR to the decoder to reconstruct the original images.The purpose of this is to train an encoder that can extract global features of visible and SAR with higher relevance.For the second training stage, our approach is to input the visible and SAR images into the trained encoder, then fuse and concatenate the extracted global and local features, and send these concatenated features to the decoder for decoding to reconstruct the fused image.

Decoder
The decoders in the first and second training stages are structurally identical, both using Restormer Blocks as their basic unit, but they differ in function.The decoder in the first training stage mainly receives the global/local features from the visible and SAR images and ultimately reconstructs the original images, while the decoder in the second training stage receives the globally and locally concatenated features of the visible and SAR images and is capable of reconstructing the fused image.

Loss Function
Inspired by reference [39], this paper designs a two-stage training process.As introduced above, the tasks and functions realized in the first and second stages are not completely the same; therefore, we have designed specific loss functions for the training processes of both stages.

Training Stage 1
In training stage 1, the total loss function trained is calculated as follows: where α 1 , α 2 , and α 3 refer to the adjustment coefficients, which are 3, 10, and 1, respectively.L MI , L SSI M , and L decomp respectively refer to the mutual information loss, structural similarity loss, and feature decomposition loss of visible and SAR images, which are defined as follows: •

Mutual information loss
The specific expression for the MI loss function is as follows: where x and y represent the original image and the reconstructed image, respectively.H(x) and H(y) represent the information entropy of the original image and the reconstructed image, respectively, and H(x, y) represents the joint information entropy of the source image and the reconstructed image.

• Structural similarity loss
The specific expression for the SSIM loss function is as follows: where β represents the adjustment coefficient of 0.5.SSIM( , ) is the structural similarity index, and its specific expression is as follows: where x and y represent the original image and the reconstructed image, respectively; µ x and µ y represent the means of the original and reconstructed images; σ 2 x and σ 2 y represent the variances of the original and reconstructed images; σ xy represents the covariance of the original and reconstructed images; 2 is a constant used to maintain stability; and L is the dynamic range of image pixel values, with k 1 = 0.01, k 2 = 0.03.

•
Feature decomposition loss L decomp is a loss function of our own design that aims to better distinguish between the extracted global feature information and the local feature information.It is defined as follows: where CC( , ) refers to the correlation coefficient operator; f D SAR and f D V IS respectively refer to the detailed local features extracted from SAR images and visible images; and f B SAR and f B V IS respectively refer to the global features extracted from SAR images and visible images.Equation ( 5) is designed based on the viewpoint we proposed earlier because, in our view, visible and SAR images should be highly correlated in terms of global feature information.In order to preserve the same global information for both types of images, the larger CC( f B SAR , f B V IS ), the better.In terms of local detail feature information, there are certain differences between the two types of images.In order to extract richer details, the smaller CC( f D SAR , f D V IS ), the better.Therefore, this article proposes the above loss function.

Training Stage 2
In training stage 2, the total loss function of the training is calculated as follows: where α 4 , α 5 , α 6 , α 7 , and α 8 are adjustment coefficients, which are 1, 1, 3, 10, and 1, respectively.On the basis of the first training stage loss function, two terms, L int and L grad , have been added, where L int is the intensity loss of the image, which constrains the fused image to maintain a similar intensity distribution to the source image; and L grad is the gradient loss of the image, forcing the fused image to contain rich texture details.The specific definition formula is as follows: where I and ∇I respectively refer to the operators of the image intensity and gradient magnitude, while H and W respectively refer to the height and width of the image.

Experimental Setup and Result Analysis
In this section, we first introduce the dataset used in this experiment, then detail some parameter configurations and the implementation process of the experiment, compare it with existing visible and SAR image fusion methods, and finally, conduct an ablation study to prove the advancement and reference value of our proposed image fusion method.

Dataset Introduction
The dataset used in this experiment is OGSOD-1.0 [40], a publicly available dataset downloaded from the Internet.The SAR images in OGSOD-1.0 are collected from the Chinese Gaofen-3 satellite in the C-band, Vertical-Vertical (VV), and Vertical-Horizontal (VH) polarization modes.These SAR images are provided by the 38th Research Institute of China Electronics Technology Group Corporation (CETGC), and their resolution is 3 m.The optical images are provided by Google Earth, and their resolution is 10 m.In addition, to increase the diversity of the training set, the original authors obtained permission from Michael Schmitt to extend the dataset by selecting an additional 3000 sample pairs from the SEN1-2 [41] dataset.Therefore, OGSOD-1.0 consists of a training set of 14,665 optical and SAR image pairs and a test set of 3666 SAR-only images, containing a total of more than 48,000 instance annotations.For this experiment, we selected 1048 pairs from the dataset as the training set and 100 pairs as the test set.

Evaluation Metrics
To verify the fusion performance of the algorithm proposed in this paper, the experiment quantitatively evaluates the fusion results from four aspects and a total of 12 common metrics: information-based, structure similarity-based, image feature-based, and human visual perception-based.The information-based image fusion metrics include entropy (EN), mutual information (MI), and peak signal-to-noise ratio (PSNR); the structure similaritybased metrics include Structural Similarity Index Measure (SSIM) and Mean Squared Error (MSE); the image feature-based metrics include Average Gradient (AG), Edge Intensity (EI), Standard Deviation (SD), Spatial Frequency (SF), and edge information-based index (Qabf); and the visual perception-based metrics include Sum of Correlated Differences (SCDs) and Visual Information Fidelity (VIF).They are categorized in Table 1.Except for the MSE, where a smaller value indicates higher image quality, higher values in all other metrics indicate better image quality after fusion.
Table 1.Classification of quantitative evaluation metrics used in the experiment [42].

Experimental Setup
All algorithm implementations were trained and tested on a high-performance workstation equipped with an Nvidia Tesla A100 GPU with 80 GB of memory and an AMD Ryzen Threadripper PRO 5995WX 64-Core CPU.The deep learning framework is PyTorch, using CUDA version 11.7.During the training phase, the input image size was set to 256 × 256, with a total of 140 training epochs, where the first and second phases were 40 and 100 epochs, respectively.The batch size was set to 16, with an initial learning rate of 10 −4 , reduced by 50% every 20 epochs.

Qualitative Comparison
To better evaluate the fusion performance of various algorithms, this experiment selected three pairs of visible-light and SAR images with rich texture details from the test set for comparative display.The original pairs of visible-light and SAR images are shown in Figure 7. From the figure, it can be seen that visible-light images have a better visual effect and clearly express local features such as buildings.However, their contour information is difficult to distinguish from the background.Conversely, SAR images express contour information more fully.Therefore, the fused image should include both local information, such as buildings, and global information, such as terrain contours.

Quantitative Comparison
To verify the superiority of the proposed algorithm more objectively, Table 2 sho a quantitative index comparison between our method and the other five methods, wh the data highlighted in red are the best values for each index.From the experimental d in Table 2, it can be seen that the SSIM metrics obtained by some methods are greater th 1, which is not in line with common sense.This is because we have made a small chan to the SSIM metrics when calculating them (the calculation expression is shown in Eq tion ( 9)).It can be seen that the SSIM metrics in this paper are obtained by calculating SSIM metrics from the fused images with the visible and SAR images, respectively, a then summing them, which is why the size of the metrics may be greater than 1.In ad tion, similar to this practice, MI, MSE, CC, PSNR, SCD, VIFF, Qabf, and other metrics calculated.
( , ) ( , )  Figure 8 shows a visual comparison of the fused images obtained by our proposed visible and SAR image fusion method and the five methods mentioned above, with red boxes highlighting some detailed comparisons of the fused images from each method.From the comparison results, it can be seen that our proposed method captures more abundant texture details and clearer contour information in the fused images compared to the other five methods, and the fused images obtained by our method make the target objects more prominent and easier to distinguish from the background, helping us better understand various scenes.
Upon examination of the fused images, it becomes evident that they all exhibit a monochromatic appearance devoid of color.This is a notable departure from the conventional characteristics of visible and SAR fusion images, which will be elucidated below.In the execution of our algorithm, we first compress the RGB bands of the visible image into a single channel during the data processing stage, which results in the loss of color information.This is performed to ensure that the visible input and the SAR input have the same number of channels, which facilitates the overall execution of the algorithm.In order to show the comparison effect more intuitively, we normalized the data in Table 2 and plotted them as a radar chart, as shown in Figure 9. Since this is just a simple representation of the advantages and disadvantages of the metrics obtained by different methods, the normalization process we have adopted is to take the maximum value of the metrics in each category to be 1, while the metrics obtained by the other methods in this category are taken to be the ratio of their actual metrics to the maximum actual metrics in that category.However, the MSE metrics are special in that the smaller the metrics, the

Quantitative Comparison
To verify the superiority of the proposed algorithm more objectively, Table 2 shows a quantitative index comparison between our method and the other five methods, where the bolded data are the best values for each index.From the experimental data in Table 2, it can be seen that the SSIM metrics obtained by some methods are greater than 1, which is not in line with common sense.This is because we have made a small change to the SSIM metrics when calculating them (the calculation expression is shown in Equation ( 9)).It can be seen that the SSIM metrics in this paper are obtained by calculating the SSIM metrics from the fused images with the visible and SAR images, respectively, and then summing them, which is why the size of the metrics may be greater than 1.In addition, similar to this practice, MI, MSE, CC, PSNR, SCD, VIFF, Qabf, and other metrics are calculated.In order to show the comparison effect more intuitively, we normalized the data in Table 2 and plotted them as a radar chart, as shown in Figure 9. Since this is just a simple representation of the advantages and disadvantages of the metrics obtained by different methods, the normalization process we have adopted is to take the maximum value of the metrics in each category to be 1, while the metrics obtained by the other methods in this category are taken to be the ratio of their actual metrics to the maximum actual metrics in that category.However, the MSE metrics are special in that the smaller the metrics, the better the quality of the fused image generated.In order to facilitate the intuitive understanding of the human eye, we do the opposite of normalizing the metrics of the MSE.We set the minimum metric to 1, while the metrics obtained by other methods take the value of the ratio of the minimum actual metric to its actual metric.better the quality of the fused image generated.In order to facilitate the intuitive understanding of the human eye, we do the opposite of normalizing the metrics of the MSE.We set the minimum metric to 1, while the metrics obtained by other methods take the value of the ratio of the minimum actual metric to its actual metric.
It can be seen that our proposed method performs well in all metrics except for 2 (PSNR and MSE) and is the best in the other 10 metrics, demonstrating that our method performs better in visible and SAR image fusion tasks.Specifically, our method performs best on the EN and MI metrics, indicating that it can fully mine and transfer the information from the source images to the fused images; it also performs best on the SSIM index, showing that it can retain the detailed information of the source images, being most similar to them; it performs best on the AG, EI, SD, SF, and Qabf metrics, indicating that the fused images obtained by our method are of higher quality and clarity; and it performs best on the SCD and VIF metrics, demonstrating that the fused images obtained by our method have better visual effects.

Ablation Studies
In this section, we validate the rationality of different modules through a set of ablation experiments.Specifically, we conducted ablation studies on the dual-branch structure, residual structure, DropKey mechanism, CBAM module, and two-stage training used in our experiments.The details are as follows: (  It can be seen that our proposed method performs well in all metrics except for 2 (PSNR and MSE) and is the best in the other 10 metrics, demonstrating that our method performs better in visible and SAR image fusion tasks.Specifically, our method performs best on the EN and MI metrics, indicating that it can fully mine and transfer the information from the source images to the fused images; it also performs best on the SSIM index, showing that it can retain the detailed information of the source images, being most similar to them; it performs best on the AG, EI, SD, SF, and Qabf metrics, indicating that the fused images obtained by our method are of higher quality and clarity; and it performs best on the SCD and VIF metrics, demonstrating that the fused images obtained by our method have better visual effects.

Ablation Studies
In this section, we validate the rationality of different modules through a set of ablation experiments.Specifically, we conducted ablation studies on the dual-branch structure, residual structure, DropKey mechanism, CBAM module, and two-stage training used in our experiments.The details are as follows: Based on the above ablation experimental setup, we obtained the experimental results and recorded them as shown in Table 3. From the comparative results, it is evident that in certain group comparisons, the methodology we employed exhibited a slight deterioration in several metrics.However, the overall enhancement across the majority of metrics substantiates the rational design of our proposed structure.Furthermore, it is noteworthy that the metrics derived from the dual-branch experiments significantly and consistently surpassed those obtained solely from the CNN branch.As for the results obtained by using only the Transformer branch, we can see that in the five indexes based on image features, the method using only the Transformer branch is even superior in four of them, which indicates that the Transformer branch we added has a strong feature extraction capability.However, in terms of overall performance, the dual-branch structure we use has a greater advantage in the other eight indicators, which indicates that the dual-branch structure adopted in this study is reasonable and effective.Additionally, in the ablation experiments involving DropKey and CBAM, our approach demonstrated notable improvements in the PSNR, MSE, and SSIM metrics.These results suggest that our method preserves more original image information and exhibits superior performance in representing details and textural features.
Additionally, Figure 10 intuitively displays the comparative results of the ablation experiments, highlighting certain aspects within red boxes to showcase detailed contrasts in the fusion images obtained by each group.An analysis of these comparisons reveals that the fusion images produced by our proposed method exhibit superior fusion quality.Specifically, the fusion images generated solely using the CNN branch contain more noise, which could hinder further image processing; certain texture details are lost in the fused image obtained without using residual structures; and the images resulting from only one-stage training also show some loss in textural detail, with weaker contrast between structures such as buildings and their backgrounds compared to our method.Furthermore, it is clearly visible that the fusion images obtained without employing DropKey and CBAM lack detailed textures and appear more blurred, demonstrating the significant role of DropKey and CBAM utilized in our study.In summary, the results of the ablation experiments show that our designed method is effective and rational.

Electronics 2024 , 19 Figure 1 .
Figure 1.Existing AE fusion algorithm frameworks.(a) and (b) represent the process based on a shared encoder, while (c) represents the process based on private encoders.

Figure 1 .
Figure 1.Existing AE fusion algorithm frameworks.(a,b) represent the process based on a shared encoder, while (c) represents the process based on private encoders.

Figure 1 .
Figure 1.Existing AE fusion algorithm frameworks.(a) and (b) represent the process based on a shared encoder, while (c) represents the process based on private encoders.

Figure 2 .
Figure 2. The dual-branch AE fusion algorithm framework designed in this paper.

Figure 2 .
Figure 2. The dual-branch AE fusion algorithm framework designed in this paper.First, addressing the lack of global feature extraction capability in CNNs, this paper introduces a dual-branch feature extraction network based on a Transformer and a CNN to separately extract and fuse global high-frequency features and local low-frequency features from visible and SAR images.Second, addressing the potential loss of important feature information during the fusion process, this paper makes relevant improvements to the Transformer and CNN feature extraction models, enhancing the network's ability to capture important feature information.On one hand, based on the Transformer network structure, we introduce the LT[26] block to balance fusion image quality and reduce computational costs and the DropKey[27] mechanism in the network's attention layer to adaptively adjust attention weights, making the model focus on more useful information.On the other hand, based on the CNN network model, we have added the CBAM module, which enhances the network model's focus on important areas by introducing channel attention and spatial attention mechanisms, thereby reducing the loss of important information.Third, regarding visible and SAR images, we believe that the large-scale environmental features such as background and contour of different modal data have high similarity, showing high correlation in global features, whereas for different modal textures and details, they show some differences and independence, demonstrating low correlation in local features.Therefore, we promote the feature extraction capability and effectiveness of different modal data by increasing the correlation of global features and reducing the correlation of local features in visible and SAR images.In summary, the main contributions of this paper are as follows:
et al. introduced the Restormer [38] structure, incorporating Multi-head Dconv Transfer Attention (MDTA) modules and gated-Dconv feed-forward network (GDFN) for multi-scale local/global representation learning in high-resolution images.

Figure 6 .
Figure 6.Fusion method structure of this experiment.(a) First stage of training process; (b) second stage of training process; (c) DropKey principle; (d) CBAM principle.

Figure 6 .
Figure 6.Fusion method structure of this experiment.(a) First stage of training process; (b) second stage of training process; (c) DropKey principle; (d) CBAM principle.

3. 1 .
Encoder The encoder part is mainly used for feature extraction of input images, which consists of three parts: shallow feature extraction, global feature extraction, and local feature extraction.The specific details are as follows: Shallow Feature Extraction.Initially, the Restormer Block extracts shallow features from the input visible and SAR images and then continues to extract their global/local features based on the extracted shallow features.The Restormer Block has been proven to extract shallow features of images without increasing computational power, facilitating multi-scale global/local representation learning suitable for image reconstruction tasks.

Electronics 2024 ,Figure 7 .
Figure 7. Three pairs of visible and SAR original images.

Figure 7 .
Figure 7. Three pairs of visible and SAR original images.

Figure 8 .
Figure 8.Comparison of fused images obtained by different methods.Red boxes highlight some detailed comparisons of the fused images from each method.

Figure 8 .
Figure 8.Comparison of fused images obtained by different methods.Red boxes highlight some detailed comparisons of the fused images from each method.

Figure 9 .
Figure 9. Radar chart of results of experimental metrics.

1 )
Dual-branch structure: In this paper, we design a CNN-based and a Transformerbased dual-branch structure, and in order to prove the effectiveness of the dualbranch structure, we design ablation experiments as follows: (a) We use only the Transformer branch to complete the feature extraction, i.e., the CNN branch is replaced by the Transformer branch.(b) We use only the CNN branch to complete the

Figure 9 .
Figure 9. Radar chart of results of experimental metrics.

( 1 )
Dual-branch structure: In this paper, we design a CNN-based and a Transformerbased dual-branch structure, and in order to prove the effectiveness of the dualbranch structure, we design ablation experiments as follows: (a) We use only the Transformer branch to complete the feature extraction, i.e., the CNN branch is replaced by the Transformer branch.(b) We use only the CNN branch to complete the feature extraction, i.e., the Transformer branch is replaced by the CNN branch.(2) Residual structure: A comparative experiment is conducted by comparing scenarios with and without the introduction of the residual structure.(3) DropKey: For the Transformer branch, a comparative experiment is conducted between using the DropKey mechanism and not using it.(4) CBAM: For the CNN branch, an experiment is conducted comparing the use of the CBAM module against not using it.(5) Two-stage training: This experiment introduced two-stage training to enhance fusion performance.In the ablation study, a one-stage training method directly trains the encoder, fusion layer, and decoder.The number of training rounds is consistent with the total number of rounds in the two-stage training, both at 140 rounds.
This article proposes a visible and SAR image fusion method based on a dual-branch residual structure combining Transformer and CNN networks.It introduces the LT and DropKey mechanisms into the feature extraction network based on Transformer and incorporates the CBAM module into the feature extraction network based on CNN to better extract global and local features from both modalities.In addition, we have made certain improvements to the entire fusion network architecture by first fusing and concatenating the global features of the two modalities and then inputting the concatenated features separately with the local features of each modality into the decoder for reconstruction.To this end, we have also designed a specific loss function to adapt to this task.Finally, through comparative experiments with five other methods and ablation experiments, we have demonstrated the effectiveness and feasibility of our proposed method.Author Contributions: L.H.: Writing the original draft, Methodology, Investigation, Software; S.S.: Supervision, Validation; Z.Z.(Zhen Zuo): Methodology, Writing-review and editing; J.W.: Methodology, Project administration; S.H.: Methodology, Visualization, Conceptualization, Software; Z.Z.(Zongqing Zhao): Software, Conceptualization; X.T.: Resources, Data curation; S.Y.: Software, Resources.All authors have read and agreed to the published version of the manuscript.

Table 2 .
Comparison of evaluation metrics for different fusion methods.

Table 2 .
Comparison of evaluation metrics for different fusion methods.

Table 3 .
Results of ablation experiments.