DSA-Net: Infrared and Visible Image Fusion via Dual-Stream Asymmetric Network

Infrared and visible image fusion technologies are used to characterize the same scene using diverse modalities. However, most existing deep learning-based fusion methods are designed as symmetric networks, which ignore the differences between modal images and lead to source image information loss during feature extraction. In this paper, we propose a new fusion framework for the different characteristics of infrared and visible images. Specifically, we design a dual-stream asymmetric network with two different feature extraction networks to extract infrared and visible feature maps, respectively. The transformer architecture is introduced in the infrared feature extraction branch, which can force the network to focus on the local features of infrared images while still obtaining their contextual information. The visible feature extraction branch uses residual dense blocks to fully extract the rich background and texture detail information of visible images. In this way, it can provide better infrared targets and visible details for the fused image. Experimental results on multiple datasets indicate that DSA-Net outperforms state-of-the-art methods in both qualitative and quantitative evaluations. In addition, we also apply the fusion results to the target detection task, which indirectly demonstrates the fusion performances of our method.


Introduction
Image fusion can combine images of the same scene captured by different sensors to obtain an image with rich information to make up for the shortage of information in single-sensor imaging, which is beneficial to the subsequent application of images. Infrared (IR) and visible (VIS) image fusion is a widely used branch of image fusion applications. The infrared images are obtained by the sensor capturing the infrared wavelength of the scene with significant thermal radiation information, which can effectively distinguish the target even under poor lighting or extreme weather conditions. However, the target contour edges, as well as the background in the infrared images, are always blurred. On the contrary, the visible image records the reflected light captured by the sensor and has rich texture details and structure information, so it is in accordance with human visual cognition. The infrared and visible fusion algorithm combines the advantages of both to generate a fused image with prominent targets and abundant texture information, which is widely used in military reconnaissance [1], industrial production [2], civilian surveillance [3], and other fields [4].
The purpose of infrared and visible image fusion is to extract and integrate the essential feature information from source images acquired by distinct imaging devices into a single fused image. Therefore, extracting the significant features of the fusion image is one of the central problems. Over the past few decades, numerous fusion methods have been proposed by researchers, which can be roughly divided into two categories: traditional fusion methods [5][6][7] and deep learning-based fusion methods [8][9][10]. Traditional fusion methods measure pixels' salience in the spatial domain or transform domain, and later ablation experiments, generalization experiments, and applications in target detection. Finally, conclusions are given in Section 5.

Related Work
In this section, we first review the existing infrared and visible image fusion algorithms in Section 2.1, followed by a brief introduction to the transformer in Section 2.2.

Deep Learning-Based Fusion Methods
In 2017, Liu et al. [20] first used CNN to extract image features and design fusion strategies to achieve image fusion tasks. However, this method was limited to multi-focus image fusion tasks. Later, Liu et al. [21] presented an analogous approach for infrared and visible image fusion. They used CNN to obtain the weight map and obtained the fused image through a series of post-processing. These two methods use CNN for feature extraction only; other parts of the fusion framework still need to be designed manually without completely removing the traditional algorithms. In 2018, Li et al. [22] proposed an encoding-decoding framework that uses dense concatenation in the encoder to fully extract feature maps and also designs a fusion strategy to combine the extracted features. Then, the decoder is utilized to decode the features and reconstruct the fused image. However, this approach still requires the manual design of fusion rules and cannot fully achieve end-to-end fusion. In 2019, Ma et al. [23] introduced generative adversarial networks (GAN) into infrared and visible image fusion, using a pair of simple generators and discriminators to obtain fused images. Subsequently, Ma et al. [24] proposed a model called DDcGAN, which uses a dual discriminator to reduce the loss of source image information. However, GAN is not stable in unsupervised learning tasks, and the fused image edge contours may be blurred. In 2020, Xu et al. [25] proposed a unified fusion model that trains the model by learning multiple fusion tasks continuously to avoid catastrophic forgetting, storage, and computation problems. In 2021, Liu et al. [26] proposed a deep network for infrared and visible image fusion using a feature learning module with a fusion learning mechanism to optimize the fusion effect. In 2022, Tang et al. [27] proposed a Y-shape fusion framework and used a dynamic transformer module to acquire local features and important contextual information. These methods have designed symmetric models or single-branch models for feature extraction of infrared and visible source images, ignoring the differences in modal features between the two, resulting in the loss of intermediate extracted feature details.

Transformer
In 2017, Vaswani first proposed the concept of a transformer [28] to capture more long-range information, which conquers the inherent problem of CNN, i.e., long memory loss, by employing multi-headed self-attention. Since then, transformers have swept the field of natural language processing (NIP) [29,30]. In 2020, Dosovitskiy proposed a vision transformer (VIT) for image classification [31], which was the first application of a transformer in the field of vision. Since then, transformers have been extensively developed in the field of vision, for example, a new transformer network for medical image segmentation [32], an end-to-end video instance segmentation [33], a pure semantic segmentation [34], and even better models of visual transformers [35,36] for other vision tasks. Recently, transformers have also been widely used in image fusion tasks. In 2021, vs. et al. [37] proposed a transformer-based multi-scale fusion strategy that captures local and global features using spatial CNN branches and transformer branches for multi-scale feature fusion. Zhao et al. [38] used density nets for encoding and a dual transformer to focus and integrate information from the infrared and visible images. Subsequently, Fu et al. [39] presented a patch pyramid transformer (PPT) for image fusion; a patch transformer is designed to transform the image into a series of patches and then leverage the pyramid transformer for feature extraction. Rao et al. [40] developed a lightweight fusion framework by combining a transformer and adversarial learning, where a generator was designed for generating the fused image and two discriminators for optimizing the perceptual quality of the fused images.

Proposed Method
In this section, we present the structure of our method and the loss function in detail. Figure 1 shows the framework of our proposed network. The main framework consists of three parts, the infrared feature extraction module, the visible feature extraction module, and the merge module. Since infrared images contain strong thermal radiation information and can effectively distinguish targets, we use a combination of CNN and transformer for infrared image feature extraction. The transformer is designed to model global dependencies, the network architecture of which is shown in Figure 2. Visible images are rich in texture details and background information, so we use the densely connected convolution layer to extract local features, the network architecture of which is shown in Figure 3. Using these two branches, the useful information from the source images can be fully extracted. Then, these features are concatenated together and fed into a merge module. This module uses convolution skip joints to produce a continuous memory mechanism, which can adaptively learn more effective features from previous and current local features and stabilize the training of the broader network. The decoding block is used for the generation of subsampling and fusion results and consists, in turn, of a convolution layer with a kernel size of 3 × 3, a batch normalization (BN), and a corrected linear unit (ReLU). Since our network is an end-to-end network, the output of the network is the fused image.

Framework Overview
sented a patch pyramid transformer (PPT) for image fusion; a patch transformer is designed to transform the image into a series of patches and then leverage the pyramid transformer for feature extraction. Rao et al. [40] developed a lightweight fusion framework by combining a transformer and adversarial learning, where a generator was designed for generating the fused image and two discriminators for optimizing the perceptual quality of the fused images.

Proposed Method
In this section, we present the structure of our method and the loss function in detail. Figure 1 shows the framework of our proposed network. The main framework consists of three parts, the infrared feature extraction module, the visible feature extraction module, and the merge module. Since infrared images contain strong thermal radiation information and can effectively distinguish targets, we use a combination of CNN and transformer for infrared image feature extraction. The transformer is designed to model global dependencies, the network architecture of which is shown in Figure 2. Visible images are rich in texture details and background information, so we use the densely connected convolution layer to extract local features, the network architecture of which is shown in Figure 3. Using these two branches, the useful information from the source images can be fully extracted. Then, these features are concatenated together and fed into a merge module. This module uses convolution skip joints to produce a continuous memory mechanism, which can adaptively learn more effective features from previous and current local features and stabilize the training of the broader network. The decoding block is used for the generation of subsampling and fusion results and consists, in turn, of a convolution layer with a kernel size of 3 × 3, a batch normalization (BN), and a corrected linear unit (ReLU). Since our network is an end-to-end network, the output of the network is the fused image.   In this section, we present the structure of our method and the loss func Figure 1 shows the framework of our proposed network. The main fra sists of three parts, the infrared feature extraction module, the visible featu module, and the merge module. Since infrared images contain strong therm information and can effectively distinguish targets, we use a combination transformer for infrared image feature extraction. The transformer is design global dependencies, the network architecture of which is shown in Figure  ages are rich in texture details and background information, so we use the nected convolution layer to extract local features, the network architecture shown in Figure 3. Using these two branches, the useful information from th ages can be fully extracted. Then, these features are concatenated together a merge module. This module uses convolution skip joints to produce a memory mechanism, which can adaptively learn more effective features fr and current local features and stabilize the training of the broader network. T block is used for the generation of subsampling and fusion results and consis a convolution layer with a kernel size of 3 × 3, a batch normalization (BN rected linear unit (ReLU). Since our network is an end-to-end network, the network is the fused image.

Infrared Feature Extraction Module
Infrared images have strong target information. In order to obtain infrared feature maps with local enhancement, we use a combination of CNN and transformer to extract infrared image features. The framework of the transformer adopts multi-head self-attention and has good global contextual feature exploration capability, shown in Figure 2. The transformer consists of two LayerNorm, multi-head self-attention (MSA), and multi-layer perceptron (MLP). LayerNorm normalizes the features, which keeps similarities between different channels' statistical properties and enhances the generalization ability of the model. After normalization, the features are linearly projected into multiple feature subspaces to obtain attention weights Q, weight indexes K, and feature vectors V. Then, parallel processing is performed using multiple independent scaled dot product attention, as shown in Figure 4a. Compared with single attention, MSA can effectively prevent the model from over-focusing on its own location when encoding information about the current location. The scaled dot product attention is shown in Figure 4b. The similarity matrix is obtained using Q and K for dot product operation. Scale represents the quantization operation, which can prevent the similarity matrix variance from being too large and make the training gradient update more stable. Mask is the padding operation, but unlike the ordinary padding 0, it is padded with negative infinity and then normalized by the Softmax layer to obtain the attention weight matrix. The attention of the padding part will be 0, which does not affect the subsequent operations. Finally, the attention map image is generated by multiplying the feature vector V with the corresponding attention weights. The attention process can be expressed as Equation (1). The MLP is shown in Figure 4c and consists of Full Connection, GELU activation function, and Dropout. The MLP layer can perform nonlinear transformations on the features, which can be better adapted to complex image tasks. Moreover, the MLP layer can extract higher-level features from the input features, which can represent information such as objects and backgrounds in an image. The residual connection in the transformer can effectively solve the problem of gradient disappearance and the degradation of the weight matrix. Through the above operations, the transformer uses the self-attention mechanism to establish the relationship between image features, which can capture the global information and long-distance dependence in an image.

Infrared Feature Extraction Module
Infrared images have strong target information. In order to obtain infrared feature maps with local enhancement, we use a combination of CNN and transformer to extract infrared image features. The framework of the transformer adopts multi-head self-attention and has good global contextual feature exploration capability, shown in Figure 2. The transformer consists of two LayerNorm, multi-head self-attention (MSA), and multi-layer perceptron (MLP). LayerNorm normalizes the features, which keeps similarities between different channels' statistical properties and enhances the generalization ability of the model. After normalization, the features are linearly projected into multiple feature subspaces to obtain attention weights Q, weight indexes K, and feature vectors V. Then, parallel processing is performed using multiple independent scaled dot product attention, as shown in Figure 4a. Compared with single attention, MSA can effectively prevent the model from over-focusing on its own location when encoding information about the current location. The scaled dot product attention is shown in Figure 4b. The similarity matrix is obtained using Q and K for dot product operation. Scale represents the quantization operation, which can prevent the similarity matrix variance from being too large and make the training gradient update more stable. Mask is the padding operation, but unlike the ordinary padding 0, it is padded with negative infinity and then normalized by the Softmax layer to obtain the attention weight matrix. The attention of the padding part will be 0, which does not affect the subsequent operations. Finally, the attention map image is generated by multiplying the feature vector V with the corresponding attention weights. The attention process can be expressed as Equation (1). The MLP is shown in Figure 4c and consists of Full Connection, GELU activation function, and Dropout. The MLP layer can perform nonlinear transformations on the features, which can be better adapted to complex image tasks. Moreover, the MLP layer can extract higher-level features from the input features, which can represent information such as objects and backgrounds in an image. The residual connection in the transformer can effectively solve the problem of gradient disappearance and the degradation of the weight matrix. Through the above operations, the transformer uses the self-attention mechanism to establish the relationship between image features, which can capture the global information and long-distance dependence in an image.

Visible Feature Extraction Module
Visible images have a higher spatial resolution and contain more texture details. Therefore, we design a visible feature extraction module with a residual dense block (RDB) [41] to extract the visible features. In this module, we first extract and obtain visible shallow features using three convolution layers, followed by deep feature extraction using the RDB. The RDB consists of five convolution layers, as shown in Figure 3. Each convolution layer can acquire the features of all previous layers through local dense connections, thus making full use of the features of each layer. The final convolution layer filters all the previous features and adaptively controls the output information. Finally, the shallow and deep feature results are combined using residual connectivity, and the residual connectivity enhances the gradient connectivity, which can effectively prevent the gradient from disappearing. The visible image can fully extract its local features through its feature extraction module, prevents the degradation of intermediate features, and obtains a feature map with rich texture detail features.

Merge Module
The features obtained from the infrared feature extraction module and the visible feature extraction module are concatenated as the input to the merge module. The merge module consists of ten convolution layers and skip connections. The convolution layers all consist of 3 × 3 convolution, BN, and ReLU activation functions. The first five convolution layers are used to extract the depth features of the infrared and visible images. As the network depth increases, the issue of feature degradation is more likely to arise, which can be addressed by incorporating skip connections. The skip connections also use the learned features of the previous layer in this layer, which achieves feature reuse. Finally, the extracted depth features are used in the last five convolution layers to achieve feature decoding and to obtain the fused image.

Loss Function
Since our method is unsupervised learning, the loss function plays a crucial role in the fusion effect. It is an important challenge to fully retain the features of the source images, such as the infrared salient targets in infrared images and the detailed textures in visible images. Therefore, in order to fully retain the source image information, our loss function consists of three types of loss terms, structure loss , intensity loss , and gradient loss . The structure loss constrains the similarity between the fused image and the source images. The intensity loss constrains the fused image to maintain a similar intensity distribution as the source image, while the gradient loss enforces the presence of

Visible Feature Extraction Module
Visible images have a higher spatial resolution and contain more texture details. Therefore, we design a visible feature extraction module with a residual dense block (RDB) [41] to extract the visible features. In this module, we first extract and obtain visible shallow features using three convolution layers, followed by deep feature extraction using the RDB. The RDB consists of five convolution layers, as shown in Figure 3. Each convolution layer can acquire the features of all previous layers through local dense connections, thus making full use of the features of each layer. The final convolution layer filters all the previous features and adaptively controls the output information. Finally, the shallow and deep feature results are combined using residual connectivity, and the residual connectivity enhances the gradient connectivity, which can effectively prevent the gradient from disappearing. The visible image can fully extract its local features through its feature extraction module, prevents the degradation of intermediate features, and obtains a feature map with rich texture detail features.

Merge Module
The features obtained from the infrared feature extraction module and the visible feature extraction module are concatenated as the input to the merge module. The merge module consists of ten convolution layers and skip connections. The convolution layers all consist of 3 × 3 convolution, BN, and ReLU activation functions. The first five convolution layers are used to extract the depth features of the infrared and visible images. As the network depth increases, the issue of feature degradation is more likely to arise, which can be addressed by incorporating skip connections. The skip connections also use the learned features of the previous layer in this layer, which achieves feature reuse. Finally, the extracted depth features are used in the last five convolution layers to achieve feature decoding and to obtain the fused image.

Loss Function
Since our method is unsupervised learning, the loss function plays a crucial role in the fusion effect. It is an important challenge to fully retain the features of the source images, such as the infrared salient targets in infrared images and the detailed textures in visible images. Therefore, in order to fully retain the source image information, our loss function consists of three types of loss terms, structure loss L ssim , intensity loss L int , and gradient loss L grad . The structure loss constrains the similarity between the fused image and the source images. The intensity loss constrains the fused image to maintain a similar intensity distribution as the source image, while the gradient loss enforces the presence of additional texture details in the fused image. The loss function of the network can be expressed as follows: where α, β, and γ are the weighting factors of the three loss functions, which are used to control the total loss function balance. Ensure that the fused image has similar structural information to the source images, which can be expressed as I V IS , I IR and I F denote the visible image, the infrared image, and the fused image of both, respectively. λ V IS ssim and λ IR ssim represent the SSIM loss weights between the fused image and the visible and infrared images. SSI M(·) denotes the structural similarity operation between the fused image and the source images, which is defined as follows: where µ denotes the mean and σ denotes the standard deviation or covariance. C 1 , C 2 , and C 3 are constants to prevent µ 2 and σ X σ Y being 0 from causing formula instability. It constrains the loss and distortion of the fused image from the similarity of brightness, contrast, and structural information.
The SSIM loss function is weakly constrained in terms of pixel intensity, while the significant targets in visible images have great pixel intensity. Therefore, we also design the intensity loss to retain the infrared targets in the source image.
where λ V IS int and λ IR int represent the intensity loss weights between the fused image and the visible and infrared images. H and W denote the height and width of the fused image. · 2 is the l2-norm. In addition, we use the gradient loss constraint to fuse the images to retain the detailed textures in the visible images as well as the target edges of the infrared images.
where λ V IS grad and λ IR grad represent the gradient loss weight between the fused image and the visible and infrared images. ∇ denotes the gradient operator. Due to the optimization of the above loss function, the fused image can well retain the structural information, intensity information, and gradient information of the source images. We hope that the fusion image retains more structural information of the visible image, combined gradient information, and more infrared image intensity information. Therefore, the loss weights described above should meet the following conditions:

Experiments
The experimental configuration and experimental details will be outlined in Section 4.1. Then, we present the comparison methods and objective evaluation metrics in Section 4.2.
The ablation experiments on the network structure are presented in Section 4.3, demonstrating the rationality of our network structure. The comparison experiments and generalization experiments are presented in Sections 4.4 and 4.5, respectively, revealing the superiority of our proposed method. Finally, we perform target detection task-driven eval-

Experimental Configuration and Experimental Details
Two mainstream datasets, the TNO dataset and the RoadScene dataset, were used in this work. We collected 51 and 83 pairs of infrared and visible image pairs from these two datasets, respectively. Then, 50 pairs were randomly selected from the RoadScene datasets as the training data, while all the remaining image pairs were used as the test data. To obtain sufficient training samples, the training data were expanded using an overlapping cropping strategy. It is worth mentioning that the cropping strategy is a widely used data enhancement method in the image domain. In our experiments, the RGB images in the RoadScene dataset were converted to the YUV color model, the Y channel was used for image fusion, and finally, the fused images were converted to RGB images.
Specifically, 40,964 pairs of infrared and visible image patches with 120 × 120 size were generated for network training. Since the cropping strategy is only used for data expansion, the test data are not used. Therefore, by feeding the entire image into the trained model, fusion results can be generated. In our experiments, the epoch was 25 and the batch size was fixed at 29. The learning rate was set to 0.001, and the Adam optimizer was used for model optimization. The three weighting factors, α, β, and γ in the loss function are specified as 1.1, 10, and 10, respectively. All experiments were conducted on a computer with an Intel(R) Core(TM) i9-10920X CPU @ 3.50 GHz and an NVIDIA GeForce RTX 3090 GPU. The proposed deep model was implemented on the PyTorch framework.

Comparison Methods and Evaluation Indicators
To ensure a thorough evaluation of the proposed algorithm, we performed experiments on both the RoadScene and TNO datasets. We compared our approach with nine stateof-the-art methods, including three representative traditional methods, namely GF [42], ADF [43], and IVFusion [44], and six deep learning-based methods, namely DenseFuse [22], GAN-FM [45], DDcGAN [24], YDTR [27], CUFD [46], and DATFuse [47]. The implementations of all nine methods are publicly available, and we set the optional parameters in the same way as reported in the original paper.
For quantitative evaluation, six metrics were selected to objectively assess the fusion performance, including structure similarity index measure (SSIM) [48], mean square error (MSE) [49], correlation coefficient (CC) [50], peak signal-to-noise ratio (PSNR) [51], the sum of correlations of differences (SCD) [52], and Chen-Blum Metric (Q CB ) [53]. SSIM evaluates the structural loss and distortion of fused images from the human visual system's perspective, and MSE calculates the error between the fused images and the source images. CC measures the degree of linear correlation between the fused images and the source images. PSNR measures the ratio of peak power to the noise power in the fused images. SCD measures the maximum information of the fused images containing each source image. Q CB evaluates the image quality of the fused images based on the human visual system model. In addition, larger SSIM, CC, PSNR, SCD, and Q CB indicate better fusion performances. Smaller MSE indicates better fusion performances.

Ablation Experiments
To investigate the effectiveness of our asymmetric network structure, transformerbased infrared feature extraction module, and RDB-based visible feature extraction module, we performed ablation validation on the TNO and RoadScene datasets. We divided the model structure into five groups of types. (a) Transformer-based dual-stream symmetric network (D-Trans)-in order to verify the effectiveness of the asymmetric network structure, we applied the infrared feature extraction module of this paper to visible image feature extraction and constructed a symmetric network. (b) RDB-based dual-stream symmetric network (D-RDB)-to verify the effectiveness of the asymmetric network structure, we applied the visible feature extraction module of this paper to infrared image feature ex-  6 show the fusion results of the TNO and RoadScene datasets, respectively. To allow for better comparison, we zoomed in for a close-up of a local area in each fusion result. From the fusion results of D-Trans and D-RDB, it can be seen that we changed the asymmetric network to a two-stream symmetric network, resulting in blurred edges of infrared targets and insufficient clarity of the scene, which indicates that the proposed asymmetric network has better complementary information preservation capabilities. In the absence of the transformer module, the infrared feature extraction module cannot capture the infrared protruding target well due to the failure to build the long-distance dependency. Therefore, the infrared character target in Figure 5 and the clouds in the sky in Figure 6 are relatively blurry. As for the case without RDB, we can see that its results fail to fully extract the visible details; although the infrared target is better maintained, the background texture of the fused image and the landmark lines of the RoadScene are not clear enough. In addition, the E-FEM fusion is not well preserved in both infrared target and texture details, which proves the effectiveness of the transformer and RDB for infrared and visible feature extraction, respectively.
Sensors 2023, 23, x FOR PEER REVIEW 9 of 21 structure, we applied the infrared feature extraction module of this paper to visible image feature extraction and constructed a symmetric network. (b) RDB-based dual-stream symmetric network (D-RDB)-to verify the effectiveness of the asymmetric network structure, we applied the visible feature extraction module of this paper to infrared image feature extraction. (c) Without Transformer (O-Trans)-in order to investigate the importance of the transformer, we moved the transformer out of the infrared feature extraction module to study its function. (d) Without RDB (O-RDB)-to verify the necessity of RDB, we removed the RDB in the network to illustrate its validity. (e) To verify the effectiveness of the transformer and RDB for infrared and visible feature extraction, respectively, we exchanged their extraction modules (E-FEM). Figure 5 and 6 show the fusion results of the TNO and RoadScene datasets, respectively. To allow for better comparison, we zoomed in for a close-up of a local area in each fusion result. From the fusion results of D-Trans and D-RDB, it can be seen that we changed the asymmetric network to a two-stream symmetric network, resulting in blurred edges of infrared targets and insufficient clarity of the scene, which indicates that the proposed asymmetric network has better complementary information preservation capabilities. In the absence of the transformer module, the infrared feature extraction module cannot capture the infrared protruding target well due to the failure to build the long-distance dependency. Therefore, the infrared character target in Figure 5 and the clouds in the sky in Figure 6 are relatively blurry. As for the case without RDB, we can see that its results fail to fully extract the visible details; although the infrared target is better maintained, the background texture of the fused image and the landmark lines of the RoadScene are not clear enough. In addition, the E-FEM fusion is not well preserved in both infrared target and texture details, which proves the effectiveness of the transformer and RDB for infrared and visible feature extraction, respectively.   Sensors 2023, 23, x FOR PEER REVIEW 9 of 21 structure, we applied the infrared feature extraction module of this paper to visible image feature extraction and constructed a symmetric network. (b) RDB-based dual-stream symmetric network (D-RDB)-to verify the effectiveness of the asymmetric network structure, we applied the visible feature extraction module of this paper to infrared image feature extraction. (c) Without Transformer (O-Trans)-in order to investigate the importance of the transformer, we moved the transformer out of the infrared feature extraction module to study its function. (d) Without RDB (O-RDB)-to verify the necessity of RDB, we removed the RDB in the network to illustrate its validity. (e) To verify the effectiveness of the transformer and RDB for infrared and visible feature extraction, respectively, we exchanged their extraction modules (E-FEM). Figure 5 and 6 show the fusion results of the TNO and RoadScene datasets, respectively. To allow for better comparison, we zoomed in for a close-up of a local area in each fusion result. From the fusion results of D-Trans and D-RDB, it can be seen that we changed the asymmetric network to a two-stream symmetric network, resulting in blurred edges of infrared targets and insufficient clarity of the scene, which indicates that the proposed asymmetric network has better complementary information preservation capabilities. In the absence of the transformer module, the infrared feature extraction module cannot capture the infrared protruding target well due to the failure to build the long-distance dependency. Therefore, the infrared character target in Figure 5 and the clouds in the sky in Figure 6 are relatively blurry. As for the case without RDB, we can see that its results fail to fully extract the visible details; although the infrared target is better maintained, the background texture of the fused image and the landmark lines of the RoadScene are not clear enough. In addition, the E-FEM fusion is not well preserved in both infrared target and texture details, which proves the effectiveness of the transformer and RDB for infrared and visible feature extraction, respectively.

Quantitative Comparisons
To evaluate the ablation experiments more objectively, we assessed the quality of their fusion results using image quality metrics. Table 1 shows the objective results in two different datasets. The table highlights the top-performing results in bold font, while the

Quantitative Comparisons
To evaluate the ablation experiments more objectively, we assessed the quality of their fusion results using image quality metrics. Table 1 shows the objective results in two different datasets. The table highlights the top-performing results in bold font, while the second-best results are indicated in underlined font. It is easy to see that our final method has the best overall score ranking in both the TNO datasets and the RoadScene datasets. Combining this with the subjective evaluation demonstrates the effectiveness of our network structure and the individual modules in the network.

Comparative Experiments
To fully evaluate the fusion performance of our approach, we first compared the proposed method with nine other algorithms on the RoadScene datasets.

Qualitative Comparisons
We randomly selected 50 of the 83 infrared and visible image pairs as the training set and used the remaining 33 image pairs as the test set. As shown in Figure 7, our fusion results outperform the other methods in improving the visual quality and integrating complementary information. To show the difference more clearly, we zoom in on the red boxed area and can observe that the three traditional methods, ADF and GF, have a loss of clothing texture details and a general prominence of significant targets. IVFusion fusion results do not match human visual effects and look unnatural. DDcGAN and DATFuse retain texture details, but the image quality is poor, producing significant artifacts. The methods of DenseFuse, CUFD, and GAN-FM maintain the intensity information of infrared and have high overall contrast, but the detailed information of visible images is more severely weakened (e.g., stripes on clothes, bicycle markings on the ground), and YDTR has too much useless information. Figure 8 shows the second set of source images of different methods and their fused image results. All nine methods have their own advantages but still have some drawbacks compared with our method. Specifically, both IVFusion and DDcGAN are inferior to all other methods from a visual sensory perspective. From the perspective of texture detail preservation, the methods of GAN-FM and YDTR inevitably suffer from infrared thermal radiation information, blurring the background and visible features (e.g., patterns in zoomed-in regions, distant tree branches). However, it is worth mentioning that they retain sufficient infrared salient target information. In contrast, the ADF, GF, DenseFuse, CUFD, and DATFuse methods are able to balance visible and infrared information, highlighting salient targets while retaining rich texture details. However, they are still inferior to DAS-Net, and in the enlarged area of the red box, only our method clearly shows the pattern on the clothes. In summary, only our method can effectively integrate the complementary information from the source image and simultaneously ensure the visual quality of the fused image.   Figure 8 shows the second set of source images of different methods and their fused image results. All nine methods have their own advantages but still have some drawbacks compared with our method. Specifically, both IVFusion and DDcGAN are inferior to all other methods from a visual sensory perspective. From the perspective of texture detail preservation, the methods of GAN-FM and YDTR inevitably suffer from infrared thermal radiation information, blurring the background and visible features (e.g., patterns in zoomed-in regions, distant tree branches). However, it is worth mentioning that they retain sufficient infrared salient target information. In contrast, the ADF, GF, DenseFuse, CUFD, and DATFuse methods are able to balance visible and infrared information, highlighting salient targets while retaining rich texture details. However, they are still inferior to DAS-Net, and in the enlarged area of the red box, only our method clearly shows the pattern on the clothes. In summary, only our method can effectively integrate the complementary information from the source image and simultaneously ensure the visual quality of the fused image.    Figure 8 shows the second set of source images of different methods and their fused image results. All nine methods have their own advantages but still have some drawbacks compared with our method. Specifically, both IVFusion and DDcGAN are inferior to all other methods from a visual sensory perspective. From the perspective of texture detail preservation, the methods of GAN-FM and YDTR inevitably suffer from infrared thermal radiation information, blurring the background and visible features (e.g., patterns in zoomed-in regions, distant tree branches). However, it is worth mentioning that they retain sufficient infrared salient target information. In contrast, the ADF, GF, DenseFuse, CUFD, and DATFuse methods are able to balance visible and infrared information, highlighting salient targets while retaining rich texture details. However, they are still inferior to DAS-Net, and in the enlarged area of the red box, only our method clearly shows the pattern on the clothes. In summary, only our method can effectively integrate the complementary information from the source image and simultaneously ensure the visual quality of the fused image.

Quantitative Comparisons
We selected 33 image pairs from the RoadScene datasets for quantitative evaluation. The quantitative results for the six statistical metrics are shown in Figure 9 and Table 2. For each metric, the best and second-best fusion results for all methods are marked in bold and underlined, respectively. It can be observed that our method has outstanding stability and advantages on the RoadScene datasets. Our method performs admirably overall, achieving high rankings across all metrics. Moreover, there is a robust correlation between the fused image and the original image, indicating that our method is highly compatible with the human visual system. Extensive qualitative and quantitative results from the RoadScene datasets demonstrate that our method excels at generating fused images that align with the visual traits of the human eye while maximizing the preservation of information from the source images.
The quantitative results for the six statistical metrics are shown in Figure 9 and Table 2. For each metric, the best and second-best fusion results for all methods are marked in bold and underlined, respectively. It can be observed that our method has outstanding stability and advantages on the RoadScene datasets. Our method performs admirably overall, achieving high rankings across all metrics. Moreover, there is a robust correlation between the fused image and the original image, indicating that our method is highly compatible with the human visual system. Extensive qualitative and quantitative results from the RoadScene datasets demonstrate that our method excels at generating fused images that align with the visual traits of the human eye while maximizing the preservation of information from the source images.

Generalization Experiments
Generalization performance is an important aspect of evaluating deep learning-based methods. Therefore, we provide generalization experiments on the TNO datasets to demonstrate the generalizability of the proposed approach. It is worth mentioning that our fusion model is trained on the RoadScene datasets and tested directly on the TNO datasets.

Subjective Results
As shown in Figure 10, the fusion results obtained by the different methods in the TNO datasets introduce some meaningless information, which is reflected in the loss of texture details and the diminution of significant targets. To visualize the effect of the fused images, we zoom in on the region with rich texture details in the red box. We can observe that, compared with our method, the three traditional methods, ADF, IVFusion, and GF, have some degree of loss of door frame details, which is because the texture details in the background region are contaminated by the thermal radiation information, especially the IVFusion method, which fails to preserve the useful information of the source image well and the overall visual effect is poor. The DDcGAN method has a limited ability to extract texture details from the visible image and not only has a distortion problem but also cannot preserve the sharpened edges of the target. As for the YDTR method, the intensity information of the significant targets is diminished to different degrees, and the overall contrast is low. It is worth mentioning that DenseFuse, CUFD, DATFuse, and GAN-FM interfere less with the useless information, but the texture details are still lost, and the edges of the door frame are not clear enough. Overall, DAS-Net provides a good visual effect. On the one hand, our method maintains clear background information, such as bright skies, layered grasses, and door frames with distinct edges; on the other hand, the major significant information from the infrared image is clearly highlighted. The second set of source images and their fused image results for different methods are shown in Figure 11. It is obvious that our fusion results are better than the other nine methods from the viewpoint of visual effect, preservation of texture details, and significant target. The methods of CUFD, DATFuse, GAN-FM, and YDTR do not retain enough texture details, the background information of tree branches and railings is blurred, and the intensity information of infrared is too low, resulting in low contrast between light and dark in the fused image. Although the ADF and GF methods retain relatively clear background information, it can be observed from the areas with rich texture details in the red box that the fusion effect is still our method.

Objective Results
We selected 51 image pairs from the TNO dataset for quantitative evaluation. Figure  12 displays the performance metrics for each fusion result, while Table 3 shows the average performance metric values for these fusion methods. For each metric, the best and second-best fusion results for all methods are marked in bold and underlined, respectively. As can be seen in several figures, our method has significant advantages on SSIM, Figure 11. Fusion images of "forest trail" for different methods, along with their respective source images.

Objective Results
We selected 51 image pairs from the TNO dataset for quantitative evaluation. Figure 12 displays the performance metrics for each fusion result, while Table 3 shows the average performance metric values for these fusion methods. For each metric, the best and secondbest fusion results for all methods are marked in bold and underlined, respectively. As can be seen in several figures, our method has significant advantages on SSIM, MSE, CC, PSNR, and Q CB on the TNO dataset. This phenomenon implies that our fused images have the best visual effect and contain rich texture information and infrared salient target information. In addition, our method ranks second in SCD, which indicates that our method transfers enough source image information to the fused images. MSE, CC, PSNR, and QCB on the TNO dataset. This phenomenon implies that our fused images have the best visual effect and contain rich texture information and infrared salient target information. In addition, our method ranks second in SCD, which indicates that our method transfers enough source image information to the fused images.  In summary, a large number of qualitative and quantitative results on the TNO datasets show that our method has outstanding generalization and stability and is able to retain sufficient texture details and intensity information. We attribute this advantage to several aspects. On the one hand, we design an asymmetric network for the different modal characteristics of the infrared and visible images, preserving thermal radiation information of infrared images and texture details of visible images, respectively. On the In summary, a large number of qualitative and quantitative results on the TNO datasets show that our method has outstanding generalization and stability and is able to retain sufficient texture details and intensity information. We attribute this advantage to several aspects. On the one hand, we design an asymmetric network for the different modal characteristics of the infrared and visible images, preserving thermal radiation information of infrared images and texture details of visible images, respectively. On the other hand, we embed the transformer into the CNN network, which allows the network to preserve the global and local features to the maximum extent.

Detecting Performance
Target detection is an important research direction in the field of computer vision, and its performance well reflects the semantic information integrated into the fused images. To be able to better evaluate the target detection performance of fused images, we used the YOLOX detector [54] for detection. We conducted experiments on 50 randomly selected pairs of images from the MFNet dataset, including 25 pairs of nighttime images and 25 pairs of daytime images that depict urban scenes. Figures 13-15 show some typical source images and the detection results of different methods. From the visualization results, we can find that visible images contain rich background information but are difficult to detect salient targets, while infrared images can provide sufficient semantic information about salient targets (e.g., people), and the target has high contrast with the background, which is more helpful for detectors to detect salient targets. Different fusion algorithms can integrate the complementary information of these two images; however, the performance of fusion and detection differs due to the difference in methods. For example, in the 00004N scenario, the methods of IVFusion, GF, DDcGAN, YDTR, CUFD, and DATFuse detect only one person. ADF, DenseFuse, and GAN-FM detect two people, while our method can detect three persons. A similar scenario occurs in the 00726N scene, where only our method and CUFD accurately detect people, cars, and trucks. This shows that our method fully integrates the intensity information of infrared images and the texture information of visible and is suitable for subsequent image applications. methods. From the visualization results, we can find that visible images contain rich background information but are difficult to detect salient targets, while infrared images can provide sufficient semantic information about salient targets (e.g., people), and the target has high contrast with the background, which is more helpful for detectors to detect salient targets. Different fusion algorithms can integrate the complementary information of these two images; however, the performance of fusion and detection differs due to the difference in methods. For example, in the 00004N scenario, the methods of IVFusion, GF, DDcGAN, YDTR, CUFD, and DATFuse detect only one person. ADF, DenseFuse, and GAN-FM detect two people, while our method can detect three persons. A similar scenario occurs in the 00726N scene, where only our method and CUFD accurately detect people, cars, and trucks. This shows that our method fully integrates the intensity information of infrared images and the texture information of visible and is suitable for subsequent image applications.    In daytime images, the confidence level of visible images for detecting pedestrians is lower than that of infrared images due to daytime illumination factors, and some pedestrian targets may not be detected. In the 00420D scene, DDcGAN cannot keep the sharpened edges of pedestrians and other objects, resulting in low confidence of both targets. Due to the interference of useless information, ADF, IVFusion, and YDTR leave some targets undetected. The poor fusion of GF, GAN-FM, CUFD, and DATFuse leads to lower confidence in detecting pedestrians and other objects than the source image. In contrast, our method and DenseFuse fully integrate the semantic information in the source images, preserving the source image targets and details. Compared with others, our fused images can detect all targets with confidence levels closer to the source image for all detected In daytime images, the confidence level of visible images for detecting pedestrians is lower than that of infrared images due to daytime illumination factors, and some pedestrian targets may not be detected. In the 00420D scene, DDcGAN cannot keep the sharpened edges of pedestrians and other objects, resulting in low confidence of both targets. Due to the interference of useless information, ADF, IVFusion, and YDTR leave some targets undetected. The poor fusion of GF, GAN-FM, CUFD, and DATFuse leads to lower confidence in detecting pedestrians and other objects than the source image. In contrast, our method and DenseFuse fully integrate the semantic information in the source images, preserving the source image targets and details. Compared with others, our fused images can detect all targets with confidence levels closer to the source image for all detected targets, which demonstrates the advantage of our method for facilitating advanced vision tasks.

Objective Results
To further evaluate the performance of different methods for the detection task, we use the mean evaluation precision (mAP) for quantitative evaluation. The mAP has a value between 0 and 1; the closer to 1, the better in the model. mAP@0.5 and mAP@0.9 denote the mAP values at confidence thresholds of 0.5 and 0.9, respectively. The results are shown in Table 4, and it can be seen that our method performs better under both thresholds. Especially in terms of mAP@0.9, the fused images of our method have a clear advantage and rank first in terms of average accuracy. In terms of mAP@0.5, GAN-FM performs the best, and our method is second, while GAN-FM performs poorly in the other threshold. This further indicates the excellent stability of our method. Based on the above subjective and objective analysis, we conclude that the fused images of the proposed method can perform well in the image fusion task and also help improve the performance of target detection.

Conclusions
In this paper, we propose a new end-to-end network to solve the problem of infrared and visible image fusion. For the characteristics of two different modal images, we design dual-stream asymmetric branched paths to extract infrared and visible image features, use the transformer to capture global information and long-distance dependencies in infrared images, and use residual dense blocks to fully extract texture details in visible images. Finally, the captured features are fully merged by the main path to further retain important information. This approach enables the preservation of both texture details from visible images and thermal radiation targets from infrared images in a superior manner. We conducted a large number of comparison experiments and generalization experiments testing using the RoadScene datasets and TNO datasets. The experimental results reveal that our approach outperforms existing techniques in both subjective and objective evaluations, demonstrating its outstanding performance and generalization ability. Target detection experiments were carried out on the MFNet datasets to showcase the prowess of our approach in elevating high-level visual tasks. In our future research, we will further modify our network details and design more appropriate loss functions to improve our method. We will design our method to be a unified fusion model and apply it to other multimodal image fusion fields, such as medical image fusion, remote sensing image fusion, and other tasks.