IFormerFusion: Cross-Domain Frequency Information Learning for Infrared and Visible Image Fusion Based on the Inception Transformer

: The current deep learning-based image fusion methods can not sufﬁciently learn the features of images in a wide frequency range. Therefore, we proposed IFormerFusion, which is based on the Inception Transformer and cross-domain frequency fusion. To learn features from high- and low-frequency information, we designed the IFormer mixer, which splits the input features through the channel dimension and feeds them into parallel paths for high- and low-frequency mixers to achieve linear computational complexity. The high-frequency mixer adopts a convolution and a max-pooling path, while the low-frequency mixer adopts a criss-cross attention path. Considering that the high-frequency information relates to the texture detail, we designed a cross-domain frequency fusion strategy, which trades high-frequency information between the source images. This structure can sufﬁciently integrate complementary features and strengthen the capability of texture retaining. Experiments on the TNO, OSU, and Road Scene datasets demonstrate that IFormerFusion outperforms other methods in object and subject evaluations.


Introduction
The image fusion technique aims to fuse images captured by different sensors to generate a fused image with better human visual effects and scene representation. As a result of thermal-based imaging, the infrared sensor can work in low-illuminance environments and all weather conditions and generate infrared images to emphasize prominent targets. The infrared image contains rich, low-frequency information, but it lacks texture detail. The visible image contains a rich texture and detailed spatial features, but it is susceptible to influence by illuminance and weather conditions. Due to the good complementarity of infrared and visible images, the fused image can provide high-quality results for person re-identification [1], object tracking [2], remote sensing [3,4], and salient object detection [5].
The image fusion methods can be categorized into traditional methods and deep learning-based methods. Traditional methods usually consist of three parts, a manual representation model to extract features, fusion strategies to the feature maps or weight maps, and an inverse feature extractor to reconstruct images. The traditional methods can be categorized into spatial domain [6] and frequency domain methods [7][8][9]. The spatial domain methods are usually simple to infer, but have poor effects on edge preserving. The frequency domain methods usually adopt domain transform operation and fuse procedures, which are more complex [10]. Although traditional fusion methods can achieve good fusion results, the manual design of transformation algorithms and fusion strategies can result in higher computational complexity, limiting the fusion performance. Conversely, deep learning-based models can extract features and generate high-performance fusion images without the need for a complex manual design. Deep learning-based methods can be categorized into four types: convolution neural network(CNN)-based fusion methods [11,12], auto-encoder-based fusion methods [13,14], generative adversarial network(GAN)-based fusion methods [15,16], and transformer-based fusion methods [17,18]. Although the above methods can generate fused images with a good performance, some issues still exist in order for them to be improved. CNN-based, auto-encoder-based, and GAN-based fusion methods generally use convolutional layers to extract the features. Thus, these methods cannot establish long-range dependencies with the limitation of the perceptive field of convolutional layers. The vision transformer develops attention mechanisms to build a long-range relationship between the image patches. However, there are still some drawbacks. Firstly, the existing methods fail to learn information in a wide frequency range, which is important for infrared and visible image fusion tasks. Secondly, the existing methods a lack consideration for the relationship between the frequency information of the source images.
Above all, we proposed IFormerFusion based on the Inception Transformer and cross-domain frequency fusion. On the one hand, we designed the IFormer mixer based on the Inception Transformer, which adopts convolution/max-pooling paths to process high-frequency information and a criss-cross attention path to process low-frequency information. The IFormer mixer was designed to capture high-frequency information to Transformers structure and learn information in a wide frequency range. On the other hand, cross-domain frequency fusion can trade high-frequency information between the source images to guide the model to retain more high-frequency information. IFormerFusion has four parts: feature extraction, cross-domain frequency fusion, feature reconstruction, and fused image reconstruction. The feature extractor can effectively extract comprehensive features in a wide frequency range. The cross-domain frequency fusion can learn features in a wide frequency range and trade high-frequency information between the source images to strengthen the texture-retaining capability of the model. The fusion results are concatenated and fed to the Inception Transformer-based feature reconstruction part to reconstruct deep features. Finally, a CNN-based fused image reconstruction part is utilized to reconstruct the images. Above all, the main contributions of this work can be summarized as follows:

•
We propose an infrared and visible image fusion method, IFormerFusion, which can efficiently learn features from source images in a wide frequency range. IFormerfusion can sufficiently retain texture details and maintain the structure of the source images.

•
We designed the IFormer mixer, which consists of convolution/max-pooling paths and a criss-cross attention path. The convolution/max-pooling can learn high-frequency information, while the criss-cross attention path can learn low-frequency information.

•
The cross-domain frequency fusion can trade high-frequency information between the source images to sufficiently learn comprehensive features and strengthen the capability to retain texture. • Experiments conducted using the TNO, OSU, and Road Scene datasets show that IFormerFusion obtains better results in both visual quality evaluation and quantitatively evaluation.

Vision Transformer
The transformer, which was designed for natural language processing (NLP), has achieved notable success in a broad range of computer vision tasks, such as image classification [19,20], semantic segmentation [21,22], and object detection [23][24][25][26][27][28]. In 2020, Dosovitskiy et al. proposed the vision transformer, which splits an image into a sequence of flattened 16 × 16 patches and regards them as words in pieces of text [29]. Then, the vision transformer will embed the patches linearly and adopt a self-attention mechanism, which also brings an increased computational complexity quadratic to the image size, which is the bottleneck of the application in pixel-level tasks (such as image fusion) that require dense prediction. In 2021, Liu et al. proposed the Swin Transformer, in which self-attention computation is limited in the local window and shifted window partitioning in successive blocks [21]. Thus, the Swin Transformer has a linear complexity corresponding to the image size [21]. In 2022, Si et al. noticed that it is incompetent at learning information in a wide Remote Sens. 2023, 15, 1352 3 of 17 frequency range and proposed the Inception Transformer, which can sufficiently learn comprehensive features with both high-and low-frequency information [30]. The input images are split into three partitions and fed into the Inception mixer to learn features in a wide frequency range. Through the Inception mixer, the Inception Transformer has greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling paths and self-attention path to learn information within a wide frequency range [30].

Deep Learning-Based Image Fusion Methods
In the beginning, deep learning models were only utilized to extract features and generate weight maps for fusion [11]. The majority of fusion methods retain the structure of the traditional fusion framework. With more researchers designing networks and loss functions, CNN-based methods gradually differ from traditional frameworks. In 2021, Long et al. designed a parallel, aggregated, residual dense block consisting of a dense block path and a residual dense block path and the proposed RXDNFuse [12]. In 2019, Li et al. introduced an auto-encoder-based fusion network, DenseFuse [13]. The auto-encoder-based fusion methods can benefit from the high interpretability of the traditional fusion methods and the feature extraction ability of the CNN. Thus, many researchers focus on improving each part of the auto-encoder structure. In 2020, to fuse the extracted feature more effectively and reconstruct the feature maps, Li et al. proposed NestFuse, which adopts a nestconnected decoder network and the attention-based fusion strategy [31]. In 2021, Xu et al. proposed a learnable fusion strategy for the first time, which could quantitatively measure the classification significance of feature maps by using the back-propagation integral gradient of the classification results [32]. In 2022, Wang et al. proposed Res2Fusion, which adopts a multi-scale feature extraction strategy without down-sampling, and established long-distance feature dependency through nonlocal attention mechanisms [33].
Some researchers utilized the generative adversarial network(GAN) to fuse images in an implicit manner. In 2019, Ma et al. first introduced GAN to infrared and visible image fusion and proposed FusionGAN [15]. Subsequently, they proposed other GAN-based fusion methods (DDcGAN [34] in 2020 and GANMcC [16] in 2021). DDcGAN is utilized to fuse multi-resolution images. GANMcC is utilized to generate fused images with a balance between the gradient and intensity of the source images.
To further improve the fusion effects, some researchers replaced the CNN layers with transformer structures. In 2022, Liu et al. proposed MFST, which adopts a self-adaptive transformer fusion strategy [14]. In 2022, Rao et al. proposed TGFuse, which adopts a transformer-based generator [35]. In 2022, Wang et al. proposed SwinFuse, which adopts residual Swin Transformer blocks as the encoder network. Based on the Swin Transformer, in 2022, Ma et al. proposed a cross-domain long-range fusion method, which includes interand Swin Transformer-based cross-domain modules to extract and fuse deep features [18].
Different from the mentioned methods, we propose IFormerFusion, which sufficiently learns the features of the images in a wide frequency range through the IFormer mixer that consists of parallel convolution, max-pooling, and attention paths in a wide frequency range. Moreover, we designed a cross-domain frequency fusion strategy to sufficiently integrate complementary features and strengthen the ability to retain texture.

Overall Framework
Let I 1 and I 2 ∈ R H×W×C represent the input images of IFormerFusion, while C is the channel number, and H and W are the image sizes. As shown in Figure 1, IFormerFusion is constructed with four parts: feature extraction, cross-domain frequency fusion, feature reconstruction, and fused image reconstruction.
Feature Extraction: Firstly, the input images or features are first embedded by convolutional embedding layers, which extend the channel dimension of the input to the required number. In this experiment, the required channel number is 60. The embedding result can be expressed as: Let I1 and I2 ∈ ℝ represent the input images of IFormerFusion, while C is the channel number, and H and W are the image sizes. As shown in Figure 1, IFormerFusion is constructed with four parts: feature extraction, cross-domain frequency fusion, feature reconstruction, and fused image reconstruction.
Feature Extraction: Firstly, the input images or features are first embedded by convolutional embedding layers, which extend the channel dimension of the input to the required number. In this experiment, the required channel number is 60. The embedding result can be expressed as: , (1) We then design feature extraction blocks called base blocks. In each base block, the input feature will split into three partitions including two high-frequency partitions (H1 and H2) and a low-frequency partition (L) through the channel dimension and feed them into the IFormer mixer. Next, a feed-forward network (FFN) is deployed to refine the result of the IFormer mixer. Layer normalization (LN) is adopted before both the FFN and the split operation in the IFormer mixer, which is inferred specifically in the next section. Moreover, the residual connection is adopted in the mixer. The inference can be expressed as: , , , , where and represent the high-frequency information, represents the lowfrequency information, represents the residual mixed results, and represents the extract features. Feature extraction consists of N1 base blocks. In this experiment, N1 is 3.
Cross-Domain Frequency Fusion: The cross-domain frequency fusion part is designed to learn features in a wide frequency range and fuse cross-domain frequency information. The cross-domain frequency fusion part consists of N2 cross blocks. The base block is the same as the base block in feature extraction. However, in the cross block, the high-frequency information of two input features is exchanged. The embedding results of input features and can be expressed as: We then design feature extraction blocks called base blocks. In each base block, the input feature will split into three partitions including two high-frequency partitions (H1 and H2) and a low-frequency partition (L) through the channel dimension and feed them into the IFormer mixer. Next, a feed-forward network (FFN) is deployed to refine the result of the IFormer mixer. Layer normalization (LN) is adopted before both the FFN and the split operation in the IFormer mixer, which is inferred specifically in the next section. Moreover, the residual connection is adopted in the mixer. The inference can be expressed as: where φ h1 and φ h2 represent the high-frequency information, φ l represents the lowfrequency information, φ M represents the residual mixed results, and φ F represents the extract features. Feature extraction consists of N 1 base blocks. In this experiment, N 1 is 3. Cross-Domain Frequency Fusion: The cross-domain frequency fusion part is designed to learn features in a wide frequency range and fuse cross-domain frequency information. The cross-domain frequency fusion part consists of N 2 cross blocks. The base block is the same as the base block in feature extraction. However, in the cross block, the high-frequency information of two input features is exchanged. The embedding results of input features I 1 and I 2 can be expressed as: The split partitions of φ i are φ i h 1 , φ i h 2 , and φ i l (i = 1, 2), which can be expressed as: The mixed results φ 1 M and φ 2 M can be expressed as: The extract features φ 1 F and φ 2 F can be expressed as: In this experiment, N 2 is 6. The results φ 1 F and φ 2 F are concatenated and fed to the feature reconstruction part. The concatenated result φ F can be expressed as: Feature Reconstruction: The results of the Cross-Domain Frequency Fusion part are fed into the Feature Reconstruction part, which consists of N 3 reconstruction blocks. The reconstruction block is half of the base block, which has a single path to process the concatenated result. In this experiment, N 3 is 3.
Fused Image Reconstruction: Finally, a simple CNN-based part to reconstruct fused images is utilized to reconstruct fused images. The fused image reconstruction part consists of two convolutional layers.

IFormer Mixer
The architecture of the IFormer mixer is shown in Figure 2. After splitting the input feature into three partitions through the channel dimension, high-frequency and lowfrequency mixers are adopted to learn the features in a wide frequency range. The highfrequency mixer has a max-pooling (MaxPool) path, which consists of a max pooling subsequently a linear layer [36], and a parallel convolution path, which consists of a linear subsequently a depthwise convolution (DwConv) layer [37]. The low-frequency mixer has a criss-cross attention path, which consists of average pooling, criss-cross attention (CC-Atten) [38], and up-sampling (UpSample). The detailed inference is as follows.
With the input X ∈ R H×W×C , X can be divided into two parts through channel dimension, the high-frequency part, X h ∈ R H×W×C h , and the low-frequency part, X l ∈ R H×W×C l , where C h + C l = C. Then, X h and X l are assigned to the high-frequency mixer and low-frequency mixer, respectively. The details of the high-and low-frequency mixers follow. High-frequency mixer: Considering the sharpness sensitivity of the maximum filter and the detail perception of the convolution operation, we propose two high-frequency paths to take advantage of the sharpness sensitivity of max-pooling and the detail perception capability of the convolution layers to learn high-frequency information. Firstly, the input is divided into ∈ ℝ × × / and ∈ ℝ × × / . is fed into the max High-frequency mixer: Considering the sharpness sensitivity of the maximum filter and the detail perception of the convolution operation, we propose two high-frequency paths to take advantage of the sharpness sensitivity of max-pooling and the detail perception capability of the convolution layers to learn high-frequency information. Firstly, the input X h is divided into X h1 ∈ R H×W×C h/2 and X h2 ∈ R H×W×C h/2 . X h1 is fed into the max pooling path. X h2 is fed into the parallel convolution path. The outputs of the high-frequency mixer Y h1 and Y h2 can be expressed as: Low-frequency mixer: Considering the strong capability of the attention mechanism for learning global representation, we use criss-cross attention [38] to establish long-range dependency to learn low-frequency information. However, in image fusion tasks, dense prediction can bring great computation complexity with the large resolution of feature maps in the low-frequency mixer. Therefore, we utilize an average-pooling operation to reduce the scale of X l before the criss-cross attention operation and an up-sample layer to restore the original scale. In this experiment, the kernel size and stride for the average pooling are 4, and the size of up-sample layers is also 4. This branch can be defined as where Y l is the output of the low-frequency mixer, and CC represents criss-cross attention. Finally, the outputs Y h1 , Y h2 , and Y l are concatenated through the channel dimension and fused to obtain Y c :

Loss Function
Three loss functions are adopted for the IFormerFusion in the training phase, which is explained as follows.
The structural similarity (SSIM) loss L SSI M can be expressed as: where α = β = 0.5 in this experiment and SSIM(·) represents the philosophy of structural similarity [39]. Inspired by IFCNN [40] and SwinFusion [18], we deploy the intensity loss to supervise the model to capture potential intensity information. The intensity loss L Int can be expressed as: where ||·||1 represents the l1-norm, and max(·) represents the chosen max value. H and W represent the image sizes. We deploy texture loss [18] to evaluate the texture details, which can be extracted by the maximum function. Thus, the texture loss L Text can be expressed as: where ∇ represents the Sobel gradient operator, |·| represents for the absolute operation, max(·) represents to choose max value, and ||·||1 represents the l1-norm. Finally, the total loss function L total is a weighted sum of all loss functions, which can be expressed as: where λ 1 , λ 2 , and λ 3 are weighted to balance each loss.
All the experiments in this paper are conducted on Intel(R) Xeon(R) Silver 4210 CPU and NVIDIA GeForce RTX 3090 GPU. The PyTorch program is used. The batch size is eight. The images are randomly cropped to 128 × 128 patches and normalized to [0, 1]. The Adam optimizer is used. The model is trained for 200 epochs. The initial learning rate is 0.001 and it decays to half this in epochs 20, 40, 80, 120, and 180.

Evaluation Metrics
A total of six metrics are selected for evaluation, including mutual information (MI) [47], fast mutual information (FMI) [48], the peak signal-to-noise ratio (PSNR), the structural similarity index measurement (SSIM) [39], visual information fidelity (VIF) [49], and Q abf [50]. These metrics measure the performance of the fusion method from different aspects. Suppose the infrared image, the visible image, and the fused image are I, V, and F, respectively. Their detailed definitions are described as follows: The mutual information metric is a quality index that measures the amount of information transferred from the source images to the fused image. mutual information is a fundamental concept in information theory and measures the dependence of two random variables. The definition of the mutual information metric can be expressed as: where MI(I, F) and MI(V, F) are the amounts of information transferred from the infrared images and visible images to the fused image, respectively. The MI between two random variables can be calculated by the Kullback-Leibler approach, which can be expressed as: where P X,F (x, f ) is the joint histogram of the source image x and the fused image F; P X (x) and P F ( f ) are the marginal histograms of the source image X and the fused image F, respectively. The fast mutual information metric calculates the regional mutual information between the corresponding windows in the fused image and the source images [48]. The mutual information In f o(I, F) and In f o(V, F) can be expressed as: where H i (X) and H i (F) are the entropies of X and F, and the mutual information In f o i (X, F) can be expressed as: The fast mutual information metric can be expressed as: The peak signal-to-noise ratio metric is the ratio of the maximum possible power of a signal to the destructive noise power, which affects its representation accuracy. The PSNR of the fused image and source images can be expressed as: where max(F) is the maximum value in the fused image, and the MSE(I, F) and MSE(V, F) are the mean-squared errors, which can be expressed as: where X(i, j) and F(i, j) are the value of X and F in row i and column j, and M and N are the weight and height of the images, respectively. The structural similarity index measurement metric calculates the similarity of the fused images and source images in terms of luminance, contrast, and structure [39]. The SSIM metric can be expressed as: where µ x and µ f are the average gradients of X and F; σ x and σ f are the standard deviation of X and F; σ x f is the correlation coefficient of X and F; C 1 , C 2 , and C 3 are (0.01 * L) 2 , (0.03 * L) 2 , and 1 2 (0.01 * L) 2 , respectively; L the dynamic range of the pixel values, which is 255.
The visual information fidelity metric calculates image distortions including additive noise, blurs, and changes [49]. VIF is derived from the quantification of two types of mutual information: the mutual information between the input and the output of the HVS channel (described via a stationary white Gaussian noise model) when no distortion channel is presented (i.e., reference mutual information) and the mutual information between the input of the distortion channel and the output of the HVS channel for the test image [51]. The visual information fidelity can be expressed as: where the subbands are a collection of specific sub-bands; I( Thus, the visual information fidelity for fusion (VIFF) can be expressed as: The Q abf metric measures the similarity of the edge transferred from the source images to the fused image. Q abf can be expressed as: where w X (i, j) is the weight matrix of source images and Q X,F (i, j) is the edge information transferred from the source image to the fused image. The Q X,F (i, j) can be expressed as: where Q X,F g (i, j) and Q X,F a (i, j) are the retaining edge intensity and direction in the pixel (i, j), respectively.

Results on the TNO Dataset
Visual Quality Evaluation: Four pairs of images are selected to evaluate the visual effects of IFormerFusion and nine comparable methods, as shown in Figure 3. Some targets and details are highlighted with boxes to display the information that is worthy of attention. DenseFuse, NestFuse, RFN-Nest, and SwinFuse extract features through a well-designed encoder and adopt an additional attention-based or learnable fusion strategy. Their fused images have lower brightness and contrast values than those of the other methods do, which indicates a weaker ability to retain detailed texture information. In Figure 3a, the fused images of DenseFuse, NestFuse, and SwinFuse cannot display the text information of the pedestrian and the store billboard. In Figure 3b, the target on the building is hard to distinguish from the background. In Figure 3c,d, the tree branches lack details. The fused images of GANMcC retain the sharpness of the target in the infrared images. However, the background information, such as the plants and buildings, is fuzzy and lacks details. Res2Fusion offers better visual effects than the above methods do. In Figure 3a, the store billboard and pedestrian details are retained from the source images. Nevertheless, in Figure 3b, the target over the building is still not clear. In Figure 3c, the details of the tree branches are retained, but in Figure 3d, more specific details of the tree are lost. The fused images of IFCNN, SwinFusion, U2Fusion, and IFormerFusion generate fused images that can balance the gradient and intensity information. The targets are displayed with rich texture details and sharp edges. The details of background information, including the plants and buildings, are retained. However, IFormerFusion obtains the best visual effects and retains rich texture details and sharp edges. including the plants and buildings, are retained. However, IFormerFusion obtains the best visual effects and retains rich texture details and sharp edges.  Quantitative Evaluation: To quantitatively evaluate IFormerFusion and the comparable methods, six metrics are calculated using 21 pairs of images. The result is shown in Figure 4. The average value of each metric is shown in Table 1. IFormerFusion obtains the best values in MI, FMI, PSNR, VIFF, and Qabf, and the second-best value in SSIM. More specifically, the highest MI and FMI values show that the proposed method can transfer more feature and edge information from the source images to the fused image. The highest PSNR value shows that the proposed method causes the least information distortion during fusion. The highest VIFF value shows that the proposed method has more effective visual information. The highest Q abf value shows that the proposed method can obtain more visual information from the source images. The DenseFuse has the highest SSIM value, which indicates the advantage of structural information maintenance. However, DenseFuse is weaker in terms of the other metrics. Above all, the result indicates that IFormerFusion produces the best fusion results.
Quantitative Evaluation: To quantitatively evaluate IFormerFusion and the comparable methods, six metrics are calculated using 21 pairs of images. The result is shown in Figure 4. The average value of each metric is shown in Table 1. IFormerFusion obtains the best values in MI, FMI, PSNR, VIFF, and Qabf, and the second-best value in SSIM. More specifically, the highest MI and FMI values show that the proposed method can transfer more feature and edge information from the source images to the fused image. The highest PSNR value shows that the proposed method causes the least information distortion during fusion. The highest VIFF value shows that the proposed method has more effective visual information. The highest Qabf value shows that the proposed method can obtain more visual information from the source images. The DenseFuse has the highest SSIM value, which indicates the advantage of structural information maintenance. However, DenseFuse is weaker in terms of the other metrics. Above all, the result indicates that IFormerFusion produces the best fusion results.

Results on the OSU Dataset
Visual Quality Evaluation: A pair of images are selected to evaluate the visual effects of IFormerFusion and nine comparable methods. The results are shown in Figure 5. In the visible image, the pedestrian in the shadow of the building is hard to distinguish, while they are clearly observable in the infrared image. In the fused images of DenseFuse, GANMcC, NestFuse, and RFN-Nest, the gradient of the pedestrian is close to the building. Moreover, the edge of the pedestrian is blurred by the background building in GANMcC and RFN-Nest. In the fused images of IFCNN, Res2Fusion, SwinFuse, SwinFusion, and IFormerFusion, the brightness of the pedestrian is similar to that of the infrared images, which indicates that the information in the infrared images is well retained. In the fused images of GANMcC and RFN-Nest, the sculpture on the lawn is blurred. The constrain of the lawn in the fused images of IFCNN, NestFuse, SwinFuse, and U2Fusion is discordant, which is more influenced by the infrared images. Thus, the texture detail of the lawn is lost. The lawn in the fused images of Res2Fusion, SwinFusion, and IFormerFusion has better visual effects. Above all, the proposed method, IFormerFusion, can retain texture detail and fuse with a balance of the gradient and intensity.
Visual Quality Evaluation: A pair of images are selected to evaluate the visual effects of IFormerFusion and nine comparable methods. The results are shown in Figure 5. In the visible image, the pedestrian in the shadow of the building is hard to distinguish, while they are clearly observable in the infrared image. In the fused images of DenseFuse, GANMcC, NestFuse, and RFN-Nest, the gradient of the pedestrian is close to the building. Moreover, the edge of the pedestrian is blurred by the background building in GANMcC and RFN-Nest. In the fused images of IFCNN, Res2Fusion, SwinFuse, SwinFusion, and IFormerFusion, the brightness of the pedestrian is similar to that of the infrared images, which indicates that the information in the infrared images is well retained. In the fused images of GANMcC and RFN-Nest, the sculpture on the lawn is blurred. The constrain of the lawn in the fused images of IFCNN, NestFuse, SwinFuse, and U2Fusion is discordant, which is more influenced by the infrared images. Thus, the texture detail of the lawn is lost. The lawn in the fused images of Res2Fusion, SwinFusion, and IFormerFusion has better visual effects. Above all, the proposed method, IFormerFusion, can retain texture detail and fuse with a balance of the gradient and intensity. Quantitative Evaluation: To quantitatively evaluate IFormerFusion and compare the methods, six metrics are calculated using 40 pairs of images. The result is shown in Figure  6. The average value of each metric is shown in Table 2. IFormerFusion obtains the best results in terms of PSNR, VIFF, and Qabf, which indicates that the proposed method has the least information distortion and obtains the best visual effects during fusion. The proposed method lags behind the best method by a narrow margin in terms of MI, FMI, and SSIM. Thus, the result indicates that IFormerFusion can transfer a lot of information from the source images and produces the best results. Quantitative Evaluation: To quantitatively evaluate IFormerFusion and compare the methods, six metrics are calculated using 40 pairs of images. The result is shown in Figure 6. The average value of each metric is shown in Table 2. IFormerFusion obtains the best results in terms of PSNR, VIFF, and Qabf, which indicates that the proposed method has the least information distortion and obtains the best visual effects during fusion. The proposed method lags behind the best method by a narrow margin in terms of MI, FMI, and SSIM. Thus, the result indicates that IFormerFusion can transfer a lot of information from the source images and produces the best results.

Results on the Road Scene Dataset
Visual Quality Evaluation: A pair of images is selected to evaluate the visual effects of IFormerFusion and nine comparable methods. The result is shown in Figure 7. In the fused images of DenseFuse, GANMcC, Res2Fusion, and RFN-Nest, the tree on the left is hard to distinguish from the background. On the right of the images, the texture of the tire rim is not rich, and the people standing by the area are fuzzy. In the fused image of NestFuse, though the tree branches are clear, artifacts exist in the car light, which indicates that NestFuse cannot balance the light information. In the fused images of IFCNN, Swin-Fuse, SwinFusion, U2Fusion, and IFormerFusion, the details of tree branches can be distinguished on the left. On the right, the tire rim is sharp, and the people in front of the car can be distinguished.

Results on the Road Scene Dataset
Visual Quality Evaluation: A pair of images is selected to evaluate the visual effects of IFormerFusion and nine comparable methods. The result is shown in Figure 7. In the fused images of DenseFuse, GANMcC, Res2Fusion, and RFN-Nest, the tree on the left is hard to distinguish from the background. On the right of the images, the texture of the tire rim is not rich, and the people standing by the area are fuzzy. In the fused image of NestFuse, though the tree branches are clear, artifacts exist in the car light, which indicates that NestFuse cannot balance the light information. In the fused images of IFCNN, SwinFuse, SwinFusion, U2Fusion, and IFormerFusion, the details of tree branches can be distinguished on the left. On the right, the tire rim is sharp, and the people in front of the car can be distinguished. Quantitative Evaluation: To quantitatively evaluate IFormerFusion and compare methods, six metrics are calculated on 221 pairs of images from Road Scene datasets. The average value of each metric is shown in Table 3. The NestFuse obtains the best results in terms of MI and FMI. However, as shown in Figure 7, artifacts exist on the car, which will introduce extra information to the fused images to increase the MI and FMI values. IFormerFusion obtains the best values in terms of PSNR, VIFF, and Qabf. IFormerFusion obtains the second-best value for SSIM, behind that of DenseFuse. DenseFuse also obtains second-best result in terms of PNSR. However, DenseFuse produces unsatisfactory results Quantitative Evaluation: To quantitatively evaluate IFormerFusion and compare methods, six metrics are calculated on 221 pairs of images from Road Scene datasets. The average value of each metric is shown in Table 3. The NestFuse obtains the best results in terms of MI and FMI. However, as shown in Figure 7, artifacts exist on the car, which will introduce extra information to the fused images to increase the MI and FMI values. IFormerFusion obtains the best values in terms of PSNR, VIFF, and Q abf . IFormerFusion obtains the second-best value for SSIM, behind that of DenseFuse. DenseFuse also obtains second-best result in terms of PNSR. However, DenseFuse produces unsatisfactory results in other metrics; in other words, DenseFuse can retain the structure and information from the source images, but has worse visual effects and less information. Above all, the result demonstrates that IFormerFusion produces the best fusion results.

Computation Efficiency
Moreover, we provide the computation efficiency of IFormerFusion and comparable methods, as shown in Table 4. DenseFuse, IFCNN, and NestFuse have a high computation efficiency because these methods consist of convolutional layers and simple fusion strategies. The RFN-Nest and SwinFuse one have a low computational efficiency because these methods utilize attention-based fusion strategies. Though GANMcC and U2Fusion also consist of convolutional layers, the computation is more complex. Thus, these methods have a lower computational efficiency. Res2Fusion and SwinFusion have the lowest computational efficiency because more complex attention mechanisms are used in these methods. However, the proposed IFormerFusion has a competitive computational efficiency and retains the linear computational complexity of the image size. All the methods are tested using a public code.

Conclusions
We propose IFormerFusion, a cross-domain frequency information learning infrared and visible image fusion network based on the Inception Transformer. The IFormer mixer based on the Inception Transformer consists of the high-frequency mixer, which consists of max pooling and convolution paths, and the low-frequency mixer, which contains a criss-cross attention path. The high-frequency mixer can take the advantage of convolution and max-pooling for capturing high-frequency information. The low-frequency mixer can establish long-range dependency, and the criss-cross attention can reduce the computational complexity. Thus, the IFormer mixer can learn information in a wide frequency range. Moreover, the high-frequency information is traded in the cross-domain frequency fusion part to achieve the sufficient integration of complementary features and strengthen the capability to retain texture. The proposed IFormerFusion can comprehensively learn features from high-and low-frequency information to retain texture details and maintain the structure.
We conducted experiments on TNO, OSU, and Road Scene datasets and compared them with nine advanced deep-learning methods using six metrics. The results demonstrate that IFormerFusion performs well at preserving the structure and retaining texture details. In addition, IFormerFusion presents the balanced intensity of the targets and background in the fused image. IFormerFusion also has a competitive computational efficiency.