Real-Time Semantics-Driven Infrared and Visible Image Fusion Network

This paper proposes a real-time semantics-driven infrared and visible image fusion framework (RSDFusion). A novel semantics-driven image fusion strategy is introduced in image fusion to maximize the retention of significant information of the source image in the fusion image. First, a semantically segmented image of the source image is obtained using a pre-trained semantic segmentation model. Second, masks of significant targets are obtained from the semantically segmented image, and these masks are used to separate the targets in the source and fusion images. Finally, the local semantic loss of the separation target is designed and combined with the overall structural similarity loss of the image to instruct the network to extract appropriate features to reconstruct the fusion image. Experimental results show that the RSDFusion proposed in this paper outperformed other comparative methods on both subjective and objective evaluation of public datasets and that the main target of the source image is better preserved in the fusion image.


Introduction
Due to technical limitations of hardware devices or optical imaging, images captured with a single type of sensor and recording device can only capture partial information and cannot effectively and comprehensively describe the scene being imaged [1]. For example, visible light images typically contain rich texture detail information but are susceptible to extreme environments, occlusions, etc., which can cause targets in the scene to be lost. Infrared sensors capture the thermal radiation emitted by objects and are effective in detecting prominent targets such as pedestrians and vehicles but lack a textural description of the scene [2]. As an important branch of image processing, image fusion plays an important role in this regard by effectively integrating complementary information between images captured by different sensors and recording devices and generating images that approximate the fused scene by performing reconstruction from multiple modal samples. Fusion images have complementary properties and better scene representation than the source images, enabling effective applications such as target recognition [3], clinical diagnosis [4], semantic segmentation [5], remote sensing monitoring [6], and applications in other areas.
Infrared and visible images have become a popular means for image fusion due to their powerful complementary information. In recent years, many infrared and visible image fusion algorithms have been proposed, which broadly fall into two categories: traditional algorithms and deep learning-based algorithms. Traditional image fusion algorithms mainly include methods based on multiscale transform [7,8], methods based on sparse representation [9,10], methods based on low-rank representation [11], methods based on subspace [12], and methods based on hybrid techniques [13]. Although the existing traditional image fusion methods achieve good fusion results in most cases, problems still need to be solved. First, traditional manually designed fusion strategies cannot be adapted to complex fusion scenes, which limits their fusion results. Second, traditional methods do 1.
This paper proposes a real-time semantics-driven framework for infrared and visible image fusion that makes full use of semantic information to have the fusion images retain clearer semantic targets; 2.
This paper proposes local target content loss that guides the fusion network to locally fuse the same targets extracted from infrared and visible images, effectively improving the local target quality of the fusion images; 3.
Experimental results show that the proposed algorithm outperforms existing popular fusion algorithms in both subjective visualization and objective evaluation.
The rest of the paper is structured as follows: Section 2 briefly describes the work related to deep learning-based image fusion methods and semantic segmentation. Section 3 provides a specific description of the proposed fusion method. Section 4 presents comparative validation experiments. Section 5 summarizes the work performed in this thesis.

Related Work
This section reviews the available popular image fusion algorithms and provides a brief introduction to semantic segmentation.

Deep Learning-Based Image Fusion Methods
With deep learning techniques gaining ground in image processing, deep learningbased methods have been widely used in image fusion tasks. Among them, AE-based frameworks mainly pre-train a self-encoder on some large datasets to achieve feature extraction and image reconstruction and then integrate the features to achieve image fu-sion using some hand-designed fusion strategies. For example, Li et al. proposed the DenseFuse method [14], which has three parts: encoder, feature fusion layer, and decoder. In the encoder layer, dense blocks are introduced to obtain deeper features while effectively improving the transfer of features from the previous layers. In the fusion layer, element-by-element addition and L1 parametric strategy are used for feature fusion. In addition, they proposed a multiscale fusion method, NestFuse [15], which uses a multiscale encoder-decoder architecture and nested linking modules to enhance the deeper feature extraction capability and uses a spatial/channel attention model in the fusion layer instead of weighting and L1 parametric strategies. Although the above approaches achieve good performance, their handcrafted fusion strategies are difficult to adapt to complex fusion scenarios.
CNN-based frameworks are end-to-end image fusion frameworks that can effectively avoid the disadvantages of handcrafted fusion strategies. Liu et al. first proposed a CNNbased fusion method using 16 × 16 paired image blocks to train the network to build decision maps [16]. However, the authors pointed out that the trained network was only suited to the multi-focus image fusion task but not other fusion tasks. Zhang et al. proposed a general fusion framework, IFCNN [17], which first extracts salient features from the source image using two convolutional layers to determine the image class; then, it selects a specific fusion strategy to fuse the features according to the input image type; finally, it generates a fused image. Lin et al. proposed a semantics-aware image fusion network, SeAFusion [18]. It effectively improves the performance of fused images in advanced vision tasks by adding a semantic segmentation module after the image fusion module and feeding the semantic information back to the image fusion module using semantic loss.
GAN-based frameworks are also end-to-end image fusion frameworks that avoid hand-designed fusion strategies using adversarial learning between generators and discriminators. Ma et al. first proposed FusionGAN [19]. This framework constrains the fusion image to obtain more information from the source image by establishing an adversarial game mechanism between the generator and the discriminator. However, the fusion images generated using this framework have poor retention of infrared salient targets. Therefore, Ma et al. introduced a dual discriminator fusion framework, DDcGAN [20]. This framework has two discriminators that are separately evaluated for infrared and visible images. Subsequently, Li et al. presented AttentionGAN [21], which incorporates a multiscale attention mechanism into the GAN architecture to improve the retention of significant features in the source image by enhancing the attention region with the generator and discriminator.
In general, both traditional and deep learning-based approaches emphasize the quality and metrics of the fusion image as a whole while ignoring the importance of the main target object. In practice, the information of interest is what needs to be observed, while the information of disinterest is what can be relatively ignored. Therefore, the targets in the image are segmented and assigned appropriate weights according to the needs of different real tasks. Semantic segmentation is a quality technique for dividing targets in each region of the image.

Semantic Segmentation
Semantic segmentation is one of the fundamental tasks in computer vision today, and its main function is to let the computer segment an object based on the semantics in the image and determine what and where the object is at the pixel level. In recent years, semantic segmentation methods based on deep learning have attracted much attention due to their excellent performance and powerful generalization capabilities. First, Long et al. proposed a semantic segmentation method, FCN [22], which innovatively applies deep learning methods to semantic segmentation. FCN replaces the fully connected layers of the network with convolutional layers to achieve semantic segmentation at arbitrary resolutions. Olaf et al. proposed the UNet architecture [23] based on the self-encoder architecture, which was the first network applied to medical image segmentation, and this network preserves the high-level semantic information and low-level positional informa-tion of the source image using jumping connection. Chen et al. used null convolution instead of traditional convolution to increase the density of features while maintaining the spatial resolution [24]. Xie et al. proposed SegFormer, a segmentation algorithm based on a transformer with a multilayer perceptron [25]. Since semantic segmentation can classify images at the pixel level based on their semantic features, it can be an effective method to classify different target objects in an image. There are some practical schemes that combine the image fusion task and the semantic segmentation task. For example, Zhou et al. introduced image fusion using semantic segmentation using a mask with semantic information to divide the source image into foreground and background regions and using a generative adversarial model to fuse the infrared foreground with the visible foreground, and the infrared background with the visible background [26]. Tang et al. proposed SeAFusion, which concatenates a semantic segmentation module after the fusion module. This approach aims to use semantic loss to guide the integration of high-level semantic information feedback into the fusion module, thus improving the performance of fused images in advanced vision tasks. However, we believe that it is more important to select different salient target objects in the source image to be enhanced according to different scenes under the same computational power, while the remaining objects can be relatively ignored. For this purpose, we propose a new semantics-driven image fusion algorithm, RSDFusion.

Proposed Method
This section introduces the proposed semantics-driven infrared and visible light image fusion framework, RSDFusion.

General Framework
Image fusion is a technique for extracting and integrating important information from source images. The key to this technique is how to select significant information in the source images. In different scenarios, different target information has different importance. For example, in autonomous driving, information on people, vehicles, and roads is more important than other information. In other words, the targets in the source image need to be weighted differently according to different scenes. In this context, semantic segmentation techniques can help to select the important targets in the source image. Therefore, we designed a semantics-driven image fusion network using semantic segmentation, so that the fusion image can more effectively preserve the semantic information of important targets in the source image according to the needs.
The general framework of our proposed RSDFusion is shown in Figure 1. In a first step, the infrared image (I ir ) and visible image (I vi ) are input into the fusion network (F(·)) to obtain a fusion image (I F ), which is expressed as Meanwhile, I ir and I vi are fed into the semantic segmentation network (S(·)) to generate the mask (I M ). The source image target (I T ) is obtained with the weight function (W(·)) and the mask, which can be expressed as Then, the fusion image target (I FT ) is obtained by masking and fusing the images, which can be expressed as Finally, the structural loss function (ST(·)) is used to calculate the structural loss of the fusion and source images, and the semantic loss function (SE(·)) is used to calculate the semantic loss of the fusion image and source image targets; after summing structural loss and semantic loss to obtain the total loss (L total ), the total loss is fed back into the network using the backpropagation method to update the network parameters. The L total function is defined as follows:

Network Architecture
The architecture of RSDFusion is shown in Figure 2. It has two main parts: a feature extraction network and a feature reconstruction network.  The feature extraction network contains two separate feature extractors. Each feature extractor contains a 1 × 1 convolutional block and three RABlocks tuned according to ResBlock [27]. Each RABlock consists of two 3 × 3 convolutional blocks with BN layers, a spatial attention module, and a hopping 1 × 1 convolutional layer. The spatial attention block contains two 1 × 1 convolution layers and a sigmoid function. RABlock improves the feature extraction network and the ability of deep learning to focus on essential features while reducing gradient disappearance or explosion. The activation function in the network is the leakage modified linear unit (LReLU). This function can help to speed up network training while solving the problem of non-learning neurons. In addition, due to the differences between the infrared and visible images, the two extractors have identical network structures and independent convolutional parameters.
The feature reconstruction network contains three 3 × 3 convolutional blocks and one 1 × 1 convolutional block. Among them, the activation function used after the 1 × 1 convolutional block is the hyperbolic tangent function (Tanh). This function ensures that the fusion image has the same range of variation as the input image.

Semantic Segmentation Module
We used the semantic segmentation network SegFormer to semantically segment the infrared and visible dataset RoadScene and manually correct the segmented images to finally obtain a new infrared and visible dataset, RRS (RoadScene-Seg), with semantically segmented images. The SegFormer network is a lightweight segmentation network based on a transformer and multilayer perceptron that has the advantages of few parameters, fast training, and being powerful. The structure of the SegFormer network is shown in Figure 3. It uses an auto-encoder structure: The encoder consists of four transformer blocks that can output features at different scales. The decoder uses a lightweight multilayer perceptron (MLP) to aggregate multiscale features and the UpSample layer to recover the original resolution. The training dataset of this network is ADE20K [28], which covers a wide range of scenes and object classes. After the semantic segmentation of the image, each target object of the semantic image can be easily extracted by following the palette in the ADE20K dataset.  Figure 3. The specific architecture of SegFormer.
A total of 1908 images are included in the RSS dataset. Each image set contains an infrared image, a visible image, an infrared semantically segmented image, and a visible semantically segmented image. The segmented source image contains a large number of semantic objects, but not all of them are necessary. Therefore, semantics-based image fusion is concerned with how to select the necessary semantic target objects. In this paper, we argue that the targets of the source image should be suitable to the work to be performed. For different usage scenarios, the appropriate key target objects should be selected. For example, the segmentation of people, cars, and roads is crucial in autonomous driving. Therefore, in this image fusion process, the semantic targets of people, cars, and roads are given higher priority and are the important semantic target objects to be retained in the final fusion image. In addition, the semantic segmentation images of infrared and visible images are very different, and some semantic objects only exist in infrared or visible images. Therefore, to avoid semantic dropouts of important targets from the source images, we use masks for people and cars in infrared images, masks for roads and plants in visible photos, and masks for the sky in infrared and visible images.

Loss Function
To better preserve the source image's overall structure and enhance the fusion image's semantic information, we propose a new loss function. It is mainly divided into global structural loss (L SSI M ) and local semantic object loss (L semantic ). Our loss function is defined as follows: where L SSI M mainly constrains the global structural features in the fusion image, while L semantic constrains the fusion image to retain more detailed features and semantic information of the source image target. α and β are used as adjustment factors to balance the global structural loss and partial semantic loss. The global structural loss (L SSI M ) is obtained by calculating the sum of the structural loss between the fusion image and the infrared and visible images as follows: where I f is the fusion image; I ir is the infrared image; I vi is the visible image; and ω 1 and ω 2 are the structural loss coefficients, which are mainly used to control the degree of influence of infrared and visible structural features on the fusion image during the fusion process. ω 1 and ω 2 have values in the range [0, 1], and ω 1 + ω 2 = 1. The SSIM [29] function considers three elements of the image: luminance loss, contrast loss, and texture loss. The SSIM formula is as follows: where l(x, y) represents the loss of global brightness of the fusion image, c(x, y) represents the loss of global contrast of the fusion image, and s(x, y) represents the loss of global structural similarity between fusion image and source image. The local semantic target loss (L semantic ) can drive the locally important targets of the fusion image toward the source image. Infrared and visible images are segmented with a semantic segmentation model, and the semantic objects in the segmented images are extracted using a mask. However, a large number of semantic objects are extracted from the source images. For the application scenario of the training set used, we set the main targets as people, vehicles, sky, roads, and plants. The local semantic object loss function is defined as follows: L semantic = λ 1 L person + λ 2 L car + λ 3 L sky + λ 4 L load + λ 5 L green (9) where the local semantic loss of the person is defined as where F person is the person pixel in the fusion image, IR person is the person pixel in the infrared image, V IS person is the person pixel in the visible image, · 2 represents the l 2 -norm, N represents the number of pixels within the mask that have a value of 1, and ω 3 and ω 4 denote the person pixel significance coefficients in the infrared and visible images. The range of the latter is [0, 1], and ω 3 + ω 4 = 1.

Experimental Validation
In this section, we first present the experimental setup of this work. Then, we contrast the proposed method with nine other representative methods on public datasets. Finally, we further validate the efficiency of the proposed method with ablation and efficiency experiments.

Experimental Configuration
(1) Datasets: We performed quantitative and qualitative experiments on our algorithms using the RoadScene and TNO datasets. The RoadScene dataset is an infrared and visible light dataset for autonomous driving containing 221 image pairs including people, vehicles, and roads. The TNO dataset is currently the most classic dataset for infrared and visible light image fusion tasks, and it consists of 60 image pairs from different military scenes taken with cameras at different wavelengths.
(2) Evaluation metrics: Fusion image performance is mainly divided into two categories: subjective evaluation and objective evaluation. Subjective evaluation is generally based on the visual perception of the user. Usually, the more infrared salient targets and visible detailed textures contained in the fusion image there are, the better the user's subjective evaluation is. Objective evaluations typically use suitable quantitative metrics to assess the performance of the fusion network. In this paper, the following six objective metrics were selected: entropy (EN) [30], standard deviation (SD) [31], mutual information content (MI) [32], structural similarity measure (SSIM) [29], visual fidelity (VIF) [33], and sum of difference correlation (SCD) [34].
The mathematical formula of EN, which is a measurement of the amount of information in an image, is as follows: where L is the total number of gray levels and p l is the average distribution of matching gray-level fusion images. Larger EN indicates more information contained in the fusion image and better performance of the network. MI is a metric that quantifies how much information has passed from the source to the fusion image. It mainly computes the correlation between source and fusion images. The MI mathematical formula is as follows: where P X,F (x, f ) is the joint histogram of the source and fusion images, and P X (x) and P F ( f ) are the edge histograms of the source and fusion images. The higher MI is, the more information of the source image the fusion image contains. SD is a metric of a fused image's contrast and distribution. The SD mathematical formula is as follows: where mu is the average of the fusion image pixels. Because the human visual system pays a lot of attention to high-contrast regions, fusion results with higher SD have better contrast. SSIM is a metric that measures the structural similarity between the fusion image and the source image, and its formula is shown in Equation (3); its range is [−1, 1]. When SSIM is equal to 1, the two images are identical. Therefore, the higher the index is, the worse it is. If SSIM is too high, it reflects that the fusion images are too similar to the source images and lack creativity and distinctiveness.
VIF is a metric that quantifies the amount of information shared by the fusion and source images using the human visual system and natural scene statistics. The VIF mathematical formula is as follows: SCD is a metric for assessing the information richness within a fusion image that measures the difference between the source and fusion images. The SCD mathematical formula is as follows: RSDFusion was compared with nine of today's most popular methods: two traditional algorithms, namely, MDLatLRR [11] and GTF [35]; three AE-based methods, namely, DenseFuse [14], NestFuse [15], and RFN-Nest [36]; two CNN-based algorithms, namely, IFCNN [17] and SeAFusion [18]; two GAN-based algorithms, namely, FusionGAN [19] and U2Fusion [37]. Test images were obtained from the above nine publicly available image fusion algorithms.
(3) Training setup: We used the RoadScene dataset to train our RSDFusion model. From it, 180 pairs of infrared and visible images were selected; then, a semantic segmentation network was used to segment these images. Each set of images after segmentation contained an infrared image, an infrared semantic image, a visible image, and a visible semantic image. A sliding window of 256 × 384 with a step size of 32 was used to crop the images to obtain more training images. After cropping, a total of 1908 image sets were obtained for training. In the test, 21 pairs of typical images were chosen from the TNO and RoadScene datasets for comparison experiments. In addition, the hyperparameters in network training were defined as follows: the training batch size was 16; the number of iterations was 10; the learning rate was 5 × 10 −4 ; and the optimizer was Adam. In addition, the proposed technique was built on the PyTorch platform. All experiments were performed on Intel i9-11900 and NVIDIA GeForce RTX 3090.

Comparative Experiments
We compared RSDFusion with nine other methods on the RoadScene dataset to evaluate our performance.
(1) Qualitative comparison: To visually compare the performance difference between our method and other algorithms, we selected two pairs of typical source images from the RoadScene dataset for subjective evaluation, both of which contained semantic segmentation objects to be extracted and focused on in subsequent evaluations and which were captured in the daytime and nighttime, respectively. The results of the compared algorithms are shown in Figures 4 and 5. In the figures above, we selected a salient object (i.e., red box) from each set of fusion images and enlarged it in the lower right corner of the fusion image for observation. As shown in Figure 4, the infrared target was severely missing with U2Fusion and could not be captured significantly. There was a lack of background information with FusionGAN and GTF, such as the lack of detailed texture information of clouds in the sky. The RFN-Nest fusion image had low contrast, and the overall brightness was dark. The infrared target information was retained with IFCNN and MDLatLRR, but the target saliency was insufficient. In contrast, SeAFusion and RSDFusion could retain high-quality infrared targets. In particular, RSDFusion provided more detailed texture information while highlighting the infrared target, which is more realistic in terms of visual effects.
As shown in Figure 4, RSDFusion retained high-quality infrared targets and relatively more visible detailed texture information. Only SeAFusion and our method could observe the mountains and trees in the background, but the contrast between the main targets and the background was higher in RSDFusion fusion images. The above qualitative comparison shows that RSDFusion can extract high-quality targets while preserving rich detailed texture information. The targets extracted with the semantic network mechanism better preserve the target information and edge area of the source image, thus producing images with more comfortable and good subjective visual perception.  (2) Quantitative comparison: A total of 21 pairs of images from the RoadScene dataset were selected for quantitative comparison. The objective comparison results of each method on six metrics are shown in Figure 6 and Table 1. On the RoadScene dataset, RSDFusion achieved the best results, with significant advantages in four metrics, EN, SD, MI, and VIF, and the rest of the metrics were at an average level. Among them, the highest value of EN indicates that the fusion images generated using RSDFusion were richer in information. The highest value of VIF indicates that the fusion image had the best visual effect. The value of MS-SSIM indicates that the fusion image generated using RSDFusion had higher brightness and contrast than the source image, which reduces the similarity of brightness and contrast in this index.  Table 1. Twenty-one pairs of images from the RoadScene dataset were compared quantitatively on six metrics, namely, EN, SD, MI, SSIM, VIF, and SCD. The red color represents the best result, and the blue color represents the second-best result. In summary, extensive experimental controls on the RoadScene dataset show that our algorithm has superior semantic target object extraction and detailed texture retention, and the generated fusion images have superior human visual effects. We attribute this to the following factors: First, we accurately capture the extracted targets using semantic loss, which improves the network's control over local images. Second, we constrain the fusion image to retain more detailed texture information of the source image while ensuring the extraction of the focal target using the global structural loss function.

Generalization Comparison
To evaluate the generalization performance of the model, further comparisons were performed on the TNO dataset.
(1) Qualitative comparison: Two pairs of typical source images were selected from the TNO dataset for subjective evaluation; they were taken in clear and foggy scenes at night, respectively. Their comparison results are shown in Figures 7 and 8. First, as shown in Figure 7, the infrared target was severely lost with RFN-Nest, and no infrared target could be significantly detected. Second, the brightness of the targets detected with MDLatLRR and DenseFuse was too large compared with the targets in the infrared images, and the edge positions were blurred. In addition, infrared target information was retained with GTF, IFCNN, and NestFuse, but there was information contamination of infrared targets by visible images. U2Fusion, STDFusion, SeAFusion, and RSDFusion could retain highquality infrared targets. In particular, RSDFusion had more detailed texture information while highlighting the infrared target, which is more realistic in visual effect. According to Figure 8, RSDFusion is the only method that preserved both the visible detail features on the right and the infrared flow targets.  (2) Quantitative comparison: A total of 21 pairs of images from the TNO dataset were selected for quantitative comparison. The objective evaluation results are shown in Figure 9 and Table 2. Similar to the metrics in the RoadScene dataset, RSDFusion achieved the best results, with significant advantages in EN, SD, MI, and VIF metrics on the TNO dataset, and the advantages were more obvious compared with the RoadScene dataset, which was due to the greater brightness and contrast between the main target and the background of the source images in the TNO dataset. In addition, our algorithm still had the best stability on the TNO dataset. Overall, qualitative and quantitative experiments show that RSDFusion has a good generalization ability.

Efficiency Comparison
In image fusion tasks, operational efficiency is a significant factor in evaluating the performance of image fusion models. In this paper, the designed fusion model, RSDFusion, is a lightweight real-time network that must have efficient operational efficiency to support real-time image fusion. The efficiency of RSDFusion and nine other methods was tested on the RoadScene and TNO datasets in the same hardware environment. As shown in Table 3, all methods using deep learning had significant advantages in terms of operational efficiency. In particular, our method outperformed all other methods except IFCNN on the TNO and RoadScene datasets. Overall, RSDFusion has short and stable average running time, which enables real-time image fusion.

Ablation Experiments
In our model, semantic loss and structural loss are significant components of the training network. To verify the soundness of the designed semantic and structural loss functions, we trained the fusion model without semantic and structural loss, respectively. As shown in Figure 10, if trained without semantic loss, the network unintentionally retains the most significant information of the source image, resulting in poor distinction between the main semantic target and the background of the final fusion image. In the case of training without structural loss guidance, the quality of the fusion image is low, with problems such as blurred target edges and a lack of detailed texture information. In contrast, the fusion result of RSDFusion preserves not only the semantic targets of the source image but also the rich texture details of the source image.
In Table 4, our RSDFusion is the best according to the four metrics of EN, SD, MI, and VIF. Thus, our method can retain clearer and richer semantic targets, and superior human visuals.

Conclusions
In this paper, we propose a new real-time semantics-driven infrared and visible image fusion framework, RSDFusion. The framework introduces semantic segmentation into image fusion, divides the source image into several semantic objects, designs an adapted loss function for its important semantic target objects, and combines it with global structural loss to drive network training. As a result, the fusion images we generate not only preserve the important targets of the source images but also retain a large amount of the texture detail information of the source images. The method also has the limitations that the generality of the task needs to be improved and the fusion network needs to be retrained when the target of interest for the task changes. In extensive qualitative and quantitative experiments on the RoadScene and TNO datasets, our fusion method RSDFusion outperformed other popular methods in both subjective visualization and objective metric measures. Furthermore, the efficient operation of our algorithm supports the real-time fusion of source images.