You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

26 April 2023

Multi-Modality Image Fusion and Object Detection Based on Semantic Information

,
and
1
School of Software Technology, Dalian University of Technology, Dalian 116620, China
2
International School of Information Science & Engineering, Dalian University of Technology, Dalian 116620, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Learning Models and Applications to Computer Vision

Abstract

Infrared and visible image fusion (IVIF) aims to provide informative images by combining complementary information from different sensors. Existing IVIF methods based on deep learning focus on strengthening the network with increasing depth but often ignore the importance of transmission characteristics, resulting in the degradation of important information. In addition, while many methods use various loss functions or fusion rules to retain complementary features of both modes, the fusion results often retain redundant or even invalid information.In order to accurately extract the effective information from both infrared images and visible light images without omission or redundancy, and to better serve downstream tasks such as target detection with the fused image, we propose a multi-level structure search attention fusion network based on semantic information guidance, which realizes the fusion of infrared and visible images in an end-to-end way. Our network has two main contributions: the use of neural architecture search (NAS) and the newly designed multilevel adaptive attention module (MAAB). These methods enable our network to retain the typical characteristics of the two modes while removing useless information for the detection task in the fusion results. In addition, our loss function and joint training method can establish a reliable relationship between the fusion network and subsequent detection tasks. Extensive experiments on the new dataset (M3FD) show that our fusion method has achieved advanced performance in both subjective and objective evaluations, and the mAP in the object detection task is improved by 0.5% compared to the second-best method (FusionGAN).

1. Research Background and Introduction

A single sensor has its limitations, and it is challenging to create a thorough, credible, and accurate description of a multitude of scenarios involving people, vehicles, roads, traffic lights, and so on. This has emerged as the biggest obstacle to the ability of intelligent systems to carry out various complex tasks. The core capabilities of the entire intelligent system, such as information gathering and intelligent cognition, have advanced in recent years thanks to the rapid development of multi-mode sensors [1]. Among them, visible and infrared images, which serve as the primary visual data sources for intelligent systems, are crucial for a variety of perception tasks.
Infrared and visible images have very different imaging principles and feature representations. While visible images can more effectively present the scene texture details and retain the illumination intensity information, infrared images aim to highlight the overall contour characteristics of the object. However, due to hardware and environment factors, blur and halos may appear in infrared images. Therefore, it is crucial to understand how to fully exploit their benefits and combine infrared and visible images.
Additionally, the demands for multimodal fusion and subsequent downstream tasks, such as object detection and semantic segmentation, are growing rapidly with the vigorous development of video surveillance and automated driving; the majority of currently used approaches focus on creating networks or models to improve the visual impact of the fused image, but they ignore the fact that the fused image only matches the human vision, making it hard to meet the perceptual requirements of subsequent tasks, such as object detection and segmentation.
The approach suggested in this paper aims to address the existing practical issues. In order to effectively address the underlying visual issues in challenging environments, this method employs two techniques: multi-layer semantic information guidance and a neural network architecture search. These techniques work together to enhance the visual quality and effects of fused images as well as the performance of downstream tasks.
Aiming to fully utilize semantic information and conduct target detection-oriented fusion research, this paper employs a neural network search scheme and attention mechanism to process various downstream tasks. The main research work and contribution of this paper can be divided into two main parts:
  • In order to reduce feature redundancy and preserve complementary information, we designed a multi-level adaptive attention block (MAAB) in the network, which allows our network learning to retain rich features at different scales, and more efficiently and effectively integrate high-level semantic information.
  • To discard the limitations of the existing manually constructed neural network structure, we introduce a neural architecture search (NAS) in the construction of the overall network structure, so as to adaptively search the network structure that is suitable for the current fusion task.
This paper will be divided into the following sections.
Section 1 introduces the research background and the significance of IVIF, as well as the subsequent target detection. This section briefly introduces the research status of a deep learning scheme of multimodal image fusion [2,3,4], and puts forward the main work and innovation of this paper.
Section 2 introduces the research status of infrared and visible image fusion, including traditional methods and deep learning methods [5,6,7,8].
In Section 3, an infrared visible image fusion and detection algorithm guided by semantic information is proposed. This section introduces the specific network structure, the construction of loss function, and the design of the search space.
In Section 4, experiments are carried out to verify the effectiveness of the method. Moreover, the final section summarizes this paper.

3. The Proposed Method

3.1. Method Motivation

As mentioned in the introduction, the goal of IVIF is to retain complementary information and remove redundant information from the two modal images. At present, most fusion methods achieve high-quality fusion results by increasing the network depth or manually designing various loss functions. We anticipate that the foreground object in the fused image will be nearer to the infrared image, consistent with the top-down attention mechanism of human vision, which directs our focus toward the highlighted regions of the infrared image. In addition, saliency representation has achieved great success in the field of computer vision. It extracts salient features in images by similar means. For infrared and visible image pairs, the internal specific contrast can distinguish between the foreground target and background details using significance representation.
Moreover, it is well known that high-level semantic information can effectively guide subsequent downstream tasks. When performing infrared-visible image fusion and subsequent target detection simultaneously, incorporating such rich semantic information is expected to enhance feature representation and improve target detection results.
Finally, considering that manually designed network structures may have difficulty adapting to the introduction of high-level semantic information, we propose using neural architecture search with a high upper limit and good effectiveness to determine sub-operations. Thus, we propose a multi-level attention-guided search fusion network based on semantic information. By incorporating saliency information constraints into the loss function, designing multi-level attention modules, and introducing NAS, we expect that this method can generate visual fusion results with clear thermal targets and rich real details.

3.2. Network Architecture

The network structure is composed of three parts: a high-level semantic network, a multi-layer adaptive attention block, and a searched residual network. The overall structure of the whole network is shown in Figure 1; each part will be introduced separately below.
Figure 1. The overall architecture of the proposed method.

3.2.1. High-Level Semantic Network

In this paper, we use the pre-trained VGG16 network as the basic framework of the high-level semantic network ϕ . VGGNet has a simple structure, which is suitable for image classification tasks. However, in other research fields of computer vision, researchers found that networks with excellent pre-trained weights have outstanding generalization performances when migrating to other image data. Therefore, VGGNet is still often used to extract high-level image features.
As shown in Figure 2, we extract high-level semantic information at three different scales in the VGG16. The larger the size of the feature map, the richer the details and textures of the image, while the smaller the size of the feature map, to some extent, can reflect the overall pixel distribution and significant areas. The extracted features are f u 2 , f u 1 , f u 0 , and their dimensions are (64, H, W), (128, H/2, W/2), and (256, H/4, W/4), respectively.
Figure 2. The overall architecture of the high-level semantic network(part of VGG16).

3.2.2. Multi-Layer Adaptive Attention Block

To adapt the image features of the two modes to their semantic information at different scales, this paper proposes a multi-layer adaptive attention block (MAAB) [47,48].
As shown in Figure 3, under the guidance of a similar spatial attention mechanism, the module is designed to have an adaptive structure for processing at different scales, where i denotes the number of downsampling operations applied. As the feature map size continuously decreases, we reduce it to the order of i = 2 , 1 , 0 to ensure a minimum feature size and prevent excessive loss of structural information.
Figure 3. The architecture of M A A B i . i = 2 , 1 , 0 .
Specifically, the module combines the feature information of different modes through the searched convolution layer and then obtains the information by using the average pool feature and the maximum pool feature. After downsampling operations for i times, the module can fully contain local and global feature information. Jump links are added at the corresponding convolution layer to enrich the amount of information and make up for some losses. Then these intermediate features, containing spatially significant information, are sent to the channel attention module introduced below to obtain richer and more effective deep-seated semantic information [34,35,36].
As shown in Figure 4, the input features undergo a squeeze operation on the channel domain, generating a channel descriptor by integrating the feature map in the spatial dimension. This descriptor embeds the global distribution of channel characteristic response, allowing the network to use information from the global receptive field. The two-channel descriptors are multiplied element-wise and passed through a softmax function to generate a weight graph representing the degree of dependence between channels through the self-gating mechanism based on the dependence between channels. The weight graph is multiplied by another channel descriptor point and reshaped into the original three-dimensional feature representation. Similarly, in order to reduce information loss in the process, there is a jump link between the original input characteristic map and the final output result [49].
Figure 4. The architecture of channel attention.

3.2.3. Searched Residual Network

The method employed in this study uses different scales of semantic information, and determining how to use this information manually can be challenging. To improve the network’s adaptability to semantic information, the study incorporates a neural network structure search to determine the network structure adaptively [47,48].
Considering that the original task of the directed acyclic graph used by darts is image classification, and residual blocks and dense link blocks have exhibited outstanding performance in image fusion tasks, the search network proposed in this paper fixes the form of the residual network, and only searches for the weight of the single-layer convolution. This cannot only reduce the search time and operation cost but also better adapt NAS to the task of image fusion.
Constructing an effective search space is a critical task in the neural architecture search (NAS) domain. A well-designed search space should balance the complexity of the candidate operations with their ability to improve the network architecture. In this paper, we present ten effective and efficient operations that were selected from the search space, as illustrated in Figure 5:
Figure 5. The channel attention architecture.
  • ConV 1 × 1
  • ConV 3 × 3
  • ConV 5 × 5
  • DilConV 3 × 3
  • DilConV 5 × 5
  • ResConV 1 × 1
  • ResConV 3 × 3
  • ResConV 3 × 3
  • ConV 1 × 3 3 × 1
  • ConV 1 × 5 5 × 1
We integrated these operations, referred to as searched operations, into the encoders and decoders of both visible and infrared networks, as demonstrated in Figure 1.
The process of discovering searched operations involves three steps. First, a mixed operation is constructed by computing a weighted sum of the ten operations in the search space, with the weight coefficients represented by a weight matrix. Second, the training loss, which incorporates the weight matrix, is minimized through gradient descent optimization. The weight matrix is updated iteratively until convergence. Finally, the operation with the maximum weight in the weight matrix is chosen as the Searched Operation, and the search process is terminated.
Once we have identified each searched operation through the search process, the architecture of the visible and infrared networks is fully determined. We then retrain the network using the training data to obtain the training results. Integrating the searched operations leads to better performance compared to using manually designed convolutional operations in the encoder and decoder. The experimental results demonstrate the effectiveness and efficiency of this NAS approach.

3.3. Loss Function

The setting of the loss function is an important part of solving computer vision problems by using the deep learning method. This section will introduce each part of the loss function. In this section, I represents the image, and I A , I B , and I F represent the input visible image, infrared image, and fused image, respectively.

3.3.1. Pixel Loss Function

Considering that this is an image fusion task, we use the most common pixel-level constraint to guide the network training, which is defined as follows:
L p i x e l = 1 H W I F I A 2 2 + 1 H W I F I B 2 2

3.3.2. Structure Loss Function

Overall, we expect the final fusion image to be structurally similar to the source image, because the most commonly used structural similarity index is used as the loss function, which is defined as follows:
S S I M F , X = x , f 2 μ x μ f + C 1 μ x 2 + μ f 2 + C 1 · 2 σ x σ f + C 2 σ x 2 + σ y 2 + C 2 · σ x f + C 3 σ x σ f + C 3 ,
L S S I M = 1 S S I M F , A + 1 S S I M F , B .
S S I M F , X represents the structural similarity between the source image X ( X represents the source images A and B ) and the fused image F , and x and f, respectively, represent the image patches of the source image and the fused image in the sliding window, σ x , f is the covariance of the source image patch and the fused image patch, σ x and σ f represent the standard deviations of plaque, μ x and μ f are the average values of the source image patches and fused image patches. C 1 , C 2 , C 3 are the parameters used to stabilize the algorithm [50].
The brightness information of an object’s surface is related to the illumination of its environment and the object’s reflection coefficient. In natural scenes, an object’s structure and material are generally independent of ambient illumination. This means that the reflection coefficient is only related to the object itself. By taking this into consideration, we can explore the structural information in an image by separating the influence of illumination on the object.
From an image composition standpoint, the structural similarity index (SSIM) characterizes structural information as a feature that portrays the object’s arrangement in the scene, while being invariant to brightness and contrast. It models image distortion by combining three distinct elements: brightness, contrast, and structure. The mean is used as the estimate of brightness, the standard deviation as the estimate of contrast, and the covariance as the measure of structural similarity. Compared with the traditional measurement indicators, such as PSNR, the structural similarity is more consistent with the judgment of human eyes on image quality in the measurement of image quality.

3.3.3. Gradient Loss Function

In addition, image gradient information can be used to represent texture details and scene structures. In order to enrich the details in the fused image, this paper uses gradient loss [51] to constrain texture factors, as follows
L g r a d = 1 H W I F M a x ( | I A | , | I B | ) 2 2 ,
where ∇ is defined as the gradient operation using the Sobel operator. The Sobel operator is an important processing method in the field of computer vision, which is mainly used to obtain the one-step degree of the digital image. HW represents the product of the height and width of the image.
Technically, it is a discrete difference operator used to calculate the approximate value of the gradient of the image brightness function. It weights the gray values of each pixel in the image, and the extreme value obtained is the edge for edge detection. Applying this operator at any point in the image will produce the corresponding gradient vector. In general, the Sobel operator produces good detection results and has a smoothing effect on noise, which is why it is widely used in image fusion. This paper calculates the Sobel operator in the X direction and Y direction, respectively, and combines them to obtain the final result. As shown above, in the background, the same part often has large gradient differences in images of different modes. We take the maximum value of the gradient information of the two modes to calculate the loss function, which can reduce the impact caused by modal differences.

3.3.4. Target-Aware Loss Function

In this paper, we propose a weighted average technique based on a visual saliency map (VSM) to design the pixel-level loss function. VSM [52] can depict and highlight visual structures, regions, or objects in images, making it useful in various computer vision and computer graphics applications. In image fusion, VSM can reflect the salient features of the image, providing a regional semantic information constraint based on the distribution of pixel features.
Considering the simplicity and effectiveness of the method, this paper uses this method [53] to construct VSM. The algorithm defines pixel-level saliency based on the contrast of pixels with all other pixels. Let P i represent the intensity value of a pixel i in image I. The salient value V ( i ) of pixel i is defined as
V ( i ) = P i P 1 + P i P 2 + + P i P N ,
where N denotes the total number of pixels in image I. The saliency values of two pixels are equal if two pixels have the same intensity value. Thus, we can rewrite this as follows:
V ( i ) = j = 0 G 1 M j P i P j ,
where j denotes the pixel intensity, M j represents the number of pixels whose intensities are equal to j, and G is the number of gray levels (256 in this paper). Then, V ( i ) is normalized to [0, 1]. Let V B denote the VSMs of the infrared images. We can denote the target-aware loss function as follows:
L t a r g e t = 1 H W I F · V B I B · V B 1

3.3.5. Total Fusion Loss Function

Consequently, combining all of the loss functions above, the following total fusion loss function guides the learning of image fusion.
L f u s i o n = L p i x e l + α L S S I M + β L g r a d + γ L t a r g e t ,
where L p i x e l is the pixel loss, and L S S I M and L g r a d are the structure loss and gradient loss, respectively. α , β , and γ are the trade-off parameters. In the relevant settings of the neural network search, we also use this loss function as the loss function of the training set and the val set. The definitions are as follows:
L t r a i n = L f u s i o n = L v a l
In this way, the overall goal of the network can be consistently ensured, and the convergence of the structure and parameters can be facilitated.

4. Experiments

We conducted experimental evaluations on three datasets, namely TNO, Roadscene, and M3FD. For the image fusion task, we selected 180 images from the 4500 images in the 3 datasets and converted them to grayscale. To enhance the data content and make better use of the images, we randomly cropped these images to generate 20 K pictures of size 64 × 64 for network training. During training, these image blocks were normalized to [−1, 1] before being fed to the network.
In the search process of the neural architecture search, we used Adam [54] as the optimizer on the training set and set the learning rate as 1 × 10 5 . In the validation set, we used SDG as the optimizer and set the learning rate and the weight attenuation of the weight matrix as 1 × 10 2 and 1 × 10 4 , respectively. The batch size during training was set to 16, and the number of epochs was set to 100. After determining the structure, we used the Adam optimizer during training and set the learning rate to 2 × 10 5 . The batch size during this training was set to 64, and the number of epochs was set to 20.
In object detection, we used 4200 pairs of images labeled in M3FD for network training. We used YOLOv5 as the object detection network and we used the pre-trained YOLOv5s model. In object detection training, the image size was set to 320 × 320, and other parameters were carried out according to the original data provided by the official.
Our approach was implemented on PyTorch with an NVIDIA Tesla V100 GPU. The tuning parameters α , β , and γ were set to 0.3, 0.5, and 1.2, respectively.
First, as shown in Figure 6 below, we exhibit a heat map of the neural network search results. The heat map depicts the weight matrix, which was introduced in Section 3.2.3. The left subfigure illustrates the infrared network, while the right subfigure portrays the visible network. Each column in the figure corresponds to ten operations in the search space, and each row corresponds to a searched operation in either the encoder or decoder. The intensity of the color used in the figure signifies the weight of the operation, with darker shades denoting greater weights. Ultimately, the operation with the highest weight is selected as the searched operation, which is highlighted by the red box.
Figure 6. Heat map of search results by the training strategy.

4.1. Results of Infrared and Visible Image Fusion

We evaluated the fusion performance of our method by comparing it with seven state-of-the-art methods: DenseFuse [37], FusionGAN [42], RFN [39], GANMcC [55], IFCNN [40], MFEIF [56], and U2Fusion [57].

4.1.1. Qualitative Comparisons

(1)
Qualitative Comparisons on TNO Datasets
The intuitive qualitative results on two typical image pairs from the TNO dataset are shown in Figure 7. The boxes in the figure indicate the target objects and textured background in the fused image that are of our interest.
Figure 7. Qualitative comparison of typical image pairs in the TNO dataset.
The first pair of famous images depicts a residential area under night surveillance. The infrared image clearly shows useful information that is difficult to distinguish in visible images, such as human figures, clouds, shrubs, and wire circles. As shown in the red boxes, our method effectively preserves the contrasted information of the image, and the fused image indicates the shrubs as background with an even distribution of pixels, while other methods, such as FusionGAN and GANMcC, generate blurred backgrounds. As shown in the green boxes, our approach outlines target objects that are more apparent, while other methods, such as RFN-Nest, are greatly affected by the visible images, and the outline of the target object is blurred and difficult to identify.
The second pair of images depicts a daytime surveillance scene. The human figure in the visible image blends into the background, while the target object is evident in the infrared image. The outline of the street light pole in the infrared image is difficult to identify, and the outline is more apparent in the visible image. As shown in the red box, our method sharpened the contours of the target object compared to the visible image and effectively absorbed the valuable information from the infrared light image, while other methods, such as RFN-Nest and U2Fusion, generated blurred low-brightness target objects. As shown in the green box, our method effectively preserves the gradient information and clearly shows the outline of the street light pole as a detailed background, while other methods, such as FusionGAN, generate fused images with uneven distribution of pixels in the background, making it difficult to distinguish the outlined details.
(2)
Qualitative Comparisons on Roadscene Datasets
The intuitive qualitative results of two typical image pairs from the Roadscene dataset are shown in Figure 8. The boxes in the figure indicate the target objects and textured background in the fused image that are of our interest. The grayscale image, as the fusion result, is treated as the Y channel and merged with the Cb and Cr channels of the visible light image to form a YCbCr image, which is then converted to the RGB channel to obtain the color fusion image.
Figure 8. Qualitative comparison of typical image pairs in the Roadscene dataset.
Two classic pairs of images depict street scenes at night. In contrast to the previous pairs, both pairs include artificial light sources, setting a unique barrier to image fusion.
In the first group of images, the target object in the visible image is not identifiable, while the background details in the infrared image have a chaotic pixel distribution. In the second group of images, the vehicle in the visible image blends in with the blackness, while the overexposure caused by the artificial light source blurs the background details in the infrared images. Our method shows stronger contrast and richer color details compared to FusionGAN and GANMcC, as shown in the green box in the first set and the red box in the second set. It also avoids overexposure that causes ghosting compared to MFEIF. Moreover, as shown in the red box in the first group and the green box in the second group, our method effectively enhances the contours of the target object.
(3)
Qualitative Comparisons of M3FD Datasets
The intuitive qualitative results of two typical image pairs from the M3FD dataset are shown in Figure 9. The boxes in the figure indicate the target objects and textured background in the fused image that are of our interest.
Figure 9. Qualitative comparison of typical image pairs in the M3FD dataset.
These two sets of images evaluate the effectiveness of different fusion approaches under typical and unique barriers.
The first set of images shows a night street scene. The visible image has overexposure due to artificial light sources, making it difficult to recognize the human silhouette. The infrared image cannot capture the artificial light sources and loses text information on the signboard. As shown in the green box, our method generates the fused image with the brightest and most visible target object compared to other methods. As shown in the red box, our method produces an image with the highest contrast, good distribution of background details, and precise text information on the signboard.
The second set of images depicts a mountain scene during the daytime. The presence of smoke in the visible image makes it impossible to identify the target object. Still, the infrared image can capture the target object information through the smoke. As shown by the green box, our method fully absorbs the knowledge of the target object in the infrared image, with clear contour edge information and good brightness. As shown by the red box, the fused image generated by our method has high contrast, good pixel distribution, and rich color details compared to other methods in representing background details.

4.1.2. Quantitative Comparisons

In order to make a more comprehensive quantitative comparison, we selected six common fusion indicators in the field of image fusion as the evaluation benchmark, including three referenced indexes: mutual information (MI) [58], structural similarity (SSIM) [50], and the sum of correlation differences (SCD) [59], and three non-referenced evaluation indexes: entropy (EN) [60], standard deviation (SD) [61], and spatial frequency (SF) [62].
  • The quantity of information transmitted from the source image to the fused image is measured using MI. A larger MI indicates that more information from the source image pair is maintained in the fused image.
  • The human visual system is sensitive to picture loss and distortion, and this is modeled using SSIM. It has a good correlation with fusion performance.
  • SCD displays the correlation between the source and fused images. A larger SCD indicates a higher fusion performance.
  • EN measures the information in the fused image. A higher EN typically denotes improved fusion performance.
  • The contrast and pixel distribution of the fused image are reflected by SD. A larger SD typically denotes a more aesthetically pleasing fused image.
  • SF represents the overall gradient distribution of the image in the spatial domain. The texture and edges become richer as the SF becomes larger.
The specific results are shown in Table 1 and Table 2 below.
Table 1. Quantitative comparison display of the TNO datasets. The best result is highlighted in red whereas the second-best one is highlighted in blue.
Table 2. Quantitative comparison display of the Roadscene datasets. The best result is highlighted in red whereas the second-best one is highlighted in blue.
In the TNO dataset, as shown in Table 1, our method consistently delivers the highest or second-highest mean compared to the other methods. The lower standard deviation, on the other hand, illustrates the consistency of our approach across cases. Our method specifically obtains the highest EN, SD, and SF, which shows that it produces fused images with rich information, good pixel distribution, strong contrast, and rich texture. SSIM reflects the pleasing visual effect of our fusion method.
In the Roadscene dataset, as shown in Table 2, our approach yields the highest average values under MI, EN, SD, and SF evaluations. The highest MI and EN values show how effectively our method can transfer information from the source images to the fused images. The highest SD and SF values demonstrate how the fused images generated by our method are aesthetically pleasing and maintain their sharp edges and rich textures.

4.2. Results of IVIF and Object Detection

In the ensuing target detection section, we conducted the following actions to ensure that the experiment was fair: 300 epochs were retrained based on the pre-trained YOLOv5s model for each fusion method, and their corresponding detection network parameters were obtained before detection. That is, a pre-trained YOLOv5s model was used and fine-tuned on the fused images.
In the dataset used in this paper, only the M3FD dataset has object annotation, so relevant experiments were carried out on this dataset. Different from our previous work, this time we enabled a new training set and test set. The 4200 images in the dataset are divided into a test set and a training set according to the 800/3400 images and scenes. Each scene in the test set has not appeared in the training set.
The quantitative results of target detection in the M3FD dataset are shown in Table 3. Almost all fusion approaches have shown excellent detection results in general. Except for particular label categories, the detection accuracy of fusion approaches outperformed that of using solely visible or infrared pictures. By highlighting the representation of infrared thermal objects and activating highly rich information in the modes through high-level semantic information, our method has greater advantages in this dataset featuring challenging scenes.
Table 3. Quantitative results (precision) of object detection in the M3FD dataset among all of the image fusion methods plus detectors. The best result is highlighted in red whereas the second-best one is highlighted in blue.

4.3. Ablation Studies

4.3.1. Study on Model Architectures

We study our method’s model architecture and further validate the effectiveness of several individual components as shown in Table 4. The introduction of NAS and the MAAB module built for semantic information is the focus of the network structure presented in this study. For these two components, we built appropriate ablation.
Table 4. Quantitative results of image fusion in the TNO and Roadscene datasets using different networks. The best result is highlighted in red whereas the second-best one is highlighted in blue.
The initial single-layer search convolution layer is replaced with ConV 3 × 3 (stride = 1) operation when NAS is not used. When MMAB is not used, a combination of ConV 3 × 3 (stride = 1) + BN + ReLU is used instead. Figure 10 shows the visualization results of fused images generated by models with different ablation degrees, where, w/o means “without”.
Figure 10. Qualitative comparison of results using different networks.
Without MAAB, the utilization of high-level semantic information has some adaption issues, and the overall contrast and color are poor. Without NAS, it is impossible to make greater use of semantic information with the original features, resulting in a significant loss of detail. These two components have favorable effects on the ultimate fusion result.
The relevance of MAAB and NAS in the network can also be seen in the quantitative comparison presented in the table. Our overall model’s final findings came in second place. The M1 model creates too much noisy pixel distributions, which interferes with the findings, thus ’en’ is not the greatest value.

4.3.2. Analyzing the Training Loss Functions

We discuss the impact of different loss functions on our method. The resulting image is substantially deteriorated without SSIM limitations, as seen in Figure 11; moreover, the contrast and brightness are drastically altered. The edge of the picture becomes blurred without the gradient limitation, and more information is lost. The brightness of the characters in the image is lowered without the significant target limitation supplied by VSM, as is the contrast with the environment, altering the viewing effect.
Figure 11. Qualitative comparison of loss functions. From left to right: w/o L S S I M , w/o L g r a d , w/o L T a r g e t and the L f u s i o n .
Overall, the loss functions we investigated do their jobs and have a significant impact on the training process.

5. Discussion

Overall, the paper presents a valuable contribution to the field of IVIF and object detection by proposing a multi-level structure search attention fusion network based on semantic information guidance. Future research could expand on these findings to enhance the accuracy and robustness of object detection algorithms in fused images under challenging conditions.

6. Conclusions

In this paper, a multi-level structure-search attention fusion network based on semantic information guidance is proposed. The multi-level adaptive attention block (MAAB) is designed in the network to reduce feature redundancy and preserve complementary information. The neural architecture search (NAS) is introduced in constructing the overall network structure to eliminate the limitations of the existing manually constructed neural network structure. Extensive experiments on the new dataset (M3FD) show that our fusion method has achieved advanced performance in both subjective and objective evaluations, and the mAP in the target detection task is improved by 0.5%.

Author Contributions

Conceptualization, Y.L. and W.Z.; methodology, Y.L.; software, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, X.Z.; visualization, X.Z.; supervision, W.Z.; project administration, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61922019.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The program code and data that support the plots discussed within this paper are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hall, D.L.; Llinas, J. An introduction to multisensor data fusion. Proc. IEEE 1997, 85, 6–23. [Google Scholar] [CrossRef]
  2. Liu, R.; Liu, J.; Jiang, Z.; Fan, X.; Luo, Z. A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion. IEEE Trans. Image Process. 2020, 30, 1261–1274. [Google Scholar] [CrossRef]
  3. Liu, J.; Shang, J.; Liu, R.; Fan, X. Attention-guided global-local adversarial learning for detail-preserving multi-exposure image fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5026–5040. [Google Scholar] [CrossRef]
  4. Jiang, Z.; Zhang, Z.; Yu, Y.; Liu, R. Bilevel modeling investigated generative adversarial framework for image restoration. Vis. Comput. 2022, 1, 1–13. [Google Scholar] [CrossRef]
  5. Ma, L.; Ma, T.; Liu, R.; Fan, X.; Luo, Z. Toward Fast, Flexible, and Robust Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 5637–5646. [Google Scholar]
  6. Liu, R.; Ma, L.; Zhang, J.; Fan, X.; Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 10561–10570. [Google Scholar]
  7. Liu, J.; Jiang, Z.; Wu, G.; Liu, R.; Fan, X. A unified image fusion framework with flexible bilevel paradigm integration. Vis. Comput. 2022, 1, 1–18. [Google Scholar] [CrossRef]
  8. Liu, R.; Jiang, Z.; Fan, X.; Luo, Z. Knowledge-driven deep unrolling for robust image layer separation. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 1653–1666. [Google Scholar] [CrossRef] [PubMed]
  9. Nencini, F.; Garzelli, A.; Baronti, S.; Alparone, L. Remote sensing image fusion using the curvelet transform. Inf. Fusion 2007, 8, 143–156. [Google Scholar] [CrossRef]
  10. Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
  11. Liang, J.; He, Y.; Liu, D.; Zeng, X. Image fusion using higher order singular value decomposition. IEEE Trans. Image Process. 2012, 21, 2898–2909. [Google Scholar] [CrossRef]
  12. Li, H.; He, X.; Tao, D.; Tang, Y.; Wang, R. Joint medical image fusion, denoising and enhancement via discriminative low-rank sparse dictionaries learning. Pattern Recognit. 2018, 79, 130–146. [Google Scholar] [CrossRef]
  13. Zhu, Z.; Yin, H.; Chai, Y.; Li, Y.; Qi, G. A novel multi-modality image fusion method based on image decomposition and sparse representation. Inf. Sci. 2018, 432, 516–529. [Google Scholar] [CrossRef]
  14. Liu, Y.; Chen, X.; Ward, R.K.; Wang, Z.J. Image fusion with convolutional sparse representation. IEEE Signal Process. Lett. 2016, 23, 1882–1886. [Google Scholar] [CrossRef]
  15. Li, H.; Liu, L.; Huang, W.; Yue, C. An improved fusion algorithm for infrared and visible images based on multi-scale transform. Infrared Phys. Technol. 2016, 74, 28–37. [Google Scholar] [CrossRef]
  16. Ibrahim, R.; Alirezaie, J.; Babyn, P. Pixel level jointed sparse representation with RPCA image fusion algorithm. In Proceedings of the 38th International Conference on Telecommunications and Signal Processing, Prague, Czech Republic, 9–11 July 2015; pp. 592–595. [Google Scholar]
  17. Liu, C.; Qi, Y.; Ding, W. Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Phys. Technol. 2017, 83, 94–102. [Google Scholar] [CrossRef]
  18. Shibata, T.; Tanaka, M.; Okutomi, M. Visible and near-infrared image fusion based on visually salient area selection. In Proceedings of the Digital Photography XI International Society for Optics and Photonics, San Francisco, CA, USA, 27 February 2015; Volume 4, p. 94. [Google Scholar]
  19. Gan, W.; Wu, X.; Wu, W.; Yang, X.; Ren, C.; He, X.; Liu, K. Infrared and visible image fusion with the use of multi-scale edge-preserving decomposition and guided image filter. Infrared Phys. Technol. 2015, 72, 37–51. [Google Scholar] [CrossRef]
  20. Rajkumar, S.; Mouli, P.C. Infrared and visible image fusion using entropy and neuro-fuzzy concepts. In Proceedings of the ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India-Vol I; Springer: Berlin/Heidelberg, Germany, 2014; pp. 93–100. [Google Scholar]
  21. Zhao, J.; Cui, G.; Gong, X.; Zang, Y.; Tao, S.; Wang, D. Fusion of visible and infrared images using global entropy and gradient constrained regularization. Infrared Phys. Technol. 2017, 81, 201–209. [Google Scholar] [CrossRef]
  22. Bai, X. Morphological center operator based infrared and visible image fusion through correlation coefficient. Infrared Phys. Technol. 2016, 76, 546–554. [Google Scholar] [CrossRef]
  23. Liu, J.; Wu, Y.; Huang, Z.; Liu, R.; Fan, X. Smoa: Searching a modality-oriented architecture for infrared and visible image fusion. IEEE Signal Process. Lett. 2021, 28, 1818–1822. [Google Scholar] [CrossRef]
  24. Huang, Z.; Liu, J.; Fan, X.; Liu, R.; Zhong, W.; Luo, Z. ReCoNet: Recurrent Correction Network for Fast and Efficient Multi-modality Image Fusion. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 539–555. [Google Scholar]
  25. Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised Misaligned Infrared and Visible Image Fusion via Cross-Modality Image Generation and Registration. arXiv 2022, arXiv:2205.11876. [Google Scholar]
  26. Jiang, Z.; Zhang, Z.; Fan, X.; Liu, R. Towards all weather and unobstructed multi-spectral image stitching: Algorithm and benchmark. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3783–3791. [Google Scholar]
  27. Liu, R.; Jiang, Z.; Yang, S.; Fan, X. Twin adversarial contrastive learning for underwater image enhancement and beyond. IEEE Trans. Image Process. 2022, 31, 4922–4936. [Google Scholar] [CrossRef]
  28. Jiang, Z.; Li, Z.; Yang, S.; Fan, X.; Liu, R. Target Oriented Perceptual Adversarial Fusion Network for Underwater Image Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6584–6598. [Google Scholar] [CrossRef]
  29. Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef] [PubMed]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  31. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  32. Li, H.; Wu, X.j.; Durrani, T.S. Infrared and visible image fusion with ResNet and zero-phase component analysis. Infrared Phys. Technol. 2019, 102, 103039. [Google Scholar] [CrossRef]
  33. Li, H.; Wu, X.J.; Kittler, J. Infrared and visible image fusion using a deep learning framework. In Proceedings of the International Conference on Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2705–2710. [Google Scholar]
  34. Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
  35. Chen, X.; Teng, Z.; Liu, Y.; Lu, J.; Bai, L.; Han, J. Infrared-Visible Image Fusion Based on Semantic Guidance and Visual Perception. Entropy 2022, 24, 1327. [Google Scholar] [CrossRef]
  36. Hou, J.; Zhang, D.; Wu, W.; Ma, J.; Zhou, H. A generative adversarial network for infrared and visible image fusion based on semantic segmentation. Entropy 2021, 23, 376. [Google Scholar] [CrossRef]
  37. Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef]
  38. Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep image decomposition for infrared and visible image fusion. arXiv 2020, arXiv:2003.09210. [Google Scholar]
  39. Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
  40. Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
  41. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 139–144. [Google Scholar]
  42. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
  43. Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.P. DDcGAN: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995. [Google Scholar] [CrossRef] [PubMed]
  44. Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. arXiv 2022, arXiv:2203.16220. [Google Scholar]
  45. Liu, H.; Simonyan, K.; Yang, Y. Darts: Differentiable architecture search. arXiv 2018, arXiv:1806.09055. [Google Scholar]
  46. Cai, H.; Zhu, L.; Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv 2018, arXiv:1812.00332. [Google Scholar]
  47. Saini, S.; Agrawal, G. (m) slae-net: Multi-scale multi-level attention embedded network for retinal vessel segmentation. In Proceedings of the 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI), Victoria, BC, Canada, 9–12 August 2021; pp. 219–223. [Google Scholar]
  48. Chen, L.; Liu, C.; Chang, F.; Li, S.; Nie, Z. Adaptive multi-level feature fusion and attention-based network for arbitrary-oriented object detection in remote sensing imagery. Neurocomputing 2021, 451, 67–80. [Google Scholar] [CrossRef]
  49. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  50. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  51. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Detfusion: A detection-driven infrared and visible image fusion network. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4003–4011. [Google Scholar]
  52. Cheng, M.M.; Mitra, N.J.; Huang, X.; Torr, P.H.; Hu, S.M. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 569–582. [Google Scholar] [CrossRef]
  53. Zhai, Y.; Shah, M. Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, CA, USA, 23–27 October 2006; pp. 815–824. [Google Scholar]
  54. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  55. Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 1–14. [Google Scholar] [CrossRef]
  56. Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 105–119. [Google Scholar] [CrossRef]
  57. Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
  58. Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 313–315. [Google Scholar] [CrossRef]
  59. Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
  60. Roberts, J.W.; Van Aardt, J.A.; Ahmed, F.B. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
  61. Aslantas, V.; Bendes, E. A new image quality metric for image fusion: The sum of the correlations of differences. Aeu-Int. J. Electron. Commun. 2015, 69, 1890–1896. [Google Scholar] [CrossRef]
  62. Cui, G.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Opt. Commun. 2015, 341, 199–209. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.