Advanced Driving Assistance Based on the Fusion of Infrared and Visible Images

Obtaining key and rich visual information under sophisticated road conditions is one of the key requirements for advanced driving assistance. In this paper, a newfangled end-to-end model is proposed for advanced driving assistance based on the fusion of infrared and visible images, termed as FusionADA. In our model, we are committed to extracting and fusing the optimal texture details and salient thermal targets from the source images. To achieve this goal, our model constitutes an adversarial framework between the generator and the discriminator. Specifically, the generator aims to generate a fused image with basic intensity information together with the optimal texture details from source images, while the discriminator aims to force the fused image to restore the salient thermal targets from the source infrared image. In addition, our FusionADA is a fully end-to-end model, solving the issues of manually designing complicated activity level measurements and fusion rules existing in traditional methods. Qualitative and quantitative experiments on publicly available datasets RoadScene and TNO demonstrate the superiority of our FusionADA over the state-of-the-art approaches.


Introduction
Smart cities have become new hot spots for global city development, including smart transportation, smart security, smart communities, and so on. Among them, advanced driving assistance is an indispensable and effective tool playing a pivotal role in smart transportation. The core of a smart city is a high degree of information fusion, so as advanced driving assistance. In the advanced driving assistance scene, there are a large number of information sensing devices to monitor, connect and interact with objects and pedestrians in the environment online [1]. Among the sensors, infrared and visible sensors are generally the most widely used types of sensors whose wavelengths are 300-530 nm and 8-14 µm, respectively.
The peculiarity of combining infrared and visible sensors depends on the fact that visible image captures reflected light to represent abundant texture details, while infrared image captures thermal radiation, which can emphasize thermal infrared targets though in poor lighting conditions or under the severe occlusion [2][3][4]. Based on the strong complementarity between infrared and visible sensors, the fused results can show abundant texture details with salient thermal targets. Therefore, infrared and visible image fusion is undoubtedly a significant and effective application in advanced driving assistance, which is much more beneficial for automatic detection of the system or driver's visual perception.
In the infrared and visible images fusion, many methods have been proposed in the past few years, and they can be divided into six categories according to corresponding schemes, including pyramid methods [5,6], neural network-based methods [7], wavelet transformation based methods [8], sparse representation methods [9,10], salient feature methods [11,12], and other methods [13]. There are three main parts in these fusion methods, i.e., (i) domain transform, (ii) activity level measurement, and (iii) fusion rule design. The biggest criticism lies in that designing complex activity level measurements and fusion rules manually are usually needed in most existing methods, which leads to additional time consumption and complexity.
The development of the smart city is inseparable from the empowerment of artificial intelligence (AI). Among them, the powerful feature extraction capabilities of deep learning have caught more and more eyes [14,15]. Some detailed exposition about these fusion methods will be discussed later in Section 2.2. These deep learning-based methods have found a new breakthrough for image fusion and also achieved excellent effects. However, this kind of method does not completely break away from the shackles of traditional methods, because the framework based on deep learning is typically only applied to some small parts, e.g., the extraction of features, while the whole fusion process is still based on traditional frameworks.
In addition, both traditional and deep learning-based methods suffer from a common predicament, i.e., information attenuation. Specifically, the extracted (or to be fused) information, including texture details and salient thermal targets, are attenuated to varying degrees due to the weight selection accompanying the fusion process.
To address the above issues and improve the performance of advanced driving assistance, in this paper, we propose a new fusion method that is fully based on deep learning, called FusionADA. For convenience, we abbreviate source visible and infrared images, and fused image as V I, IR and I F , respectively. First of all, in our fusion model, deep learning runs through the whole model, and manually designing complex activity level measurements and fusion rules are not required, thus our FusionADA is a fully end-to-end model. Furthermore, our FusionADA can overcome the predicament of information attenuation, which is reflected in texture details and salient thermal targets, respectively. On the one hand, since the texture details can be characterized by gradient variation, based on the major intensity information, we employ the max-gradient loss to guide the fused image to learn the optimal texture details from source images. On the other hand, with a labeled mask reflecting the domains of salient thermal targets, we establish a specific adversarial framework of two kinds of neural networks, i.e., the generator and the discriminator, based on conditional generative adversarial networks (GAN). Rather than a whole image, the real data only refers to the salient thermal targets from the source infrared image limited by the labeled mask (M), i.e., IR ⊗ M, while the fake date refers to the corresponding regions of the fused image, i.e., I F ⊗ M, which forces the fused image to restore the salient thermal targets from the source infrared image. In conclusion, our FusionADA can be trained to generate the fused image with the optimal texture details and salient thermal targets in a fully end-to-end way without information attenuation.
The main contributions of this paper can be summarized into two aspects as follows. (i) In order to improve the performance of the advanced driving assistance, we propose a new fully end-to-end infrared and visible images fusion method, which is achieved without any manual designs of complex activity level measurements and fusion rules. (ii) To overcome the predicament of information attenuation, we employ the max-gradient loss and adversarial learning to learn the optimal texture details and restore the salient thermal targets, respectively.
The rest of this paper is arranged as follows: In Section 2, we present some related works with a conspectus of explanations of the advanced driving assistance and existing deep learning-based fusion methods. The detailed introduction of our FusionADA with the motivation is presented in Section 3. Section 4 shows the fusion performance of our FusionADA on public infrared and visible image fusion datasets RoadScene and TNO, compared with other state-of-the-art methods in terms of both qualitative visual effect and quantitative metrics. Besides, we carry out the ablation experiment of adversarial learning in this section, followed by some conclusions in Section 5.

Related Work
In this section, we provide brief explanations of the advanced driving assistance in smart transportation and deep learning-based fusion methods.

Advanced Driving Assistance
Advanced driving assistance refers to a kind of integrated system that integrates a camera detection module, a communication module and a control module, which is of great benefit for vehicle driving tasks. Specifically, there are different operating principles and levels of assistance to the drivers. The advanced driving assistance can be divided into different classes according to the monitored environment, and the used sensors [16]. These systems will not act completely autonomously, they will only provide relevant information to drivers and assist them when taking key actions. The proposed infrared and visible images fusion method relies on exteroceptive sensors, and the information of fused results is shown on a screen as the visual assistance to the drivers, which can be also incorporated in automatic recognition by smart transportation.

Infrared and Visible Image Fusion Based on Deep Learning
In the last several years, the breakthroughs in deep learning have driven the vigorous development of artificial intelligence, which also provides new ideas for infrared and visible image fusion. Fusion methods based on deep learning can be roughly divided into two categories: convolutional neural networks (CNN)-based model and GAN-based model [17]. In the methods based on CNN, Liu et al. [18] firstly established a deep convolutional neural network to achieve the generations of both activity level measurement and fusion rule, which are also applied for fusing infrared and visible images. Innovatively, Li et al. [19] used the architecture of dense block to get more useful features from source images in the encoding process, followed by a decoder to reconstruct the fused image. Besides, a novel convolution sparse representation was introduced by Liu et al. [20] for image fusion, where a hierarchy of layers was built by deconvolutional networks. As for the methods based on GAN, Ma et al. [21,22] proposed the FusionGAN to fuse infrared and visible images by adversarial learning, which is also the first time that the GANs are adopted for addressing the image fusion task. Xu et al. [23] achieved fusion via a conditional generative adversarial network with dual discriminators (DDcGAN), in which a generator accompanied by two discriminators is employed to enhance the functional information in IR and texture details in V I.

Proposed Method
In this section, combining the characteristics of infrared and visible images and the fusion target, we give a detailed introduction to the proposed method, including our fusion formulation, the network architectures of generator and discriminator, and the definitions and formulations of loss functions.

Fusion Formulation
The training procedure of our proposed FusionADA is illustrated in Figure 1. The infrared images can distinguish the targets from their background based on the dissimilarity in thermal radiation, but they lack rich texture details. In contrast, the visible images are able to show relatively richer texture details with high spatial resolution, but they fail to highlight the salient targets. Besides, for a certain area corresponding to the two source images, the infrared or visible image may own better texture details. Given an infrared image IR and a visible image V I, the ultimate goal of our FusionADA is to learn a fusegenerator G conditioned on them constrained by a content loss. With the labeled mask M reflecting the domain of salient thermal targets, the fused image I F multiplied by the labeled mask M, i.e., I F ⊗ M is encouraged to be realistic enough and close enough to real data, i.e., I R ⊗ M, to fool the discriminator D. Meanwhile, the discriminator aims to distinguish the fake data (I F ⊗ M) from the real data (I R ⊗ M). Accordingly, the objective function of adversarial learning can be formulated as follows: (1) After the continuous optimization of the generator and the adversarial learning of the generator and the discriminator, the fused image will finally possess the optimal texture details and salient thermal targets in a fully end-to-end way.

Network Architecture
Fuse-Generator G: As shown in Figure 1, the Fuse-Generator can be regarded as an En-decoder structure. In the encoder, for each image, we use a branch to extract information from it. Adopting the idea of DenseNet [19], each layer is directly connected with other layers in a feed-forward manner. Since the information extracted from each source image is not the same, the internal parameters of each branch are also different. There are four convolutional layers in each branch, and each convolutional layer consists of the operations of padding and convolution, and the corresponding activation function, i.e., leaky rectified linear unit (LReLU). In order to avoid the blurring of the image edges caused by "SAME", the padding mode of all convolution layers is set as "VALID". The additional padding operation placed before convolution is employed to keep the size of feature maps unchanged and match the size of source images. The kernel sizes of the first two convolutional layers are set to 5, while the kernel sizes of the latter two convolutional layers are set to 3. The strides of all convolutional layers are set to 1. Since the number of output feature maps of each convolutional layer is 16, the number of the final concatenated output feature maps is 128.
The decoder is used for channel reduction and fusion of the extracted information. The kernel sizes in all convolutional layers are uniformly set to 1 with the strides setting to 1, and thus the sizes of feature maps will not change. Therefore, there are no padding operations. The activation function of the last convolutional layer is set as Tanh. Moreover, the specific settings for the number of output channels in all layers are summarized in Table 1. Discriminator D: The discriminator is added to form an adversarial relationship with the generator. The input of the discriminator is the real data, i.e., I R ⊗ M, or the fake data, I F ⊗ M, and the output is the scalar estimating the probability of the discriminator's input image from real data rather than fake data. There are only three convolutional layers in the discriminator, which is much simpler compared to the Fuse-Generator. The strides of all convolutional layers are set to 2. After the fully connected layer, the scaler is obtained by the activation function Tanh.

Loss Functions
The loss functions in our work are composed of the loss of Fuse-Generator L G and the loss of discriminator L D .

Fuse-Generator Loss L G
The Fuse-Generator loss includes content loss L con G and adversarial loss L adv G , which are used to extract and reconstruct the basic intensity information accompanying the optimal texture details and restore the thermal infrared salient targets. With the weight λ controlling the trade-off between two terms, the Fuse-Generator is defined as follows: Among them, the content loss L con G has two parts: basic-content loss L SSI M for extracting and reconstructing the basic intensity information, the max-gradient loss L gra for obtaining the optimal texture details, which is formulated as follows: where the η is used to tradeoff the balance of intensity information and gradient variation. Specifically, the L SSI M is formalized as follows: where the ω is employed to tradeoff the balance of intensity information and gradient variation. The SSI M X,F is the metric to measure the similarity between two images, including three different factors of brightness, contrast and structure, which is mathematically defined as follows: where X and F in our work refer to the source image and fused image, respectively. The x and f mean the image patches of source image X and fused image F, µ and σ are the average values and the standard deviation. C 1 , C 2 and C 3 are the parameters to make the metric stable.
Only the basic-content loss L SSI M will cause the issue of information attenuation in texture details. Therefore, we further employ the max-gradient loss L gra to obtain the optimal texture details. L gra is mathematically formalized as follows: where H and W are the height and width of the source images. ∇(·) refers to the step of calculating the gradient map. The idea of the loss L gra is to make the gradient map of the fused image (∇I f ) and the optimal gradient map of the source images g max tend to be infinitely similar. The g max is mathematically defined as follows: where round(·) and max(·) mean the operations of rounding and taking the maximum value. The eps is a very small value to prevent the denominator from being 0. The adversarial loss L adv G is to further restore the thermal infrared salient targets from the source IR image in the fused image, which is defined as: where M is the labeled mask reflecting the domains of salient thermal targets. When minimizing L adv G , I F ⊗ M is encouraged to be realistic enough and close enough to real data, i.e., I R ⊗ M to fool the discriminator D.

Loss of Discriminator L D
The discriminator loss L D is the term that forms an adversarial relationship with the Fuse-Generator adversarial loss L adv G . The L D is formulated as follows:

Experimental Results and Analysis
In this section, in order to show the superiority of our proposed FusionADA, we firstly compare it with 7 state-of-the-art fusion methods on the publicly available dataset RoadScene (https://github.com/hanna-xu/RoadScene (accessed on 1 December 2021)) qualitatively. Furthermore, we employ 8 metrics to evaluate their fusion results through qualitative comparisons. In addition, the ablation experiment of the adversarial learning is conducted. Finally, we show the fusion results of our FusionADA on another publicly available dataset TNO (https://figshare.com/articles/TNO_Image_Fusion_Dataset/1008 029 (accessed on 1 December 2021)) dataset.

Experimental Settings
Dataset and Training Details. The training dataset is 45 aligned infrared and visible image pairs with different scenes selected from RoadScene. In order to improve the training performance, the tailoring and decomposition are applied as the expansion strategies before training to obtain a larger dataset. Specifically, the training dataset is uniformly cropped to 4736 patch pairs of size 128 × 128. There are 30 image pairs for testing. In the test phase, only the trained generator is used to generate the fusion image, and the input source images only need to be of the same resolution size.
Since this work is based on the adversarial learning of the generative adversarial network, we design a training strategy to keep the stability of the generative adversarial network in order to balance the adversarial relationship between the generator and the discriminator. The overall idea lies in that finding the loss value when the generator and the discriminator are in balance, and optimizing the generator or the discriminator to achieve their respective loss values through variable optimization times. The detailed training details of FusionADA are summarized in Algorithm 1. The λ, η, and ω are set to 3, 100 and 1.23 in Equations (2)-(4), respectively. • Sample n V I patches {V 1 , · · · , V n } and n corresponding IR patches {I 1 , · · · , I n }; • Acquire generated data {F 1 , · · · , F n } • Update Discriminator parameters θ D by GradientDescentOptimizer to minimize L D in Equation (9); (step I) • While L D > L max and N D < 10, repeat step I. N D ← N D + 1; Train Generator G: • Sample n V I patches {V 1 , · · · , V n } and n corresponding IR patches {I 1 , · · · , I n }; • Acquire generated data {F 1 , · · · , F n } • Update parameters θ G by RMSPropOptimizer for minimizing L G in Equation (2); (step II) • While L D < L min and N G < 10, repeat step II. N G ← N G + 1; • While L G > L Gmax and N G < 10, repeat step II. N G ← N G + 1;

Comparison Algorithms and Evaluation Metrics
In order to verify the effectiveness of our FusionADA, we show some intuitive results from our work with 7 other state-of-the-art infrared and visible fusion methods, containing gradient transfer fusion (GTF) [24], fourth-order partial differential equations (FPDE) [25], hybrid multi-scale decomposition (HMSD) [26], DenseFuse [19], proportional maintenance of gradient and intensity (PMGI) [27], unified unsupervised image fusion (U2Fusion) [28], and generative adversarial network with multi-classification constraints (GANMcC) [29]. Among them, GTF, FPDE and HMSD are fusion methods based on the traditional framework, while DenseFuse, PMGI, U2Fusion and GANMcC are deep learning-based fusion methods. Besides the intuitive evaluation, to do a more accurate evaluation of the fused results, we employ eight metrics to evaluate the fusion performance of these eight fusion methods, including standard deviation (SD) [30], spatial frequency (SF) [30], entropy (EN) [31], mean gradient (MG) and edge intensity (EI) [32] that measure the fused image itself, feature mutual information (FMI), the sum of the correlations of differences (SCD) [33], and visual information fidelity (VIF) [33] that measure the correlation between the fused image and source images. Specifically, SD, SF, EN, MG and EI are used to evaluate the contrast, frequency, amount of information, details, gradient amplitude of the edge point in the fused image, respectively. FMI is used to evaluate the amount of feature information that is transferred from source images to the fused image. SCD and VIF are used to measure the sum of the correlations of differences and information fidelity, respectively.

Qualitative Comparisons
There are four representative and intuitive fusion results of eight methods on infrared and visible images from the RoadScene dataset in Figures 2-5. Compared to the existing seven other comparative fusion methods, our fused results show three obvious advantages. First, The salient thermal targets can be characterized clearly in our fused images, such as the pedestrians in Figures 2, 3 and 5, and the driver and passenger who got off the bus halfway in Figure 4 (all shown in the green boxes). The targets in the fused results of other methods all look dimmer compared to our results. Due to the salient thermal targets in our fused images, the drivers and machines can identify targets more easily and accurately, which facilitates the subsequent operations. Second, the scenes in our fused results show richer texture details, such as the schoolbag in Figure 2, the signs in Figures 3 and 4 and the pavement marking in Figure 5 (all shown in the enlarged red boxes). Some scenes in the results of other methods seem fuzzier. The rich texture details are more conducive to scene understanding for the drivers and machines. Last but not the least, our results look cleaner than others without redundant fog or noise compared with the results of other methods.

Quantitative Comparisons
To have a more comprehensive and objective evaluation of the experimental results. We selected 30 test pairs of infrared and visible images randomly to further perform quantitative comparisons of our FusionADA with the competitors on eight fusion metrics. Each test image pair is aligned with the same resolution. The results of their values are summarized in Table 2. It is worth noting that our FusionADA can almost reach the optimal or suboptimal mean values on the eight metrics. For the metrics MG, EI, FMI and SCD, our FusionADA can also achieve comparable results with the suboptimal average values, only following behind a certain method by a narrow margin. It can be concluded that our results contain stronger contrast, richer texture details, more information and are closer to source images with less distortion.
In addition, we also provide the mean and standard deviation of runtime for eight methods in Table 3. Although our FusionADA does not achieve optimal efficiency, it still plays a comparable role.

Ablation Experiment of Adversarial Learning
The adversarial learning with a labeled mask is further employed in our FusionADA to restore the salient thermal targets from the source infrared image. To show the effect of adversarial learning, the following comparative experiments are conducted: (a) the adversarial learning is not applied; (b) the adversarial learning is applied. The experimental settings in other parts of the ablation experiments are the same. As can be seen from Figure 6, the fused results with adversarial learning own more salient thermal targets, which is more beneficial to the drivers and machines to identify the targets. Therefore, we can conclude that adversarial learning plays an important part in the fusion process.

VIS
IR Results of (a) Results of (b) Figure 6. Results on whether the adversarial learning exists. From left to right: source VIS image, source IR image, the fused results without adversarial learning, the fused results of our FusionADA (with adversarial learning).

Generalization Performance
Our FusionADA also performs well on other datasets. To show the performance of our FusionADA on other datasets, we choose the TNO dataset to carry out the experiments without retraining the fusion methods. In particular, we choose two state-of-the-art methods HMSD and GANMcC that perform well on RoadScene to be the comparative methods with our FusionADA. The intuitive fusion results are presented in Figure 7. By contrast, with less noise, the salient thermal targets can be characterized more clearly and the richer texture details are contained in our results, which can be concluded that our FusionADA has good generalization performance and obtains excellent fusion results on other datasets.

Conclusion
In this paper, we propose a novel end-to-end fusion model for advanced driving assistance to obtain key and rich visual information under sophisticated road conditions, called FusionADA. Specifically, we achieve our FusionADA by fusing the infrared and visible images from infrared and visible sensors. For drivers and machines, based on the scenes, the salient thermal targets and rich texture details are indispensable for identifying targets easily and accurately. Therefore, in our model, we guide the generator to generate the optimal texture details from source images. Meanwhile, we constitute an adversarial framework with a labeled mask to further restore the salient thermal targets from the source infrared image. In addition, our FusionADA is achieved in a fully end-to-end way, which avoids manually designing complicated activity level measurements and fusion rules. The adequate experimental results reveal that our FusionADA not only presents better visual performance compared with other state-of-the-art methods, but also preserves the maximum or approximate maximum amount of features from source images.