Two-Branch Feature Interaction Fusion Method Based on Generative Adversarial Network

: This study proposes a fusion method of infrared and visible images based on feature interaction. Existing fusion methods can be classified into two categories based on a single-branch network and a two-branch network. Generative adversarial networks are widely used in single-branch-based fusion methods, which ignore the difference in feature extraction caused by different input images. Most two-branch-based fusion methods use convolutional neural networks, which do not take into account the inverse promotion of fusion results and lack the interaction between different input features. To remedy the shortcomings of these fusion methods and better utilize the feature from source images, this study proposes a two-branch feature interactions method based on a generative adversarial network for visible and infrared image fusion. In the generator part, a two-branch feature interaction approach was designed to extract features from different inputs and realize feature interaction through the network connection of different branches. In the discriminator part, a double-classification discriminator was used for visible images and infrared images. Extensive comparison experiments with state-of-the-art methods have demonstrated the advantages of this proposed generative adversarial network based on two-branch feature interaction, which can enhance the texture details of objects in fusion results and reduce the interference of noise information from source inputs. In addition, the above advantages were also confirmed in generalization experiments of object detection.


Introduction
Infrared images and visible images are different types of images obtained from different sensors that capture different types of information about the same scene. Infrared images provide texture detail and contrast based on the intensity of heat captured from objects [1]. Conversely, visible images that are strongly affected by the environment of strong light, the environment of weak light, and the environment of smoke, may increase the difficulty in capturing effective information and cause a large amount of noisy information interference. Both feature complementation and feature fusion are very necessary for these two classes of images [2]. Therefore, novel fusion methods have been designed according to the difference in image information, which can filter out the noise information and improve the diversity of information in fusion images.
All of the above traditional and deep-learning-based approaches address some problems and provide new ideas, but some challenges have not yet been overcome: a. Existing fusion methods based on generative adversarial networks and two-branch fusion methods based on convolutional neural networks have inspired us to develop new fusion ideas, but new challenges have also emerged. b. Existing fusion methods based on generative adversarial networks adopt the singlebranch mode for feature extraction [1,2,16,17]. These methods cannot avoid the feature loss problem of infrared images caused by interference information from visible images, thus resulting in contrast reduction of the scene and weak texture information in occluded regions. c. Some of the existing fusion methods based on convolutional neural networks use a double-branch mode to extract features [11][12][13][14][15]. We draw inspiration from these methods of extracting features individually, but they also face some problems. Feature correlation between different source images is reduced in this simple two-branch design. Moreover, restricting the feature extraction and reconstruction of the network only by loss functions may result in retaining too much useless information in the fusion results when the source images are disturbed by smoke or strong light.
To address the three aforementioned issues, this study designed a two-branch fusion model for infrared and visible images. The main contributions of the proposed model are as follows: (1) In terms of issue (a): This study designed a generative adversarial network based on two-branch feature interaction for infrared and visible image fusion. The advantages of two-branch feature extraction and the adversarial advantage of generated adversarial networks were reasonably utilized in this proposed model. (2) In terms of issue (b): This study designed a generator based on two-branch feature extraction to extract features from visible and infrared images. The two-branch feature extraction mode was designed to address the loss of texture features in infrared images caused by the large area of interference information from visible images. (3) In terms of issue (c): This study designed a two-branch feature extraction mode of feature interaction and enhanced the feature correlation of different branches through layer-hopping connection. The feature similarity between both the source images and fused results was enhanced by a discriminator optimization generator.

Related Works
For the past few years, visible and infrared image fusion algorithms designed by traditional methods have been well known. With the wide application of deep learning networks, they have also been widely used in the design of infrared and visible image fusion algorithms. Next, this paper first introduces some traditional methods for image fusion, then introduces algorithms based on convolutional neural networks (CNNs), and finally introduces fusion algorithms based on generative adversarial networks (GANs).

The Traditional Fusion Method
Feature extraction, feature fusion, and feature reconstruction have been redesigned to optimize the function of the fusion image using different mathematical methods in traditional methods. Four representative traditional methods are described in detail below.
Regarding multi-scale transform (MST) fusion methods [4][5][6], it is the main algorithm of MST that obtains multi-scale representations from input images through multi-scale transformations. Specific fusion rules are designed to obtain multi-scale coefficients of fusion, which are related to the correlation and activity between pixels in multi-scale representation of images. Finally, fusion images are obtained by multi-scale inverse transformation of the fusion coefficient.
Sparse representation (SR) fusion methods [3] transform two types of source images into one single-scale feature vector based on a linear combination through a dictionary and then fuse and reconstruct the feature vector to obtain an all-fused result.
For low-rank representation fusion methods [7,8], the fusion method of low-rank representation is divided into three steps: the low-rank partial features and significant features of all the source images are extracted as the first step, then different extracted features are fused separately, and, finally, fused features are reconstructed. The authors in [20] designed an image decomposition strategy, MDLatLRR, based on imprecise augmented Lagrange multipliers, and the weighted average method was used to fuse different features.

Convolutional-Neural-Network-Based Fusion Methods
The convolutional neural network is a type of neural network that is commonly used in computer vision tasks. It consists of multiple layers, including convolutional layers, pooling layers, and fully connected layers. The convolutional layer applies a set of filters to input images, which helps to extract features such as edges and textures. The pooling layer then reduces the spatial size of the feature maps by taking the maximum or average value in each region. The fully connected layer connects all neurons from the previous layers to the output layer for classification or regression.
Convolutional neural networks have become the first choice of image fusion methods as networks have evolved. The fusion method of infrared and visible images also uses convolutional neural networks more widely [21][22][23][24][25][26]. The single-branch fusion method and two-branch fusion method are two kinds of convolutional-neural-network-based fusion methods.
The authors in [27] proposed a unified fusion framework of multiple types of fusion images, U2Fusion, which is based on a single-branch network. An adaptive method was used to estimate the feature significance from both source images in the steps of feature extraction and feature retention. Although it obtained a good effect in fused results from visible and infrared images, the training set of the model was derived from single multi-focus image data, which led to poor performance in extracting and preserving feature information from other types of images. Moreover, the choice of an appropriate fusion strategy is also an issue for different types of image fusion requiring different fusion strategies.
Most fusion methods based on convolutional neural networks use two-branch networks to obtain features from source images. An image fusion framework based on CNN was proposed by Zhang et al. [12], which proved to be suitable for various types of image fusion, such as visible and infrared images, different exposure images, and different modal medical images. First, fully convolutional neural networks were proposed, so that the training process was end-to-end without the need for manual intervention or post-processing procedures. The training of the model was only targeted at different focus image datasets, due to the establishment of completely different focus image datasets. Finally, the CNNbased image fusion model with perceptual loss was also a precedent. The authors in [11] created a network known as RFN, which used different encoders to encode both source images, fuse encoded features, and then transmit them to the decoder to get fusion results. The authors in [13] created DIDFuse, which uses different encoders to decompose different source images into background parts and detail parts to obtain different feature maps, and then uses decoders to restore features to visible or infared images. The network was well optimized by encoding and decoding. The authors in [14] designed SeaFusion, which designs different branches to extract different features of different source images by designing gradient residual-dense blocks in each branch to improve the ability of the network convolution to acquire features. Such feature extraction methods also provide us with good inspiration. In the method designed by Li et al. [28], basic parts and detailed parts from different source images were obtained by different branches, and then a weighted average and deep learning network were used to fuse the basic parts and details parts of different source images. The fused image consisted of the above two parts.
The above fusion methods based on convolutional neural networks solve some challenges in the fusion field, but there are still some open problems. These methods only rely on the design of loss function, which limits the feature extraction ability of the network and introduces noise information into the feature reconstruction, thereby resulting in a large number of details covered by noise information.

Generative-Adversarial-Network-Based Methods
The generative adversarial network constitutes one of the most promising methods of weakly supervised learning. This model can be divided into generator and discriminator parts. The generator in GANs typically consists of one or more deep neural networks that take random noise as the input and produce fake data samples as the output. The discriminator, on the other hand, is a binary classifier that takes data samples as input and outputs a probability score indicating whether the input is real or fake. These two parts of the adversarial game contribute to the optimization of the entire fusion network. Goodfellow et al. [29] designed the GAN model. This network concept based on adversarial learning has brought new vitality to the deep learning field. The first fusion model, FusionGAN, was proposed by Ma et al. [2], which is a fusion framework based on adversarial learning that is used for visible and infrared image fusion tasks. The function of the generator is to obtain fusion results, which mainly include not only intensity features from infrared images, but also gradient features from visible images. Meanwhile, the function of the discriminator is to make fused images retain more detail features from visible images. A specific discriminator was designed to judge the authenticity of visible images and fusion results, with a promoting corresponding generator to produce more feature-rich results.
DDcGAN [17] was designed for source image fusion with different resolutions. The source images with different resolutions are passed into the encoder part and the decoder part of the generator to obtain fusion images. Fused images are fed into different discriminators, both of which simultaneously promote the generator. Ma et al. [1] used multiple classifiers in generative adversarial networks (GANMcC) to balance the information retention degree from different source images in fused results. In addition, the new content loss function was introduced in the network of GANMcC, which includes two types of loss-gradient loss and intensity loss. In addition, different content loss functions were adopted for different source images.
Unfortunately, existing generative adversarial networks used in the fusion domain all adopt a single-branch mode, and such feature extraction methods have difficulty in specialized feature extraction for different source images.

Proposed Method
The proposed fusion model based on two-branch feature complementation is described in this section. The formulation of the problem is introduced first in the article. Secondly, the proposed fusion network is introduced in detail, including the generator based on two-branch feature complementarity and the double-classification discriminator based on structures of layer-hopping connections. The loss functions used to optimize the discriminator and generator are then introduced. Finally, this section provide certain details of model training and testing.

Problem Formulation
A two-branch feature complementary generator is proposed, visible images I h * w * 3 v are passed into one branch of the proposed generator and infrared images, and I h * w * 1 r are passed into another branch of the proposed generator. Feature extraction is performed via convolution on both source images. The features obtained from each convolution layer of infrared image are passed into another corresponding feature extraction branch for feature complementarity. The features from both branches can generate fused images I h * w * 3 f after feature fusion and feature reconstruction. The fusion results obtained from the two-branch feature complementary generator, visible images and infared images, are respectively fed to a special discriminator based on layer-hopping connections. The identification results are obtained, which are guided to achieve optimization by the loss function.

The Fusion Model Based on Two-Branch Feature Complementation
In this section, the proposed fusion network is described in detail, including a generator based on two-branch feature complementarity and a double-classification discriminator based on layer-hopping connections in Figure 1. Visible and infrared images are fed into two branches in the proposed generator separately. In each branch, features from different images are extracted by convolution. It is worth noting that the features in each layer of the infrared images are passed into another branch. Then, the different features of the two branches are fused and reconstructed to obtain fusion results. the fusion results, visible images, and infared images are passed into the double-classification discriminator based on layer-hopping connections in turn. The discriminator result is used to promote the fusion performance of the two-branch generator through updated loss function. The two-branch generator and updated discriminator enhance the performance of the entire network through this interaction. The design of the two-branch network improves the feature extraction ability of different source images. This study designed pixel superposition between the same convolution layers of different branches, which was different from the previous two-branch networks, and proposed the idea of feature complementarity for different features in feature extraction. This design can promote the feature complementarity between the two source images in the process of feature extraction from visible images and infared images.
Visible images and infrared images are passed into two branches in the proposed generator, respectively, as shown in Figure 2. The each convolution layer result of the infrared image branch is not only passed into the next convolution layer, but also passed into the next convolution layer of the visible image branch after pixel addition with the convolution layer result of the visible image branch. Subsequently, the last convolution layer results of the different branches are added and passed into four following convolution layers for feature fusion and feature reconstruction. All the convolution layers in the proposed generator backbone network include a convolution layer, batchnorm layer, and ReLu layer.

The Architecture of Double-Classification Discriminator Based on Layer-Hopping Connections
Fusion models using single infrared image discriminators may reduce the feature retention degree of the visible images. Therefore, this study designed a double-classification discriminator to solve the problem of missing features in the visible image. In addition, the feature extraction capability of the discriminator network on input images was enhanced, and the ability of the discriminator network to distinguish images was improved by introducing hopping connect layers.
As shown in Figure 3, the double-classification discriminator consists of three 3 × 3 convolution layers, a 5 × 5 convolution layer, and an 8 × 8 convolution layer, all of which are equipped with batchnorm layers and ReLu layers. The result of five convolution layers is classified by an activiation layer. The classification results of the different images are used to guide the generator to retain different reconstruction features through the loss function. The two-branch generator and double-classification discriminator enhance the feature extraction and reconstruction capabilities of the network through the action of loss functions.

Loss Functions
This section starts with introducing the lightweight loss functions of the two-branch generator. These include a traditional loss function, content loss function, and adversarial loss function. Then, the proposed loss functions of thedouble-classification discriminator are introduced.

Loss Functions of the Generator Based on Two-Branch Feature Complementarity
Recently, GAN-based image fusion methods have introduced content loss into the loss function. However, this is not sufficient to solve some invisible problems, such as the insufficient ability to extract gradient information from both visible and infared images, thus resulting in a large gradient difference between source images and fused images. In order to solve the above problems, this study designed the traditional loss function L tra , the updated content loss function L con , and the adversarial loss function L Adv . The similarity between both source images and fused results was improved.
The global loss function for the two-branch generator L G is shown in detail.
L G denotes the global loss function of the two-branch generator, which consists of three parts. First, L tra denotes the traditional loss, then the second is the loss of content L con , and last is the adversarial loss V Adv (G). a, b, and c are the corresponding proportional parameters of the three parts. The three parts of the loss function are described in detail next.
On the right hand side of the L tra , I t f (m,n) describes the pixel in column n and row m of the tth fused image. I t r(m,n) describes the pixel in column n row m of the tth infrared image. I t v(m,n) represents the pixel in column n and row m of the tth visible image. α and β are the corresponding proportional parameters.
The second part is content loss L * content . In the formula above, L and H describe the width and height of input data, respectively. The matrix Frobenius norm is represented by · F . ∇ describes the Laplace gradient operator. ξ is used to equilibrate the proportions of the two parts in the updated content loss function.
The second part is the updated adversarial loss L * Adv , where I t f describes the fused image number t. The number of fused images is T. The numerical value that the proposed generator expects discriminator D to believe as fake data is shown by d 1 , and the numerical value that the proposed generator expects discriminator D to believe as real data is shown by d 2 .

Loss Functions of the Double-Classification Discriminator Based on Layer-Hopping Connections
The discriminator of a single infrared image guides the corresponding generator to retain the features of the images. The semantic features of the visible image cannot realize the complementary function of different types of features under such a discriminator. To sum up, the proposed model should keep the same processing steps and calculation methods in the discriminator for two different input images. Therefore, this study designed a double-classification discriminator with the redefined loss function L * Dis , which is shown in detail.
L * Dis can be divided into three parts. In order to perform the same processing steps and calculation methods for infrared images as for visible images, this study designed the V Dis (I r ), which represents the results of the estimated value for the infrared image. V Dis (I v ) and V Dis (I f ) are the rest of the content, which represent the estimated values of the visible image and fused image, respectively.
The first part is V Dis (I r ). I n r expresses the n number of visible images, N expresses total number of input visible images, and D(I n r ) indicates the estimated value of the discriminator. The numerical value that the proposed generator expects the discriminator to believe as fake data is represented by d 1 .
The next part is V Dis (I v ), where I n v expresses the n number of visible images, N expresses total number of input visible images. and D(I n v ) indicates the results of the discriminator. The numerical value that the generator expects the discriminator to believe as fake data is described by d 2 .
The last part is V Dis (I f ), where I n f denotes the n number of fused images, N denotes total number of fused images. and D(I n f ) indicates the results of the discriminator. The numerical value that the proposed generator expects the discriminator to believe as fake data is described by d 3 .

Training Details
This study randomly selected 50 groups of visible and infrared images to train and optimize the proposed model in Python 3.7. By clipping each group of visible and infrared images with a step size of 14, groups of training images could be obtained, which were 136 × 136 in size. t groups of source images patched from the training dataset were selected with the same size of 136 × 136 as the inputs of the two-branch generator. Fused image patches were the outputs of the two-branch generator, with sizes of 136 × 136. Then, this study selected t patches of fusion images, with t corresponding patches of infrared images and t corresponding patches of visible images incoming to the discriminator. We trained the two-branch generator and double-classification discriminator n times with an Adam optimizer [30] to obtain the best efficient generator, which is shown in Algorithm 1.
In practice, this study empirically set E = 10, m = 8, and k as the ratio of the total number of patches to m. The settings of various parameters in the two-branch generator and the multi-classification discriminator included the following: a = 1, b = 50, and c = 50. α = 0.5, β = 0.5, and ξ = 5. d 1 was a random number between 0.6 and 1.1, d 2 was a random number between 0.7 and 1.2, and d 3 was a random number between 0 and 0.3.
During testing, four common datasets were selected to verify the proposed approach through cropping images without overlapping them. Then, a batch of image patches generated by a two-branch generator was spliced together to produce the fusion image.

Experiments
In this section, the dataset, comparison method, and evaluation index designed in the comparison experiment are first detailed. Subsequently, comparative tests with twelve methods are carried out on four public datasets, and object detection task analysis and time complexity analysis are introduced in detail. Finally, this study designed an ablation experiment to verify the necessity of each part of the network and summarized the experimental part.

Metrics
Six different evaluation metrics were selected to quantitatively analyze the proposed model compared with other comparative methods. The calculation of pixel-level similarity between the input and output images, as well as a separate entropy calculation of the output image, were both included in the optimization process. The following are the six evaluation metrics: EN [31]: Information entropy is used to measure the information richness of fusion results. For fusion images obtained by different methods, the larger the numerical value of information entropy, the more the features from both source images are described in fusion images.
SF [32]: Spatial frequency reflects the gray change rate of the images. The larger the numerical value, the sharper the fusion image. (10)

PSNR:
One of the quality evaluations based on the noise calculation of the image is the peak signal-to-noise ratio; the higher the numerical value, the higher the pixel quality of the fusion images.
VIF [33]: Visual information fidelity is an evaluation index based on information fidelity, which reflects the modeling relationship between the image and human visual distortion. The VIF is always in the range of [0,1], and the image distortion is lower. More information is retained and consistent with human visual perception when the VIF approaches 1.
AG [34]: The average gradient is used to calculate the gradient information of the images so as to measure the sharpness of theimages; the higher the value, the clearer the fusion result.

MSE:
The mean square error calculates the differentiation between both the source images and the fused results. The lower the numerical value is, the higher the similarity between both the source images and the fused results.

Comparison Experiment
In this part, we conducted comparative tests with twelve comparison methods on four public datasets. The following are the quantitative analysis and qualitative analysis.

Qualitative Analysis
It can be obviously observed that TarDAL, MFEIF, DDcGAN, SeAFusion, U2Fusion, and IFCNN could not avoid the excessive useless information retained from visible images, such as the smoke interference information in Figure 4 and the strong light interference information in Figure 5. Although the GTF, MDLatLRR, FusionGAN, GANMcC, and RFN methods could retain useful scene information from infrared images under the interference of smoke or strong light, they only retained contrast information but lacked a lot of contour information and texture information.
Conversely, the proposed method did a good job of preserving useful contrast and texture details from both source images when there was a lot of useless information in the visible images. In addition, the proposed method could extract and retain the weak texture and brightness information of infrared images well when the visible images could not capture effective information, as shown in Figure 4.
In order to better illustrate the advantages of the proposed approach, this study also conducted comparative tests on the TNO dataset and RoadScene dataset. The proposed method could also avoid interference caused by useless smoke information to generate the fusion results well, and it effectively used the information from the infrared images, as shown in Figure 6. Fortunately, the proposed method also showed good contrast information, as well as rich texture information, in scenes where the road had a strong light at night, as shown in Figure 7.

Quantitative Analysis
We randomly selected 70 pairs of visible and infared images from the MFNet dataset, M 3 FD dataset, and RoadScene dataset, as well as 25 pairs of images from the TNO dataset. The images were quantitatively compared and analyzed across six different evaluation metrics.
As shown in Tables 1-4, the proposed method outperformed all other comparison methods in the SF and AG values on all datasets, thus indicating that the proposed fusion images have the highest average gradient and spatial frequency, which proves that the proposed fusion images have the highest quality. The ranking of other evaluation indicators also ranked in the top four among various methods, thus indicating that the proposed model can not only retain different features from both source images, but also reconstruct the gradient information in fused images.
In summary, the proposed fused image not only had more similar information from both source images, but the gradient information was also more satisfying for human visual observation.

Comparative Experiments Based on Object Detection
This study tested the performance of different fusion results in object detection using the MFNet dataset in order to better verify the advantage of the proposed fusion image in weakening noise information. Public YOLOv5 was used as the target detection network, and the results were divided into visual results and quantitative results, which are introduced separately below.
Firstly, two groups from the test results were randomly selected for visual comparison, as shown in Figure 8. From the first set of data, it is not difficult to find that the object detection based on the fused images of this study was superior to the other 12 methods in night images with an interference of strong light. Similarly, there was no error detection in the fusion results of this study, which is also an advantage over the source images. The results of this study could not only be detected with higher accuracy for large targets in the second set of data, but also for all small targets in the images.
Subsequently, 90 groups of object detection results from the MFNet dataset were randomly selected for quantitative analysis, including the three indicators Recall, AP@0.5, and mAP@[0.5:0.95] in Table 5. It is obvious that, among the above indicators, the fusion images of this study had higher numerical results than the other fusion images. The outstanding ability of the infrared images to capture heat information promotes its achieving the highest detection accuracy and recall rate, but it lacked the necessary texture, which can be well illustrated in Figure 8.
In summary, the proposed method has been convincingly proven to be able to comprehensively retain specific information from different sources, and the interference of noise information was reduced according to the high detection accuracy of the object detection.

Efficiency Comparison
This study calculated the time complexity of the proposed method and twelve other comparison methods on four datasets. All methods were tested under GPU. It is evident from Table 6 that the proposed method and TarDAL performed well on all datasets. In particular, the proposed method ran fastest on the RoadScene dataset and was second only to TarDAL on the M 3 FD dataset, MFNet dataset, and TNO dataset. The proposed approach with a faster time is more likely to meet the needs of subsequent advanced visual tasks.

Ablation Experiment
The functional necessities of each part of the proposed fusion method were well validated by the ablation experiment conducted on the M 3 FD dataset. The related qualitative and quantitative analyses are discussed below.
Qualitative analysis: As shown in Figure 9, only a double-classification discriminator was designed in Module 1, a two-branch feature extraction was added to the generator part in the model of Module 2, and a two-branch feature interaction was introduced in Module 3.
It can be obviously observed that the introduction of the double-classification discriminator overcame the problem of weak edge information caused by a single discriminator. However, the richness of the texture information and contrast differences still needed to be improved. Meanwhile, a two-branch feature extraction generator was designed to solve the problem that important information is overwritten due to the connection of two source image channels. Finally, the global contrast between different objects was further enhanced, and the detailed texture inside objects was further improved through the design of feature interactions.
Quantitative analysis: In order to more convincingly prove the necessity of each component, this study randomly selected 70 pairs of images from the M 3 FD dataset for quantitative analysis, as shown in Table 7.  Although the introduction of the double-classification discriminator did not significantly improve information entropy and other indicators, it greatly improved the visual quality of the images. The design of the two-branch feature extraction generator enhanced the gradient information of the fused images but also retained the useless noise information. Fortunately, the design of feature interactions filtered noise information to avoid the texture feature weakening and contrast disappearance caused by the excessive retention of useless information from visible images in the final network.

Discussion
The proposed method achieved the highest scores in the SF and AG indicators, with the other four indicators also averaging in the top three. Furthermore, the visualization results demonstrated the strong robustness of the proposed method, even when there was noise in the visible images. In the object detection experiment, quantitative comparison with digital evidence confirmed the advantage of the proposed method in preserving contour information. Additionally, visual analysis provided direct evidence that the proposed method could output fusion results that were more suitable for object detection, especially when there was a significant amount of noise in the input images.
In conclusion, the proposed method enhanced the neglected texture detail and contrast when visible images were disturbed by a amount of noise information, which could be well demonstrated by the comparison experiments of different datasets and the performance tests of object detection comparisons with existing advanced methods.

Conclusions and Future Work
Within this paper, a two-branch feature interaction fusion network based on generative adversarial networks was proposed for visible and infrared images. The generator part was designed with a two-branch network to strengthen the preservation of edge features and texture information from both visible and infared images. Second, the feature interactions design of the two-branch network filtered the influence of noise from the network in the fusion results. Meanwhile, the design of the double-classification discriminator enhanced the guiding ability of the feature extraction of the generator in the proposed approach. However, the proposed method was only suitable for infrared and visible image fusion, and other multimodal image fusions were difficult to be widely applied due to the lack of datasets, which requires further study. Funding: The science and technology program "Research on remote safety control technology of power field operation based on infrared and visible multi-source image fusion" funded by China Southern Power Grid provided funding for this effort. This work was also partially supported by Yunnan Province Ten thousand Talents Program and Yunnan Normal University PhD Research Initiation Project 01000205020503148.