Attention Mechanism Based Semi-Supervised Multi-Gain Image Fusion

: High-dynamic range imaging technology is an e ﬀ ective method to improve the limitations of a camera’s dynamic range. However, most current high-dynamic imaging technologies are based on image fusion of multiple frames with di ﬀ erent exposure levels. Such methods are prone to various phenomena, for example motion artifacts, detail loss and edge e ﬀ ects. In this paper, we combine a dual-channel camera that can output two di ﬀ erent gain images simultaneously, a semi-supervised network structure based on an attention mechanism to fuse multiple gain images is proposed. The proposed network structure comprises encoding, fusion and decoding modules. First, the U-Net structure is employed in the encoding module to extract important detailed information in the source image to the maximum extent. Simultaneously, the SENet attention mechanism is employed in the encoding module to assign di ﬀ erent weights to di ﬀ erent feature channels and emphasis important features. Then, a feature map extracted from the encoding module is input to the decoding module for reconstruction after fusing by the fusion module to obtain a fused image. Experimental results indicate that the fused images obtained by the proposed method demonstrate clear details and high contrast. Compared with other methods, the proposed method improves fused image quality relative to several indicators.


Introduction
In the traditional camera structure, due to the limitations of the physical characteristics of CCD, it is difficult to capture the entire dynamic range of the scene that meets the characteristics of the human eye for a single camera exposure [1], which significantly affects the visual effect of the image. Currently, image quality can be improved using various image enhancement technologies [2]; however, there will be a certain loss of image detail and color retention. HDR imaging technology can obtain a wide dynamic range image by fusing multiple frames of different exposure images of the same scene, which can effectively overcome the problem of the narrow dynamic range of cameras and improve image quality. Existing shooting methods can only obtain multiple frames of different exposure images by fixing the camera to shoot multiple times in a short time by adjusting the camera exposure time. Due to the relative movement between the camera and target within the frame time, it is easy for motion artifacts to occur, which makes the subsequent fusion difficult. Effectively restoring image details, avoiding motion artifacts and reducing storage space has become a topic of interest in the computer vision field. structure is shown in Figure 1a, and we train the network directly without a fusion module. Figure 1b shows the testing component of the network. Here, the encoding and decoding network comprising the optimal weights obtained in the training network is used. The inputs to the network layer are multi-gain images (HG and LG images), and the output is a fused image. To extract a multi-scale feature from the image, a convolution layer C1 is added before U-Net to extract coarse features, e.g. edges in the image. The feature map output by U-Net is concatenated with the output from C1 and then fused by C2, which helps retain the context and semantic information of the image. The feature map output by the convolution through a lightweight network SENet, which assigns different weights to different channels to extract image detail information. For simplicity, we refer to this network as (U-Net + SENet, U-SENet).
Through the above operations, the low-level and high-level information of the image are exploited to generate two-branch feature maps, which are then fused using an appropriate fusion strategy. Therefore, the fused feature map contains information on all scales of HG and LG images. Finally, the fused feature map is sent to the decoding network for reconstruction to obtain a high-dynamic range image. The specific parameters of U-SENet's network structure are given in Table 1.
To extract the detailed information of each scale of the image, the training and testing modules include the U-Net network, which is a multi-scale, symmetrical and fully convolutional neural network and it contains two parts, that is, contraction and expanding paths. The contraction path, which is used for image feature extraction, primarily comprises convolution and pooling layers, and the expansion path, which is used to reduce the number of feature map channel, primarily comprises convolution and deconvolution layers. When processing up-sampling in the expansion path, feature maps with the same dimensional size of the feature extraction component and up-sampling component are concatenated using an operation to fuse the multi-scale information. LG HG (b)    Figure 1b shows the testing component of the network. Here, the encoding and decoding network comprising the optimal weights obtained in the training network is used. The inputs to the network layer are multi-gain images (HG and LG images), and the output is a fused image. To extract a multi-scale feature from the image, a convolution layer C1 is added before U-Net to extract coarse features, e.g. edges in the image. The feature map output by U-Net is concatenated with the output from C1 and then fused by C2, which helps retain the context and semantic information of the image. The feature map output by the convolution through a lightweight network SENet, which assigns different weights to different channels to extract image detail information. For simplicity, we refer to this network as (U-Net + SENet, U-SENet).
Through the above operations, the low-level and high-level information of the image are exploited to generate two-branch feature maps, which are then fused using an appropriate fusion strategy. Therefore, the fused feature map contains information on all scales of HG and LG images. Finally, the fused feature map is sent to the decoding network for reconstruction to obtain a high-dynamic range image. The specific parameters of U-SENet's network structure are given in Table 1. To extract the detailed information of each scale of the image, the training and testing modules include the U-Net network, which is a multi-scale, symmetrical and fully convolutional neural network and it contains two parts, that is, contraction and expanding paths. The contraction path, which is used for image feature extraction, primarily comprises convolution and pooling layers, and the expansion path, which is used to reduce the number of feature map channel, primarily comprises convolution and deconvolution layers. When processing up-sampling in the expansion path, feature maps with the same dimensional size of the feature extraction component and up-sampling component are concatenated using an operation to fuse the multi-scale information.
The encoding structure, fusion structure, decoding structure and loss function are described in the following.

Encoding Structure
The proposed U-SENet feature extraction network structure primarily comprises two parts, the U-Net and SE layer. First, multi-gain images are input to C1 with a filter kernel size of 3 × 3 for course feature extraction to obtain a 16-channel feature map that contains the edge information of the original image. The feature map output from C1 is input into U-Net to extract deep features of the feature map, and output the feature map whose number of channels are still 16, the U-Net structure diagram is shown in Figure 2a. Connect the SE layer after C1 and U-Net respectively to maintain important information, for example as edges and suppress unnecessary information, such as noise. The structure of the SE layer is shown in Figure 2b. The skip connection operation facilitates gradient propagation and speeds up model convergence [20]; thus, we implement this operation in the encoding network to prevent gradient information from being lost during the convolution process. Similarly, the skip connection is performed between the output of the latter SE module and that of the previous SE module. Note that the two SE modules in the network are the same size. The encoding network has the following obvious advantages. First, the network structure of each branch learns the same features from the input image; thus, the types of feature maps output by convolution layers C1 and C2 are the same. Second, the ability to adaptively assign different weights to the channels is conducive to strengthening useful features. Third, shallow information is retained through the skip connection operations; thus, all significant features that eventually enter the fusion layer can be fully utilized.

Encoding Structure
The proposed U-SENet feature extraction network structure primarily comprises two parts, the U-Net and SE layer. First, multi-gain images are input to C1 with a filter kernel size of 3 × 3 for course feature extraction to obtain a 16-channel feature map that contains the edge information of the original image. The feature map output from C1 is input into U-Net to extract deep features of the feature map, and output the feature map whose number of channels are still 16, the U-Net structure diagram is shown in Figure 2a. Connect the SE layer after C1 and U-Net respectively to maintain important information, for example as edges and suppress unnecessary information, such as noise. The structure of the SE layer is shown in Figure 2b. The skip connection operation facilitates gradient propagation and speeds up model convergence [20]; thus, we implement this operation in the encoding network to prevent gradient information from being lost during the convolution process. Similarly, the skip connection is performed between the output of the latter SE module and that of the previous SE module. Note that the two SE modules in the network are the same size. The encoding network has the following obvious advantages. First, the network structure of each branch learns the same features from the input image; thus, the types of feature maps output by convolution layers C1 and C2 are the same. Second, the ability to adaptively assign different weights to the channels is conducive to strengthening useful features. Third, shallow information is retained through the skip connection operations; thus, all significant features that eventually enter the fusion layer can be fully utilized.

Fusion Module
As shown in Figure 1, in the fusion module, the output feature maps of the two encoding branches (LG and HG) are of the same type and same number; thus, we use the addition method to fuse the corresponding feature map. The advantage of this fusion method is that the same types of features are added together by adding the pixel values at the same position in the same feature map. This method can effectively utilize the gradient information in the feature map, such that the fused feature map contains both the details of bright areas in the LG image and details of dark areas in the HG image, which helps retain significant information in the feature map. The calculation process is given in Equation (1). where, m ∈ {1, 2, · · · , M}, M = 64 is the number of feature maps, i, j represents the pixel position in the feature map, k refers to the feature map index obtained from the input image, φ m represents the m-th feature map and f m is the m-th fusion feature map.

Decoding Module
As shown in Figure 1, the decoding module comprises convolutional layers C3, C4, C5 and C6. Here, the filter kernel size of each convolutional layer is 3 × 3, the step size is 1 and the output of each layer is used as the input to the subsequent layer. Each time the feature map passes through a convolutional layer, the number of channels is halved, and the number of channels of the final convolutional layer output is 1, which helps reduce the number of the network model parameters. The fused feature map is input to the decoding module, and, by performing the convolution operation on the fused feature map and summing the convoluted feature maps, a reconstructed output image is obtained, as shown in Equation (2). (2) In Equation (2), M j is the input map sequence, W is the transpose of the convolution kernel, b is the bias, * represents the convolution operation, f is the activation function and x j is the output feature map after convolution, l represents the l-th convolutional layer. In Equations (3) and (4), E is the error cost function and η is the learning rate for gradient descent.

Loss Function
The network loss function [23] is given in Equation (5).
Here, L p represents the pixel loss of an image. The calculation formula is given as follows.
Here, I, O represent the input and output images, respectively. Here,L ssim , which represents the loss of structural similarity, is expressed as follows.
Here, SSIM is a type of measurement index representing the similarity between two images [19]. Note that the value range of L ssim is in the range [0, 1] and there is a difference with the value L p ; thus, we use λ to balance the two losses. We set λ = 1000. Figure 3 shows a dual-channel multi-gain camera and its imaging effect designed by our research team. The camera is designed based on the CMOS sensor GSENSE400BIS, and its main parameters are shown in Table 2. The image's resolution that the camera output is 4096 (H) × 2048 (V), including two LG; size: 2048 × 2048). To capture the details of bright and dark areas in the scene simultaneously, each pixel in the scene captured by the camera is simultaneously sampled once by the imaging unit with different gains to obtain two images of the same target scene with different gain values. Note that the grid lines in the image are auxiliary reference lines set because camera is in a debugging process. The computer parameters include are as follows: Intel (R) Xeon (R) Silver4110 CPU@2.10 GHz, 32 GB memory, NVIDIA GeForce RTX2080 Ti, Windows 10, 64-bit operating system, Python 3.6. Figure 3 shows a dual-channel multi-gain camera and its imaging effect designed by our research team. The camera is designed based on the CMOS sensor GSENSE400BIS, and its main parameters are shown in Table 2. The image's resolution that the camera output is 4096 (H) × 2048 (V), including two single-channel grayscale images arranged left and right (left: HG; right: LG; size: 2048 × 2048). To capture the details of bright and dark areas in the scene simultaneously, each pixel in the scene captured by the camera is simultaneously sampled once by the imaging unit with different gains to obtain two images of the same target scene with different gain values. Note that the grid lines in the image are auxiliary reference lines set because camera is in a debugging process. The computer parameters include are as follows: Intel (R) Xeon (R) Silver4110 CPU@2.10 GHz, 32 GB memory, NVIDIA GeForce RTX2080 Ti, Windows 10, 64-bit operating system, Python 3.6.

Dasetset and Training Strategy
To train the encoding and decoding structures with superior performance and better feature extraction and reconstruction capabilities, we used grey image to train the weights of the network during training. Note that multi-gain images have no ground truth image; therefore, we use the

Dasetset and Training Strategy
To train the encoding and decoding structures with superior performance and better feature extraction and reconstruction capabilities, we used grey image to train the weights of the network during training. Note that multi-gain images have no ground truth image; therefore, we use the public MS-COCO dataset [24] to train the network. The training set include 15,073 images, and the validation set include 10316 images, which is used to verify the network reconstruction capability after each iteration. All images are cropped from the middle to 256 × 256 and converted to grayscale images. The learning rate is 1 × 10 −4 , and the batch size is 12.
The advantage of the training mode is that the testing stage can select a suitable fusion strategy adaptively to fuse feature maps when fixing network weights.

Validation
The first two rows in Figure 4 are the original experimental images, and the fourth row shows the experimental results obtained by the proposed algorithm. To verify the effectiveness of the proposed algorithm, four types of scenes were captured by the self-developed camera: insufficient lighting in the laboratory (scenes #1 and #2), normal indoor scenes (scene #4), outdoor scenes with strong light (scenes #3 and #6) and scenes with excessive indoor partial light (scene #5). These scenes were used to verify the algorithm's effectiveness in normal scenes, its ability to retain scene details under strong and weak light conditions and its ability to suppress halo effects in images when the local area is too bright.   In this experiment, we set the number of epochs to 20, 30 and 40 to train the network. The corresponding loss curve is presented in Figure 5, which shows that both validation and test loss decreased relatively quickly during the initial training period. As the number of iterations increased, the loss decreased increasingly slowly. With 20 epochs, the network did not converge completely. For Symmetry 2020, 12, 451 9 of 16 30 and 40 epochs, training loss tended to be stable, validation loss fluctuated within a lower range and the network converged. Therefore, we select 30 epochs in this study. In scenes #1 and #2, the textures in the HG and LG images are preserved, and the detailed texture part in the red rectangular area in the image is more natural. In scenes #3 and #6, the edges and textures of the clouds under strong light are clearer. In addition, the halo effect at the light source in scene #5 is suppressed effectively, and, scene #4 shows that the proposed algorithm is equally effective for normal conditions. To verify the effectiveness of the attention mechanism for the network, we removed the SE layer from the network proposed and performed network training and testing. The third row in Figure 4 shows the experimental results obtained without the SE layer. As can be seen, compared to the fusion result of the proposed algorithm, the quality of the image fusion result is low with no SE layer, and the overexposed image details and textures could not be recovered effectively. For example, in the roof area in scene #3, it is difficult to retain the details of insufficiently illuminated areas. The visual effect is poor, which demonstrates that SE layer effectively retains important details in the source image and improves the quality of the fused image.

Experimental Result and Analysis
Generally, the evaluation of image fusion algorithms is divided into subjective and objective evaluations [25]. A subjective evaluation is a qualitative evaluation of the visual effect after fusion, and an objective evaluation is a quantitative evaluation of various indicators of the fused image. Here we evaluate the performance of the proposed algorithm from these two perspectives.
We randomly selected 12 sets of images from the captured dataset [26] to compare the performance of the proposed algorithm to similar algorithms. The compared algorithms include the DSIFT [11] algorithm, the multi-exposure image fusion algorithm proposed by Mertens [3], and a multi-exposure image fusion algorithm based on structural patch decomposition (SPD-MEF) [12] and the Deepfuse fusion method proposed by Prabhakar et al [18]. Figure 6 shows the fusion results of six different image pairs which are randomly selected from the 12 groups obtained by different algorithms, DSIFT, SPD-MEF and Deepfuse, respectively, and the seventh row shows the fusion result of the propose. The first two rows show the original multi-gain images, the third to sixth rows show the fusion results of Mertens algorithm. As can be seen, the fusion images obtained by the conventional Mertens, DSIFT and SPD-MEF methods are prone to local blur, low contrast and sufficient local details. Deepfuse cannot extract deep scale In scenes #1 and #2, the textures in the HG and LG images are preserved, and the detailed texture part in the red rectangular area in the image is more natural. In scenes #3 and #6, the edges and textures of the clouds under strong light are clearer. In addition, the halo effect at the light source in scene #5 is suppressed effectively, and, scene #4 shows that the proposed algorithm is equally effective for normal conditions. To verify the effectiveness of the attention mechanism for the network, we removed the SE layer from the network proposed and performed network training and testing. The third row in Figure 4 shows the experimental results obtained without the SE layer. As can be seen, compared to the fusion result of the proposed algorithm, the quality of the image fusion result is low with no SE layer, and the overexposed image details and textures could not be recovered effectively. For example, in the roof area in scene #3, it is difficult to retain the details of insufficiently illuminated areas. The visual effect is poor, which demonstrates that SE layer effectively retains important details in the source image and improves the quality of the fused image.

Experimental Result and Analysis
Generally, the evaluation of image fusion algorithms is divided into subjective and objective evaluations [25]. A subjective evaluation is a qualitative evaluation of the visual effect after fusion, and an objective evaluation is a quantitative evaluation of various indicators of the fused image. Here we evaluate the performance of the proposed algorithm from these two perspectives.
We randomly selected 12 sets of images from the captured dataset [26] to compare the performance of the proposed algorithm to similar algorithms. The compared algorithms include the DSIFT [11] algorithm, the multi-exposure image fusion algorithm proposed by Mertens [3], and a multi-exposure image fusion algorithm based on structural patch decomposition (SPD-MEF) [12] and the Deepfuse fusion method proposed by Prabhakar et al. [18]. Figure 6 shows the fusion results of six different image pairs which are randomly selected from the 12 groups obtained by different algorithms, DSIFT, SPD-MEF and Deepfuse, respectively, and the seventh row shows the fusion result of the propose. The first two rows show the original multi-gain images, the third to sixth rows show the fusion results of Mertens algorithm. As can be seen, the fusion images obtained by the conventional Mertens, DSIFT and SPD-MEF methods are prone to local blur, low contrast and sufficient local details. Deepfuse cannot extract deep scale details of the image, because the feature extraction module is relatively simple, it only uses two convolutional layers.    Table 3 indicates each fusion method's performance for images in Figure 6 relative to index entropy (EN) [27], average gradient (AG) [28], mutual information (MI) [28] and multi-level structural similarity (MSSIM) [29], which totally include 30 fused images. Note that greater index values indicate better image fusion performance (numbers in bold indicate optimal values).  Figure 7 shows the trend curve corresponding to the data given in Table 3. Here, the X-axis represents each corresponding group of images and the Y-axis represents the index value. Table 3. indicates each fusion method's performance for images in Figure 6 relative to index entropy (EN) [27], average gradient (AG) [28], mutual information (MI) [28] and multi-level structural similarity (MSSIM) [29], which totally include 30 fused images. Note that greater index values indicate better image fusion performance (numbers in bold indicate optimal values).  Figure 7 shows the trend curve corresponding to the data given in Table 3. Here, the X-axis represents each corresponding group of images and the Y-axis represents the index value. As shown, compared to the compared methods, the proposed method demonstrates improvement in various indicators. EN is improved by 5.19%-7.66%, AG is improved by 4.8%-35.34%, MI is improved by 7.24%-79.54% and MSSIM is improved by 2.5%-43.11%. Obviously, the proposed method shows a great advantage relative to improving image quality.
Obviously, time complexity is important for HDR imaging. We mainly compare with Deepfuse in terms of the time complexity. The training parameters of Deepfuse and the proposed method are shown in Table 4. In the following experimental environment in Table 4, the training time of our method is about 15 hours, Deepfuse can be saturated in about 10 minutes with fewer epochs, and the average time to fuse an image during testing in our method is 5.32s, while Deepfuse takes 0.58s. As shown, compared to Deepfuse, the proposed method does not occupy an advantage in time complexity. The main reason for this situation is that we use UNet as the feature extraction network, and adds the attention mechanism network SENet to the network. The entire network structure is deep, which leads to the higher time complexity.  Figure 8 shows the fusion results of six other sets of images corresponding to different scenes. Here, for each scene, the proposed method obtained a better fusion effect. Figure 9 shows an enlarged comparison of various texture details. As can be seen, the Mertens and DSIFT algorithms are insufficient in extracting image details. The shaded area above the computer due to the reflection of the sunlight in Figure 9a and the clouds above the house in Figure 9c could not be recovered effectively. The SPD-MEF algorithm retains the details of the image; however, the fused image retains too much brightness information of the LG image, the overall image is darker and the image includes obvious halos, as shown in Figure 9a (above the computer area) and Figure 9b (the flashlight has an obvious halo effect). The fusion image obtained by Deepfuse demonstrates high contrast and uniform brightness distribution; however, there is room to improve the extraction of image details. The shadow areas above the computer in Figure 9a and the details of the keyboard table in Figure 9d were not extracted completely. The fusion results obtained by the proposed method demonstrate higher contrast, more uniform brightness distribution and better detail restoration. In addition, the proposed method and can effectively avoid halo effects, which provide a good visual effect. As shown, compared to the compared methods, the proposed method demonstrates improvement in various indicators. EN is improved by 5.19%-7.66%, AG is improved by 4.8%-35.34%, MI is improved by 7.24%-79.54% and MSSIM is improved by 2.5%-43.11%. Obviously, the proposed method shows a great advantage relative to improving image quality.
Obviously, time complexity is important for HDR imaging. We mainly compare with Deepfuse in terms of the time complexity. The training parameters of Deepfuse and the proposed method are shown in Table 4. In the following experimental environment in Table 4, the training time of our method is about 15 h, Deepfuse can be saturated in about 10 min with fewer epochs, and the average time to fuse an image during testing in our method is 5.32s, while Deepfuse takes 0.58s. As shown, compared to Deepfuse, the proposed method does not occupy an advantage in time complexity. The main reason for this situation is that we use UNet as the feature extraction network, and adds the attention mechanism network SENet to the network. The entire network structure is deep, which leads to the higher time complexity.  Figure 8 shows the fusion results of six other sets of images corresponding to different scenes. Here, for each scene, the proposed method obtained a better fusion effect. Figure 9 shows an enlarged comparison of various texture details. As can be seen, the Mertens and DSIFT algorithms are insufficient in extracting image details. The shaded area above the computer due to the reflection of the sunlight in Figure 9a and the clouds above the house in Figure 9c could not be recovered effectively. The SPD-MEF algorithm retains the details of the image; however, the fused image retains too much brightness information of the LG image, the overall image is darker and the image includes obvious halos, as shown in Figure 9a (above the computer area) and Figure 9b (the flashlight has an obvious halo effect). The fusion image obtained by Deepfuse demonstrates high contrast and uniform brightness distribution; however, there is room to improve the extraction of image details. The shadow areas above the computer in Figure 9a and the details of the keyboard table in Figure 9d were not extracted completely. The fusion results obtained by the proposed method demonstrate higher contrast, more uniform brightness distribution and better detail restoration. In addition, the proposed method and can effectively avoid halo effects, which provide a good visual effect.
features. Comparative experimental results demonstrate that the quality of the fused image obtained by the proposed method is higher, which effectively expands the dynamic range of the image. In addition, the proposed method achieves good results relative to various indexes, such as EN, AG, MI and MSSIM. Despite the above successes, it also exposed disadvantages in the experiment, which is high time complexity. Therefore, we will solve this problem to further improve the method performance in the future.

Conclusions
In this paper, we have proposed a semi-supervised network to fuse multi-gain images captured by a dual-channel camera. Two multi-gain images are generated by the camera hardware simultaneously; thus, they are naturally immune to motion artifacts that tend to occur in traditional multi-exposure image fusion methods. The proposed method extracts texture details of multi-gain images through U-Net, emphasizes more valuable information using an attention mechanism SE layer and implements a skip connection mechanism to achieve effective extraction of the deep image features. Comparative experimental results demonstrate that the quality of the fused image obtained by the proposed method is higher, which effectively expands the dynamic range of the image. In addition, the proposed method achieves good results relative to various indexes, such as EN, AG, MI and MSSIM. Despite the above successes, it also exposed disadvantages in the experiment, which is high time complexity. Therefore, we will solve this problem to further improve the method performance in the future.